Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Optimizing Schema and Data Types
Transcript of Optimizing Schema and Data Types
you must design your schema for the
specific queries you will run Optimizing Schema and Data Types MySQL supports a large variety of data types, and choosing the correct type to store your data is crucial to getting good performance. The following simple guidelines can help you make better choices, no matter what type of data you are storing: Choosing Optimal Data Types In general, try to use the smallest data type that can correctly store and represent your data. Smaller data types are usually faster, because they use less space on the disk, in memory, and in the CPU cache. They also generally require fewer CPU cycles to process.
Make sure you don’t underestimate the range of values you need to store, though, because increasing the data type range in multiple places in your schema can be a painful and time-consuming operation. If you’re in doubt as to which is the best data type to use, choose the smallest one that you don’t think you’ll exceed. (If the system is not very busy or doesn’t store much data, or if you’re at an early phase in the design process, you can change it easily later.) Smaller
is usually better. Fewer CPU cycles are typically required to process operations on simpler data types. For example, integers are cheaper to compare than characters, because character sets and collations (sorting rules) make character comparisons compli-cated. Here are two examples: you should store dates and times in MySQL’s built-
in types instead of as strings, and you should use integers for IP addresses. Simple is good. A lot of tables include nullable columns even when the application does not need to store NULL (the absence of a value), merely because it’s the default. It’s usually best to specify columns as NOT NULL unless you intend to store NULL in them. It’s harder for MySQL to optimize queries that refer to nullable columns, because
they make indexes, index statistics, and value comparisons more complicated. A nullable column uses more storage space and requires special processing inside MySQL. When a nullable column is indexed, it requires an extra byte per entry and can even cause a fixed-size index (such as an index on a single integer column) to be converted to a variable-sized one in MyISAM. Avoid NULL if possible. Smaller is usually better.
Simple is good.
Avoid NULL if possible. The first step in deciding what data type to use for a given column is to determine what general class of types is appropriate: numeric, string, temporal, and so on. This is usually pretty straightforward, but we mention some special cases where the choice is unintuitive.
The next step is to choose the specific type. Many of MySQL’s data types can store the same kind of data but vary in the range of values they can store, the precision they permit, or the physical space (on disk and in memory) they require. Some data types
also have special behaviors or properties.
For example, a DATETIME and a TIMESTAMP column can store the same kind of data: date and time, to a precision of one second. However, TIMESTAMP uses only half as much storage space, is time zone–aware, and has special autoupdating capabilities. On the other hand, it has a much smaller range of allowable values, and sometimes its special capabilities can be a handicap.
MySQL supports many aliases for compatibility, such as INTEGER, BOOL, and NUMERIC. These are only aliases. They can be confusing, but they don’t affect performance. If you create a table with an aliased data type and then examine SHOW CREATE TABLE, you’ll see that MySQL reports the base type, not the alias you used. There are two kinds of numbers: whole numbers and real numbers (numbers with a fractional part). If you’re storing whole numbers, use one of the integer types: TINYINT, SMALLINT, MEDIUMINT, INT, or BIGINT. These require 8, 16, 24, 32, and 64 bits of storage space, respectively. They can store values from 2(N–1) to 2(N–1)–1, where N is the number of bits of storage space they use.
Integer types can optionally have the UNSIGNED attribute, which disallows negative values and approximately doubles the upper limit of positive values you can store. For example, a TINYINT UNSIGNED can store values ranging from 0 to 255 instead of from 128 to 127. Signed and unsigned types use the same amount of storage space and have the same
performance, so use whatever’s best for your data range. Whole Numbers Real numbers are numbers that have a fractional part. However, they aren’t just for fractional numbers; you can also use DECIMAL to store integers that are so large they don’t fit in BIGINT. MySQL supports both exact and inexact types. Real Numbers Third-party storage engines, such as Infobright, sometimes have their own storage formats and compression schemes, and don’t necessarily use those that are common to MySQL’s built-in storage engines The FLOAT and DOUBLE types support approximate calculations with standard floating-point math. The DECIMAL type is for storing exact fractional numbers. In MySQL 5.0 and newer, the DECIMAL type supports exact math. Both floating-point and DECIMAL types let you specify a precision. For a DECIMAL column, you can specify the maximum allowed digits before and after the decimal point. This influences the column’s space consumption. MySQL 5.0 and newer pack the digits into a binary string (nine digits per four bytes). For example, DECIMAL(18, 9) will store nine
digits from each side of the decimal point, using nine bytes in total: four for the digits before the decimal point, one for the decimal point itself, and four for the digits after the decimal point. MySQL supports quite a few string data types, with many variations on each. These data types changed greatly in versions 4.1 and 5.0, which makes them even more complicated. Since MySQL 4.1, each string column can have its own character set and set of sorting rules for that character set, or collation. This can impact performance greatly. String Types VARCHAR type stores variable-length character strings and is the most common string data type. It can require less storage space than fixed-length types, because it uses only as much space as it needs
uses 1 or 2 extra bytes to record the value’s length: 1 byte if the column’s maximum length is 255 bytes or less, and 2 bytes if it’s more.
helps performance because it saves space. However, because the rows are variable-length, they can grow when you update them, which can cause extra work CHAR is fixed-length: MySQL always allocates enough space for the specified number of characters.
CHAR were logically identical and differed only in storage format.) Values are padded
with spaces as needed for comparisons.
CHAR is useful if you want to store very short strings, or if all the values are nearly the same length. For example, CHAR is a good choice for MD5 values for user passwords, which are always the same length. CHAR is also better than VARCHAR for data that’s changed frequently, because a fixed-length row is not prone to fragmentation. CHAR BLOB and TEXT are string data types designed to store large amounts of data as either binary or character strings, respectively.
In fact, they are each families of data types: the character types are TINYTEXT, SMALLTEXT, TEXT, MEDIUMTEXT, and LONGTEXT, and the binary types are TINYBLOB, SMALLBLOB, BLOB, MEDIUMBLOB, and LONGBLOB. BLOB is a synonym for SMALLBLOB, and TEXT is a synonym for SMALLTEXT.
Unlike with all other data types, MySQL handles each BLOB and TEXT value as an object with its own identity. Storage engines often store them specially; InnoDB may use a separate “external” storage area for them when they’re large. BLOB and TEXT types Using ENUM instead of a string type Sometimes you can use an ENUM column instead of conventional string types. An ENUM column can store a predefined set of distinct string values. MySQL stores them very compactly, packed into one or two bytes depending on the number of values in the list. It stores each value internally as an integer representing its position in the field definition list, and it keeps the “lookup table” that defines the number-to-string correspondence in the table’s .frm file. Here’s an example:
mysql> CREATE TABLE enum_test(
e ENUM('fish', 'apple', 'dog') NOT NULL
mysql> INSERT INTO enum_test(e) VALUES('fish'), ('dog'), ('apple'); Using ENUM instead of
a string type MySQL has many types for various kinds of date and time values, such as YEAR and DATE. The finest granularity of time MySQL can store is one second. (MariaDB has microsecond-granularity temporal types.) However, it can do temporal computations with microsecond granularity, and we’ll show you how to work around the storage limitations.
MySQL offers two very similar data types for this purpose:
TIMESTAMP. Date and Time Types DATETIME
This type can hold a large range of values, from the year 1001 to the year 9999, with a precision of one second. It stores the date and time packed into an integer in YYYYMMDDHHMMSS format, independent of time zone. This uses eight bytes of storage space.
By default, MySQL displays DATETIME values in a sortable, unambiguous format, such as 2008-01-16 22:37:08. This is the ANSI standard way to represent dates and times. As its name implies, the TIMESTAMP type stores the number of seconds elapsed since midnight, January 1, 1970, Greenwich Mean Time (GMT)—the same as a Unix timestamp. TIMESTAMP uses only four bytes of storage, so it has a much smaller range than DATETIME: from the year 1970 to partway through the year 2038. MySQL provides the FROM_UNIXTIME() and UNIX_TIMESTAMP() functions to convert a Unix timestamp to a date, and vice versa.
The value a TIMESTAMP displays also depends on the time zone. The MySQL server,
operating system, and client connections all have time zone settings.
TIMESTAMP also has special properties that DATETIME doesn’t have. By default,
MySQL will set the first TIMESTAMP column to the current time when you insert a
row without specifying a value for the column. TIMESTAMP There are usually many ways to represent any given data, ranging from fully normalized to fully denormalized and anything in between. In a normalized database, each fact is represented once and only once. Conversely, in a denormalized database, information is duplicated, or stored in multiple places. Normalization and Denormalization People who ask for help with performance issues are frequently advised to normalize their schemas, especially if the workload is write-heavy. This is often good advice. It works well for the following reasons:
• Normalized updates are usually faster than denormalized updates.
• When the data is well normalized, there’s little or no duplicated data, so there’s less data to change.
• Normalized tables are usually smaller, so they fit better in memory and perform better. Pros and Cons of a Normalized Schema A denormalized schema works well because everything is in the same table, which avoids joins. If you don’t need to join tables, the worst case for most queries—even the ones that don’t use indexes—is a full table scan. This can be much faster than a join when the data doesn’t fit in memory, because it avoids random I/O Pros and Cons of a Denormalized Schema Given that both normalized and denormalized schemas have benefits and drawbacks,
how can you choose the best design?
The truth is, fully normalized and fully denormalized schemas are like laboratory rats: they usually have little to do with the real world. In the real world, you often need to mix the approaches, possibly using a partially normalized schema, cache tables, and other techniques.
The most common way to denormalize data is to duplicate, or cache, selected columns
from one table in another table. In MySQL 5.0 and newer, you can use triggers to update the cached values, which makes the implementation easier. A Mixture of Normalized and Denormalized Sometimes the best way to improve performance is to keep redundant data in the same table as the data from which it was derived. However, sometimes you’ll need to build completely separate summary or cache tables, specially tuned for your retrieval needs.
This approach works best if you can tolerate slightly stale data, but sometimes you really don’t have a choice (for instance, when you need to avoid complex and expensive real-time updates).
The terms “cache table” and “summary table” don’t have standardized meanings. We use the term “cache tables” to refer to tables that contain data that can be easily, if more slowly, retrieved from the schema (i.e., data that is logically redundant). When we say “summary tables,” we mean tables that hold aggregated data from GROUP BY queries (i.e., data that is not logically redundant). Some people also use the term “roll-up tables”
for these tables, because the data has been “rolled up.” Cache and Summary Tables Many database management systems, such as Oracle or Microsoft SQL Server, offer a feature called materialized views. These are views that are actually precomputed and stored as tables on disk, and can be refreshed and updated through various strategies. MySQL doesn’t support this natively (we’ll go into details about its support for views . However, you can implement materialized views yourself, using Justin
Swanhart’s open source Flexviews tools . Flex-views is more sophisticated than roll-your-own solutions and offers a lot of nice features that make materialized views simpler to create and maintain. It consists of a few parts:
• A Change Data Capture (CDC) utility that reads the server’s binary logs and ex-tracts relevant changes to rows
• A set of stored procedures that help define and manage the view definitions
• Tools to apply the changes to the materialized data in the database Materialized Views Counter Tables
An application that keeps counts in a table can run into concurrency problems when updating the counters. Such tables are very common in web applications. You can use them to cache the number of friends a user has, the number of downloads of a file, and so on. It’s often a good idea to build a separate table for the counters, to keep it small
and fast. Speeding Up ALTER TABLE
MySQL’s ALTER TABLE performance can become a problem with very large tables.
MySQL performs most alterations by making an empty table with the desired new structure, inserting all the data from the old table into the new one, and deleting the old table. This can take a very long time, especially if you’re short on memory and the table is large and has lots of indexes. Many people have experience with ALTER TABLE operations that have taken hours or days to complete. We’ve seen that modifying a table’s .frm file is fast and that MySQL sometimes rebuilds a table when it doesn’t have to. If you’re willing to take some risks, you can convince MySQL to do several other types of modifications without rebuilding the table.
You can potentially do the following types of operations without a table rebuild:
• Remove (but not add) a column’s AUTO_INCREMENT attribute.
• Add, remove, or change ENUM and SET constants. If you remove a constant and some rows contain that value, queries will return the value as the empty string. The basic technique is to create a .frm file for the desired table structure and copy it into the place of the existing table’s .frm file, as follows:
1. Create an empty table with exactly the same layout, except for the desired modification (such as added ENUM constants).
2. Execute FLUSH TABLES WITH READ LOCK. This will close all tables in use and prevent any tables from being opened.
3. Swap the .frm files.
4. Execute UNLOCK TABLES to release the read lock. Modifying Only the .frm File Building MyISAM Indexes Quickly
The usual trick for loading MyISAM tables efficiently is to disable keys, load the data, and reenable the keys:
mysql> ALTER TABLE test.load_data DISABLE KEYS; -- load the data
mysql> ALTER TABLE test.load_data ENABLE KEYS;
This works because it lets MyISAM delay building the keys until all the data is loaded,
at which point it can build the indexes by sorting. This is much faster and results in a
defragmented, compact index tree. Unfortunately, it doesn’t work for unique indexes, because DISABLE KEYS applies only to nonunique indexes. MyISAM builds unique indexes in memory and checks the uniqueness as it loads each row. Loading becomes extremely slow as soon as the
index’s size exceeds the available memory. Good schema design is pretty universal, but of course MySQL has special implementation details to consider. In a nutshell, it’s a good idea to keep things as small and simple as you can. MySQL likes simplicity, and so will the people who have to work with your database:
• Try to avoid extremes in your design, such as a schema that will force enormously complex queries, or tables with oodles and oodles of columns. (An oodle is somewhere between a scad and a gazillion.)
• Use small, simple, appropriate data types, and avoid NULL unless it’s actually the right way to model your data’s reality.
• Try to use the same data types to store similar or related values, especially if they’ll be used in a join condition.
• Watch out for variable-length strings, which might cause pessimistic full-length memory allocation for temporary tables and sorting.
• Try to use integers for identifiers if you can. • Avoid the legacy MySQL-isms such as specifying precisions for floating-point numbers or display widths for integers.
• Be careful with ENUM and SET. They’re handy, but they can be abused, and they’re tricky sometimes. BIT is best avoided.
Normalization is good, but denormalization (duplication of data, in most cases) is sometimes actually necessary and beneficial. We’ll see more examples of that in the next chapter. And precomputing, caching, or generating summary tables can also be a
big win. Justin Swanhart’s Flexviews tool can help maintain summary tables. Finally, ALTER TABLE can be painful because in most cases, it locks and rebuilds the whole table. We showed a number of workarounds for specific cases; for the general
case, you’ll have to use other techniques, such as performing the ALTER on a replica and then promoting it to master. There’s more about this later in the book.