Archive

Archive for the ‘Tech’ Category

TiskTasks for Todoist

August 8th, 2011

Two years ago I gushed over how much I love todoist.com. I still use Todoist every day, for both work and personal tasks. It’s an amazingly simple, clean, user-friendly website that has kept me mentally organized for several years.

Yesterday, I released my second Android app, TiskTasks for Todoist.  TiskTasks attempts to mimic the beautiful simplicity of the todoist.com website in a touch-friendly format.  This is my first paid app ($0.99), and I’m pretty happy with how it’s turned out.

If you’re a Todoist user, please consider TiskTasks.

TiskTasks is available on the Android Market and via the Amazon Android Appstore.

MySQL: Converting an incorrect latin1 column to utf8

April 17th, 2011

I recently stumbled across a major character encoding issue on one of the websites I run.  Through resolving the issue, I learned a lot about the complexities of supporting international character sets in a LAMP (Linux, Apache, MySQL, PHP) environment.  Fixing the problem was a challenge, so I wanted to share some of the knowledge I gained in case anyone else finds similar issues on their own websites.

The post below is a long yet detailed account of my experience.  Let me know if you’ve had similar experiences or found another solution for this type of issue.

Oh My Münchhausen!!!

A couple of days ago I was notified by a visitor of one of my websites that searching for a term with a non-ASCII character in it (in this case, “Münchhausen”) was returning over 500 results, though none of the results actually matched the given search term.

I started looking into the issue, and saw the same thing he was. The debug logs from the search page showed the following SQL query being used:

mysql> SELECT * FROM MyTable WHERE city = 'Münchhausen';
...
(7ms, 532 rows affected)

However, none of the results actually contained “Münchhausen” for the city.

I took the exact same query and ran it in the command-line mysql client. Strangely, this returned a different result:

mysql> SELECT * FROM MyTable WHERE city = 'Münchhausen';
(0ms, 0 rows affected)

The exact same query, run instead from the command line, returned 0 rows.

Assuming this had something to do with the “ü” character, I started a long journey of re-learning what character encodings are all about, including what UTF-8, latin1 and Unicode are, and how they are used in MySQL.

What’s Going On?

latin1, AKA ISO 8859-1 is the default character set in MySQL 5.0. latin1 is a 8-bit-single-byte character encoding, as opposed to UTF-8 which is a 8-bit-multi-byte character encoding. latin1 can represent most of the characters in the English and European alphabets with just a single byte (up to 256 characters at a time). UTF-8, on the other hand, can represent every character in the Unicode character set (over 109,000 currently) and is the best way to communicate on the Internet if you need to store or display any of the world’s various characters.

You can specify a default character set per MySQL server, database, or table. The defaults for a database will get applied to new tables, and the defaults for a table will get applied to new columns. You can change the defaults at any time (ALTER TABLE, ALTER DATABASE), but they will only get applied to new tables and columns.

You can also specify the character set you’re using for client connections (via the command line, or through an API like PHP’s mysql functions).

Getting back to the Münchhausen Problem, one of the things I initially checked was what character set PHP was talking to MySQL with:

mysql>  SHOW VARIABLES LIKE 'character_set_client';
+----------------------+--------+
| Variable_name        | Value  |
+----------------------+--------+
| character_set_client | latin1 |
+----------------------+--------+
1 row in set (0.00 sec)

Knowing the character “ü” is represented differently in latin1 versus UTF-8 (see below), and taking a wild stab in the dark, I tried to force my PHP application to use UTF-8 when talking to the database to see if this would fix the issue:

mysql_set_charset('utf8');

Voila! The problem was fixed! Searching for “Münchhausen” on the site returned 0 results ( the correct number of matches).

The Problem With “ø” Is That It’s Not “ø”

Or was it? A couple minutes later, I was browsing the site and started coming across funky characters everywhere. For example, a page that previously had the text “Graffiti by Dolk and Pøbel” was now reading “Graffiti by Dolk and Pøbel”. These strange character sequences also looked like an issue I had noticed from time to time in phpMyAdmin with edit fields showing strange characters. As long as I didn’t edit the strange characters, they displayed correctly when PHP spit them back out as HTML, so I hadn’t though much of it until now.

Seeing these strange characters sequences everywhere scared me enough to look into the problem a bit more. I disabled the call to mysql_set_charset() and the site reverted to the previous “correct” behavior of talking to the server via latin1 and displaying “Graffiti by Dolk and Pøbel”.

Back to square one.

It’s All Bits From MySQL’s Point of View

Since the term “Münchhausen” was returning inappropriate results, I tried other search terms that contained non-ASCII characters. For example, I searched for the city “São Paulo”:

mysql> SELECT city FROM MyTable WHERE city = 'São Paulo';
+-----------+
| city      |
+-----------+
| Sao Paulo |
| Sao Paulo |
| Sao Paulo |
...
| Sao Paulo |
(2ms, 72 rows affected)

As you can see, the search term kind-of worked. It found occurrences of “Sao Paulo” but not “São Paulo”. I know there are rows with São in the database, so the query wasn’t working 100% correctly.

I changed the query slightly to a wildcard match instead of the non-ASCII character:

mysql> SELECT city FROM MyTable WHERE city LIKE 'S%o Paulo';
+------------+
| city       |
+------------+
| Sao Paulo  |
| Sao Paulo  |
| Sao Paulo  |
...
| São Paulo |
| São Paulo |
(5ms, 348 rows affected)

This search worked a bit better — it found rows with cities of both Sao Paulo and São Paulo. However, it returned the character sequence “ã” for São Paulo for some reason.

The reason for this is, from MySQL’s point of view, the data stored within its tables are all just bits. The column type and character set of a column determine how queries work against the data and how the data is returned as a result of a SELECT query.

It’s probably pretty obvious by now that my city column wasn’t the right character set. It was set to latin1 when the database was created.

The interesting thing is that my web application, which uses PHP, didn’t seem to mind this very much. Storing and retrieving from the city column is binary-safe — that is, MySQL doesn’t modify the data PHP sends it via the mysql extension.   Later, MySQL will give PHP the exact same data (bits) back.  The problems only occur when you ask MySQL to, on its own, analyze the column or present it.

So all this time, my PHP web application had been storing UTF-8-encoded data in the city column, and later retrieving the exact same (binary) data which it display on the website. MySQL doesn’t modify the data for simple UPDATEs and SELECTs, so the UTF-8 characters were all still displayed properly on the website.

To fix the above SQL query, we can actually force MySQL to re-interpret the data as a specific character encoding by first converting the data to a BINARY type then casting that as UTF-8.  If you try to simply CONVERT USING utf8, MySQL will helpfully convert your garbage-latin1 characters to garbage-utf8 characters.  The two-step process of temporarily converting to BINARY ensures that MySQL doesn’t try to re-interpret the column in the other character encoding.  If we don’t convert to BINARY, MySQL would end up displaying the same ã characters even in UTF-8 output.

So we CAST to BINARY temporarily first, then CONVERT this USING UTF-8:

mysql> SELECT city, CONVERT(CAST(city as BINARY) USING utf8) FROM MyTable WHERE city = 'São Paulo';
+------------+-------------+
| city       | CONVERT...  |
+------------+-------------+
| Sao Paulo  | Sao Paulo   |
| Sao Paulo  | Sao Paulo   |
...
| São Paulo | São Paulo   | ...

Success!  We’ve tricked MySQL into giving us the UTF-8 interpretation of our latin1 column on the fly, and we see that São Paulo is represented properly.

Why do they differ?

Some background: Why is “ã” represented differently in latin1 vs UTF-8?

The character ã in latin1 is character code 0xE3 in hex, or 227 in decimal. It’s 8 bits would be represented as:

1110 0011

latin1 is a single-byte encoding, so each of the 256 characters are just a single byte.

The UTF-8 encoding was designed to be backward-compatible with ASCII documents, for the first 128 characters. For characters above #128, a multi-byte sequence describes the character.

Our character ã, #227, misses the single-byte compatibility with ASCII’s first 128 characters and must be represented in two bytes as described on the Wikipedia UTF-8 page.

Here’s a representation of the character ã in both encodings:

   0000 0000 1110 0011 ASCII character (latin1) (0x00E3)
-> 1100 0011 1010 0011 UTF-8 encoding (0xC3A3)

UTF-8 encoding turns our ã, represented as 0xE3 in latin1, into two bytes, 0xC3A3 in UTF-8.

Note that these two bytes 0xC3 and 0xA3 in UTF-8 happen to look like this in latin1:

0xC3 = Ã in latin1

0xA3 = £ in latin1

So the UTF-8 encoding of ã explains precisely why we see it reinterpreted as ã in latin1.

What To Do, What To Do

At this point, it’s obvious that I messed up somewhere. More precisely, the city column should be UTF-8, since PHP has always been putting UTF-8 data in it. The big reason I hadn’t noticed an issue up to this point is that while the MySQL column is latin1, my PHP app was getting this data and calling htmlentities to convert the UTF-8 characters to HTML codes before displaying them. My website’s visitors saw proper UTF-8 characters on the website even though the MySQL column was latin1.

As we’ve seen, issues start occurring when you do queries against the data. The SELECT above was using a UTF-8 character “ü” for “Münchhausen”, and when comparing this to “latin1” data in the column, MySQL gets confused (can you blame it?). Additional issues can appear with applications that display the “natural” encoding of the column (such as phpMyAdmin): they show the strange character sequences as seen above, instead of UTF-8 decoded characters.

The core of the problem is that the MySQL database was created several years ago and the default collation at the time was latin1_swedish_ci. Over the years, I changed the default to utf8_general_ci for new columns, but existing tables and columns weren’t changed.  I have over 100 tables in latin1 that should be UTF-8 and need to be converted.

So I started investigating what it takes to convert my existing latin1 tables to UTF-8 as appropriate. Once I set the character encoding properly, queries against the database should work better and I shouldn’t have to worry about these types of issues in the future.

There are a couple ways to make the conversion.

Some people have successfully exported their data to latin1, converted the resulting file to UTF-8 via iconv or a similar utility, updated their column definitions, then re-imported that data. Unfortunately this requires taking the database down as tables are dropped and re-created, and this can be a bit time-consuming.

I was hoping for a process that I could apply to an online database, and luckily I found some good notes by Paul Kortman and fabio, so I combined some of their ideas and automated the process for my site. I hit a couple issues along the way, so I wanted to share the steps that worked for me.

The Process

Using the method described on fabio’s blog, we can convert latin1 columns that have UTF-8 characters into proper UTF-8 columns by doing the following steps:

  1. Convert the column to the associated BINARY-type (ALTER TABLE MyTable MODIFY MyColumn BINARY)
  2. Convert the column back to the original type and set the character set to UTF-8 at the same time (ALTER TABLE MyTable MODIFY MyColumn TEXT CHARACTER SET utf8 COLLATE utf8_general_ci)

This is a similar approach to our SELECT CONVERT(CAST(city as BINARY) USING utf8) trick above, where we basically hide the column’s actual data from MySQL by masking it as BINARY temporarily.

If you simply force the column to UTF-8 without the BINARY conversion, MySQL does a data-changing conversion of your “latin1” characters into “UTF-8” and you end up with improperly converted data. Converting the column to BINARY first forces MySQL to not realize the data was in UTF-8 in the first place.

We need to convert each source column type (CHAR vs. VARCHAR vs. TEXT, etc) into its associated BINARY type (BINARY vs. VARBINARY vs. BLOB).

Additionally, the MODIFYs to BINARY and back need to retain the entire column definition. For TEXT types, a simple TEXT to BLOB conversion is sufficient. But for column definitions that have specified lengths, defaults or NOT NULL:

MyColumn VARCHAR(255) DEFAULT '' NOT NULL

We need to MODIFY keeping the same attributes, or the column definition will be fundamentally changed (see notes in ALTER TABLE). In this case, we would specify:

mysql> ALTER TABLE MyTable MODIFY MyColumn VARBINARY(255) DEFAULT '' NOT NULL;

If we don’t specify the length, default and NOT NULL, the columns aren’t the same as before the conversion.

The Script

The script at the bottom of this post automates the conversion of any UTF-8 data stored in latin1 columns to proper UTF-8 columns.

I modified fabio’s script to automate the conversion for all of the latin1 columns for whatever database you configure it to look at. It converts the columns first to the proper BINARY cousin, then to utf8_general_ci, while retaining the column lengths, defaults and NULL attributes.

Warning: This script assumes you know you have UTF-8 characters in a latin1 column. Please test your changes before blindly running the script!

Here are the steps you should take to use the script:

Step 1. Determine Which Columns Need Updating

If you’re like me, you may have a mixture of latin1 and UTF-8 columns in your databases.  Not all of the columns in my database needed to be updated from latin1 to UTF-8.  For example, some of the tables belonged to other PHP apps on the server, and I only wanted to update the columns that I knew had to be fixed.  The script will currently convert all of the tables for the specified database – you could modify the script to change specific tables or columns if you need.

Additionally, the script will only update appropriate text-based columns.  Character sets are only appropriate for some types of data: CHAR, VARCHAR, TINYTEXT, TEXT, MEDIUMTEXT and LONGTEXT. Other column types such as numeric (INT) and BLOBs do not have a “character set”.

You can see what character sets your columns are using via the MySQL Administration tool, phpMyAdmin, or even using a SQL query against the information_schema:

mysql> SELECT * FROM COLUMNS WHERE TABLE_SCHEMA = "MyTable" AND COLLATION_NAME LIKE "latin1%";
...
115 rows in set (0.03 sec)

Step 2. Test Convert the Columns

You should test all of the changes before committing them to your database.

The first thing to test is that the SQL generated from the conversion script is correct.  To do this, you can dump the structure of your database:

server> mysqldump -d -h localhost -u dbuser -p mydatabase > structure.sql

And import this structure to another test MySQL database:

server> mysql -u dbuser -p mydatabase_test < structure.sql

Next, run the conversion script (below) against your temporary database:

server> php -f mysql_convert_latin1_to_utf8.php

The script will spit out “!!! ERROR” statements if a change fails.  If you encounter ERRORs, modifications may be needed based on your requirements.  Some of the common problems are listed in Step 3.

After you run the script against your temporary database, check the information_schema tables to ensure the conversion was successful:

mysql> SELECT * FROM COLUMNS WHERE TABLE_SCHEMA = "MyTable";

As long as you see all of your columns in UTF8, you should be all set!

Step 3. Problems You May Encounter

If it were only that simple. I hit some issues along the way.

Step 3.1. FULLTEXT indexes

I have several columns with FULLTEXT indexes on them.  The ALTER TABLE to BINARY command for a column that has a FULLTEXT index will cause an error:

mysql> ALTER TABLE MyTable MODIFY MyColumn BLOB;
ERROR 1283 (HY000): Column 'MyColumn' cannot be part of FULLTEXT index

The simple solution I came up with was to modify the script to drop the index prior to the conversion, and restore it afterward:

ALTER TABLE MyTable DROP INDEX `mycolumn_fulltext`

... (convert all columns) ...

ALTER TABLE MyTable ADD FULLTEXT KEY `mycolumn_fulltext` (`MyColumn`)

There are TODOs listed in the script where you should make these changes.

Step 3.2. Invalid UTF-8 data

Since my database was over 5 years old, it had acquired some cruft over time. I’m not sure exactly how this happened, but some of the columns had data that are not valid UTF-8 encodings, though they were valid latin1 characters. I believe this occurred before I hardened my PHP application to reject non-UTF-8 data, but I’m not sure. I found this out when initially trying to do the conversion:

mysql> ALTER TABLE MyTable MODIFY MyColumn VARBINARY(3000) NOT NULL DEFAULT '';
Query OK, 21171 rows affected (0.66 sec)

mysql> ALTER TABLE MyTable MODIFY MyColumn varchar(3000) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL DEFAULT '';
ERROR 1366 (HY000): Incorrect string value: '\xE2\x80? fl...' for column 'MyColumn' at row 128

What’s going on?

At some point, a character sequence that contained invalid UTF-8 characters was entered into the database, and now MySQL refuses to call the column VARCHAR (as UTF-8) because it has these invalid character sequences.

I checked the HTML representation of this column in my PHP website, and sure enough, the garbage shows up there too:

… �? flown …

The � is the actual character that your browser shows. Not the best user experience, and definitely not the correct character.

I fixed that single row (via phpMyAdmin), and ran the ALTER TABLE MODIFY command again — same issue, another row. Looks like there is more than a single corrupt row.

I found a good way of rooting out all of the columns that will cause the conversion to fail. If you SELECT CONVERT (MyColumn USING utf8) as a new column, any NULL columns returned are columns that would cause the ALTER TABLE to fail.

For example:

mysql> SELECT MyID, MyColumn, CONVERT(MyColumn USING utf8)
       FROM MyTable
       WHERE CONVERT(MyColumn USING utf8) IS NULL
...
5 rows in set, 10 warnings (0.05 sec)

This showed me the specific rows that contained invalid UTF-8, so I hand-edited to fix them. You could manually NULL them out using an UPDATE if you’re not afraid of losing data. I had to do this for 6 columns out of the 115 columns that were converted.  Only 30 rows in total were corrupt.

Step 4. Profit?

At this point, it may take some guts for you to hit the go button on your live database.

Personally, I ran the script against a test (empty) database, then a copy of my live data, then a staging server before finally executing it on the live data.

Warning: Please be careful when using the script and test, test, test before committing to it!

The Script

The script can be found at Github: https://github.com/nicjansma/mysql-convert-latin1-to-utf8. If you find bugs or want to contribute changes, please head there.

Final Words

I hope what I’ve learned will be useful to others.

If you hit any problems with the conversion script, please let me know.

Android development and the Minifig Collector app

January 31st, 2011
Comments Off on Android development and the Minifig Collector app

About a year ago, I started taking a look at the Android OS as a platform for mobile application development.  I had recently purchased a Motorola Droid (on Verizon), and wanted to learn what it takes to develop an Android app.

My goal was to start out small, and learn the ins and outs of the platform.  I was also a bit rusty with Java, so I wanted to develop a small yet fun app.

Around the same time, LEGO came out with their Series 1 Collectible Minifigures (minifigs).  There are 18 minifigs to collect.  Inside each $3 bag sold is a single minifig, but the bags don’t say which minifig is inside.  You can feel around the contents of the bag and search for known shapes of their accessories, but that is tedious work and error prone.  Luckily someone noticed that each bag has two barcodes, and one of them is unique and consistent to each minifig.

So after a few weeks of coding, I released my first (free) app, called Minifig Collector on the Android marketplace:

It’s a pretty simple app, but it allows you to determine which minifig is in each bag by pointing your phone’s camera at the barcode, which is then scanned and decoded:

Once you’ve scanned the minifig, you can keep track of which ones you’ve found:

The app is pretty simple, but developing it gave me a good introduction to the platform and how the Android UI works.  I was able to utilize a free library called ZXing to handle the camera’s barcode scanning/decoding.

I’ve learned a lot about Android development along the way, and I am preparing a few follow-up posts about some of the things I’ve learned.  While the app is simple, I went a little overboard and explored unit testing, code style enforcement, continuous integration and debugging to see how it would work for future (paid) apps.  This lead to a bit of head-scratching about how to do various things, so I want to share my findings with others that are interested in Android development.

The app has about 5,000 downloads in the Android marketplace, and it looks like it’s helped a few people along the way.  Considering how much I’ve learned through its development, I’m pretty happy with my time investment in it.  I’m now working on a paid app.

Unfortunately, LEGO learned from community’s resourcefulness and removed the identifying barcodes in Series 3, though they left some identifying marks (‘bumpcodes’) on them so people can still (sometimes) determine which minifig is inside.  The phone’s camera can’t detect these codes, but the Minifig Collector been updated to show an image of the bumpcodes and still allows you to keep track of which Minifigs you have.

MySQL: TEXT vs. VARCHAR Performance

January 20th, 2011

Starting with MySQL 5.0.3, the maximum field length for VARCHAR fields was increased from 255 characters to 65,535 characters.  This is good news, as VARCHAR fields, as opposed to TEXT fields, are stored in-row for the MyISAM storage engine (InnoDB has different characteristics).  TEXT and BLOB fields are not stored in-row — they require a separate lookup (and a potential disk read) if their column is included in a SELECT clause.  Additionally, the inclusion of a TEXT or BLOB column in any sort will force the sort to use a disk-based temporary table, as the MEMORY (HEAP) storage engine, which is used for temporary tables, requires.  Thus, the benefits of using a VARCHAR field instead of TEXT for columns between 255 and 65k characters seem obvious at first glance in some scenarios: potentially less disk reads (for queries including the column, as there is no out-of-row data) and less writes (for queries with sorts including the column).

Following a review of my MySQL server instance’s performance metrics (mysqltuner.sh and mysql tuning-primer.sh point out interesting things), I found that approximately 33% of the server’s temporary tables used on-disk temporary tables (as opposed in-memory tables).  On-disk temporary tables can be created for several reasons, most notably if the resulting table will be larger than the minimum of MySQL’s tmp_table_size/max_heap_table_size variables OR when a TEXT/BLOB field is included in the sort.

Since any operation that uses the disk for sorting will be noticeably slower than using RAM, I started investigating why so many of my temporary tables went to disk.  One of my website’s most heavily used MyISAM tables contained two TEXT columns, which were textual descriptions of an object, submitted by the website’s visitors.  The table schema was created prior to MySQL v5.0.3, when VARCHAR fields were limited to 255 characters or less.  For this object, I wanted to allow descriptions larger than 255 characters, so the table utilized the two TEXT fields to store this data.

So my first thought in reducing the number of on-disk temporary tables was to convert these TEXT fields to VARCHAR.  I sampled the column’s values, and found that the two fields’ maximum input sizes were currently around 8KB.  Not thinking too much about what size I wanted to support, I decided that I could set both fields to VARCHAR(30000) instead of TEXT.  I changed the fields, verified through automated tests that everything still worked, and called it a night.

Over the next two days, I noticed that there were several alarming metrics trending the wrong way.  I utilize Cacti to monitor several hundred server health metrics — it is great for showing trends and alerting about changes.  Unfortunately, it was reporting that server load, page load time and disk activity were all up significantly — especially disk writes.  Wondering if it was a fluke, I left the server alone for another day to see if it would subside, but the high load and disk writes continued consistently.  The load was causing real user-perceived performance impacts for the website’s visitors, causing average page load time to increase from 70ms to 470ms.

Here’s what the Cacti graphs looked like:

Wouldn’t you be alarmed?

Not wanting to run an intensive performance review or diagnostics on the live website, I came up with a plan for how I would diagnose what the problem was:

  1. Enable temporary lightweight tracing on the server to try to determine the source of the increased disk activity (iotop or dstat).
  2. If MySQL is causing most of the disk activity, temporarily revert the VARCHAR(30000) columns back to TEXT, as I suspected they were somehow the cause of the slowdown.
  3. Perform a deeper analysis of the problem on a separate machine.

Running iotop on the server confirmed that a majority of the disk writes were coming from MySQL.  Additionally, after I reverted the columns to TEXT, the server load and page load times went back to normal.

So why did my seemingly obvious “performance win” end up as a performance dud?  Shame on me for not testing and verifying the changes before I pushed them live!

I didn’t want to do any more diagnosis of the problem on the live Linux server — there’s no point in punishing my visitors.  I have a separate Linux development server that I could have used to dig a little deeper, but I’m more comfortable doing performance analysis on Windows, and luckily, MySQL has Windows ports.

For almost all of my performance analysis work, I use the excellent Windows Performance Tools (WPT) Kit (using ETW and xperf).  If you haven’t used the WPT and xperf tools before, there are some good guides on trace capturing using ETW and visual analysis via xperfview on MSDN.  ETW is a lightweight run-time tracing environment for Windows.  ETW can trace both kernel (CPU, disk, network and other activity) and user mode (application) events, and saves them to a .ETL file that can later be processed.  The Windows Performance team and many other teams in Windows regularly use ETW/xperf for a majority of their performance analysis.

To figure out what’s going on with our VARCHAR columns, I first needed to ensure that I could replicate the problem on my Windows MySQL machine.  I installed MySQL 5 and loaded a snapshot of my live database into the instance.

I then looked at the MySQL Slow Query Log from the real server to see what queries were taking a long time (>1 second).  There were thousands of instances of a query that looked something like this:

SELECT   t2.*
FROM     table1 t1, table2 t2
WHERE    t2.t1id = t1.id
ORDER BY t1.id
LIMIT    0, 10

Which looks innocent enough, but Table2 is the table that I changed the two TEXT fields to VARCHAR, and I’m querying all of the columns (SELECT *) from it.  Before, because of the TEXT column, this query would’ve used an on-disk temporary table for the results (because the MySQL manual tells us this is the case for results that need temporary tables and have TEXT columns).  So why is this query appearing to be so much slower now?

First of all, I checked how this query responded on my Windows MySQL instance:

mysql> SELECT t2.*
FROM table1 t1, table2 t2
WHERE t2.t1id = t1.id
ORDER BY t1.id
LIMIT 0, 10;
...
10 rows in set (1.71 sec)

This confirmed the issue appeared on my development machine as well!  This query should be nearly instantaneous (less than 50ms), and 1,710 milliseconds is a long time to wait for a single query of many in a page load.

My guess at this point was it had something to do with temporary tables.  And disks.  Since that was what I was trying to improve with my TEXT to VARCHAR change, it only makes sense that I somehow made it worse.  So to validate this theory, I enabled a bit of lightweight ETW tracing to see how the OS handled the query.

I started ETW tracing to get basic CPU and disk IO information:

xperf -on base+fileio

Then I re-ran the above query in the MySQL command line, and saved the trace to disk (trace.etl) after the query finished:

xperf -d trace.etl

Loading the trace up in xperfview showed some interesting things:

xperfview trace.etl

From a CPU usage perspective, the CPU (on my quad-core system) seemed to be pretty active, but not 100% utilized.  Looking at the CPU Sampling Summary Table, I found that mysqld-nt.exe was using 1,830ms / 17% of my CPU (essentially 68% of one of the cores).  Not bad, but not maxed out either.  But what’s interesting here was the disk utilization graph.  For a period of ~700ms, we’re 100% utilized.  Highlighting that region and viewing the summary table showed where we spent our time:

#sql_1a00_0.MYD is a temporary table from MySQL (which can be confirmed from the File IO graph).  In this case, our single query caused 38MB of disk writes and ~626ms to write/read it back in.

Huh?

At this point, I wanted to double-check that the TEXT to VARCHAR change caused this.  I updated the column to TEXT, and re-run the same query:

mysql> SELECT t2.*
FROM table1 t1, table2 t2
WHERE t2.t1id = t1.id
ORDER BY t1.id
LIMIT 0, 10;
...
10 rows in set (0.03 sec)

Well, 0.03 seconds is a lot faster than 1.71 seconds.  This is promising.  I took another ETW trace of the query on with the TEXT field:

After switching back to TEXT fields, mysql used ~30ms of CPU and caused no disk activity.

Now that I knew what was causing the slowdown, I wanted to try to fully understand why this was the case.  Remember, I started down this path originally because I found that I had a high portion of temporary tables were on-disk temporary tables.  In the interest of seeing less disk activity on my server, I attempted to change several TEXT columns (which can cause on-disk temporary tables) to VARCHAR(30000) columns.  However, I didn’t fully look into what was causing the on-disk temporary tables, and instead just guessed.  As a result, my server’s perf tanked!

Now’s a good time to review the first paragraph of this post.  There are several reasons MySQL may use an internal temporary table.  One interesting quote:

Such a [temporary] table can be held in memory and processed by the MEMORY storage engine, or stored on disk and processed by the MyISAM storage engine.

and

If an internal temporary table is created initially as an in-memory table but becomes too large, MySQL automatically converts it to an on-disk table. The maximum size for in-memory temporary tables is the minimum of the tmp_table_size and max_heap_table_size values. This differs from MEMORY tables explicitly created with CREATE TABLE: such tables, the max_heap_table_size system variable determines how large the table is permitted to grow and there is no conversion to on-disk format.

So a temporary table can start out as a MEMORY table, then if MySQL realizes it’s too big for tmp_table_size and max_heap_table_size, it may convert it to a MyISAM table on the disk.  One caveat with the MEMORY engine is:

MEMORY tables use a fixed-length row-storage format. Variable-length types such as VARCHAR are stored using a fixed length.

How did this behavior affect me, and cause the 30ms query to take 1,710 ms?  Well let’s start out with the two newly VARCHAR(30000) columns.  In a normal MyISAM table, with a dynamically sized row, these two columns only take as much space as the data they contain (plus 1 byte).  That is, if I had a row and these two columns only had 10 bytes of data in them, the row size would be (10+1)*2+[other columns].  However, if I happened to convert this MyISAM table to use fixed-length rows, or I was using the MEMORY storage engine, the row size would be 30000*2+[other columns].  Currently, according to my dataset, these dynamically sized rows only required an average of 1,648 bytes per row.

And that’s the crux of the problem.  My query above, simple enough, requires a temporary table to do its work.  We can verify this via the MySQL EXPLAIN command:

mysql> EXPLAIN SELECT t2.*
FROM table1 t1, table2 t2
WHERE t2.t1id = t1.id
ORDER BY t1.id
LIMIT 0, 25;
+----+-------------+-------+--------+---------------+---------+---------+------
| id | select_type | table | key     | rows | Extra                           |
+----+-------------+-------+--------+---------------+---------+---------+------
|  1 | SIMPLE      | t2    | NULL    | 8813 | Using temporary; Using filesort |
|  1 | SIMPLE      | t1    | PRIMARY | 1    | Using index                     |
+----+-------------+-------+--------+---------------+---------+---------+------

(I trimmed a couple columns to fit to the page’s width).

Here, we see our t2 table Using temporary.  MySQL converted the 8,813 dynamic-row columns to fixed-length, which expanded the VARCHARS to their full size: approximately 60,600 byes per-row.  That’s 8,813 rows * 60,600 bytes = 534,067,800 bytes to deal with!  The server’s tmp_table_size variables decided this wasn’t good for an in-memory temporary table, so MySQL ended up moving a lot of this work to disk.  As a result, we had ~700ms of disk writes with this query when using VARCHAR(30000) columns.

There are a couple ways to avoid this behavior in MyISAM tables:

  1. Use TEXT fields, with their known caveats.
  2. Use a smaller, more reasonable VARCHAR size.  These fields probably don’t need to hold more than 10k of data.  One could reduce their size to 10k or smaller, or even move them to another table and only retrieve them when necessary.
  3. Fiddle with the tmp_table_size and max_heap_table_size variables.  These two variables dictate which queries use on-disk temporary tables, as described here.  They are set at approximately 35mb/16mb by default (on my Windows MySQL 5.1 instance).

I made a couple changes.  I changed one of the original 30k fields to 10k, and and changed the other one to 1k.  This reduced the potential row size in MEMORY temporary tables tremendously.  I also upped the tmp_table_size and max_heap_table_size variables to 128MB on my server.  The combination of these two changes ensured that the specific query above was no longer causing all of the performance issues (for now).  I should probably move the 10k field to another table (or back to TEXT) to be sure.

I probably didn’t need to use ETW and xperf here to look into things.  Since I was aware that the changes I made to the database had a high correlation with the slow-downs I was seeing, and reverting these changes fixed the issue, I could have probably figured out what was going on by reading the MySQL manual a bit more.  But I like to dig into things, and I think xperf can help visually communicate what’s going on with your system for problems like this.  Seeing how much blocking IO a single query can cause really sheds light on things!

Admittedly, the server MySQL is running on also hosts a web server and multiple sites.  A dedicated SQL server with fast, dedicated disks would help minimize problems like this.

One interesting note is that MySQL version 5.1 on Windows doesn’t have the same IO patterns as version 5.0 did – I see the same File IO for temporary tables with the VARCHAR fields, but not the same amount of disk activity.  This could mean that 5.1 memory maps or caches a lot of the on-disk temporary file and doesn’t end up actually writing it out to disk.  I am unsure why.

And again, InnoDB has different performance characteristics.  I did not do any testing on InnoDB.

So at the end of the day, my server is back to normal, and I’ve learned a little bit more about how MySQL works.  I hope you did too.

The Economist, and, The Kindle: Take 2

January 3rd, 2010

A while ago I had written about how you can get The Economist on your Kindle (and other e-readers) by running a simple PHP script that crawls the economist.com and generates a .mobi file that it emails it to your Kindle weekly. Unfortunately (though understandably), around July 2009 they locked out their This Week’s Print Edition website to only subscribers of their online and print editions.

With a little bit of work, I’ve updated the economist-to-kindle.php PHP script to handle logging into the economist.com’s website with your user-name and password so it can generate a Kindle version again:

https://github.com/nicjansma/economist-to-kindle

With this update, and if you’re a print edition subscriber, you should be able to get this week’s edition on your Kindle again.

Updated 2010/01/25: Several bugfixes, see comments for details.

Updated 2010/07/22: slifox and crosscode have made some great additions to the code and got it working with the Economist.com’s latest site structure. Check out crosscode’s latest version or read the comments for details.

Updated 2011/05/02: Based on crosscode’s latest version, I’ve updated the script on this site (https://github.com/nicjansma/economist-to-kindle) to work with recent Economist.com articles.

Updated 2011/07/26: Small update to work with the economist.com’s latest updates: https://github.com/nicjansma/economist-to-kindle

Updated 2012/01/04: I’ve moded this project to Github: https://github.com/nicjansma/economist-to-kindle. If you have any suggestions, find bugs, or want to contribute, please head there.