NicJ.net

MySQL: Converting an incorrect latin1 column to utf8

I recently stumbled across a major character encoding issue on one of the websites I run.  Through resolving the issue, I learned a lot about the complexities of supporting international character sets in a LAMP (Linux, Apache, MySQL, PHP) environment.  Fixing the problem was a challenge, so I wanted to share some of the knowledge I gained in case anyone else finds similar issues on their own websites.

The post below is a long yet detailed account of my experience.  Let me know if you’ve had similar experiences or found another solution for this type of issue.

Oh My Münchhausen!!!

A couple of days ago I was notified by a visitor of one of my websites that searching for a term with a non-ASCII character in it (in this case, “Münchhausen”) was returning over 500 results, though none of the results actually matched the given search term.

I started looking into the issue, and saw the same thing he was. The debug logs from the search page showed the following SQL query being used:

mysql> SELECT * FROM MyTable WHERE city = 'Münchhausen';
...
(7ms, 532 rows affected)

However, none of the results actually contained “Münchhausen” for the city.

I took the exact same query and ran it in the command-line mysql client. Strangely, this returned a different result:

mysql> SELECT * FROM MyTable WHERE city = 'Münchhausen';
(0ms, 0 rows affected)

The exact same query, run instead from the command line, returned 0 rows.

Assuming this had something to do with the “ü” character, I started a long journey of re-learning what character encodings are all about, including what UTF-8, latin1 and Unicode are, and how they are used in MySQL.

What’s Going On?

latin1, AKA ISO 8859-1 is the default character set in MySQL 5.0. latin1 is a 8-bit-single-byte character encoding, as opposed to UTF-8 which is a 8-bit-multi-byte character encoding. latin1 can represent most of the characters in the English and European alphabets with just a single byte (up to 256 characters at a time). UTF-8, on the other hand, can represent every character in the Unicode character set (over 109,000 currently) and is the best way to communicate on the Internet if you need to store or display any of the world’s various characters.

You can specify a default character set per MySQL server, database, or table. The defaults for a database will get applied to new tables, and the defaults for a table will get applied to new columns. You can change the defaults at any time (ALTER TABLE, ALTER DATABASE), but they will only get applied to new tables and columns.

You can also specify the character set you’re using for client connections (via the command line, or through an API like PHP’s mysql functions).

Getting back to the Münchhausen Problem, one of the things I initially checked was what character set PHP was talking to MySQL with:

mysql>  SHOW VARIABLES LIKE 'character_set_client';
+----------------------+--------+
| Variable_name        | Value  |
+----------------------+--------+
| character_set_client | latin1 |
+----------------------+--------+
1 row in set (0.00 sec)

Knowing the character “ü” is represented differently in latin1 versus UTF-8 (see below), and taking a wild stab in the dark, I tried to force my PHP application to use UTF-8 when talking to the database to see if this would fix the issue:

mysql_set_charset('utf8');

Voila! The problem was fixed! Searching for “Münchhausen” on the site returned 0 results ( the correct number of matches).

The Problem With “ø” Is That It’s Not “ø”

Or was it? A couple minutes later, I was browsing the site and started coming across funky characters everywhere. For example, a page that previously had the text “Graffiti by Dolk and Pøbel” was now reading “Graffiti by Dolk and Pøbel”. These strange character sequences also looked like an issue I had noticed from time to time in phpMyAdmin with edit fields showing strange characters. As long as I didn’t edit the strange characters, they displayed correctly when PHP spit them back out as HTML, so I hadn’t though much of it until now.

Seeing these strange characters sequences everywhere scared me enough to look into the problem a bit more. I disabled the call to mysql_set_charset() and the site reverted to the previous “correct” behavior of talking to the server via latin1 and displaying “Graffiti by Dolk and Pøbel”.

Back to square one.

It’s All Bits From MySQL’s Point of View

Since the term “Münchhausen” was returning inappropriate results, I tried other search terms that contained non-ASCII characters. For example, I searched for the city “São Paulo”:

mysql> SELECT city FROM MyTable WHERE city = 'São Paulo';
+-----------+
| city      |
+-----------+
| Sao Paulo |
| Sao Paulo |
| Sao Paulo |
...
| Sao Paulo |
(2ms, 72 rows affected)

As you can see, the search term kind-of worked. It found occurrences of “Sao Paulo” but not “São Paulo”. I know there are rows with São in the database, so the query wasn’t working 100% correctly.

I changed the query slightly to a wildcard match instead of the non-ASCII character:

mysql> SELECT city FROM MyTable WHERE city LIKE 'S%o Paulo';
+------------+
| city       |
+------------+
| Sao Paulo  |
| Sao Paulo  |
| Sao Paulo  |
...
| São Paulo |
| São Paulo |
(5ms, 348 rows affected)

This search worked a bit better — it found rows with cities of both Sao Paulo and São Paulo. However, it returned the character sequence “ã” for São Paulo for some reason.

The reason for this is, from MySQL’s point of view, the data stored within its tables are all just bits. The column type and character set of a column determine how queries work against the data and how the data is returned as a result of a SELECT query.

It’s probably pretty obvious by now that my city column wasn’t the right character set. It was set to latin1 when the database was created.

The interesting thing is that my web application, which uses PHP, didn’t seem to mind this very much. Storing and retrieving from the city column is binary-safe — that is, MySQL doesn’t modify the data PHP sends it via the mysql extension.   Later, MySQL will give PHP the exact same data (bits) back.  The problems only occur when you ask MySQL to, on its own, analyze the column or present it.

So all this time, my PHP web application had been storing UTF-8-encoded data in the city column, and later retrieving the exact same (binary) data which it display on the website. MySQL doesn’t modify the data for simple UPDATEs and SELECTs, so the UTF-8 characters were all still displayed properly on the website.

To fix the above SQL query, we can actually force MySQL to re-interpret the data as a specific character encoding by first converting the data to a BINARY type then casting that as UTF-8.  If you try to simply CONVERT USING utf8, MySQL will helpfully convert your garbage-latin1 characters to garbage-utf8 characters.  The two-step process of temporarily converting to BINARY ensures that MySQL doesn’t try to re-interpret the column in the other character encoding.  If we don’t convert to BINARY, MySQL would end up displaying the same ã characters even in UTF-8 output.

So we CAST to BINARY temporarily first, then CONVERT this USING UTF-8:

mysql> SELECT city, CONVERT(CAST(city as BINARY) USING utf8) FROM MyTable WHERE city = 'São Paulo';
+------------+-------------+
| city       | CONVERT...  |
+------------+-------------+
| Sao Paulo  | Sao Paulo   |
| Sao Paulo  | Sao Paulo   |
...
| São Paulo | São Paulo   | ...

Success!  We’ve tricked MySQL into giving us the UTF-8 interpretation of our latin1 column on the fly, and we see that São Paulo is represented properly.

Why do they differ?

Some background: Why is “ã” represented differently in latin1 vs UTF-8?

The character ã in latin1 is character code 0xE3 in hex, or 227 in decimal. It’s 8 bits would be represented as:

1110 0011

latin1 is a single-byte encoding, so each of the 256 characters are just a single byte.

The UTF-8 encoding was designed to be backward-compatible with ASCII documents, for the first 128 characters. For characters above #128, a multi-byte sequence describes the character.

Our character ã, #227, misses the single-byte compatibility with ASCII’s first 128 characters and must be represented in two bytes as described on the Wikipedia UTF-8 page.

Here’s a representation of the character ã in both encodings:

   0000 0000 1110 0011 ASCII character (latin1) (0x00E3)
-> 1100 0011 1010 0011 UTF-8 encoding (0xC3A3)

UTF-8 encoding turns our ã, represented as 0xE3 in latin1, into two bytes, 0xC3A3 in UTF-8.

Note that these two bytes 0xC3 and 0xA3 in UTF-8 happen to look like this in latin1:

0xC3 = Ã in latin1

0xA3 = £ in latin1

So the UTF-8 encoding of ã explains precisely why we see it reinterpreted as ã in latin1.

What To Do, What To Do

At this point, it’s obvious that I messed up somewhere. More precisely, the city column should be UTF-8, since PHP has always been putting UTF-8 data in it. The big reason I hadn’t noticed an issue up to this point is that while the MySQL column is latin1, my PHP app was getting this data and calling htmlentities to convert the UTF-8 characters to HTML codes before displaying them. My website’s visitors saw proper UTF-8 characters on the website even though the MySQL column was latin1.

As we’ve seen, issues start occurring when you do queries against the data. The SELECT above was using a UTF-8 character “ü” for “Münchhausen”, and when comparing this to “latin1” data in the column, MySQL gets confused (can you blame it?). Additional issues can appear with applications that display the “natural” encoding of the column (such as phpMyAdmin): they show the strange character sequences as seen above, instead of UTF-8 decoded characters.

The core of the problem is that the MySQL database was created several years ago and the default collation at the time was latin1_swedish_ci. Over the years, I changed the default to utf8_general_ci for new columns, but existing tables and columns weren’t changed.  I have over 100 tables in latin1 that should be UTF-8 and need to be converted.

So I started investigating what it takes to convert my existing latin1 tables to UTF-8 as appropriate. Once I set the character encoding properly, queries against the database should work better and I shouldn’t have to worry about these types of issues in the future.

There are a couple ways to make the conversion.

Some people have successfully exported their data to latin1, converted the resulting file to UTF-8 via iconv or a similar utility, updated their column definitions, then re-imported that data. Unfortunately this requires taking the database down as tables are dropped and re-created, and this can be a bit time-consuming.

I was hoping for a process that I could apply to an online database, and luckily I found some good notes by Paul Kortman and fabio, so I combined some of their ideas and automated the process for my site. I hit a couple issues along the way, so I wanted to share the steps that worked for me.

The Process

Using the method described on fabio’s blog, we can convert latin1 columns that have UTF-8 characters into proper UTF-8 columns by doing the following steps:

  1. Convert the column to the associated BINARY-type (ALTER TABLE MyTable MODIFY MyColumn BINARY)
  2. Convert the column back to the original type and set the character set to UTF-8 at the same time (ALTER TABLE MyTable MODIFY MyColumn TEXT CHARACTER SET utf8 COLLATE utf8_general_ci)

This is a similar approach to our SELECT CONVERT(CAST(city as BINARY) USING utf8) trick above, where we basically hide the column’s actual data from MySQL by masking it as BINARY temporarily.

If you simply force the column to UTF-8 without the BINARY conversion, MySQL does a data-changing conversion of your “latin1” characters into “UTF-8” and you end up with improperly converted data. Converting the column to BINARY first forces MySQL to not realize the data was in UTF-8 in the first place.

We need to convert each source column type (CHAR vs. VARCHAR vs. TEXT, etc) into its associated BINARY type (BINARY vs. VARBINARY vs. BLOB).

Additionally, the MODIFYs to BINARY and back need to retain the entire column definition. For TEXT types, a simple TEXT to BLOB conversion is sufficient. But for column definitions that have specified lengths, defaults or NOT NULL:

MyColumn VARCHAR(255) DEFAULT '' NOT NULL

We need to MODIFY keeping the same attributes, or the column definition will be fundamentally changed (see notes in ALTER TABLE). In this case, we would specify:

mysql> ALTER TABLE MyTable MODIFY MyColumn VARBINARY(255) DEFAULT '' NOT NULL;

If we don’t specify the length, default and NOT NULL, the columns aren’t the same as before the conversion.

The Script

The script at the bottom of this post automates the conversion of any UTF-8 data stored in latin1 columns to proper UTF-8 columns.

I modified fabio’s script to automate the conversion for all of the latin1 columns for whatever database you configure it to look at. It converts the columns first to the proper BINARY cousin, then to utf8_general_ci, while retaining the column lengths, defaults and NULL attributes.

Warning: This script assumes you know you have UTF-8 characters in a latin1 column. Please test your changes before blindly running the script!

Here are the steps you should take to use the script:

Step 1. Determine Which Columns Need Updating

If you’re like me, you may have a mixture of latin1 and UTF-8 columns in your databases.  Not all of the columns in my database needed to be updated from latin1 to UTF-8.  For example, some of the tables belonged to other PHP apps on the server, and I only wanted to update the columns that I knew had to be fixed.  The script will currently convert all of the tables for the specified database – you could modify the script to change specific tables or columns if you need.

Additionally, the script will only update appropriate text-based columns.  Character sets are only appropriate for some types of data: CHAR, VARCHAR, TINYTEXT, TEXT, MEDIUMTEXT and LONGTEXT. Other column types such as numeric (INT) and BLOBs do not have a “character set”.

You can see what character sets your columns are using via the MySQL Administration tool, phpMyAdmin, or even using a SQL query against the information_schema:

mysql> SELECT * FROM COLUMNS WHERE TABLE_SCHEMA = "MyTable" AND COLLATION_NAME LIKE "latin1%";
...
115 rows in set (0.03 sec)

Step 2. Test Convert the Columns

You should test all of the changes before committing them to your database.

The first thing to test is that the SQL generated from the conversion script is correct.  To do this, you can dump the structure of your database:

server> mysqldump -d -h localhost -u dbuser -p mydatabase > structure.sql

And import this structure to another test MySQL database:

server> mysql -u dbuser -p mydatabase_test < structure.sql

Next, run the conversion script (below) against your temporary database:

server> php -f mysql_convert_latin1_to_utf8.php

The script will spit out “!!! ERROR” statements if a change fails.  If you encounter ERRORs, modifications may be needed based on your requirements.  Some of the common problems are listed in Step 3.

After you run the script against your temporary database, check the information_schema tables to ensure the conversion was successful:

mysql> SELECT * FROM COLUMNS WHERE TABLE_SCHEMA = "MyTable";

As long as you see all of your columns in UTF8, you should be all set!

Step 3. Problems You May Encounter

If it were only that simple. I hit some issues along the way.

Step 3.1. FULLTEXT indexes

I have several columns with FULLTEXT indexes on them.  The ALTER TABLE to BINARY command for a column that has a FULLTEXT index will cause an error:

mysql> ALTER TABLE MyTable MODIFY MyColumn BLOB;
ERROR 1283 (HY000): Column 'MyColumn' cannot be part of FULLTEXT index

The simple solution I came up with was to modify the script to drop the index prior to the conversion, and restore it afterward:

ALTER TABLE MyTable DROP INDEX `mycolumn_fulltext`

... (convert all columns) ...

ALTER TABLE MyTable ADD FULLTEXT KEY `mycolumn_fulltext` (`MyColumn`)

There are TODOs listed in the script where you should make these changes.

Step 3.2. Invalid UTF-8 data

Since my database was over 5 years old, it had acquired some cruft over time. I’m not sure exactly how this happened, but some of the columns had data that are not valid UTF-8 encodings, though they were valid latin1 characters. I believe this occurred before I hardened my PHP application to reject non-UTF-8 data, but I’m not sure. I found this out when initially trying to do the conversion:

mysql> ALTER TABLE MyTable MODIFY MyColumn VARBINARY(3000) NOT NULL DEFAULT '';
Query OK, 21171 rows affected (0.66 sec)

mysql> ALTER TABLE MyTable MODIFY MyColumn varchar(3000) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL DEFAULT '';
ERROR 1366 (HY000): Incorrect string value: '\xE2\x80? fl...' for column 'MyColumn' at row 128

What’s going on?

At some point, a character sequence that contained invalid UTF-8 characters was entered into the database, and now MySQL refuses to call the column VARCHAR (as UTF-8) because it has these invalid character sequences.

I checked the HTML representation of this column in my PHP website, and sure enough, the garbage shows up there too:

… �? flown …

The � is the actual character that your browser shows. Not the best user experience, and definitely not the correct character.

I fixed that single row (via phpMyAdmin), and ran the ALTER TABLE MODIFY command again — same issue, another row. Looks like there is more than a single corrupt row.

I found a good way of rooting out all of the columns that will cause the conversion to fail. If you SELECT CONVERT (MyColumn USING utf8) as a new column, any NULL columns returned are columns that would cause the ALTER TABLE to fail.

For example:

mysql> SELECT MyID, MyColumn, CONVERT(MyColumn USING utf8)
       FROM MyTable
       WHERE CONVERT(MyColumn USING utf8) IS NULL
...
5 rows in set, 10 warnings (0.05 sec)

This showed me the specific rows that contained invalid UTF-8, so I hand-edited to fix them. You could manually NULL them out using an UPDATE if you’re not afraid of losing data. I had to do this for 6 columns out of the 115 columns that were converted.  Only 30 rows in total were corrupt.

Step 4. Profit?

At this point, it may take some guts for you to hit the go button on your live database.

Personally, I ran the script against a test (empty) database, then a copy of my live data, then a staging server before finally executing it on the live data.

Warning: Please be careful when using the script and test, test, test before committing to it!

The Script

The script can be found at Github: https://github.com/nicjansma/mysql-convert-latin1-to-utf8. If you find bugs or want to contribute changes, please head there.

Final Words

I hope what I’ve learned will be useful to others.

If you hit any problems with the conversion script, please let me know.

Share this: