MySQL: Converting an incorrect latin1 column to utf8

April 17th, 2011

I recently stumbled across a major character encoding issue on one of the websites I run.  Through resolving the issue, I learned a lot about the complexities of supporting international character sets in a LAMP (Linux, Apache, MySQL, PHP) environment.  Fixing the problem was a challenge, so I wanted to share some of the knowledge I gained in case anyone else finds similar issues on their own websites.

The post below is a long yet detailed account of my experience.  Let me know if you’ve had similar experiences or found another solution for this type of issue.

Oh My Münchhausen!!!

A couple of days ago I was notified by a visitor of one of my websites that searching for a term with a non-ASCII character in it (in this case, “Münchhausen”) was returning over 500 results, though none of the results actually matched the given search term.

I started looking into the issue, and saw the same thing he was. The debug logs from the search page showed the following SQL query being used:

mysql> SELECT * FROM MyTable WHERE city = 'Münchhausen';
...
(7ms, 532 rows affected)

However, none of the results actually contained “Münchhausen” for the city.

I took the exact same query and ran it in the command-line mysql client. Strangely, this returned a different result:

mysql> SELECT * FROM MyTable WHERE city = 'Münchhausen';
(0ms, 0 rows affected)

The exact same query, run instead from the command line, returned 0 rows.

Assuming this had something to do with the “ü” character, I started a long journey of re-learning what character encodings are all about, including what UTF-8, latin1 and Unicode are, and how they are used in MySQL.

What’s Going On?

latin1, AKA ISO 8859-1 is the default character set in MySQL 5.0. latin1 is a 8-bit-single-byte character encoding, as opposed to UTF-8 which is a 8-bit-multi-byte character encoding. latin1 can represent most of the characters in the English and European alphabets with just a single byte (up to 256 characters at a time). UTF-8, on the other hand, can represent every character in the Unicode character set (over 109,000 currently) and is the best way to communicate on the Internet if you need to store or display any of the world’s various characters.

You can specify a default character set per MySQL server, database, or table. The defaults for a database will get applied to new tables, and the defaults for a table will get applied to new columns. You can change the defaults at any time (ALTER TABLE, ALTER DATABASE), but they will only get applied to new tables and columns.

You can also specify the character set you’re using for client connections (via the command line, or through an API like PHP’s mysql functions).

Getting back to the Münchhausen Problem, one of the things I initially checked was what character set PHP was talking to MySQL with:

mysql>  SHOW VARIABLES LIKE 'character_set_client';
+----------------------+--------+
| Variable_name        | Value  |
+----------------------+--------+
| character_set_client | latin1 |
+----------------------+--------+
1 row in set (0.00 sec)

Knowing the character “ü” is represented differently in latin1 versus UTF-8 (see below), and taking a wild stab in the dark, I tried to force my PHP application to use UTF-8 when talking to the database to see if this would fix the issue:

mysql_set_charset('utf8');

Voila! The problem was fixed! Searching for “Münchhausen” on the site returned 0 results ( the correct number of matches).

The Problem With “ø” Is That It’s Not “ø”

Or was it? A couple minutes later, I was browsing the site and started coming across funky characters everywhere. For example, a page that previously had the text “Graffiti by Dolk and Pøbel” was now reading “Graffiti by Dolk and Pøbel”. These strange character sequences also looked like an issue I had noticed from time to time in phpMyAdmin with edit fields showing strange characters. As long as I didn’t edit the strange characters, they displayed correctly when PHP spit them back out as HTML, so I hadn’t though much of it until now.

Seeing these strange characters sequences everywhere scared me enough to look into the problem a bit more. I disabled the call to mysql_set_charset() and the site reverted to the previous “correct” behavior of talking to the server via latin1 and displaying “Graffiti by Dolk and Pøbel”.

Back to square one.

It’s All Bits From MySQL’s Point of View

Since the term “Münchhausen” was returning inappropriate results, I tried other search terms that contained non-ASCII characters. For example, I searched for the city “São Paulo”:

mysql> SELECT city FROM MyTable WHERE city = 'São Paulo';
+-----------+
| city      |
+-----------+
| Sao Paulo |
| Sao Paulo |
| Sao Paulo |
...
| Sao Paulo |
(2ms, 72 rows affected)

As you can see, the search term kind-of worked. It found occurrences of “Sao Paulo” but not “São Paulo”. I know there are rows with São in the database, so the query wasn’t working 100% correctly.

I changed the query slightly to a wildcard match instead of the non-ASCII character:

mysql> SELECT city FROM MyTable WHERE city LIKE 'S%o Paulo';
+------------+
| city       |
+------------+
| Sao Paulo  |
| Sao Paulo  |
| Sao Paulo  |
...
| São Paulo |
| São Paulo |
(5ms, 348 rows affected)

This search worked a bit better — it found rows with cities of both Sao Paulo and São Paulo. However, it returned the character sequence “ã” for São Paulo for some reason.

The reason for this is, from MySQL’s point of view, the data stored within its tables are all just bits. The column type and character set of a column determine how queries work against the data and how the data is returned as a result of a SELECT query.

It’s probably pretty obvious by now that my city column wasn’t the right character set. It was set to latin1 when the database was created.

The interesting thing is that my web application, which uses PHP, didn’t seem to mind this very much. Storing and retrieving from the city column is binary-safe — that is, MySQL doesn’t modify the data PHP sends it via the mysql extension.   Later, MySQL will give PHP the exact same data (bits) back.  The problems only occur when you ask MySQL to, on its own, analyze the column or present it.

So all this time, my PHP web application had been storing UTF-8-encoded data in the city column, and later retrieving the exact same (binary) data which it display on the website. MySQL doesn’t modify the data for simple UPDATEs and SELECTs, so the UTF-8 characters were all still displayed properly on the website.

To fix the above SQL query, we can actually force MySQL to re-interpret the data as a specific character encoding by first converting the data to a BINARY type then casting that as UTF-8.  If you try to simply CONVERT USING utf8, MySQL will helpfully convert your garbage-latin1 characters to garbage-utf8 characters.  The two-step process of temporarily converting to BINARY ensures that MySQL doesn’t try to re-interpret the column in the other character encoding.  If we don’t convert to BINARY, MySQL would end up displaying the same ã characters even in UTF-8 output.

So we CAST to BINARY temporarily first, then CONVERT this USING UTF-8:

mysql> SELECT city, CONVERT(CAST(city as BINARY) USING utf8) FROM MyTable WHERE city = 'São Paulo';
+------------+-------------+
| city       | CONVERT...  |
+------------+-------------+
| Sao Paulo  | Sao Paulo   |
| Sao Paulo  | Sao Paulo   |
...
| São Paulo | São Paulo   | ...

Success!  We’ve tricked MySQL into giving us the UTF-8 interpretation of our latin1 column on the fly, and we see that São Paulo is represented properly.

Why do they differ?

Some background: Why is “ã” represented differently in latin1 vs UTF-8?

The character ã in latin1 is character code 0xE3 in hex, or 227 in decimal. It’s 8 bits would be represented as:

1110 0011

latin1 is a single-byte encoding, so each of the 256 characters are just a single byte.

The UTF-8 encoding was designed to be backward-compatible with ASCII documents, for the first 128 characters. For characters above #128, a multi-byte sequence describes the character.

Our character ã, #227, misses the single-byte compatibility with ASCII’s first 128 characters and must be represented in two bytes as described on the Wikipedia UTF-8 page.

Here’s a representation of the character ã in both encodings:

   0000 0000 1110 0011 ASCII character (latin1) (0x00E3)
-> 1100 0011 1010 0011 UTF-8 encoding (0xC3A3)

UTF-8 encoding turns our ã, represented as 0xE3 in latin1, into two bytes, 0xC3A3 in UTF-8.

Note that these two bytes 0xC3 and 0xA3 in UTF-8 happen to look like this in latin1:

0xC3 = Ã in latin1

0xA3 = £ in latin1

So the UTF-8 encoding of ã explains precisely why we see it reinterpreted as ã in latin1.

What To Do, What To Do

At this point, it’s obvious that I messed up somewhere. More precisely, the city column should be UTF-8, since PHP has always been putting UTF-8 data in it. The big reason I hadn’t noticed an issue up to this point is that while the MySQL column is latin1, my PHP app was getting this data and calling htmlentities to convert the UTF-8 characters to HTML codes before displaying them. My website’s visitors saw proper UTF-8 characters on the website even though the MySQL column was latin1.

As we’ve seen, issues start occurring when you do queries against the data. The SELECT above was using a UTF-8 character “ü” for “Münchhausen”, and when comparing this to “latin1” data in the column, MySQL gets confused (can you blame it?). Additional issues can appear with applications that display the “natural” encoding of the column (such as phpMyAdmin): they show the strange character sequences as seen above, instead of UTF-8 decoded characters.

The core of the problem is that the MySQL database was created several years ago and the default collation at the time was latin1_swedish_ci. Over the years, I changed the default to utf8_general_ci for new columns, but existing tables and columns weren’t changed.  I have over 100 tables in latin1 that should be UTF-8 and need to be converted.

So I started investigating what it takes to convert my existing latin1 tables to UTF-8 as appropriate. Once I set the character encoding properly, queries against the database should work better and I shouldn’t have to worry about these types of issues in the future.

There are a couple ways to make the conversion.

Some people have successfully exported their data to latin1, converted the resulting file to UTF-8 via iconv or a similar utility, updated their column definitions, then re-imported that data. Unfortunately this requires taking the database down as tables are dropped and re-created, and this can be a bit time-consuming.

I was hoping for a process that I could apply to an online database, and luckily I found some good notes by Paul Kortman and fabio, so I combined some of their ideas and automated the process for my site. I hit a couple issues along the way, so I wanted to share the steps that worked for me.

The Process

Using the method described on fabio’s blog, we can convert latin1 columns that have UTF-8 characters into proper UTF-8 columns by doing the following steps:

  1. Convert the column to the associated BINARY-type (ALTER TABLE MyTable MODIFY MyColumn BINARY)
  2. Convert the column back to the original type and set the character set to UTF-8 at the same time (ALTER TABLE MyTable MODIFY MyColumn TEXT CHARACTER SET utf8 COLLATE utf8_general_ci)

This is a similar approach to our SELECT CONVERT(CAST(city as BINARY) USING utf8) trick above, where we basically hide the column’s actual data from MySQL by masking it as BINARY temporarily.

If you simply force the column to UTF-8 without the BINARY conversion, MySQL does a data-changing conversion of your “latin1” characters into “UTF-8” and you end up with improperly converted data. Converting the column to BINARY first forces MySQL to not realize the data was in UTF-8 in the first place.

We need to convert each source column type (CHAR vs. VARCHAR vs. TEXT, etc) into its associated BINARY type (BINARY vs. VARBINARY vs. BLOB).

Additionally, the MODIFYs to BINARY and back need to retain the entire column definition. For TEXT types, a simple TEXT to BLOB conversion is sufficient. But for column definitions that have specified lengths, defaults or NOT NULL:

MyColumn VARCHAR(255) DEFAULT '' NOT NULL

We need to MODIFY keeping the same attributes, or the column definition will be fundamentally changed (see notes in ALTER TABLE). In this case, we would specify:

mysql> ALTER TABLE MyTable MODIFY MyColumn VARBINARY(255) DEFAULT '' NOT NULL;

If we don’t specify the length, default and NOT NULL, the columns aren’t the same as before the conversion.

The Script

The script at the bottom of this post automates the conversion of any UTF-8 data stored in latin1 columns to proper UTF-8 columns.

I modified fabio’s script to automate the conversion for all of the latin1 columns for whatever database you configure it to look at. It converts the columns first to the proper BINARY cousin, then to utf8_general_ci, while retaining the column lengths, defaults and NULL attributes.

Warning: This script assumes you know you have UTF-8 characters in a latin1 column. Please test your changes before blindly running the script!

Here are the steps you should take to use the script:

Step 1. Determine Which Columns Need Updating

If you’re like me, you may have a mixture of latin1 and UTF-8 columns in your databases.  Not all of the columns in my database needed to be updated from latin1 to UTF-8.  For example, some of the tables belonged to other PHP apps on the server, and I only wanted to update the columns that I knew had to be fixed.  The script will currently convert all of the tables for the specified database – you could modify the script to change specific tables or columns if you need.

Additionally, the script will only update appropriate text-based columns.  Character sets are only appropriate for some types of data: CHAR, VARCHAR, TINYTEXT, TEXT, MEDIUMTEXT and LONGTEXT. Other column types such as numeric (INT) and BLOBs do not have a “character set”.

You can see what character sets your columns are using via the MySQL Administration tool, phpMyAdmin, or even using a SQL query against the information_schema:

mysql> SELECT * FROM COLUMNS WHERE TABLE_SCHEMA = "MyTable" AND COLLATION_NAME LIKE "latin1%";
...
115 rows in set (0.03 sec)

Step 2. Test Convert the Columns

You should test all of the changes before committing them to your database.

The first thing to test is that the SQL generated from the conversion script is correct.  To do this, you can dump the structure of your database:

server> mysqldump -d -h localhost -u dbuser -p mydatabase > structure.sql

And import this structure to another test MySQL database:

server> mysql -u dbuser -p mydatabase_test < structure.sql

Next, run the conversion script (below) against your temporary database:

server> php -f mysql_convert_latin1_to_utf8.php

The script will spit out “!!! ERROR” statements if a change fails.  If you encounter ERRORs, modifications may be needed based on your requirements.  Some of the common problems are listed in Step 3.

After you run the script against your temporary database, check the information_schema tables to ensure the conversion was successful:

mysql> SELECT * FROM COLUMNS WHERE TABLE_SCHEMA = "MyTable";

As long as you see all of your columns in UTF8, you should be all set!

Step 3. Problems You May Encounter

If it were only that simple. I hit some issues along the way.

Step 3.1. FULLTEXT indexes

I have several columns with FULLTEXT indexes on them.  The ALTER TABLE to BINARY command for a column that has a FULLTEXT index will cause an error:

mysql> ALTER TABLE MyTable MODIFY MyColumn BLOB;
ERROR 1283 (HY000): Column 'MyColumn' cannot be part of FULLTEXT index

The simple solution I came up with was to modify the script to drop the index prior to the conversion, and restore it afterward:

ALTER TABLE MyTable DROP INDEX `mycolumn_fulltext`

... (convert all columns) ...

ALTER TABLE MyTable ADD FULLTEXT KEY `mycolumn_fulltext` (`MyColumn`)

There are TODOs listed in the script where you should make these changes.

Step 3.2. Invalid UTF-8 data

Since my database was over 5 years old, it had acquired some cruft over time. I’m not sure exactly how this happened, but some of the columns had data that are not valid UTF-8 encodings, though they were valid latin1 characters. I believe this occurred before I hardened my PHP application to reject non-UTF-8 data, but I’m not sure. I found this out when initially trying to do the conversion:

mysql> ALTER TABLE MyTable MODIFY MyColumn VARBINARY(3000) NOT NULL DEFAULT '';
Query OK, 21171 rows affected (0.66 sec)

mysql> ALTER TABLE MyTable MODIFY MyColumn varchar(3000) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL DEFAULT '';
ERROR 1366 (HY000): Incorrect string value: '\xE2\x80? fl...' for column 'MyColumn' at row 128

What’s going on?

At some point, a character sequence that contained invalid UTF-8 characters was entered into the database, and now MySQL refuses to call the column VARCHAR (as UTF-8) because it has these invalid character sequences.

I checked the HTML representation of this column in my PHP website, and sure enough, the garbage shows up there too:

… �? flown …

The � is the actual character that your browser shows. Not the best user experience, and definitely not the correct character.

I fixed that single row (via phpMyAdmin), and ran the ALTER TABLE MODIFY command again — same issue, another row. Looks like there is more than a single corrupt row.

I found a good way of rooting out all of the columns that will cause the conversion to fail. If you SELECT CONVERT (MyColumn USING utf8) as a new column, any NULL columns returned are columns that would cause the ALTER TABLE to fail.

For example:

mysql> SELECT MyID, MyColumn, CONVERT(MyColumn USING utf8)
       FROM MyTable
       WHERE CONVERT(MyColumn USING utf8) IS NULL
...
5 rows in set, 10 warnings (0.05 sec)

This showed me the specific rows that contained invalid UTF-8, so I hand-edited to fix them. You could manually NULL them out using an UPDATE if you’re not afraid of losing data. I had to do this for 6 columns out of the 115 columns that were converted.  Only 30 rows in total were corrupt.

Step 4. Profit?

At this point, it may take some guts for you to hit the go button on your live database.

Personally, I ran the script against a test (empty) database, then a copy of my live data, then a staging server before finally executing it on the live data.

Warning: Please be careful when using the script and test, test, test before committing to it!

The Script

The script can be found at Github: https://github.com/nicjansma/mysql-convert-latin1-to-utf8. If you find bugs or want to contribute changes, please head there.

Final Words

I hope what I’ve learned will be useful to others.

If you hit any problems with the conversion script, please let me know.

Share this:

  1. Henry C
    April 28th, 2011 at 09:02 | #1

    i hit a snag with this gr8 script on a table that has enum for column type. e.g enum(‘taxonomy’,’edited’,’grouped’,’un-grouped’) How to fix for this?

  2. April 28th, 2011 at 20:43 | #2

    Hi Henry,

    I’m not using ENUMs for any of my column types. Here’s another article on wordpress.org that suggests how you might change an ENUM: http://codex.wordpress.org/Converting_Database_Character_Sets#Special_case:_ENUM_-_Different_process

  3. August 28th, 2011 at 01:29 | #3

    Hi!

    I’m not quite getting this to work. I have a InnoDB table which uses utf8_swedish_ci as collation. But I still get the ?-mark when presenting the data on my website. In phpMyAdmin the characters show fine.

    I use AJAX to retrieve data from the table in realtime, so I’ve made sure the headers of the retrieved file are using UTF8, but it doesn’t seem to help. I tried your ALTER TABLE-fix, but no change.

    The data I filled the table with came from a file, but also that was encoded in UTF8.

    Any ideas?

    /Björn

  4. August 28th, 2011 at 01:45 | #4

    @ Björn F
    Update: when I set the response file’s header to iso-8859-1 the characters show correctly. Although they never are stored as iso-8859-1/latin1…

  5. Ricardas
    December 30th, 2011 at 05:29 | #5

    Thank you, very helpful post!

  6. January 23rd, 2012 at 12:40 | #6

    Hi,

    Thanks for this very informational post although I have some problems that I can not fix with your guidelines.

    The problem is that on our website we see invalid utf8 characters showing as �. I use MySQL workbench and if I select the column with the problem I also see a � as the query result.

    So I though the script should fail on these columns. But the script never failed. then I though maybe I should get a list of all such values that are not valid as you suggested. So I ran this query:

    mysql> SELECT MyID, MyColumn, CONVERT(MyColumn USING utf8)
    FROM MyTable
    WHERE CONVERT(MyColumn USING utf8) IS NULL

    and it return 0 results.

    Does this mean that the data is actually proper utf8? (conversion does not fail). if so, why is it showing as � in MySQL workbench when I view the value of that specific column?

    Any help on this will be greatly appreciated. Once again thanks for sharing this with us.

  7. January 24th, 2012 at 10:33 | #7

    @Maryam
    Maryam,

    Are you using PHP on your website? Make sure you’re talking to the database in the right charset, for example:

    mysql_set_charset(‘utf8’);

    Does MySQL workbench report the colums as being utf8 now?

  8. Luca
    January 28th, 2012 at 04:01 | #8

    Hi, very interesting article and thanks for explaining everything, from the look of it i thought i might have finally found the solution to my problem… but as it looks like i have different problem even if the description is exactly the same… in the end running the convert query i get the exact same result i get when selecting the original data… if i run it using a putty connection, if i run the conosle on my laptop, ssh to the server, and run the query i get the correct italian lettters i’m trying to put in the DB… (à é and so on) in BOTH columns… O_o

    Thank you and.. “Any ideas?” 😀

  9. February 29th, 2012 at 20:44 | #9

    I have also
    SELECT MyID, MyColumn, CONVERT(MyColumn USING utf8)
    FROM MyTable
    WHERE CONVERT(MyColumn USING utf8) IS NULL
    also returns 0 results. But if I try insert values from MyColumn to other utf8 Table/Column it returns ERROR 1366: Incorrect string value

  10. February 29th, 2012 at 22:36 | #10

    @Rafis
    Hey Raifs,

    Are you using Windows cmd window? Some other folks are reporting issues on Windows here: http://bugs.mysql.com/bug.php?id=30131

    Is it reporting exactly which characters are the issue after “Incorrect string value”?

  11. February 29th, 2012 at 23:17 | #11

    NICE ONE!!!
    been searching for a week already.

    Thanks a lot for the code and explanation

  12. February 29th, 2012 at 23:55 | #12

    @ Nic
    It was:

    Error Number: 1366

    Incorrect string value: ‘\xD1\x80\xD0\xB5\xD0\xB3…’ for column ‘content’ at row 1

    At last got worked! After
    ALTER TABLE `med_news` DEFAULT CHARACTER SET utf8 COLLATE utf8_bin
    Thank you, very much! It was utf8_general_ci before.

    But how to know which these characters are “\xD1\x80\xD0\xB5\xD0\xB3”?

  13. March 1st, 2012 at 00:33 | #13

    Solved. Seems the problem was not in charset or collation! It was in size of field TEXT = 64Kb, MEDIUMTEXT = 16Mb, truncating to 64Kb was breaking last character.

  14. March 18th, 2012 at 02:31 | #14

    Great writeup! Thanks!

  15. sleep-o-matic
    May 8th, 2012 at 10:59 | #15

    Thank you so much … this saved me loads of time 😉
    very much appreciated.

  16. May 16th, 2012 at 11:32 | #16

    Thank you so much for the detailed explanation of the issue and the helpful script. We ran into this issue converting a very large EE 1.x database for use in EE 2.x and this did the trick.

  17. Mark
    May 16th, 2012 at 23:50 | #17

    I’m working on a related problem that your article and PHP do not seem to solve.

    I get this error when working with some of my data:

    Warning (Code 1366): Incorrect string value: ‘\xFCrttem…’ for column ‘name’ at row 1

    Here’s an example:

    select unhex(‘426164656E2D57FC727474656D626572672C2044452C204445’) with_fc
    , unhex(‘426164656E2D57C3BC727474656D626572672C2044452C204445’) with_c3bc;

    They could both evaluate to “Baden-Württemberg, DE, DE”, but only the second option works with hex and utf8.

    It may be that I have to convert from latin1 to utf16 and then to utf8.

  18. j
    June 18th, 2012 at 04:35 | #18

    Used your script, but seems like there is a character limit to it.

  19. June 18th, 2012 at 05:42 | #19

    j :

    Used your script, but seems like there is a character limit to it.

    Could you explain more? Are you saying you had a column with data, and after the conversion, some of the rows had their data truncated?

  20. August 17th, 2012 at 03:09 | #20

    used your script to convert a typo3 database from 4.2 to 4.7 where character sets seem to have changed, as i had many garbled chars after the update. i just ran it on the live-db after i made a backup and it worked like a charm. all garbled chars are now gone, and i did not even have to change any part of the script. THANKS! this really saved me a lot of time.

  21. Ed
    October 19th, 2012 at 10:31 | #21

    The emails I receive from just one department in my job look like this in Thunderbird/Brazilian Portuguese:
    .
    ã = ã
    í = í
    ç = ç
    é = é
    Â = null
    ém = é
    etc…
    .
    Com a finalidade de não interferir no trabalho logístico da biblioteca peço a gentileza de avisarem aos profissionais que a frequentam, para solicitarem livretos e revistas formalmente através do email ou do Fale Conosco (site) com identificação do pedido e indicação de quantidade. Na mensagem deverão constar dados pessoais como: nome completo, nº, endereço completo, telefone e email para contato, deixando claro que desta forma ele serà atendido eficazmente e também passará a receber a nova revista.
    .
    However, those same emails show OK when opened in Squirrel mail client. Re-sending a messed up text received like the one above in Thunderbird through Squirrel does not make/convert it to show up OK again. I could not find someone to offer any solution or explanation. Any hints?

    • October 27th, 2012 at 06:54 | #22

      Looks like the character encoding of the email sent out (from whatever email client they’re using) might be specified improperly, and possibly, SquirrelMail notices the error and corrects it.

      Sounds like an issue with the Thunderbird display engine or the sending email app though, not MySQL.

  22. Tei
    November 30th, 2012 at 02:35 | #23

    Thanks for this post. Help me fix a problem with a php app where everything was UTF8, but still something refused to work properly.

  23. January 19th, 2013 at 20:26 | #24

    Saved my time! Thanks a lot!

  24. Naomi
    January 23rd, 2013 at 14:17 | #25

    Useful script! For the conversion from BINARY back to CHAR, I think the ALTER TABLE command will actually pad extra 0x00 bytes at the end. So if you have an empty string in the column, after converting the column back to CHAR type, it’ll actually inflate your column.

    For that case, you may want to do something like this after the ALTER TABLE command:

    sqlExec($targetDB, “UPDATE `$tableName` SET `$colName` = TRIM(TRAILING 0x00 FROM `$colName`)”, $pretend);

  25. yip
    February 5th, 2013 at 19:06 | #26

    just to let you know,
    this statement:
    “latin1, AKA ISO 8859-1 is the default character set in MySQL 5.0”
    is false.

    MySQL’ latin1 is NOT iso-8859-1(5).
    it is Windows1252, also known as CP1252. check the conversion tables to confirm.

    MySQL foolishly call it Latin1. For ALL other systems, latin1=iso-8859-1(5) .
    Thanks MySQL for the confusion.
    thousands of devs, including me, fall for the trap.

  26. February 21st, 2013 at 03:53 | #27

    Thanks a lot for providing this script!
    I spent hours to find a way out of this encoding-hell!

    The script worked for me without any problems.

  27. Vijay
    February 8th, 2016 at 09:16 | #28

    Wow! Great Article. That saved a Production issue(that encoding hell) for us.!

  28. June 6th, 2016 at 10:11 | #29

    Awesome !
    Thank you very much.

  29. October 13th, 2017 at 01:51 | #30

    Thanks NIC

    This article was indeed helpful.
    For me i was looking this
    mysql> SELECT MyID, MyColumn, CONVERT(MyColumn USING utf8)
    FROM MyTable
    WHERE CONVERT(MyColumn USING utf8) IS NULL

  30. BBking
    May 27th, 2018 at 11:36 | #31

    When I ran you php script (many thanks for that!!) I get this message for every ALTER/MODIFY command:
    PHP Notice: Undefined variable: res in /usr/home/bbking/mysql-convert-latin1-to-utf8.php on line 201

    and the tables don’t change; either in encoding nor in content. Any ideas?
    Thanks

  31. D
    September 4th, 2018 at 09:59 | #33

    Thank you for this.

  32. Ryan
    October 17th, 2018 at 18:50 | #34

    Have you considered updating this article to refer to `utf8mb4`, which is *actually utf8* instead of the `utf8` type? Yes, that’s ridiculous.

  33. Guru
    February 15th, 2019 at 00:24 | #36

    Could you please comment on the time that we can expect for this activity on per table basis in case the amount of data already present in the table is huge?

    • February 17th, 2019 at 19:17 | #37

      Hi @Guru! That entirely depends on your data set, the processing power of the machine, etc. My guess is it should be similar to the time it takes to duplicate (or export) a table.

  34. April 28th, 2019 at 23:05 | #38

    Thanks for this Nic – I am using Media Wiki and they are actually abandoning utf8, and going binary. I assume that your scripts would work that way also… however do you see any reasons why such a conversion would create new challenges?
    https://www.mediawiki.org/w/index.php?title=Topic:Uygrdvlsipucegw6&topic_showPostId=uyr7f40seatbtn0g#flow-post-uyr7f40seatbtn0g

    • April 30th, 2019 at 17:50 | #39

      Interesting! I would assume it would work that way as well, but haven’t tested it. I’m using MediaWiki for a few sites as well, so I may have to try it out soon!

  35. Luca
    October 17th, 2019 at 11:18 | #40

    used also with cp1251 and works
    but there’s an error here
    $colDefault = ”;
    if ($col->COLUMN_DEFAULT !== null) {
    $colDefault = “DEFAULT ‘{$col->COLUMN_DEFAULT}'”;

    MODIFY `grouplevel` varchar(100) COLLATE utf8_unicode_ci NOT NULL DEFAULT ”all”,
    MODIFY `start` varchar(15) COLLATE utf8_unicode_ci NOT NULL DEFAULT ””,

    !!! ERROR: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near ‘all”,
    MODIFY `start` varchar(15) COLLATE utf8_unicode_ci NOT NULL DEFAULT ””,’ at line 6

    result in this example NOT NULL DEFAULT ”all”,
    should be NOT NULL DEFAULT ‘all’,
    so i’ve removed apex here $colDefault = “DEFAULT {$col->COLUMN_DEFAULT}”;

  36. John Seaber
    January 26th, 2021 at 18:09 | #42

    Thank you for this fantastic article! I manage a database with over 10 years of MySQL data, originally in latin1_swedish_ci. It sounds like we’ve had a similar experience with past encodings. It was like treasure finding your article during a MySQL 8 upgrade.

    I modified and tested your script from GitHub to convert latin1_swedish_ci -> utf8mb4 and the transition went fairly well. I’ll share bugs on Github as requested. Just wanted to say thanks first!

  37. March 18th, 2022 at 18:38 | #44

    Great!, Awesome.

    Thank you so much Nic for creating the script, it really helps us on fixing the incorrect encoding on our 30GB database size of MySQL data.

    Our Best Regards

  38. Rafael
    May 23rd, 2023 at 02:19 | #45

    I think that “CAST(city as BINARY)” could be changed to “BINARY city” to be more clear

  39. DrFriendless
    March 27th, 2024 at 20:05 | #46

    Oh goodness, thank you from the future! My problem is that our LONGTEXT column uses latin1, but we ut UTF-8 into it. Every other mention of this problem on the internet swears that it’s not a thing, but it absolutely is. I need to do this:

    ALTER TABLE thing MODIFY description BLOB(999999);
    ALTER TABLE thing MODIFY description LONGTEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;

  1. May 10th, 2011 at 07:31 | #1
  2. October 7th, 2011 at 09:49 | #2
  3. October 7th, 2011 at 10:00 | #3
  4. October 25th, 2011 at 12:25 | #4
  5. October 26th, 2011 at 02:09 | #5
  6. October 26th, 2011 at 02:16 | #6
  7. October 26th, 2011 at 02:20 | #7
  8. September 26th, 2012 at 22:19 | #8
  9. July 7th, 2021 at 20:31 | #9