Fixing MySQL collation/charset issues

cli-sql-collationI recently needed to fix some collation issues on some old databases which were caused by my own neglect. I became aware of this as I was finding a lot of errors in the PHP error.log pertaining to MySQL collation errors (actually mostly derived from WordPress sites receiving spam comments in oriental languages).

A fresh install of MySQL appears to have the default database collation set at latin1_swedish_ci. Obviously this can be avoided by making sure to specify your collation on database/table creation, but I’m of the opinion that I’m better off just setting all databases to run in utf8_unicode_ci no matter what.

Step One: Set default collation

Even if you go back and fix all the messed up DBs, you want to make sure that future databases are set to utf8_unicode_ci collation by default. Fortunately this is fairly easy. Please note that these instructions are for MySQL 5.5+. There’s a slightly different requirement for earlier versions of MySQL.

$ sudo cp /etc/mysql/my.cnf /etc/mysql/my.cnf.bak
$ sudo vi /etc/mysql/my.cnf

Now look for a line that says “[mysqld]” and add the following lines directly after it.

collation-server = utf8_unicode_ci
init-connect='SET NAMES utf8'
character-set-server = utf8

Once you’ve saved that file, run:

$ sudo service mysql restart

And that’s your default collation looked after. Next time you create a database without specifying collation it will automatically be utf8_unicode_ci.

Step Two: Locate old databases with incorrect collation

So, how do you find out which databases have the wrong collation, and more importantly, which database tables, have the wrong collation.

The commands below will show you what’s not in utf8_unicode_ci, the first command showing you the databases that are wrong, and the second command shows all tables that are wrong.

mysql> SELECT * FROM information_schema.SCHEMATA WHERE DEFAULT_COLLATION_NAME != 'utf8_unicode_ci';
mysql> SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_COLLATION FROM information_schema.TABLES WHERE TABLE_COLLATION != 'utf8_unicode_ci';

Step Three: Change Collations

NB: Please check the note at the end of this article with an important warning about this approach.

First make sure to backup all databases you’re changing!

[email protected]:~$ mysqldump -u user -p**** db_name > db_name.sql

Then login to mysql and run the following command to change the whole database to utf8_unicode_ci.

mysql> ALTER DATABASE db_name CHARACTER SET utf8 COLLATE utf8_unicode_ci;

Now, return to bash and run the following command. If you want, you can drop the “&” at the end so that you see everything it runs rather than pushing it to the background. Remember to edit mysql user/pass and database name twice.

[email protected]:~$ mysql -u user -p**** --database=db_name -B -N -e "SHOW TABLES" \
| awk '{print "SET foreign_key_checks = 0; ALTER TABLE", $1, "CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci; SET foreign_key_checks = 1; "}' \
| mysql -u user -p**** --database=db_name &

Data Loss Warning

The method described above is a bit of a ‘quick and dirty’ fix. If you’re changing collation from non-utf8 to utf8 (instead of changing from utf8_general_ci to utf8_unicode_ci) you will end up changing the data type of certain tables. For example a TEXT column will become MEDIUMTEXT. As far as I can tell if you’re changing from say utf8_general_ci to utf8_unicode_ci this is not a problem and the table data types stay the same.

If you’re dealing with non-utf8 to utf8 you would probably be best advised altering your table columns twice to avoid data loss, however this means that you need to manually go through them (or write a script). Doing this manually would be painful on a large scale, so you’ll definitely want to script it.

For example, if you’re changing a WordPress posts table you’ll need to run:

ALTER TABLE wp_posts CHANGE post_content post_content LONGBLOB;
ALTER TABLE wp_posts CHANGE post_content post_content LONGTEXT CHARACTER SET utf8 COLLATE utf8_unicode_ci;

So you need to change the content type to a binary/blob equivalent first, then convert it back to UTF8 and specify the original data type of the column!

These are the data conversions you need to make first before converting them back.

  • CHAR ⇾ BINARY
  • TEXT ⇾ BLOB
  • TINYTEXT ⇾ TINYBLOB
  • MEDIUMTEXT ⇾ MEDIUMBLOB
  • LONGTEXT ⇾ LONGBLOB
  • VARCHAR(x) ⇾ VARBINARY(x)

I haven’t found/written a full script for doing this the entirely correct way yet, but hopefully can sit down and draft something up soon.

You can read some more detailed explanations about this on the WordPress Codex (but applies for non WordPress sites) and the MySQL Performance Blog.

Comments

  • Awesome stuff, sorted that out for me on one of my older blogs that’s been ported around a ton of different hosts.

    • Huzzah! Delighted it was of some help. Always great to hear these posts are helpful to people – makes it worthwhile to keep on writing them 🙂

      Thanks for the comment!