Character encoding configuration in MySQL has always been a bit confusing. With too many options to set, unclear relationships between them, and the default settings that make MySQL incompatible with most languages, it is a headache to many users, many of whom end up with broken data. This lecture will provide an overview of the character set support in MySQL, guidelines on how to use it correctly, and will demonstrate several methods of detecting and repairing mangled data.
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Character Encoding - MySQL DevRoom - FOSDEM 2015
1. Character encoding
Breaking and unbreaking your data
Maciej Dobrzanski
maciek@psce.com | @mushupl
Brussels, 1 Feb 2015
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
2. Character Encoding
• Binary representation of glyphs
• Each character can be represented by 1 or more bytes
• Popular schemes
• ASCII
• Unicode
• UTF-8, UTF-16, UTF-32
• Language specific character sets
• US (Latin US)
• Europe (Latin 1, Latin 2)
• Asia (EUC-KR, GB18030)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
3. Character Encoding
• Character set defines the visual interpretation of binary information
• One glyph can be associated with several numeric codes
• One numeric code may be used to represent several different glyphs
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
4. Please state the nature of the emergency
• Application configuration
• Database configuration
• Table/column definitions
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
5. Problem #1: We are all born Swedish
• MySQL uses latin1 by default
• MySQL 5.7 too
• Is anyone actually aware of that?
• Why Swedish?
• latin1_swedish_ci is the default collation
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
6. Problem #1
• Let’s build an application
mysql> SELECT @@global.character_set_server, @@session.character_set_client;
+-------------------------------+--------------------------------+
| @@global.character_set_server | @@session.character_set_client |
+-------------------------------+--------------------------------+
| latin1 | latin1 |
+-------------------------------+--------------------------------+
1 row in set (0.00 sec)
mysql> CREATE SCHEMA fosdem;
Query OK, 1 row affected (0.00 sec)
mysql> USE fosdem;
mysql> CREATE TABLE locations (city VARCHAR(30) NOT NULL);
Query OK, 0 rows affected (0.15 sec)
mysql> SHOW CREATE TABLE locationsG
*************************** 1. row ***************************
Table: locations
Create Table: CREATE TABLE `locations` (
`city` varchar(30) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
16. Problem #2
• Why is the table character set latin1?
mysql> SELECT @@session.character_set_server, @@session.character_set_client;
+--------------------------------+--------------------------------+
| @@session.character_set_server | @@session.character_set_client |
+--------------------------------+--------------------------------+
| utf8 | utf8 |
+--------------------------------+--------------------------------+
1 row in set (0.00 sec)
mysql> USE fosdem;
mysql> SHOW CREATE TABLE peopleG
*************************** 1. row ***************************
Table: people
Create Table: CREATE TABLE `people` (
`first_name` varchar(30) NOT NULL,
`last_name` varchar(30) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
17. Problem #2
• What’s all this, then?
mysql> SHOW SESSION VARIABLES LIKE 'character_set_%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
mysql> SHOW CREATE DATABASE fosdemG
*************************** 1. row ***************************
Database: fosdem
Create Database: CREATE DATABASE `fosdem` /*!40100 DEFAULT CHARACTER SET latin1 */
1 row in set (0.00 sec)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
18. Problem #2
• Can we fix this?
mysql> SET NAMES utf8;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT last_name, HEX(last_name) FROM people;
+------------+----------------------+
| last_name | HEX(last_name) |
+------------+----------------------+
| Lemon | 4C656D6F6E |
| Müller | 4DFC6C6C6572 |
| Dobrza?ski | 446F62727A613F736B69 |
+------------+----------------------+
3 rows in set (0.00 sec)
mysql> SET NAMES latin2;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT last_name, HEX(last_name) FROM people;
+------------+----------------------+
| last_name | HEX(last_name) |
+------------+----------------------+
| Lemon | 4C656D6F6E |
| Müller | 4DFC6C6C6572 |
| Dobrza?ski | 446F62727A613F736B69 |
+------------+----------------------+
3 rows in set (0.00 sec)
• We can’t! :-(
• 0x3F is '?', so my 'ń' was lost
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
19. Problem #2: The bad news
• It may not be enough to configure the server correctly
• A mismatch between client and server can permantenly break data
• Implicit conversion inside MySQL server
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
20. Problem #2: Settings, defaults, inheritance
• Where do you set character sets in MySQL?
• Sesssion settings
• character_set_server
• character_set_client
• character_set_connection
• character_set_database
• character_set_result
• Schema level defaults – affect new tables
• Table level defaults – affect new columns
• Column charsets
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
21. 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2: Settings, defaults, inheritance
master [localhost] {msandbox} ((none)) > SELECT @@global.character_set_server, @@session.character_set_client;
+-------------------------------+--------------------------------+
| @@global.character_set_server | @@session.character_set_client |
+-------------------------------+--------------------------------+
| latin1 | utf8 |
+-------------------------------+--------------------------------+
1 row in set (0.00 sec)
master [localhost] {msandbox} ((none)) > CREATE SCHEMA fosdemG
Query OK, 1 row affected (0.00 sec)
master [localhost] {msandbox} ((none)) > SHOW CREATE SCHEMA fosdemG
*************************** 1. row ***************************
Database: fosdem
Create Database: CREATE DATABASE `fosdem` /*!40100 DEFAULT CHARACTER SET latin1 */
1 row in set (0.00 sec)
22. 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2: Settings, defaults, inheritance
master [localhost] {msandbox} ((none)) > USE fosdem;
Database changed
master [localhost] {msandbox} (fosdem) > CREATE TABLE test (a VARCHAR(300), INDEX (a));
Query OK, 0 rows affected (0.62 sec)
master [localhost] {msandbox} (fosdem) > SHOW CREATE TABLE testG
*************************** 1. row ***************************
Table: test
Create Table: CREATE TABLE `test` (
`a` varchar(300) DEFAULT NULL,
KEY `a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
23. 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2: Settings, defaults, inheritance
master [localhost] {msandbox} (fosdem) > ALTER TABLE test DEFAULT CHARSET = utf8;
Query OK, 0 rows affected (0.08 sec)
Records: 0 Duplicates: 0 Warnings: 0
master [localhost] {msandbox} (fosdem) > SHOW CREATE TABLE testG
*************************** 1. row ***************************
Table: test
Create Table: CREATE TABLE `test` (
`a` varchar(300) CHARACTER SET latin1 DEFAULT NULL,
KEY `a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
24. 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2: Settings, defaults, inheritance
master [localhost] {msandbox} (fosdem) > ALTER TABLE test ADD b VARCHAR(10);
Query OK, 0 rows affected (0.74 sec)
Records: 0 Duplicates: 0 Warnings: 0
master [localhost] {msandbox} (fosdem) > SHOW CREATE TABLE testG
*************************** 1. row ***************************
Table: test
Create Table: CREATE TABLE `test` (
`a` varchar(300) CHARACTER SET latin1 DEFAULT NULL,
`b` varchar(10) DEFAULT NULL,
KEY `a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
25. I f**ckd up. What do I do?
• Let’s start with what you shouldn’t do
• Keep calm and don’t start by changing something
• Analyze the situation
• Why did the problem occur in the first place?
• Reassess the damage
• Is it consistent?
• Are all rows broken in the same way?
• Are some rows bad, but others are okay?
• Are all bad in several different ways?
• Is it actually repearable?
• No character mapping occurred during writes (e.g. unicode over latin1/latin1)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
26. I f**ckd up. What else I shouldn’t do, then?
• Do not rush things as you may easily go from bad to worse
• Do not start fixing this on a replication slave
• You can’t fix this by fixing tables one by one on a live database
• Unless you really have everything in one table
• Do not use: ALTER TABLE … DEFAULT CHARSET = …
• It only changes the default character set for new columns
• Do not use: ALTER TABLE … CONVERT TO CHARACTER SET …
• It’s not for fixing broken encoding
• Do not use: ALTER TABLE … MODIFY col_name … CHARACTER SET …
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
27. I f**ckd up. So how do I fix it?
• What needs to be fixed?
• Schema defaut character set
• ALTER SCHEMA fosdem DEFAULT CHARSET = utf8
• Tables with text columns: CHAR, VARCHAR, TEXT, TINYTEXT, LONGTEXT
• What about ENUM?
• Use INFORMATION_SCHEMA to grab a list
• What about other tables?
• They too (eventually), but it’s not critical
SELECT CONCAT(c.table_schema, '.', c.table_name) AS candidate_table
FROM information_schema.columns c
WHERE c.table_schema = 'fosdem'
AND c.column_type REGEXP '^(.*CHAR|.*TEXT|ENUM)((.+))?$'
GROUP BY candidate_table;
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
28. I f**ckd up. So how do I fix it?
• Option 1 – requires downtime
• Dump and restore
• Dump the data preserving the bad configuration and drop the old database
bash# mysqldump -u root -p --skip-set-charset --default-character-set=latin1 fosdem >
fosdem.sql
mysql> DROP SCHEMA fosdem;
• Correct table definitions in the dump file
• Edit DEFAULT CHARSET in all CREATE TABLE statements
• Create the database again and import the data back
mysql> CREATE SCHEMA fosdem DEFAULT CHARSET utf8;
bash# mysql -u root -p --default-character-set=utf8 fosdem < fosdem.sql
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
29. I f**ckd up. So how do I fix it?
• Option 2 – requires downtime
• Perform a two step conversion with ALTER TABLE
• Original encoding -> VARBINARY/BLOB -> Target encoding
• Conversion from/to BINARY/BLOB removes character set context
• How?
• Stop applications
• On each tabe, for each text column perform:
ALTER TABLE tbl MODIFY col_name VARBINARY(255);
ALTER TABLE tbl MODIFY col_name VARCHAR(255) CHARACTER SET utf8;
• You may specify multiple columns per ALTER TABLE
• Fix the problems (application and/or db configs)
• Restart applications
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
30. I f**ckd up. So how do I fix it?
• Option 3 – online character set fix; no downtime*
• Thanks to our plugin for pt-online-schema-change
• and a tiny patch for pt-online-schema-change that goes with the plugin
• How?
• Start pt-online-schema-change on all tables – one by one
• Do not rotate tables (--no-swap-tables) or drop pt-osc triggers
• Wait until all tables have been converted
• Stop applications
• Fix the problems (application and/or db configs)
• Rotate tables – takes just 1 minute
• Restart applications
• Et voilà
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
31. GOTCHAs!
• Data space requrements may change during conversion
• Latin1 uses 1 byte per character, utf8 will need to assume 3 bytes
• VARCHAR/TEXT fit up to 64KB – it won’t fit 65536 multi-byte characters
• Key length limit is 767 bytes
• Data type and/or index length changes may be required
• Test and plan this ahead
• There may be more prolems than you think
• Detect irrecoverible problems with a simple stored procedure
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
CREATE FUNCTION `cnv_test_conversion` (`value_before` LONGTEXT, `value_after` LONGTEXT) RETURNS tinyint(1)
BEGIN
RETURN (IFNULL(CONVERT(CONVERT(`value_before` USING latin1) USING binary), "") =
IFNULL(CONVERT(`value_after` USING binary), ""));
END;;
32. 01.02.2015 Follow us on Twitter @dbasquare Need help? Visit www.psce.com
GOTCHAs!
master [localhost] {msandbox} (fosdem) > ALTER TABLE test MODIFY a VARCHAR(300) CHARACTER SET utf8;
Query OK, 0 rows affected, 1 warning (1.23 sec)
Records: 0 Duplicates: 0 Warnings: 1
master [localhost] {msandbox} (fosdem) > SHOW WARNINGSG
*************************** 1. row ***************************
Level: Warning
Code: 1071
Message: Specified key was too long; max key length is 767 bytes
1 row in set (0.00 sec)
master [localhost] {msandbox} (fosdem) > SHOW CREATE TABLE testG
*************************** 1. row ***************************
Table: test
Create Table: CREATE TABLE `test` (
`a` varchar(300) DEFAULT NULL,
`b` varchar(10) DEFAULT NULL,
KEY `a` (`a`(255))
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
33. How to do it right?
• Set character-set-server during initial configuration
• When creating new schemas, always specify the desired charset
• CREATE SCHEMA fosdem DEFAULT CHARSET = utf8
• ALTER SCHEMA fosdem DEFAULT CHARSET = utf8
• When creating new tables, also explicitly specify the charset
• CREATE TABLE people (…) DEFAULT CHARSET = utf8
• And don’t forget to configure applications too
• You can try to force charset on the clients
• init-connect = "SET NAMES utf8"
• It might also break applications that don’t want to talk to MySQL using utf8
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
34. Oh, and one more thing…
01.02.2015 Follow us on Twitter @dbasquare Need help? Visit www.psce.com
35. • We are sharing WebScaleSQL packages with the MySQL Community!
• Check out http://www.psce.com/blog for details
• Follow @dbasquare to receive updates
01.02.2015 Follow us on Twitter @dbasquare 35
WebScaleSQL
What is WebScaleSQL?
WebScaleSQL is a collaboration among engineers from several companies
such as Facebook, Twitter, Google or Linkedin, that face the same challenges
in deploying MySQL at scale, and seek greater performance from a database
technology tailored for their needs.