|
Chapter 9. Internationalization and Localization
This chapter covers issues of internationalization (MySQL's
capabilities for adapting to local use) and localization (selecting
particular local conventions):
MySQL support for character sets in SQL statements.
How to configure the server to support different character sets.
Selecting the language for error messages.
How to set the server's time zone and enable per-connection time
zone support.
Selecting the locale for day and month names.
9.1. Character Set Support
MySQL includes character set support that enables you to store
data using a variety of character sets and perform comparisons
according to a variety of collations. You can specify character
sets at the server, database, table, and column level. MySQL
supports the use of character sets for the
MyISAM , MEMORY ,
NDBCLUSTER , and
InnoDB storage engines.
This chapter discusses the following topics:
What are character sets and collations?
The multiple-level default system for character set assignment
Syntax for specifying character sets and collations
Affected functions and operations
Unicode support
The character sets and collations that are available, with
notes
Character set issues affect not only data storage, but also
communication between client programs and the MySQL server. If you
want the client program to communicate with the server using a
character set different from the default, you'll need to indicate
which one. For example, to use the utf8 Unicode
character set, issue this statement after connecting to the
server:
SET NAMES 'utf8';
For more information about configuring character sets for
application use and character set-related issues in client/server
communication, see Section 9.1.5, “Configuring the Character Set and Collation for Applications”, and
Section 9.1.4, “Connection Character Sets and Collations”.
9.1.1. Character Sets and Collations in General
A character set is a set of symbols and
encodings. A collation is a set of rules
for comparing characters in a character set. Let's make the
distinction clear with an example of an imaginary character set.
Suppose that we have an alphabet with four letters:
“A ”,
“B ”,
“a ”,
“b ”. We give each letter a
number: “A ” = 0,
“B ” = 1,
“a ” = 2,
“b ” = 3. The letter
“A ” is a symbol, the number 0 is
the encoding for
“A ”, and the combination of all
four letters and their encodings is a
character set.
Suppose that we want to compare two string values,
“A ” and
“B ”. The simplest way to do this
is to look at the encodings: 0 for
“A ” and 1 for
“B ”. Because 0 is less than 1,
we say “A ” is less than
“B ”. What we've just done is
apply a collation to our character set. The collation is a set
of rules (only one rule in this case): “compare the
encodings.” We call this simplest of all possible
collations a binary collation.
But what if we want to say that the lowercase and uppercase
letters are equivalent? Then we would have at least two rules:
(1) treat the lowercase letters
“a ” and
“b ” as equivalent to
“A ” and
“B ”; (2) then compare the
encodings. We call this a
case-insensitive collation. It is a
little more complex than a binary collation.
In real life, most character sets have many characters: not just
“A ” and
“B ” but whole alphabets,
sometimes multiple alphabets or eastern writing systems with
thousands of characters, along with many special symbols and
punctuation marks. Also in real life, most collations have many
rules, not just for whether to distinguish lettercase, but also
for whether to distinguish accents (an “accent” is
a mark attached to a character as in German
“? ”), and for
multiple-character mappings (such as the rule that
“? ” =
“OE ” in one of the two German
collations).
MySQL can do these things for you:
Store strings using a variety of character sets
Compare strings using a variety of collations
Mix strings with different character sets or collations in
the same server, the same database, or even the same table
Allow specification of character set and collation at any
level
In these respects, MySQL is far ahead of most other database
management systems. However, to use these features effectively,
you need to know what character sets and collations are
available, how to change the defaults, and how they affect the
behavior of string operators and functions.
9.1.2. Character Sets and Collations in MySQL
The MySQL server can support multiple character sets. To list
the available character sets, use the SHOW
CHARACTER SET statement. A partial listing follows.
For more complete information, see
Section 9.1.13, “Character Sets and Collations That MySQL Supports”.
mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset | Description | Default collation | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5 | Big5 Traditional Chinese | big5_chinese_ci | 2 |
| dec8 | DEC West European | dec8_swedish_ci | 1 |
| cp850 | DOS West European | cp850_general_ci | 1 |
| hp8 | HP West European | hp8_english_ci | 1 |
| koi8r | KOI8-R Relcom Russian | koi8r_general_ci | 1 |
| latin1 | cp1252 West European | latin1_swedish_ci | 1 |
| latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 |
| swe7 | 7bit Swedish | swe7_swedish_ci | 1 |
| ascii | US ASCII | ascii_general_ci | 1 |
| ujis | EUC-JP Japanese | ujis_japanese_ci | 3 |
| sjis | Shift-JIS Japanese | sjis_japanese_ci | 2 |
| hebrew | ISO 8859-8 Hebrew | hebrew_general_ci | 1 |
| tis620 | TIS620 Thai | tis620_thai_ci | 1 |
| euckr | EUC-KR Korean | euckr_korean_ci | 2 |
| koi8u | KOI8-U Ukrainian | koi8u_general_ci | 1 |
| gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci | 2 |
| greek | ISO 8859-7 Greek | greek_general_ci | 1 |
| cp1250 | Windows Central European | cp1250_general_ci | 1 |
| gbk | GBK Simplified Chinese | gbk_chinese_ci | 2 |
| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci | 1 |
...
Any given character set always has at least one collation. It
may have several collations. To list the collations for a
character set, use the SHOW
COLLATION statement. For example, to see the
collations for the latin1 (cp1252 West
European) character set, use this statement to find those
collation names that begin with latin1 :
mysql> SHOW COLLATION LIKE 'latin1%';
+---------------------+---------+----+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+---------------------+---------+----+---------+----------+---------+
| latin1_german1_ci | latin1 | 5 | | | 0 |
| latin1_swedish_ci | latin1 | 8 | Yes | Yes | 1 |
| latin1_danish_ci | latin1 | 15 | | | 0 |
| latin1_german2_ci | latin1 | 31 | | Yes | 2 |
| latin1_bin | latin1 | 47 | | Yes | 1 |
| latin1_general_ci | latin1 | 48 | | | 0 |
| latin1_general_cs | latin1 | 49 | | | 0 |
| latin1_spanish_ci | latin1 | 94 | | | 0 |
+---------------------+---------+----+---------+----------+---------+
The latin1 collations have the following
meanings.
Collations have these general characteristics:
Two different character sets cannot have the same collation.
Each character set has one collation that is the
default collation. For example, the
default collation for latin1 is
latin1_swedish_ci . The output for
SHOW CHARACTER SET indicates
which collation is the default for each displayed character
set.
There is a convention for collation names: They start with
the name of the character set with which they are
associated, they usually include a language name, and they
end with _ci (case insensitive),
_cs (case sensitive), or
_bin (binary).
In cases where a character set has multiple collations, it might
not be clear which collation is most suitable for a given
application. To avoid choosing the wrong collation, it can be
helpful to perform some comparisons with representative data
values to make sure that a given collation sorts values the way
you expect.
Collation-Charts.Org
is a useful site for information that shows how one collation
compares to another.
9.1.3. Specifying Character Sets and Collations
There are default settings for character sets and collations at
four levels: server, database, table, and column. The
description in the following sections may appear complex, but it
has been found in practice that multiple-level defaulting leads
to natural and obvious results.
CHARACTER SET is used in clauses that specify
a character set. CHARSET can be used as a
synonym for CHARACTER SET .
Character set issues affect not only data storage, but also
communication between client programs and the MySQL server. If
you want the client program to communicate with the server using
a character set different from the default, you'll need to
indicate which one. For example, to use the
utf8 Unicode character set, issue this
statement after connecting to the server:
SET NAMES 'utf8';
For more information about character set-related issues in
client/server communication, see
Section 9.1.4, “Connection Character Sets and Collations”.
9.1.3.1. Server Character Set and Collation
MySQL Server has a server character set and a server
collation. These can be set at server startup on the command
line or in an option file and changed at runtime.
Initially, the server character set and collation depend on
the options that you use when you start
mysqld. You can use
--character-set-server for the
character set. Along with it, you can add
--collation-server for the
collation. If you don't specify a character set, that is the
same as saying
--character-set-server=latin1 .
If you specify only a character set (for example,
latin1 ) but not a collation, that is the
same as saying
--character-set-server=latin1
--collation-server=latin1_swedish_ci
because latin1_swedish_ci is the default
collation for latin1 . Therefore, the
following three commands all have the same effect:
shell> mysqld
shell> mysqld --character-set-server=latin1
shell> mysqld --character-set-server=latin1 \
--collation-server=latin1_swedish_ci
One way to change the settings is by recompiling. If you want
to change the default server character set and collation when
building from sources, use:
--with-charset and
--with-collation as
arguments for configure. For example:
shell> ./configure --with-charset=latin1
Or:
shell> ./configure --with-charset=latin1 \
--with-collation=latin1_german1_ci
Both mysqld and
configure verify that the character
set/collation combination is valid. If not, each program
displays an error message and terminates.
The server character set and collation are used as default
values if the database character set and collation are not
specified in CREATE DATABASE
statements. They have no other purpose.
The current server character set and collation can be
determined from the values of the
character_set_server and
collation_server system
variables. These variables can be changed at runtime.
9.1.3.2. Database Character Set and Collation
Every database has a database character set and a database
collation. The CREATE DATABASE
and ALTER DATABASE statements
have optional clauses for specifying the database character
set and collation:
CREATE DATABASE db_name
[[DEFAULT] CHARACTER SET charset_name ]
[[DEFAULT] COLLATE collation_name ]
ALTER DATABASE db_name
[[DEFAULT] CHARACTER SET charset_name ]
[[DEFAULT] COLLATE collation_name ]
The keyword SCHEMA can be used instead of
DATABASE .
All database options are stored in a text file named
db.opt that can be found in the database
directory.
The CHARACTER SET and
COLLATE clauses make it possible to create
databases with different character sets and collations on the
same MySQL server.
Example:
CREATE DATABASE db_name CHARACTER SET latin1 COLLATE latin1_swedish_ci;
MySQL chooses the database character set and database
collation in the following manner:
If both CHARACTER SET
X and
COLLATE Y
are specified, character set X
and collation Y are used.
If CHARACTER SET
X is specified
without COLLATE , character set
X and its default collation are
used. To see the default collation for each character set,
use the SHOW COLLATION
statement.
If COLLATE Y
is specified without CHARACTER SET , the
character set associated with Y
and collation Y are used.
Otherwise, the server character set and server collation
are used.
The database character set and collation are used as default
values for table definitions if the table character set and
collation are not specified in CREATE
TABLE statements. The database character set also is
used by LOAD DATA
INFILE . The character set and collation have no
other purposes.
The character set and collation for the default database can
be determined from the values of the
character_set_database and
collation_database system
variables. The server sets these variables whenever the
default database changes. If there is no default database, the
variables have the same value as the corresponding
server-level system variables,
character_set_server and
collation_server .
9.1.3.3. Table Character Set and Collation
Every table has a table character set and a table collation.
The CREATE TABLE and
ALTER TABLE statements have
optional clauses for specifying the table character set and
collation:
CREATE TABLE tbl_name (column_list )
[[DEFAULT] CHARACTER SET charset_name ]
[COLLATE collation_name ]]
ALTER TABLE tbl_name
[[DEFAULT] CHARACTER SET charset_name ]
[COLLATE collation_name ]
Example:
CREATE TABLE t1 ( ... )
CHARACTER SET latin1 COLLATE latin1_danish_ci;
MySQL chooses the table character set and collation in the
following manner:
If both CHARACTER SET
X and
COLLATE Y
are specified, character set X
and collation Y are used.
If CHARACTER SET
X is specified
without COLLATE , character set
X and its default collation are
used. To see the default collation for each character set,
use the SHOW COLLATION
statement.
If COLLATE Y
is specified without CHARACTER SET , the
character set associated with Y
and collation Y are used.
Otherwise, the database character set and collation are
used.
The table character set and collation are used as default
values for column definitions if the column character set and
collation are not specified in individual column definitions.
The table character set and collation are MySQL extensions;
there are no such things in standard SQL.
9.1.3.4. Column Character Set and Collation
Every “character” column (that is, a column of
type CHAR ,
VARCHAR , or
TEXT ) has a column character
set and a column collation. Column definition syntax for
CREATE TABLE and
ALTER TABLE has optional
clauses for specifying the column character set and collation:
col_name {CHAR | VARCHAR | TEXT} (col_length )
[CHARACTER SET charset_name ]
[COLLATE collation_name ]
These clauses can also be used for
ENUM and
SET columns:
col_name {ENUM | SET} (val_list )
[CHARACTER SET charset_name ]
[COLLATE collation_name ]
Examples:
CREATE TABLE t1
(
col1 VARCHAR(5)
CHARACTER SET latin1
COLLATE latin1_german1_ci
);
ALTER TABLE t1 MODIFY
col1 VARCHAR(5)
CHARACTER SET latin1
COLLATE latin1_swedish_ci;
MySQL chooses the column character set and collation in the
following manner:
If both CHARACTER SET
X and
COLLATE Y
are specified, character set X
and collation Y are used.
CREATE TABLE t1
(
col1 CHAR(10) CHARACTER SET utf8 COLLATE utf8_unicode_ci
) CHARACTER SET latin1 COLLATE latin1_bin;
The character set and collation are specified for the
column, so they are used. The column has character set
utf8 and collation
utf8_unicode_ci .
If CHARACTER SET
X is specified
without COLLATE , character set
X and its default collation are
used.
CREATE TABLE t1
(
col1 CHAR(10) CHARACTER SET utf8
) CHARACTER SET latin1 COLLATE latin1_bin;
The character set is specified for the column, but the
collation is not. The column has character set
utf8 and the default collation for
utf8 , which is
utf8_general_ci . To see the default
collation for each character set, use the
SHOW COLLATION statement.
If COLLATE Y
is specified without CHARACTER SET , the
character set associated with Y
and collation Y are used.
CREATE TABLE t1
(
col1 CHAR(10) COLLATE utf8_polish_ci
) CHARACTER SET latin1 COLLATE latin1_bin;
The collation is specified for the column, but the
character set is not. The column has collation
utf8_polish_ci and the character set is
the one associated with the collation, which is
utf8 .
Otherwise, the table character set and collation are used.
CREATE TABLE t1
(
col1 CHAR(10)
) CHARACTER SET latin1 COLLATE latin1_bin;
Neither the character set nor collation are specified for
the column, so the table defaults are used. The column has
character set latin1 and collation
latin1_bin .
The CHARACTER SET and
COLLATE clauses are standard SQL.
If you use ALTER TABLE to
convert a column from one character set to another, MySQL
attempts to map the data values, but if the character sets are
incompatible, there may be data loss.
9.1.3.5. Character String Literal Character Set and Collation
Every character string literal has a character set and a
collation.
A character string literal may have an optional character set
introducer and COLLATE clause:
[_charset_name ]'string ' [COLLATE collation_name ]
Examples:
SELECT 'string ';
SELECT _latin1'string ';
SELECT _latin1'string ' COLLATE latin1_danish_ci;
For the simple statement SELECT
'string ' , the string has
the character set and collation defined by the
character_set_connection and
collation_connection system
variables.
The
_charset_name
expression is formally called an
introducer. It tells the parser,
“the string that is about to follow uses character set
X .” Because this has
confused people in the past, we emphasize that an introducer
does not change the string to the introducer character set
like CONVERT() would do. It
does not change the string's value, although padding may
occur. The introducer is just a signal. An introducer is also
legal before standard hex literal and numeric hex literal
notation
(x'literal ' and
0xnnnn ), or
before bit-field literal notation
(b'literal ' and
0bnnnn ).
Examples:
SELECT _latin1 x'AABBCC';
SELECT _latin1 0xAABBCC;
SELECT _latin1 b'1100011';
SELECT _latin1 0b1100011;
MySQL determines a literal's character set and collation in
the following manner:
If both _X and COLLATE
Y are specified,
character set X and collation
Y are used.
If _X is specified but
COLLATE is not specified, character set
X and its default collation are
used. To see the default collation for each character set,
use the SHOW COLLATION
statement.
Otherwise, the character set and collation given by the
character_set_connection
and collation_connection
system variables are used.
Examples:
A string with latin1 character set and
latin1_german1_ci collation:
SELECT _latin1'M?ller' COLLATE latin1_german1_ci;
A string with latin1 character set and
its default collation (that is,
latin1_swedish_ci ):
SELECT _latin1'M?ller';
A string with the connection default character set and
collation:
SELECT 'M?ller';
Character set introducers and the COLLATE
clause are implemented according to standard SQL
specifications.
An introducer indicates the character set for the following
string, but does not change now how the parser performs escape
processing within the string. Escapes are always interpreted
by the parser according to the character set given by
character_set_connection .
The following examples show that escape processing occurs
using
character_set_connection even
in the presence of an introducer. The examples use
SET NAMES (which changes
character_set_connection , as
discussed in Section 9.1.4, “Connection Character Sets and Collations”), and
display the resulting strings using the
HEX() function so that the
exact string contents can be seen.
Example 1:
mysql> SET NAMES latin1;
Query OK, 0 rows affected (0.01 sec)
mysql> SELECT HEX('?\n'), HEX(_sjis'?\n');
+------------+-----------------+
| HEX('?\n') | HEX(_sjis'?\n') |
+------------+-----------------+
| E00A | E00A |
+------------+-----------------+
1 row in set (0.00 sec)
Here, “? ” (hex value
E0 ) is followed by
“\n ”, the escape sequence for
newline. The escape sequence is interpreted using the
character_set_connection
value of latin1 to produce a literal
newline (hex value 0A ). This happens even
for the second string. That is, the introducer of
_sjis does not affect the parser's escape
processing.
Example 2:
mysql> SET NAMES sjis;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT HEX('?\n'), HEX(_latin1'?\n');
+------------+-------------------+
| HEX('?\n') | HEX(_latin1'?\n') |
+------------+-------------------+
| E05C6E | E05C6E |
+------------+-------------------+
1 row in set (0.04 sec)
Here,
character_set_connection is
sjis , a character set in which the sequence
of “? ” followed by
“\ ” (hex values
05 and 5C ) is a valid
multi-byte character. Hence, the first two bytes of the string
are interpreted as a single sjis character,
and the “\ ” is not interpreted
as an escape character. The following
“n ” (hex value
6E ) is not interpreted as part of an escape
sequence. This is true even for the second string; the
introducer of _latin1 does not affect
escape processing.
9.1.3.6. National Character Set
Standard SQL defines NCHAR or
NATIONAL CHAR as a way to
indicate that a CHAR column
should use some predefined character set. MySQL
5.0 uses utf8 as this
predefined character set. For example, these data type
declarations are equivalent:
CHAR(10) CHARACTER SET utf8
NATIONAL CHARACTER(10)
NCHAR(10)
As are these:
VARCHAR(10) CHARACTER SET utf8
NATIONAL VARCHAR(10)
NCHAR VARCHAR(10)
NATIONAL CHARACTER VARYING(10)
NATIONAL CHAR VARYING(10)
You can use
N'literal ' (or
n'literal ' ) to
create a string in the national character set. These
statements are equivalent:
SELECT N'some text';
SELECT n'some text';
SELECT _utf8'some text';
For information on upgrading character sets to MySQL
5.0 from versions prior to 4.1, see the
MySQL 3.23, 4.0, 4.1 Reference Manual.
9.1.3.7. Examples of Character Set and Collation Assignment
The following examples show how MySQL determines default
character set and collation values.
Example 1: Table and Column
Definition
CREATE TABLE t1
(
c1 CHAR(10) CHARACTER SET latin1 COLLATE latin1_german1_ci
) DEFAULT CHARACTER SET latin2 COLLATE latin2_bin;
Here we have a column with a latin1
character set and a latin1_german1_ci
collation. The definition is explicit, so that is
straightforward. Notice that there is no problem with storing
a latin1 column in a
latin2 table.
Example 2: Table and Column
Definition
CREATE TABLE t1
(
c1 CHAR(10) CHARACTER SET latin1
) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;
This time we have a column with a latin1
character set and a default collation. Although it might seem
natural, the default collation is not taken from the table
level. Instead, because the default collation for
latin1 is always
latin1_swedish_ci , column
c1 has a collation of
latin1_swedish_ci (not
latin1_danish_ci ).
Example 3: Table and Column
Definition
CREATE TABLE t1
(
c1 CHAR(10)
) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;
We have a column with a default character set and a default
collation. In this circumstance, MySQL checks the table level
to determine the column character set and collation.
Consequently, the character set for column
c1 is latin1 and its
collation is latin1_danish_ci .
Example 4: Database, Table, and Column
Definition
CREATE DATABASE d1
DEFAULT CHARACTER SET latin2 COLLATE latin2_czech_ci;
USE d1;
CREATE TABLE t1
(
c1 CHAR(10)
);
We create a column without specifying its character set and
collation. We're also not specifying a character set and a
collation at the table level. In this circumstance, MySQL
checks the database level to determine the table settings,
which thereafter become the column settings.) Consequently,
the character set for column c1 is
latin2 and its collation is
latin2_czech_ci .
9.1.3.8. Compatibility with Other DBMSs
For MaxDB compatibility these two statements are the same:
CREATE TABLE t1 (f1 CHAR(N ) UNICODE);
CREATE TABLE t1 (f1 CHAR(N ) CHARACTER SET ucs2);
9.1.4. Connection Character Sets and Collations
Several character set and collation system variables relate to a
client's interaction with the server. Some of these have been
mentioned in earlier sections:
Additional character set and collation system variables are
involved in handling traffic for the connection between a client
and the server. Every client has connection-related character
set and collation system variables.
Consider what a “connection” is: It is what you
make when you connect to the server. The client sends SQL
statements, such as queries, over the connection to the server.
The server sends responses, such as result sets or error
messages, over the connection back to the client. This leads to
several questions about character set and collation handling for
client connections, each of which can be answered in terms of
system variables:
What character set is the statement in when it leaves the
client?
The server takes the
character_set_client system
variable to be the character set in which statements are
sent by the client.
What character set should the server translate a statement
to after receiving it?
For this, the server uses the
character_set_connection
and collation_connection
system variables. It converts statements sent by the client
from character_set_client
to character_set_connection
(except for string literals that have an introducer such as
_latin1 or _utf8 ).
collation_connection is
important for comparisons of literal strings. For
comparisons of strings with column values,
collation_connection does
not matter because columns have their own collation, which
has a higher collation precedence.
What character set should the server translate to before
shipping result sets back to the client?
The character_set_results
system variable indicates the character set in which the
server returns query results to the client. This includes
result data such as column values, and result metadata such
as column names.
Clients can fine-tune the settings for these variables, or
depend on the defaults (in which case, you can skip the rest of
this section). If you do not use the defaults, you must change
the character settings for each connection to the
server.
There are two statements that affect the connection-related
character set variables as a group:
SET NAMES 'charset_name '
[COLLATE
'collation_name ']
SET NAMES indicates what character set
the client will use to send SQL statements to the server.
Thus, SET NAMES 'cp1251' tells the
server, “future incoming messages from this client are
in character set cp1251 .” It also
specifies the character set that the server should use for
sending results back to the client. (For example, it
indicates what character set to use for column values if you
use a SELECT statement.)
A SET NAMES
'x ' statement is
equivalent to these three statements:
SET character_set_client = x ;
SET character_set_results = x ;
SET character_set_connection = x ;
Setting each of these character set variables also sets its
corresponding collation variable to the default correlation
for the character set. For example, setting
character_set_connection to
x also sets
collation_connection to the
default collation for x . It is
not necessary to set that collation explicitly. To specify a
particular collation for the character sets, use the
optional COLLATE clause:
SET NAMES 'charset_name ' COLLATE 'collation_name '
SET CHARACTER SET
charset_name
SET CHARACTER SET is similar to
SET NAMES but sets
character_set_connection
and collation_connection to
character_set_database and
collation_database . A
SET CHARACTER SET
x statement is
equivalent to these three statements:
SET character_set_client = x ;
SET character_set_results = x ;
SET collation_connection = @@collation_database;
Setting
collation_connection also
sets
character_set_connection to
the character set associated with the collation (equivalent
to executing SET character_set_connection =
@@character_set_database ). It is not necessary to
set
character_set_connection
explicitly.
Note
ucs2 cannot be used as a client character
set, which means that it does not work for SET
NAMES or SET CHARACTER SET .
The MySQL client programs mysql ,
mysqladmin , mysqlcheck ,
mysqlimport , and mysqlshow
determine the default character set to use as follows:
In the absence of other information, the programs use the
compiled-in default character set, usually
latin1 .
The programs support a
--default-character-set
option, which enables users to specify the character set
explicitly to override whatever default the client otherwise
determines.
When a client connects to the server, it sends the name of the
character set that it wants to use. The server uses the name to
set the character_set_client ,
character_set_results , and
character_set_connection system
variables. In effect, the server performs a SET
NAMES operation using the character set name.
With the mysql client, if you want to use a
character set different from the default, you could explicitly
execute SET NAMES every time you start up.
However, to accomplish the same result more easily, you can add
the --default-character-set option
setting to your mysql command line or in your
option file. For example, the following option file setting
changes the three connection-related character set variables set
to koi8r each time you invoke
mysql:
[mysql]
default-character-set=koi8r
If you are using the mysql client with
auto-reconnect enabled (which is not recommended), it is
preferable to use the charset command rather
than SET NAMES . For example:
mysql> charset utf8
Charset changed
The charset command issues a SET
NAMES statement, and also changes the default
character set that mysql uses when it
reconnects after the connection has dropped.
Example: Suppose that column1 is defined as
CHAR(5) CHARACTER SET latin2 . If you do not
say SET NAMES or SET CHARACTER
SET , then for SELECT column1 FROM
t , the server sends back all the values for
column1 using the character set that the
client specified when it connected. On the other hand, if you
say SET NAMES 'latin1' or SET
CHARACTER SET latin1 before issuing the
SELECT statement, the server
converts the latin2 values to
latin1 just before sending results back.
Conversion may be lossy if there are characters that are not in
both character sets.
If you do not want the server to perform any conversion of
result sets or error messages, set
character_set_results to
NULL or binary :
SET character_set_results = NULL;
To see the values of the character set and collation system
variables that apply to your connection, use these statements:
SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';
You must also consider the environment within which your MySQL
applications execute. See
Section 9.1.5, “Configuring the Character Set and Collation for Applications”.
For more information about character sets and error messages,
see Section 9.1.6, “Character Set for Error Messages”.
9.1.5. Configuring the Character Set and Collation for Applications
For applications that store data using the default MySQL
character set and collation (latin1 ,
latin1_swedish_ci ), no special configuration
should be needed. If applications require data storage using a
different character set or collation, you can configure
character set information several ways:
Specify character settings per database. For example,
applications that use one database might require
utf8 , whereas applications that use
another database might require sjis .
Specify character settings at server startup. This causes
the server to use the given settings for all applications
that do not make other arrangements.
Specify character settings at configuration time, if you
build MySQL from source. This causes the server to use the
given settings for all applications, without having to
specify them at server startup.
When different applications require different character
settings, the per-database technique provides a good deal of
flexibility. If most or all applications use the same character
set, specifying character settings at server startup or
configuration time may be most convenient.
For the per-database or server-startup techniques, the settings
control the character set for data storage. Applications must
also tell the server which character set to use for
client/server communications, as described in the following
instructions.
The examples shown here assume use of the
utf8 character set and
utf8_general_ci collation.
Specify character settings per
database. To create a database such that its tables
will use a given default character set and collation for data
storage, use a CREATE DATABASE
statement like this:
CREATE DATABASE mydb
DEFAULT CHARACTER SET utf8
DEFAULT COLLATE utf8_general_ci;
Tables created in the database will use utf8
and utf8_general_ci by default for any
character columns.
Applications that use the database should also configure their
connection to the server each time they connect. This can be
done by executing a SET NAMES 'utf8'
statement after connecting. The statement can be used regardless
of connection method: The mysql client, PHP
scripts, and so forth.
In some cases, it may be possible to configure the connection to
use the desired character set some other way. For example, for
connections made using mysql, you can specify
the --default-character-set=utf8
command-line option to achieve the same effect as SET
NAMES 'utf8' .
For more information about configuring client connections, see
Section 9.1.4, “Connection Character Sets and Collations”.
Specify character settings at server
startup. To select a character set and collation at
server startup, use the
--character-set-server and
--collation-server options. For
example, to specify the options in an option file, include these
lines:
[mysqld]
character-set-server=utf8
collation-server=utf8_general_ci
These settings apply server-wide and apply as the defaults for
databases created by any application, and for tables created in
those databases.
It is still necessary for applications to configure their
connection using SET NAMES or equivalent
after they connect, as described previously. You might be
tempted to start the server with the
--init_connect="SET NAMES 'utf8'"
option to cause SET NAMES to be executed
automatically for each client that connects. However, this will
yield inconsistent results because the
init_connect value is not
executed for users who have the
SUPER privilege.
Specify character settings at MySQL
configuration time. To select a character set and
collation when you configure and build MySQL from source, use
the --with-charset and
--with-collation options:
shell> ./configure --with-charset=utf8 --with-collation=utf8_general_ci
The resulting server uses utf8 and
utf8_general_ci as the default for databases
and tables and for client connections. It is unnecessary to use
--character-set-server and
--collation-server to specify
those defaults at server startup. It is also unnecessary for
applications to configure their connection using SET
NAMES or equivalent after they connect to the server.
Regardless of how you configure the MySQL character set for
application use, you must also consider the environment within
which those applications execute. If you will send statements
using UTF-8 text taken from a file that you create in an editor,
you should edit the file with the locale of your environment set
to UTF-8 so that the file's encoding is correct and so that the
operating system handles it correctly. If you use the
mysql client from within a terminal window,
the window must be configured to use UTF-8 or characters may not
display properly. For a script that executes in a Web
environment, the script must handle character encoding properly
for its interaction with the MySQL server, and it must generate
pages that correctly indicate the encoding so that browsers know
how to display the content of the pages. For example, you can
include this <meta> tag within your
<head> element:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
9.1.6. Character Set for Error Messages
This section describes how the server uses character sets for
constructing error messages and returning them to clients. For
information about the language of error messages (rather than
the character set), see
Section 9.3, “Setting the Error Message Language”.
In MySQL 5.0, the server constructs error messages
and returns them to clients as follows:
The message template has the character set associated with
the error message language. For example, English, Korean,
and Russian messages use latin1 ,
euckr , and koi8r ,
respectively.
Parameters in the message template are replaced with values
that apply to a specific error occurrence. These parameters
use their own character set. Identifiers such as table or
column names use UTF-8. Data values retain their character
set. For example, in the following duplicate-key message,
'xxx ' has the
character set of the table column associated with key 1:
Duplicate entry 'xxx ' for key1
The preceding method of error-message construction can result in
messages that contain a mix of character sets unless all items
involved contain only ASCII characters. This issue is resolved
in MySQL 5.5, in which error messages are constructed internally
within the server using UTF-8 and returned to the client in the
character set specified by the
character_set_results system
variable.
The following sections discuss various aspects of character set
collations.
9.1.7.1. Using COLLATE in SQL Statements
With the COLLATE clause, you can override
whatever the default collation is for a comparison.
COLLATE may be used in various parts of SQL
statements. Here are some examples:
With ORDER BY :
SELECT k
FROM t1
ORDER BY k COLLATE latin1_german2_ci;
With AS :
SELECT k COLLATE latin1_german2_ci AS k1
FROM t1
ORDER BY k1;
With GROUP BY :
SELECT k
FROM t1
GROUP BY k COLLATE latin1_german2_ci;
With aggregate functions:
SELECT MAX(k COLLATE latin1_german2_ci)
FROM t1;
With DISTINCT :
SELECT DISTINCT k COLLATE latin1_german2_ci
FROM t1;
With WHERE :
SELECT *
FROM t1
WHERE _latin1 'M?ller' COLLATE latin1_german2_ci = k;
SELECT *
FROM t1
WHERE k LIKE _latin1 'M?ller' COLLATE latin1_german2_ci;
With HAVING :
SELECT k
FROM t1
GROUP BY k
HAVING k = _latin1 'M?ller' COLLATE latin1_german2_ci;
9.1.7.2. COLLATE Clause Precedence
The COLLATE clause has high precedence
(higher than || ),
so the following two expressions are equivalent:
x || y COLLATE z
x || (y COLLATE z)
The BINARY operator casts the
string following it to a binary string. This is an easy way to
force a comparison to be done byte by byte rather than
character by character. BINARY
also causes trailing spaces to be significant.
mysql> SELECT 'a' = 'A';
-> 1
mysql> SELECT BINARY 'a' = 'A';
-> 0
mysql> SELECT 'a' = 'a ';
-> 1
mysql> SELECT BINARY 'a' = 'a ';
-> 0
BINARY str is
shorthand for
CAST(str AS
BINARY) .
The BINARY attribute in character column
definitions has a different effect. A character column defined
with the BINARY attribute is assigned the
binary collation of the column's character set. Every
character set has a binary collation. For example, the binary
collation for the latin1 character set is
latin1_bin , so if the table default
character set is latin1 , these two column
definitions are equivalent:
CHAR(10) BINARY
CHAR(10) CHARACTER SET latin1 COLLATE latin1_bin
The effect of BINARY as a column attribute
differs from its effect prior to MySQL 4.1. Formerly,
BINARY resulted in a column that was
treated as a binary string. A binary string is a string of
bytes that has no character set or collation, which differs
from a nonbinary character string that has a binary collation.
For both types of strings, comparisons are based on the
numeric values of the string unit, but for nonbinary strings
the unit is the character and some character sets allow
multi-byte characters. Section 10.4.2, “The BINARY and
VARBINARY Types”.
The use of CHARACTER SET binary in the
definition of a CHAR ,
VARCHAR , or
TEXT column causes the column
to be treated as a binary data type. For example, the
following pairs of definitions are equivalent:
CHAR(10) CHARACTER SET binary
BINARY(10)
VARCHAR(10) CHARACTER SET binary
VARBINARY(10)
TEXT CHARACTER SET binary
BLOB
9.1.7.4. The _bin and binary Collations
This section describes how _bin collations
for nonbinary strings differ from the
binary “collation” for binary
strings.
Nonbinary strings (as stored in the
CHAR ,
VARCHAR , and
TEXT data types) have a
character set and collation. A given character set can have
several collations, each of which defines a particular sorting
and comparison order for the characters in the set. One of
these is the binary collation for the character set, indicated
by a _bin suffix in the collation name. For
example, latin1 and utf8
have binary collations named latin1_bin and
utf8_bin .
Binary strings (as stored in the
BINARY ,
VARBINARY , and
BLOB data types) have no
character set or collation in the sense that nonbinary strings
do. (Applied to a binary string, the
CHARSET() and
COLLATION() functions both return a value
of binary .) Binary strings are sequences of
bytes and the numeric values of those bytes determine sort
order.
The _bin collations differ from the
binary collation in several respects.
The unit for sorting and
comparison. Binary strings are sequences of bytes.
Sorting and comparison is always based on numeric byte values.
Nonbinary strings are sequences of characters, which might be
multi-byte. Collations for nonbinary strings define an
ordering of the character values for sorting and comparison.
For the _bin collation, this ordering is
based solely on numeric values of the characters (which is
similar to ordering for binary strings except that a
_bin collation must take into account that
a character might contain multiple bytes). For other
collations, character ordering might take additional factors
such as lettercase into account.
Character set conversion. A
nonbinary string has a character set and is converted to
another character set in many cases, even when the string has
a _bin collation:
When assigning column values from another column that has
a different character set:
UPDATE t1 SET utf8_bin_column=latin1_column;
INSERT INTO t1 (latin1_column) SELECT utf8_bin_column FROM t2;
When assigning column values for
INSERT or
UPDATE using a string
literal:
SET NAMES latin1;
INSERT INTO t1 (utf8_bin_column) VALUES ('string-in-latin1');
When sending results from the server to a client:
SET NAMES latin1;
SELECT utf8_bin_column FROM t2;
For binary string columns, no conversion occurs. For the
preceding cases, the string value is copied byte-wise.
Lettercase conversion.
Collations provide information about lettercase of characters,
so characters in a nonbinary string can be converted from one
lettercase to another, even for _bin
collations that ignore lettercase for ordering:
mysql> SET NAMES latin1 COLLATE latin1_bin;
Query OK, 0 rows affected (0.02 sec)
mysql> SELECT LOWER('aA'), UPPER('zZ');
+-------------+-------------+
| LOWER('aA') | UPPER('zZ') |
+-------------+-------------+
| aa | ZZ |
+-------------+-------------+
1 row in set (0.13 sec)
The concept of lettercase does not apply to bytes in a binary
string. To perform lettercase conversion, the string must be
converted to a nonbinary string:
mysql> SET NAMES binary;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT LOWER('aA'), LOWER(CONVERT('aA' USING latin1));
+-------------+-----------------------------------+
| LOWER('aA') | LOWER(CONVERT('aA' USING latin1)) |
+-------------+-----------------------------------+
| aA | aa |
+-------------+-----------------------------------+
1 row in set (0.00 sec)
Trailing space handling in
comparisons. Nonbinary strings have
PADSPACE behavior for all collations,
including _bin collations. Trailing spaces
are insignificant in comparisons:
mysql> SET NAMES utf8 COLLATE utf8_bin;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT 'a ' = 'a';
+------------+
| 'a ' = 'a' |
+------------+
| 1 |
+------------+
1 row in set (0.00 sec)
For binary strings, all characters are significant in
comparisons, including trailing spaces:
mysql> SET NAMES binary;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT 'a ' = 'a';
+------------+
| 'a ' = 'a' |
+------------+
| 0 |
+------------+
1 row in set (0.00 sec)
Trailing space handling for inserts and
retrievals.
CHAR(N ) columns
store nonbinary strings. Values shorter than
N characters are extended with
spaces on insertion. For retrieval, trailing spaces are
removed.
BINARY(N )
columns store binary strings. Values shorter than
N bytes are extended with
0x00 bytes on insertion. For retrieval,
nothing is removed; a value of the declared length is always
returned.
mysql> CREATE TABLE t1 (
-> a CHAR(10) CHARACTER SET utf8 COLLATE utf8_bin,
-> b BINARY(10)
-> );
Query OK, 0 rows affected (0.09 sec)
mysql> INSERT INTO t1 VALUES ('a','a');
Query OK, 1 row affected (0.01 sec)
mysql> SELECT HEX(a), HEX(b) FROM t1;
+--------+----------------------+
| HEX(a) | HEX(b) |
+--------+----------------------+
| 61 | 61000000000000000000 |
+--------+----------------------+
1 row in set (0.04 sec)
9.1.7.5. Special Cases Where Collation Determination Is Tricky
In the great majority of statements, it is obvious what
collation MySQL uses to resolve a comparison operation. For
example, in the following cases, it should be clear that the
collation is the collation of column x :
SELECT x FROM T ORDER BY x;
SELECT x FROM T WHERE x = x;
SELECT DISTINCT x FROM T;
However, when multiple operands are involved, there can be
ambiguity. For example:
SELECT x FROM T WHERE x = 'Y';
Should this query use the collation of the column
x , or of the string literal
'Y' ?
Standard SQL resolves such questions using what used to be
called “coercibility” rules. Basically, this
means: Both x and 'Y'
have collations, so which collation takes precedence? This can
be difficult to resolve, but the following rules cover most
situations:
An explicit COLLATE clause has a
coercibility of 0. (Not coercible at all.)
The concatenation of two strings with different collations
has a coercibility of 1.
The collation of a column or a stored routine parameter or
local variable has a coercibility of 2.
A “system constant” (the string returned by
functions such as USER() or
VERSION() ) has a
coercibility of 3.
A literal's collation has a coercibility of 4.
NULL or an expression that is derived
from NULL has a coercibility of 5.
The preceding coercibility values are current as of MySQL
5.0.3. In MySQL 5.0 prior to 5.0.3, there is no
system constant or ignorable coercibility. Functions such as
USER() have a coercibility of 2
rather than 3, and literals have a coercibility of 3 rather
than 4.
Those rules resolve ambiguities in the following manner:
Use the collation with the lowest coercibility value.
If both sides have the same coercibility, then:
If both sides are Unicode, or both sides are not
Unicode, it is an error.
If one of the sides has a Unicode character set, and
another side has a non-Unicode character set, the side
with Unicode character set wins, and automatic
character set conversion is applied to the non-Unicode
side. For example, the following statement will not
return an error:
SELECT CONCAT(utf8_column, latin1_column) FROM t1;
It will return a result, and the character set of the
result will be utf8 . The collation
of the result will be the collation of
utf8_column . Values of
latin1_column will be automatically
converted to utf8 before
concatenating.
For an operation with operands from the same character
set but that mix a _bin collation
and a _ci or _cs
collation, the _bin collation is
used. This is similar to how operations that mix
nonbinary and binary strings evaluate the operands as
binary strings, except that it is for collations
rather than data types.
Although automatic conversion is not in the SQL standard, the
SQL standard document does say that every character set is (in
terms of supported characters) a “subset” of
Unicode. Because it is a well-known principle that “what
applies to a superset can apply to a subset,” we
believe that a collation for Unicode can apply for comparisons
with non-Unicode strings.
Examples:
The COERCIBILITY() function can
be used to determine the coercibility of a string expression:
mysql> SELECT COERCIBILITY('A' COLLATE latin1_swedish_ci);
-> 0
mysql> SELECT COERCIBILITY(VERSION());
-> 3
mysql> SELECT COERCIBILITY('A');
-> 4
See Section 11.10.3, “Information Functions”.
9.1.7.6. Collations Must Be for the Right Character Set
Each character set has one or more collations, but each
collation is associated with one and only one character set.
Therefore, the following statement causes an error message
because the latin2_bin collation is not
legal with the latin1 character set:
mysql> SELECT _latin1 'x' COLLATE latin2_bin;
ERROR 1253 (42000): COLLATION 'latin2_bin' is not valid
for CHARACTER SET 'latin1'
9.1.7.7. Examples of the Effect of Collation
Example 1: Sorting German
Umlauts
Suppose that column X in table
T has these latin1
column values:
Muffler
M?ller
MX Systems
MySQL
Suppose also that the column values are retrieved using the
following statement:
SELECT X FROM T ORDER BY X COLLATE collation_name ;
The following table shows the resulting order of the values if
we use ORDER BY with different collations.
The character that causes the different sort orders in this
example is the U with two dots over it
(? ), which the Germans call
“U-umlaut.”
The first column shows the result of the
SELECT using the
Swedish/Finnish collating rule, which says that U-umlaut
sorts with Y.
The second column shows the result of the
SELECT using the German
DIN-1 rule, which says that U-umlaut sorts with U.
The third column shows the result of the
SELECT using the German
DIN-2 rule, which says that U-umlaut sorts with UE.
Example 2: Searching for German
Umlauts
Suppose that you have three tables that differ only by the
character set and collation used:
mysql> CREATE TABLE german1 (
-> c CHAR(10)
-> ) CHARACTER SET latin1 COLLATE latin1_german1_ci;
mysql> CREATE TABLE german2 (
-> c CHAR(10)
-> ) CHARACTER SET latin1 COLLATE latin1_german2_ci;
mysql> CREATE TABLE germanutf8 (
-> c CHAR(10)
-> ) CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Each table contains two records:
mysql> INSERT INTO german1 VALUES ('Bar'), ('B?r');
mysql> INSERT INTO german2 VALUES ('Bar'), ('B?r');
mysql> INSERT INTO germanutf8 VALUES ('Bar'), ('B?r');
Two of the above collations have an A = ?
equality, and one has no such equality
(latin1_german2_ci ). For that reason,
you'll get these results in comparisons:
mysql> SELECT * FROM german1 WHERE c = 'B?r';
+------+
| c |
+------+
| Bar |
| B?r |
+------+
mysql> SELECT * FROM german2 WHERE c = 'B?r';
+------+
| c |
+------+
| B?r |
+------+
mysql> SELECT * FROM germanutf8 WHERE c = 'B?r';
+------+
| c |
+------+
| Bar |
| B?r |
+------+
This is not a bug but rather a consequence of the sorting that
latin1_german1_ci or
utf8_unicode_ci do (the sorting shown is
done according to the German DIN 5007 standard).
The repertoire of a character set is the
collection of characters in the set.
As of MySQL 5.0.48, string expressions have a repertoire
attribute, which can have two values:
The ASCII range is a subset of
UNICODE range, so a string with
ASCII repertoire can be converted safely
without loss of information to the character set of any string
with UNICODE repertoire or to a character set
that is a superset of ASCII . (All MySQL
character sets are supersets of ASCII with
the exception of swe7 , which reuses some
punctuation characters for Swedish accented characters.) The use
of repertoire enables character set conversion in expressions
for many cases where MySQL would otherwise return an
“illegal mix of collations” error.
The following discussion provides examples of expressions and
their repertoires, and describes how the use of repertoire
changes string expression evaluation:
The repertoire for string constants depends on string
content:
SET NAMES utf8; SELECT 'abc';
SELECT _utf8'def';
SELECT N'MySQL';
Although the character set is utf8 in
each of the preceding cases, the strings do not actually
contain any characters outside the ASCII range, so their
repertoire is ASCII rather than
UNICODE .
Columns having the ascii character set
have ASCII repertoire because of their
character set. In the following table, c1
has ASCII repertoire:
CREATE TABLE t1 (c1 CHAR(1) CHARACTER SET ascii);
The following example illustrates how repertoire enables a
result to be determined in a case where an error occurs
without repertoire:
CREATE TABLE t1 (
c1 CHAR(1) CHARACTER SET latin1,
c2 CHAR(1) CHARACTER SET ascii
);
INSERT INTO t1 VALUES ('a','b');
SELECT CONCAT(c1,c2) FROM t1;
Without repertoire, this error occurs:
ERROR 1267 (HY000): Illegal mix of collations (latin1_swedish_ci,IMPLICIT)
and (ascii_general_ci,IMPLICIT) for operation 'concat'
Using repertoire, subset to superset
(ascii to latin1 )
conversion can occur and a result is returned:
+---------------+
| CONCAT(c1,c2) |
+---------------+
| ab |
+---------------+
Functions with one string argument inherit the repertoire of
their argument. The result of
UPPER(_utf8'abc ')
has ASCII repertoire, because its
argument has ASCII repertoire.
For functions that return a string but do not have string
arguments and use
character_set_connection as
the result character set, the result repertoire is
ASCII if
character_set_connection is
ascii , and UNICODE
otherwise:
FORMAT(numeric_column , 4);
Use of repertoire changes how MySQL evaluates the following
example:
SET NAMES ascii;
CREATE TABLE t1 (a INT, b VARCHAR(10) CHARACTER SET latin1);
INSERT INTO t1 VALUES (1,'b');
SELECT CONCAT(FORMAT(a, 4), b) FROM t1;
Without repertoire, this error occurs:
ERROR 1267 (HY000): Illegal mix of collations (ascii_general_ci,COERCIBLE)
and (latin1_swedish_ci,IMPLICIT) for operation 'concat'
With repertoire, a result is returned:
+-------------------------+
| CONCAT(FORMAT(a, 4), b) |
+-------------------------+
| 1.0000b |
+-------------------------+
Functions with two or more string arguments use the
“widest” argument repertoire for the result
repertoire (UNICODE is wider than
ASCII ). Consider the following
CONCAT() calls:
CONCAT(_ucs2 0x0041, _ucs2 0x0042)
CONCAT(_ucs2 0x0041, _ucs2 0x00C2)
For the first call, the repertoire is
ASCII because both arguments are within
the range of the ascii character set. For
the second call, the repertoire is
UNICODE because the second argument is
outside the ascii character set range.
The repertoire for function return values is determined
based only on the repertoire of the arguments that affect
the result's character set and collation.
IF(column1 < column2, 'smaller', 'greater')
The result repertoire is ASCII because
the two string arguments (the second argument and the third
argument) both have ASCII repertoire. The
first argument does not matter for the result repertoire,
even if the expression uses string values.
9.1.9. Operations Affected by Character Set Support
This section describes operations that take character set
information into account.
MySQL has many operators and functions that return a string.
This section answers the question: What is the character set
and collation of such a string?
For simple functions that take string input and return a
string result as output, the output's character set and
collation are the same as those of the principal input value.
For example,
UPPER(X )
returns a string whose character string and collation are the
same as that of X . The same applies
for INSTR() ,
LCASE() ,
LOWER() ,
LTRIM() ,
MID() ,
REPEAT() ,
REPLACE() ,
REVERSE() ,
RIGHT() ,
RPAD() ,
RTRIM() ,
SOUNDEX() ,
SUBSTRING() ,
TRIM() ,
UCASE() , and
UPPER() .
Note: The REPLACE() function,
unlike all other functions, always ignores the collation of
the string input and performs a case-sensitive comparison.
If a string input or function result is a binary string, the
string has no character set or collation. This can be checked
by using the CHARSET() and
COLLATION() functions, both of
which return binary to indicate that their
argument is a binary string:
mysql> SELECT CHARSET(BINARY 'a'), COLLATION(BINARY 'a');
+---------------------+-----------------------+
| CHARSET(BINARY 'a') | COLLATION(BINARY 'a') |
+---------------------+-----------------------+
| binary | binary |
+---------------------+-----------------------+
For operations that combine multiple string inputs and return
a single string output, the “aggregation rules”
of standard SQL apply for determining the collation of the
result:
If an explicit COLLATE
X occurs, use
X .
If explicit COLLATE
X and
COLLATE Y
occur, raise an error.
Otherwise, if all collations are
X , use
X .
Otherwise, the result has no collation.
For example, with CASE ... WHEN a THEN b WHEN b THEN
c COLLATE X END , the
resulting collation is X . The same
applies for UNION ,
|| ,
CONCAT() ,
ELT() ,
GREATEST() ,
IF() , and
LEAST() .
For operations that convert to character data, the character
set and collation of the strings that result from the
operations are defined by the
character_set_connection and
collation_connection system
variables. This applies only to
CAST() ,
CONV() ,
FORMAT() ,
HEX() ,
SPACE() . Before MySQL 5.0.15,
it also applies to CHAR() .
If you are uncertain about the character set or collation of
the result returned by a string function, you can use the
CHARSET() or
COLLATION() function to find
out:
mysql> SELECT USER(), CHARSET(USER()), COLLATION(USER());
+----------------+-----------------+-------------------+
| USER() | CHARSET(USER()) | COLLATION(USER()) |
+----------------+-----------------+-------------------+
| test@localhost | utf8 | utf8_general_ci |
+----------------+-----------------+-------------------+
9.1.9.2. CONVERT() and
CAST()
CONVERT() provides a way to
convert data between different character sets. The syntax is:
CONVERT(expr USING transcoding_name )
In MySQL, transcoding names are the same as the corresponding
character set names.
Examples:
SELECT CONVERT(_latin1'M?ller' USING utf8);
INSERT INTO utf8table (utf8column)
SELECT CONVERT(latin1field USING utf8) FROM latin1table;
CONVERT(... USING ...) is
implemented according to the standard SQL specification.
You may also use CAST() to
convert a string to a different character set. The syntax is:
CAST(character_string AS character_data_type CHARACTER SET charset_name )
Example:
SELECT CAST(_latin1'test' AS CHAR CHARACTER SET utf8);
If you use CAST() without
specifying CHARACTER SET , the resulting
character set and collation are defined by the
character_set_connection and
collation_connection system
variables. If you use CAST()
with CHARACTER SET X , the resulting
character set and collation are X and the
default collation of X .
You may not use a COLLATE clause inside a
CAST() , but you may use it
outside. That is, CAST(... COLLATE
...) is illegal, but CAST(...)
COLLATE ... is legal.
Example:
SELECT CAST(_latin1'test' AS CHAR CHARACTER SET utf8) COLLATE utf8_bin;
9.1.9.3. SHOW Statements and
INFORMATION_SCHEMA
Several SHOW statements provide
additional character set information. These include
SHOW CHARACTER SET ,
SHOW COLLATION ,
SHOW CREATE DATABASE ,
SHOW CREATE TABLE and
SHOW COLUMNS . These statements
are described here briefly. For more information, see
Section 12.5.5, “SHOW Syntax”.
INFORMATION_SCHEMA has several tables that
contain information similar to that displayed by the
SHOW statements. For example,
the CHARACTER_SETS and
COLLATIONS tables contain the
information displayed by SHOW CHARACTER
SET and SHOW
COLLATION . See Chapter 19, INFORMATION_SCHEMA Tables.
The SHOW CHARACTER SET command
shows all available character sets. It takes an optional
LIKE clause that indicates which
character set names to match. For example:
mysql> SHOW CHARACTER SET LIKE 'latin%';
+---------+-----------------------------+-------------------+--------+
| Charset | Description | Default collation | Maxlen |
+---------+-----------------------------+-------------------+--------+
| latin1 | cp1252 West European | latin1_swedish_ci | 1 |
| latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 |
| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci | 1 |
| latin7 | ISO 8859-13 Baltic | latin7_general_ci | 1 |
+---------+-----------------------------+-------------------+--------+
The output from SHOW COLLATION
includes all available character sets. It takes an optional
LIKE clause that indicates which
collation names to match. For example:
mysql> SHOW COLLATION LIKE 'latin1%';
+-------------------+---------+----+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+-------------------+---------+----+---------+----------+---------+
| latin1_german1_ci | latin1 | 5 | | | 0 |
| latin1_swedish_ci | latin1 | 8 | Yes | Yes | 0 |
| latin1_danish_ci | latin1 | 15 | | | 0 |
| latin1_german2_ci | latin1 | 31 | | Yes | 2 |
| latin1_bin | latin1 | 47 | | Yes | 0 |
| latin1_general_ci | latin1 | 48 | | | 0 |
| latin1_general_cs | latin1 | 49 | | | 0 |
| latin1_spanish_ci | latin1 | 94 | | | 0 |
+-------------------+---------+----+---------+----------+---------+
SHOW CREATE DATABASE displays
the CREATE DATABASE statement
that creates a given database:
mysql> SHOW CREATE DATABASE test;
+----------+-----------------------------------------------------------------+
| Database | Create Database |
+----------+-----------------------------------------------------------------+
| test | CREATE DATABASE `test` /*!40100 DEFAULT CHARACTER SET latin1 */ |
+----------+-----------------------------------------------------------------+
If no COLLATE clause is shown, the default
collation for the character set applies.
SHOW CREATE TABLE is similar,
but displays the CREATE TABLE
statement to create a given table. The column definitions
indicate any character set specifications, and the table
options include character set information.
The SHOW COLUMNS statement
displays the collations of a table's columns when invoked as
SHOW FULL
COLUMNS . Columns with
CHAR ,
VARCHAR , or
TEXT data types have
collations. Numeric and other noncharacter types have no
collation (indicated by NULL as the
Collation value). For example:
mysql> SHOW FULL COLUMNS FROM person\G
*************************** 1. row ***************************
Field: id
Type: smallint(5) unsigned
Collation: NULL
Null: NO
Key: PRI
Default: NULL
Extra: auto_increment
Privileges: select,insert,update,references
Comment:
*************************** 2. row ***************************
Field: name
Type: char(60)
Collation: latin1_swedish_ci
Null: NO
Key:
Default:
Extra:
Privileges: select,insert,update,references
Comment:
The character set is not part of the display but is implied by
the collation name.
MySQL 5.0 supports two character sets for storing
Unicode data:
ucs2 , the UCS-2 encoding of the Unicode
character set using 16 bits per character
utf8 , a UTF-8 encoding of the Unicode
character set using one to three bytes per character
These two character sets support the characters from the Basic
Multilingual Plane (BMP) of Unicode Version 3.0. BMP characters
have these characteristics:
Their code values are between 0 and 65535 (or
U+0000 .. U+FFFF )
They can be encoded with a fixed 16-bit word, as in
ucs2
They can be encoded with 8, 16, or 24 bits, as in
utf8
They are sufficient for almost all characters in major
languages
The ucs2 and utf8
character sets do not support supplementary characters that lie
outside the BMP.
A similar set of collations is available for each Unicode
character set. For example, each has a Danish collation, the
names of which are ucs2_danish_ci and
utf8_danish_ci . All Unicode collations are
listed at Section 9.1.13.1, “Unicode Character Sets”.
In UCS-2, every character is represented by a two-byte Unicode
code with the most significant byte first. For example:
LATIN CAPITAL LETTER A has the code
0x0041 and it is stored as a two-byte
sequence: 0x00 0x41 . CYRILLIC SMALL
LETTER YERU (Unicode 0x044B ) is
stored as a two-byte sequence: 0x04 0x4B . For
Unicode characters and their codes, please refer to the
Unicode Home Page.
The MySQL implementation of UCS-2 stores characters in
big-endian byte order and does not use a byte order mark (BOM)
at the beginning of UCS-2 values. Other database systems might
use little-endian byte order or a BOM, in which case, conversion
of UCS-2 values will need to be performed when transferring data
between those systems and MySQL.
UTF-8 (Unicode Transformation Format with 8-bit units) is an
alternative way to store Unicode data. It is implemented
according to RFC 3629. RFC 3629 describes encoding sequences
that take from one to four bytes. Currently, MySQL support for
UTF-8 does not include four-byte sequences. (An older standard
for UTF-8 encoding is given by RFC 2279, which describes UTF-8
sequences that take from one to six bytes. RFC 3629 renders RFC
2279 obsolete; for this reason, sequences with five and six
bytes are no longer used.)
The idea of UTF-8 is that various Unicode characters are encoded
using byte sequences of different lengths:
Basic Latin letters, digits, and punctuation signs use one
byte.
Most European and Middle East script letters fit into a
two-byte sequence: extended Latin letters (with tilde,
macron, acute, grave and other accents), Cyrillic, Greek,
Armenian, Hebrew, Arabic, Syriac, and others.
Korean, Chinese, and Japanese ideographs use three-byte
sequences.
MySQL uses no BOM for UTF-8 values.
Tip: To save space with UTF-8,
use VARCHAR instead of
CHAR . Otherwise, MySQL must
reserve three bytes for each character in a CHAR
CHARACTER SET utf8 column because that is the maximum
possible length. For example, MySQL must reserve 30 bytes for a
CHAR(10) CHARACTER SET utf8 column.
UCS-2 cannot be used as a client character set, which means that
SET NAMES 'ucs2' does not work. (See
Section 9.1.4, “Connection Character Sets and Collations”.)
Client applications that need to communicate with the server
using Unicode should set the client character set accordingly;
for example, by issuing a SET NAMES 'utf8'
statement. ucs2 cannot be used as a client
character set, which means that it does not work for
SET NAMES or SET CHARACTER
SET . (See Section 9.1.4, “Connection Character Sets and Collations”.)
9.1.11. UTF-8 for Metadata
Metadata is “the data about the
data.” Anything that describes the
database — as opposed to being the
contents of the database — is
metadata. Thus column names, database names, user names, version
names, and most of the string results from
SHOW are metadata. This is also
true of the contents of tables in
INFORMATION_SCHEMA , because those tables by
definition contain information about database objects.
Representation of metadata must satisfy these requirements:
All metadata must be in the same character set. Otherwise,
neither the SHOW commands nor
SELECT statements for tables
in INFORMATION_SCHEMA would work properly
because different rows in the same column of the results of
these operations would be in different character sets.
Metadata must include all characters in all languages.
Otherwise, users would not be able to name columns and
tables using their own languages.
To satisfy both requirements, MySQL stores metadata in a Unicode
character set, namely UTF-8. This does not cause any disruption
if you never use accented or non-Latin characters. But if you
do, you should be aware that metadata is in UTF-8.
The metadata requirements mean that the return values of the
USER() ,
CURRENT_USER() ,
SESSION_USER() ,
SYSTEM_USER() ,
DATABASE() , and
VERSION() functions have the
UTF-8 character set by default.
The server sets the
character_set_system system
variable to the name of the metadata character set:
mysql> SHOW VARIABLES LIKE 'character_set_system';
+----------------------+-------+
| Variable_name | Value |
+----------------------+-------+
| character_set_system | utf8 |
+----------------------+-------+
Storage of metadata using Unicode does not
mean that the server returns headers of columns and the results
of DESCRIBE functions in the
character_set_system character
set by default. When you use SELECT column1 FROM
t , the name column1 itself is
returned from the server to the client in the character set
determined by the value of the
character_set_results system
variable, which has a default value of
latin1 . If you want the server to pass
metadata results back in a different character set, use the
SET NAMES statement to force the server to
perform character set conversion. SET NAMES
sets the character_set_results
and other related system variables. (See
Section 9.1.4, “Connection Character Sets and Collations”.) Alternatively, a client
program can perform the conversion after receiving the result
from the server. It is more efficient for the client perform the
conversion, but this option is not always available for all
clients.
If character_set_results is set
to NULL , no conversion is performed and the
server returns metadata using its original character set (the
set indicated by
character_set_system ).
Error messages returned from the server to the client are
converted to the client character set automatically, as with
metadata.
If you are using (for example) the
USER() function for comparison or
assignment within a single statement, don't worry. MySQL
performs some automatic conversion for you.
SELECT * FROM t1 WHERE USER() = latin1_column;
This works because the contents of
latin1_column are automatically converted to
UTF-8 before the comparison.
INSERT INTO t1 (latin1_column) SELECT USER();
This works because the contents of
USER() are automatically
converted to latin1 before the assignment.
Although automatic conversion is not in the SQL standard, the
SQL standard document does say that every character set is (in
terms of supported characters) a “subset” of
Unicode. Because it is a well-known principle that “what
applies to a superset can apply to a subset,” we believe
that a collation for Unicode can apply for comparisons with
non-Unicode strings. For more information about coercion of
strings, see Section 9.1.7.5, “Special Cases Where Collation Determination Is Tricky”.
9.1.12. Column Character Set Conversion
To convert a binary or nonbinary string column to use a
particular character set, use ALTER
TABLE . For successful conversion to occur, one of the
following conditions must apply:
If the column has a binary data type
(BINARY ,
VARBINARY ,
BLOB ), all the values that it
contains must be encoded using a single character set (the
character set you're converting the column to). If you use a
binary column to store information in multiple character
sets, MySQL has no way to know which values use which
character set and cannot convert the data properly.
If the column has a nonbinary data type
(CHAR ,
VARCHAR ,
TEXT ), its contents should be
encoded in the column's character set, not some other
character set. If the contents are encoded in a different
character set, you can convert the column to use a binary
data type first, and then to a nonbinary column with the
desired character set.
Suppose that a table t has a binary column
named col1 defined as
VARBINARY(50) . Assuming that the information
in the column is encoded using a single character set, you can
convert it to a nonbinary column that has that character set.
For example, if col1 contains binary data
representing characters in the greek
character set, you can convert it as follows:
ALTER TABLE t MODIFY col1 VARCHAR(50) CHARACTER SET greek;
If your original column has a type of
BINARY(50) , you could convert it to
CHAR(50) , but the resulting values will be
padded with 0x00 bytes at the end, which may
be undesirable. To remove these bytes, use the
TRIM() function:
UPDATE t SET col1 = TRIM(TRAILING 0x00 FROM col1);
Suppose that table t has a nonbinary column
named col1 defined as CHAR(50)
CHARACTER SET latin1 but you want to convert it to use
utf8 so that you can store values from many
languages. The following statement accomplishes this:
ALTER TABLE t MODIFY col1 CHAR(50) CHARACTER SET utf8;
Conversion may be lossy if the column contains characters that
are not in both character sets.
A special case occurs if you have old tables from MySQL 4.0 or
earlier where a nonbinary column contains values that actually
are encoded in a character set different from the server's
default character set. For example, an application might have
stored sjis values in a column, even though
MySQL's default character set was latin1 . It
is possible to convert the column to use the proper character
set but an additional step is required. Suppose that the
server's default character set was latin1 and
col1 is defined as
CHAR(50) but its contents are
sjis values. The first step is to convert the
column to a binary data type, which removes the existing
character set information without performing any character
conversion:
ALTER TABLE t MODIFY col1 BLOB;
The next step is to convert the column to a nonbinary data type
with the proper character set:
ALTER TABLE t MODIFY col1 CHAR(50) CHARACTER SET sjis;
This procedure requires that the table not have been modified
already with statements such as
INSERT or
UPDATE after an upgrade to MySQL
4.1 or later. In that case, MySQL would store new values in the
column using latin1 , and the column will
contain a mix of sjis and
latin1 values and cannot be converted
properly.
If you specified attributes when creating a column initially,
you should also specify them when altering the table with
ALTER TABLE . For example, if you
specified NOT NULL and an explicit
DEFAULT value, you should also provide them
in the ALTER TABLE statement.
Otherwise, the resulting column definition will not include
those attributes.
9.1.13. Character Sets and Collations That MySQL Supports
MySQL supports 70+ collations for 30+ character sets. This
section indicates which character sets MySQL supports. There is
one subsection for each group of related character sets. For
each character set, the allowable collations are listed.
You can always list the available character sets and their
default collations with the SHOW CHARACTER
SET statement:
mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+
| Charset | Description | Default collation |
+----------+-----------------------------+---------------------+
| big5 | Big5 Traditional Chinese | big5_chinese_ci |
| dec8 | DEC West European | dec8_swedish_ci |
| cp850 | DOS West European | cp850_general_ci |
| hp8 | HP West European | hp8_english_ci |
| koi8r | KOI8-R Relcom Russian | koi8r_general_ci |
| latin1 | cp1252 West European | latin1_swedish_ci |
| latin2 | ISO 8859-2 Central European | latin2_general_ci |
| swe7 | 7bit Swedish | swe7_swedish_ci |
| ascii | US ASCII | ascii_general_ci |
| ujis | EUC-JP Japanese | ujis_japanese_ci |
| sjis | Shift-JIS Japanese | sjis_japanese_ci |
| hebrew | ISO 8859-8 Hebrew | hebrew_general_ci |
| tis620 | TIS620 Thai | tis620_thai_ci |
| euckr | EUC-KR Korean | euckr_korean_ci |
| koi8u | KOI8-U Ukrainian | koi8u_general_ci |
| gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci |
| greek | ISO 8859-7 Greek | greek_general_ci |
| cp1250 | Windows Central European | cp1250_general_ci |
| gbk | GBK Simplified Chinese | gbk_chinese_ci |
| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci |
| armscii8 | ARMSCII-8 Armenian | armscii8_general_ci |
| utf8 | UTF-8 Unicode | utf8_general_ci |
| ucs2 | UCS-2 Unicode | ucs2_general_ci |
| cp866 | DOS Russian | cp866_general_ci |
| keybcs2 | DOS Kamenicky Czech-Slovak | keybcs2_general_ci |
| macce | Mac Central European | macce_general_ci |
| macroman | Mac West European | macroman_general_ci |
| cp852 | DOS Central European | cp852_general_ci |
| latin7 | ISO 8859-13 Baltic | latin7_general_ci |
| cp1251 | Windows Cyrillic | cp1251_general_ci |
| cp1256 | Windows Arabic | cp1256_general_ci |
| cp1257 | Windows Baltic | cp1257_general_ci |
| binary | Binary pseudo charset | binary |
| geostd8 | GEOSTD8 Georgian | geostd8_general_ci |
| cp932 | SJIS for Windows Japanese | cp932_japanese_ci |
| eucjpms | UJIS for Windows Japanese | eucjpms_japanese_ci |
+----------+-----------------------------+---------------------+
In cases where a character set has multiple collations, it might
not be clear which collation is most suitable for a given
application. To avoid choosing the wrong collation, it can be
helpful to perform some comparisons with representative data
values to make sure that a given collation sorts values the way
you expect.
Collation-Charts.Org
is a useful site for information that shows how one collation
compares to another.
9.1.13.1. Unicode Character Sets
MySQL 5.0 has two Unicode character sets:
ucs2 , the UCS-2 encoding of the Unicode
character set using 16 bits per character
utf8 , a UTF-8 encoding of the Unicode
character set using one to three bytes per character
You can store text in about 650 languages using these
character sets. This section lists the collations available
for each Unicode character set. For general information about
the character sets, see Section 9.1.10, “Unicode Support”.
A similar set of collations is available for each Unicode
character set. These are shown in the following list, where
xxx represents the character set
name. For example,
xxx _danish_ci
represents the Danish collations, the specific names of which
are ucs2_danish_ci and
utf8_danish_ci .
xxx _bin
xxx _czech_ci
xxx _danish_ci
xxx _esperanto_ci
xxx _estonian_ci
xxx _general_ci
(default)
xxx _hungarian_ci
xxx _icelandic_ci
xxx _latvian_ci
xxx _lithuanian_ci
xxx _persian_ci
xxx _polish_ci
xxx _roman_ci
xxx _romanian_ci
xxx _slovak_ci
xxx _slovenian_ci
xxx _spanish2_ci
xxx _spanish_ci
xxx _swedish_ci
xxx _turkish_ci
xxx _unicode_ci
The
xxx _esperanto_ci
collations were added in MySQL 5.0.13. The
xxx _hungarian_ci
collations were added in MySQL 5.0.19.
MySQL implements the
xxx _unicode_ci
collations according to the Unicode Collation Algorithm (UCA)
described at
http://www.unicode.org/reports/tr10/. The
collation uses the version-4.0.0 UCA weight keys:
http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.
Currently, the
xxx _unicode_ci
collations have only partial support for the Unicode Collation
Algorithm. Some characters are not supported yet. Also,
combining marks are not fully supported. This affects
primarily Vietnamese, Yoruba, and some smaller languages such
as Navajo. The following discussion uses
utf8_unicode_ci for concreteness.
For any Unicode character set, operations performed using the
_general_ci collation are faster than those
for the _unicode_ci collation. For example,
comparisons for the utf8_general_ci
collation are faster, but slightly less correct, than
comparisons for utf8_unicode_ci . The reason
for this is that utf8_unicode_ci supports
mappings such as expansions; that is, when one character
compares as equal to combinations of other characters. For
example, in German and some other languages
“? ” is equal to
“ss ”.
utf8_unicode_ci also supports contractions
and ignorable characters. utf8_general_ci
is a legacy collation that does not support expansions,
contractions, or ignorable characters. It can make only
one-to-one comparisons between characters.
To further illustrate, the following equalities hold in both
utf8_general_ci and
utf8_unicode_ci (for the effect this has in
comparisons or when doing searches, see
Section 9.1.7.7, “Examples of the Effect of Collation”):
? = A
? = O
? = U
A difference between the collations is that this is true for
utf8_general_ci :
? = s
Whereas this is true for utf8_unicode_ci :
? = ss
MySQL implements language-specific collations for the
utf8 character set only if the ordering
with utf8_unicode_ci does not work well for
a language. For example, utf8_unicode_ci
works fine for German and French, so there is no need to
create special utf8 collations for these
two languages.
utf8_general_ci also is satisfactory for
both German and French, except that
“? ” is equal to
“s ”, and not to
“ss ”. If this is acceptable
for your application, then you should use
utf8_general_ci because it is faster.
Otherwise, use utf8_unicode_ci because it
is more accurate.
utf8_swedish_ci , like other
utf8 language-specific collations, is
derived from utf8_unicode_ci with
additional language rules. For example, in Swedish, the
following relationship holds, which is not something expected
by a German or French speaker:
? = Y < ?
The
xxx _spanish_ci
and
xxx _spanish2_ci
collations correspond to modern Spanish and traditional
Spanish, respectively. In both collations,
“? ” (n-tilde) is a separate
letter between “n ” and
“o ”. In addition, for
traditional Spanish, “ch ” is a
separate letter between “c ”
and “d ”, and
“ll ” is a separate letter
between “l ” and
“m ”
In the
xxx _roman_ci
collations, I and J
compare as equal, and U and
V compare as equal.
For additional information about Unicode collations in MySQL,
see Collation-Charts.Org
(utf8).
9.1.13.2. West European Character Sets
Western European character sets cover most West European
languages, such as French, Spanish, Catalan, Basque,
Portuguese, Italian, Albanian, Dutch, German, Danish, Swedish,
Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and
English.
ascii (US ASCII) collations:
cp850 (DOS West European) collations:
dec8 (DEC Western European) collations:
hp8 (HP Western European) collations:
hp8_bin
hp8_english_ci (default)
latin1 (cp1252 West European)
collations:
latin1 is the default character set.
MySQL's latin1 is the same as the
Windows cp1252 character set. This
means it is the same as the official ISO
8859-1 or IANA (Internet Assigned Numbers
Authority) latin1 , except that IANA
latin1 treats the code points between
0x80 and 0x9f as
“undefined,” whereas
cp1252 , and therefore MySQL's
latin1 , assign characters for those
positions. For example, 0x80 is the
Euro sign. For the “undefined” entries in
cp1252 , MySQL translates
0x81 to Unicode
0x0081 , 0x8d to
0x008d , 0x8f to
0x008f , 0x90 to
0x0090 , and 0x9d to
0x009d .
The latin1_swedish_ci collation is the
default that probably is used by the majority of MySQL
customers. Although it is frequently said that it is based
on the Swedish/Finnish collation rules, there are Swedes
and Finns who disagree with this statement.
The latin1_german1_ci and
latin1_german2_ci collations are based
on the DIN-1 and DIN-2 standards, where DIN stands for
Deutsches Institut f?r
Normung (the German equivalent of ANSI).
DIN-1 is called the “dictionary collation”
and DIN-2 is called the “phone book
collation.” For an example of the effect this has
in comparisons or when doing searches, see
Section 9.1.7.7, “Examples of the Effect of Collation”.
latin1_german1_ci (dictionary)
rules:
? = A
? = O
? = U
? = s
latin1_german2_ci (phone-book)
rules:
? = AE
? = OE
? = UE
? = ss
For an example of the effect this has in comparisons or
when doing searches, see
Section 9.1.7.7, “Examples of the Effect of Collation”.
In the latin1_spanish_ci collation,
“? ” (n-tilde) is a
separate letter between
“n ” and
“o ”.
macroman (Mac West European)
collations:
swe7 (7bit Swedish) collations:
For additional information about Western European collations
in MySQL, see Collation-Charts.Org
(ascii,
cp850,
dec8,
hp8,
latin1,
macroman,
swe7).
9.1.13.3. Central European Character Sets
MySQL provides some support for character sets used in the
Czech Republic, Slovakia, Hungary, Romania, Slovenia, Croatia,
Poland, and Serbia (Latin).
cp1250 (Windows Central European)
collations:
cp852 (DOS Central European)
collations:
keybcs2 (DOS Kamenicky Czech-Slovak)
collations:
latin2 (ISO 8859-2 Central European)
collations:
macce (Mac Central European)
collations:
For additional information about Central European collations
in MySQL, see Collation-Charts.Org
(cp1250,
cp852,
keybcs2,
latin2,
macce).
9.1.13.4. South European and Middle East Character Sets
South European and Middle Eastern character sets supported by
MySQL include Armenian, Arabic, Georgian, Greek, Hebrew, and
Turkish.
armscii8 (ARMSCII-8 Armenian)
collations:
cp1256 (Windows Arabic) collations:
geostd8 (GEOSTD8 Georgian) collations:
greek (ISO 8859-7 Greek) collations:
hebrew (ISO 8859-8 Hebrew) collations:
latin5 (ISO 8859-9 Turkish) collations:
For additional information about South European and Middle
Eastern collations in MySQL, see Collation-Charts.Org
(armscii8,
cp1256,
geostd8,
greek,
hebrew,
latin5).
9.1.13.5. Baltic Character Sets
The Baltic character sets cover Estonian, Latvian, and
Lithuanian languages.
For additional information about Baltic collations in MySQL,
see Collation-Charts.Org
(cp1257,
latin7).
9.1.13.6. Cyrillic Character Sets
The Cyrillic character sets and collations are for use with
Belarusian, Bulgarian, Russian, Ukrainian, and Serbian
(Cyrillic) languages.
cp1251 (Windows Cyrillic) collations:
cp866 (DOS Russian) collations:
koi8r (KOI8-R Relcom Russian)
collations:
koi8u (KOI8-U Ukrainian) collations:
For additional information about Cyrillic collations in MySQL,
see Collation-Charts.Org
(cp1251,
cp866,
koi8r,
koi8u).
).
9.1.13.7. Asian Character Sets
The Asian character sets that we support include Chinese,
Japanese, Korean, and Thai. These can be complicated. For
example, the Chinese sets must allow for thousands of
different characters. See Section 9.1.13.7.1, “The cp932 Character Set”, for
additional information about the cp932 and
sjis character sets.
For answers to some common questions and problems relating
support for Asian character sets in MySQL, see
Section A.11, “MySQL 5.0 FAQ — MySQL Chinese, Japanese, and Korean
Character Sets”.
big5 (Big5 Traditional Chinese)
collations:
cp932 (SJIS for Windows Japanese)
collations:
eucjpms (UJIS for Windows Japanese)
collations:
euckr (EUC-KR Korean) collations:
gb2312 (GB2312 Simplified Chinese)
collations:
gbk (GBK Simplified Chinese)
collations:
gbk_bin
gbk_chinese_ci (default)
sjis (Shift-JIS Japanese) collations:
tis620 (TIS620 Thai) collations:
tis620_bin
tis620_thai_ci (default)
ujis (EUC-JP Japanese) collations:
The big5_chinese_ci collation sorts on
number of strokes.
For additional information about Asian collations in MySQL,
see Collation-Charts.Org
(big5,
cp932,
eucjpms,
euckr,
gb2312,
gbk,
sjis,
tis620,
ujis).
9.1.13.7.1. The cp932 Character Set
Why is cp932
needed?
In MySQL, the sjis character set
corresponds to the Shift_JIS character
set defined by IANA, which supports JIS X0201 and JIS X0208
characters. (See
http://www.iana.org/assignments/character-sets.)
However, the meaning of “SHIFT JIS” as a
descriptive term has become very vague and it often includes
the extensions to Shift_JIS that are
defined by various vendors.
For example, “SHIFT JIS” used in Japanese
Windows environments is a Microsoft extension of
Shift_JIS and its exact name is
Microsoft Windows Codepage : 932 or
cp932 . In addition to the characters
supported by Shift_JIS ,
cp932 supports extension characters such
as NEC special characters, NEC selected — IBM extended
characters, and IBM extended characters.
Many Japanese users have experienced problems using these
extension characters. These problems stem from the following
factors:
MySQL automatically converts character sets.
Character sets are converted via Unicode
(ucs2 ).
The sjis character set does not
support the conversion of these extension characters.
There are several conversion rules from so-called
“SHIFT JIS” to Unicode, and some characters
are converted to Unicode differently depending on the
conversion rule. MySQL supports only one of these rules
(described later).
The MySQL cp932 character set is designed
to solve these problems. It is available as of MySQL 5.0.3.
Because MySQL supports character set conversion, it is
important to separate IANA Shift_JIS and
cp932 into two different character sets
because they provide different conversion rules.
How does cp932
differ from sjis ?
The cp932 character set differs from
sjis in the following ways:
cp932 supports NEC special
characters, NEC selected — IBM extended
characters, and IBM selected characters.
Some cp932 characters have two
different code points, both of which convert to the same
Unicode code point. When converting from Unicode back to
cp932 , one of the code points must be
selected. For this “round trip conversion,”
the rule recommended by Microsoft is used. (See
http://support.microsoft.com/kb/170559/EN-US/.)
The conversion rule works like this:
If the character is in both JIS X 0208 and NEC
special characters, use the code point of JIS X
0208.
If the character is in both NEC special characters
and IBM selected characters, use the code point of
NEC special characters.
If the character is in both IBM selected characters
and NEC selected — IBM extended characters,
use the code point of IBM extended characters.
The table shown at
http://www.microsoft.com/globaldev/reference/dbcs/932.htm
provides information about the Unicode values of
cp932 characters. For
cp932 table entries with characters
under which a four-digit number appears, the number
represents the corresponding Unicode
(ucs2 ) encoding. For table entries
with an underlined two-digit value appears, there is a
range of cp932 character values that
begin with those two digits. Clicking such a table entry
takes you to a page that displays the Unicode value for
each of the cp932 characters that
begin with those digits.
The following links are of special interest. They
correspond to the encodings for the following sets of
characters:
Starting from version 5.0.3, cp932
supports conversion of user-defined characters in
combination with eucjpms , and solves
the problems with
sjis /ujis
conversion. For details, please refer to
http://www.opengroup.or.jp/jvc/cde/sjis-euc-e.html.
For some characters, conversion to and from
ucs2 is different for
sjis and cp932 . The
following tables illustrate these differences.
Conversion to ucs2 :
Conversion from ucs2 :
Users of any Japanese character sets should be aware that
using
--character-set-client-handshake
(or
--skip-character-set-client-handshake )
has an important effect. See
Section 5.1.2, “Server Command Options”.
9.2. The Character Set Used for Data and Sorting
The character set determines what characters are allowed in
identifiers. The collation determines how strings are sorted by
the ORDER BY and GROUP BY
clauses of the SELECT statement.
By default, MySQL uses the latin1 (cp1252 West
European) character set and the
latin1_swedish_ci collation that sorts
according to Swedish/Finnish rules. These defaults are suitable
for the United States and most of Western Europe.
All MySQL binary distributions are compiled with
--with-extra-charsets=complex .
This adds code to all standard programs that enables them to
handle latin1 and all multi-byte character sets
within the binary. Other character sets are loaded from a
character-set definition file when needed.
You can change the default server character set and collation with
the --character-set-server and
--collation-server options when you
start the server. The collation must be a legal collation for the
default character set. (Use the SHOW
COLLATION statement to determine which collations are
available for each character set.) See
Section 5.1.2, “Server Command Options”.
The character sets available depend on the
--with-charset=charset_name
and
--with-extra-charsets=list-of-charsets
| complex | all | none options to
configure, and the character set configuration
files listed in
SHAREDIR /charsets/Index .
See Section 2.16.2, “Typical configure Options”.
When a client connects to the server, it sends the name of the
character set that it wants to use. The server uses the name to
set various character set system variables for the connection. For
more information, see Section 9.1.4, “Connection Character Sets and Collations”.
Within C programs, you should use
mysql_real_escape_string() when
escaping strings for an SQL query.
mysql_real_escape_string() is
identical to the old
mysql_escape_string() function,
except that it takes the MYSQL connection
handler as the first parameter so that the appropriate character
set can be taken into account when escaping characters.
If the client is compiled with paths that differ from where the
server is installed and the user who configured MySQL did not
include all character sets in the MySQL binary, you must tell the
client where it can find the additional character sets it needs if
the server runs with a different character set from the client.
You can do this by specifying a
--character-sets-dir option to indicate the path
to the directory in which the dynamic MySQL character sets are
stored. For example, you can put the following in an option file:
[client]
character-sets-dir=/usr/local/mysql/share/mysql/charsets
You can force the client to use specific character set as follows:
[client]
default-character-set=charset_name
This is normally unnecessary. However, when
character_set_system differs from
character_set_server or
character_set_client , and you
input characters manually (as database object identifiers, column
values, or both), these may be displayed incorrectly in output
from the client or the output itself may be formatted incorrectly.
In such cases, starting the mysql client with
--default-character-set=system_character_set
— that is, setting the client character set to match the
system character set — should fix the problem.
9.3. Setting the Error Message Language
By default, mysqld produces error messages in
English, but they can also be displayed in any of several other
languages: Czech, Danish, Dutch, Estonian, French, German, Greek,
Hungarian, Italian, Japanese, Korean, Norwegian, Norwegian-ny,
Polish, Portuguese, Romanian, Russian, Slovak, Spanish, or
Swedish.
You can select which language the server uses for error messages
using the instructions in this section.
To start mysqld with a particular language for
error messages, use the --language
or -L option. The option value can be a language
name or the full path to the error message file. For example:
shell> mysqld --language=swedish
Or:
shell> mysqld --language=/usr/local/share/swedish
The language name should be specified in lowercase.
By default, the language files are located in the
share/mysql/LANGUAGE
directory under the MySQL base directory.
For information about changing the character set for error
messages (rather than the language), see
Section 9.1.6, “Character Set for Error Messages”.
You can change the content of the error messages produced by the
server using the instructions in the MySQL Internals manual,
available at
http://forge.mysql.com/wiki/MySQL_Internals_Error_Messages.
If you do change the content of error messages, remember to repeat
your changes after each upgrade to a newer version of MySQL.
9.4. Adding a New Character Set
This section discusses the procedure for adding a new character
set to MySQL. You must have a MySQL source distribution to use
these instructions. The proper procedure depends on whether the
character set is simple or complex:
If the character set does not need to use special string
collating routines for sorting and does not need multi-byte
character support, it is simple.
If the character set needs either of those features, it is
complex.
For example, greek and swe7
are simple character sets, whereas big5 and
czech are complex character sets.
In the following instructions, MYSET
represents the name of the character set that you want to add.
Add a <charset> element for
MYSET to the
sql/share/charsets/Index.xml file. Use
the existing contents in the file as a guide to adding new
contents.
The <charset> element must list all
the collations for the character set. These must include at
least a binary collation and a default collation. The default
collation is usually named using a suffix of
general_ci (general, case insensitive). It
is possible for the binary collation to be the default
collation, but usually they are different. The default
collation should have a primary flag. The
binary collation should have a binary flag.
You must assign a unique ID number to each collation, chosen
from the range 1 to 254. To find the maximum of the currently
used collation IDs, use this query:
SELECT MAX(ID) FROM INFORMATION_SCHEMA.COLLATIONS;
This step depends on whether you are adding a simple or
complex character set. A simple character set requires only a
configuration file, whereas a complex character set requires C
source file that defines collation functions, multi-byte
functions, or both.
For a simple character set, create a configuration file,
MYSET .xml ,
that describes the character set properties. Create this file
in the sql/share/charsets directory. (You
can use a copy of latin1.xml as the basis
for this file.) The syntax for the file is very simple:
Comments are written as ordinary XML comments
(<!-- text
--> ).
Words within <map> array elements
are separated by arbitrary amounts of whitespace.
Each word within <map> array
elements must be a number in hexadecimal format.
The <map> array element for the
<ctype> element has 257 words.
The other <map> array elements
after that have 256 words. See
Section 9.4.1, “The Character Definition Arrays”.
For each collation listed in the
<charset> element for the
character set in Index.xml ,
MYSET .xml
must contain a <collation>
element that defines the character ordering.
For a complex character set, create a C source file that
describes the character set properties and defines the support
routines necessary to properly perform operations on the
character set:
Create the file
ctype-MYSET .c
in the strings directory. Look at one
of the existing ctype-*.c files (such
as ctype-big5.c ) to see what needs to
be defined. The arrays in your file must have names like
ctype_MYSET ,
to_lower_MYSET ,
and so on. These correspond to the arrays for a simple
character set. See Section 9.4.1, “The Character Definition Arrays”.
For each collation listed in the
<charset> element for the
character set in Index.xml , the
ctype-MYSET .c
file must provide an implementation of the collation.
If you need string collating functions, see
Section 9.4.2, “String Collating Support”.
If you need multi-byte character support, see
Section 9.4.3, “Multi-Byte Character Support”.
Follow these steps to modify the configuration information.
Use the existing configuration information as a guide to
adding information for MYSYS . The
example here assumes that the character set has default and
binary collations, but more lines will be needed if
MYSET has additional collations.
Edit mysys/charset-def.c , and
“register” the collations for the new
character set.
Add these lines to the “declaration” section:
#ifdef HAVE_CHARSET_MYSET
extern CHARSET_INFO my_charset_MYSET _general_ci;
extern CHARSET_INFO my_charset_MYSET _bin;
#endif
Add these lines to the “registration”
section:
#ifdef HAVE_CHARSET_MYSET
add_compiled_collation(&my_charset_MYSET _general_ci);
add_compiled_collation(&my_charset_MYSET _bin);
#endif
If the character set uses
ctype-MYSET .c ,
edit strings/Makefile.am and add
ctype-MYSET .c
to each definition of the CSRCS
variable, and to the EXTRA_DIST
variable.
If the character set uses
ctype-MYSET .c ,
edit libmysql/Makefile.shared and add
ctype-MYSET .lo
to the mystringsobjects definition.
Edit
config/ac-macros/character_sets.m4 :
Add MYSET to one of the
define(CHARSETS_AVAILABLE...) lines
in alphabetic order.
Add MYSET to
CHARSETS_COMPLEX . This is needed
even for simple character sets, or
configure will not recognize
--with-charset=MYSET .
Add MYSET to the first
case control structure. Omit the
USE_MB and
USE_MB_IDENT lines for 8-bit
character sets.
MYSET )
AC_DEFINE(HAVE_CHARSET_MYSET , 1, [Define to enable charset MYSET ])
AC_DEFINE([USE_MB], 1, [Use multi-byte character routines])
AC_DEFINE(USE_MB_IDENT, 1)
;;
Add MYSET to the second
case control structure:
MYSET )
default_charset_default_collation="MYSET _general_ci"
default_charset_collations="MYSET _general_ci MYSET _bin"
;;
Reconfigure, recompile, and test.
9.4.1. The Character Definition Arrays
Each simple character set has a configuration file located in
the sql/share/charsets directory. The file
is named
MYSET .xml . It
uses <map> array elements to list
character set properties. <map>
elements appear within these elements:
<ctype> defines attributes for each
character
<lower> and
<upper> list the lowercase and
uppercase characters
<unicode> maps 8-bit character
values to Unicode values
<collation> elements indicate
character ordering for comparisons and sorts, one element
per collation (binary collations need no
<map> element because the character
codes themselves provide the ordering)
For a complex character set as implemented in a
ctype-MYSET .c
file in the strings directory, there are
corresponding arrays:
ctype_MYSET [] ,
to_lower_MYSET [] ,
and so forth. Not every complex character set has all of the
arrays. See the existing ctype-*.c files
for examples. See the CHARSET_INFO.txt file
in the strings directory for additional
information.
The ctype array is indexed by character value
+ 1 and has 257 elements. This is an old legacy convention for
handling EOF . The other arrays are indexed by
character value and have 256 elements.
ctype array elements are bit values. Each
element describes the attributes of a single character in the
character set. Each attribute is associated with a bitmask, as
defined in include/m_ctype.h :
#define _MY_U 01 /* Upper case */
#define _MY_L 02 /* Lower case */
#define _MY_NMR 04 /* Numeral (digit) */
#define _MY_SPC 010 /* Spacing character */
#define _MY_PNT 020 /* Punctuation */
#define _MY_CTR 040 /* Control character */
#define _MY_B 0100 /* Blank */
#define _MY_X 0200 /* heXadecimal digit */
The ctype value for a given character should
be the union of the applicable bitmask values that describe the
character. For example, 'A' is an uppercase
character (_MY_U ) as well as a hexadecimal
digit (_MY_X ), so its
ctype value should be defined like this:
ctype['A'+1] = _MY_U | _MY_X = 01 | 0200 = 0201
The bitmask values in m_ctype.h are octal
values, but the elements of the ctype array
in MYSET .xml
should be written as hexadecimal values.
The lower and upper arrays
hold the lowercase and uppercase characters corresponding to
each member of the character set. For example:
lower['A'] should contain 'a'
upper['a'] should contain 'A'
Each collation array is a map indicating how
characters should be ordered for comparison and sorting
purposes. MySQL sorts characters based on the values of this
information. In some cases, this is the same as the
upper array, which means that sorting is
case-insensitive. For more complicated sorting rules (for
complex character sets), see the discussion of string collating
in Section 9.4.2, “String Collating Support”.
9.4.2. String Collating Support
For simple character sets, sorting rules are specified in the
MYSET .xml
configuration file using <map> array
elements within <collation> elements.
If the sorting rules for your language are too complex to be
handled with simple arrays, you need to define string collating
functions in the
ctype-MYSET .c
source file in the strings directory.
The existing character sets provide the best documentation and
examples to show how these functions are implemented. Look at
the ctype-*.c files in the
strings directory, such as the files for
the big5 , czech ,
gbk , sjis , and
tis160 character sets. Take a look at the
MY_COLLATION_HANDLER structures to see how
they are used, and see the CHARSET_INFO.txt
file in the strings directory for
additional information.
9.4.3. Multi-Byte Character Support
If you want to add support for a new character set that includes
multi-byte characters, you need to use multi-byte character
functions in the
ctype-MYSET .c
source file in the strings directory.
The existing character sets provide the best documentation and
examples to show how these functions are implemented. Look at
the ctype-*.c files in the
strings directory, such as the files for
the euc_kr , gb2312 ,
gbk , sjis , and
ujis character sets. Take a look at the
MY_CHARSET_HANDLER structures to see how they
are used, and see the CHARSET_INFO.txt file
in the strings directory for additional
information.
9.5. How to Add a New Collation to a Character Set
A collation is a set of rules that defines how to compare and sort
character strings. Each collation in MySQL belongs to a single
character set. Every character set has at least one collation, and
most have two or more collations.
A collation orders characters based on weights. Each character in
a character set maps to a weight. Characters with equal weights
compare as equal, and characters with unequal weights compare
according to the relative magnitude of their weights.
MySQL supports several collation implementations, as discussed in
Section 9.5.1, “Collation Implementation Types”. Some of these
can be added to MySQL without recompiling:
Simple collations for 8-bit character sets
UCA-based collations for Unicode character sets
Binary (xxx _bin )
collations
The following discussion describes how to add collations of the
first two types to existing character sets. All existing character
sets already have a binary collation, so there is no need here to
describe how to add one.
Summary of the procedure for adding a new collation:
Choose a collation ID
Add configuration information that names the collation and
describes the character-ordering rules
Restart the server
Verify that the collation is present
The instructions here cover only collations that can be added
without recompiling MySQL. To add a collation that does require
recompiling (as implemented by means of functions in a C source
file), use the instructions in
Section 9.4, “Adding a New Character Set”. However, instead of adding
all the information required for a complete character set, just
modify the appropriate files for an existing character set. That
is, based on what is already present for the character set's
current collations, add new data structures, functions, and
configuration information for the new collation. For an example,
see the MySQL Blog article in the following list of additional
resources.
Additional Resources
9.5.1. Collation Implementation Types
MySQL implements several types of collations:
Simple collations for 8-bit character
sets
This kind of collation is implemented using an array of 256
weights that defines a one-to-one mapping from character codes
to weights. latin1_swedish_ci is an example.
It is a case-insensitive collation, so the uppercase and
lowercase versions of a character have the same weights and they
compare as equal.
mysql> SET NAMES 'latin1' COLLATE 'latin1_swedish_ci';
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT 'a' = 'A';
+-----------+
| 'a' = 'A' |
+-----------+
| 1 |
+-----------+
1 row in set (0.00 sec)
Complex collations for 8-bit character
sets
This kind of collation is implemented using functions in a C
source file that define how to order characters, as described in
Section 9.4, “Adding a New Character Set”.
Collations for non-Unicode multi-byte
character sets
For this type of collation, 8-bit (single-byte) and multi-byte
characters are handled differently. For 8-bit characters,
character codes map to weights in case-insensitive fashion. (For
example, the single-byte characters 'a' and
'A' both have a weight of
0x41 .) For multi-byte characters, there are
two types of relationship between character codes and weights:
Weights equal character codes.
sjis_japanese_ci is an example of this
kind of collation. The multi-byte character
'?' has a character code of
0x82C0 , and the weight is also
0x82C0 .
Character codes map one-to-one to weights, but a code is not
necessarily equal to the weight.
gbk_chinese_ci is an example of this kind
of collation. The multi-byte character
'?' has a character code of
0x81B0 but a weight of
0xC286 .
Collations for Unicode multi-byte
character sets
Some of these collations are based on the Unicode Collation
Algorithm (UCA), others are not.
Non-UCA collations have a one-to-one mapping from character code
to weight. In MySQL, such collations are case insensitive and
accent insensitive. utf8_general_ci is an
example: 'a' , 'A' ,
'?' , and '?' each have
different character codes but all have a weight of
0x0041 and compare as equal.
mysql> SET NAMES 'utf8' COLLATE 'utf8_general_ci';
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT 'a' = 'A', 'a' = '?', 'a' = '?';
+-----------+-----------+-----------+
| 'a' = 'A' | 'a' = '?' | 'a' = '?' |
+-----------+-----------+-----------+
| 1 | 1 | 1 |
+-----------+-----------+-----------+
1 row in set (0.06 sec)
UCA-based collations in MySQL have these properties:
If a character has weights, each weight uses 2 bytes (16
bits)
A character may have zero weights (or an empty weight). In
this case, the character is ignorable. Example: "U+0000
NULL" does not have a weight and is ignorable.
A character may have one weight. Example:
'a' has a weight of
0x0E33 .
A character may have many weights. This is an expansion.
Example: The German letter '?' (SZ
LEAGUE, or SHARP S) has a weight of
0x0FEA0FEA .
Many characters may have one weight. This is a contraction.
Example: 'ch' is a single letter in Czech
and has a weight of 0x0EE2 .
A many-characters-to-many-weights mapping is also possible (this
is contraction with expansion), but is not supported by MySQL.
Miscellaneous collations
There are also a few collations that do not fall into any of the
previous categories.
9.5.2. Choosing a Collation ID
Each collation must have a unique ID. To add a new collation,
you must choose an ID value that is not currently used. The
value must be in the range from 1 to 254. The collation ID that
you choose will show up in these contexts:
To determine the largest currently used ID, issue the following
statement:
mysql> SELECT MAX(ID) FROM INFORMATION_SCHEMA.COLLATIONS;
+---------+
| MAX(ID) |
+---------+
| 210 |
+---------+
For the output just shown, you could choose an ID higher than
210 for the new collation.
To display a list of all currently used IDs, issue this
statement:
mysql> SELECT ID FROM INFORMATION_SCHEMA.COLLATIONS ORDER BY ID;
+-----+
| ID |
+-----+
| 1 |
| 2 |
| ... |
| 52 |
| 53 |
| 57 |
| 58 |
| ... |
| 98 |
| 99 |
| 128 |
| 129 |
| ... |
| 210 |
+-----+
In this case, you can either choose an unused ID from within the
current range of IDs, or choose an ID that is higher than the
current maximum ID. For example, in the output just shown, there
are unused IDs between 53 and 57, and between 99 and 128. Or you
could choose an ID higher than 210.
Warning
If you upgrade MySQL, you may find that the collation ID you
choose has been assigned to a collation included in the new
MySQL distribution. In this case, you will need to choose a
new value for your own collation.
In addition, before upgrading, you should save the
configuration files that you change. If you upgrade in place,
the process will replace the your modified files.
9.5.3. Adding a Simple Collation to an 8-Bit Character Set
To add a simple collation for an 8-bit character set without
recompiling MySQL, use the following procedure. The example adds
a collation named latin1_test_ci to the
latin1 character set.
Choose a collation ID, as shown in
Section 9.5.2, “Choosing a Collation ID”. The
following steps use an ID of 56.
You will need to modify the Index.xml and
latin1.xml configuration files. These
files will be located in the directory named by the
character_sets_dir system
variable. You can check the variable value as follows,
although the path name might be different on your system:
mysql> SHOW VARIABLES LIKE 'character_sets_dir';
+--------------------+-----------------------------------------+
| Variable_name | Value |
+--------------------+-----------------------------------------+
| character_sets_dir | /user/local/mysql/share/mysql/charsets/ |
+--------------------+-----------------------------------------+
Choose a name for the collation and list it in the
Index.xml file. Find the
<charset> element for the character
set to which the collation is being added, and add a
<collation> element that indicates
the collation name and ID. For example:
<charset name="latin1">
...
<!-- associate collation name with its ID -->
<collation name="latin1_test_ci" id="56"/>
...
</charset>
In the latin1.xml configuration file,
add a <collation> element that
names the collation and that contains a
<map> element that defines a
character code-to-weight mapping table. Each word within the
<map> element must be a number in
hexadecimal format.
<collation name="latin1_test_ci">
<map>
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F
40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F
60 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
50 51 52 53 54 55 56 57 58 59 5A 7B 7C 7D 7E 7F
80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F
90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F
A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF
B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF
41 41 41 41 5B 5D 5B 43 45 45 45 45 49 49 49 49
44 4E 4F 4F 4F 4F 5C D7 5C 55 55 55 59 59 DE DF
41 41 41 41 5B 5D 5B 43 45 45 45 45 49 49 49 49
44 4E 4F 4F 4F 4F 5C F7 5C 55 55 55 59 59 DE FF
</map>
</collation>
Restart the server and use this statement to verify that the
collation is present:
mysql> SHOW COLLATION LIKE 'latin1_test_ci';
+----------------+---------+----+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+----------------+---------+----+---------+----------+---------+
| latin1_test_ci | latin1 | 56 | | | 1 |
+----------------+---------+----+---------+----------+---------+
9.5.4. Adding a UCA Collation to a Unicode Character Set
UCA collations for Unicode character sets can be added to MySQL
without recompiling by using a subset of the Locale Data Markup
Language (LDML), which is available at
http://www.unicode.org/reports/tr35/. In
5.0, this method of adding collations is supported
as of MySQL 5.0.46. With this method, you begin with an existing
“base” collation. Then you describe the new
collation in terms of how it differs from the base collation,
rather than defining the entire collation. The following table
lists the base collations for the Unicode character sets.
The following brief summary describes the LDML characteristics
required for understanding the procedure for adding a collation
given later in this section:
LDML has reset rules and shift rules.
Characters named in these rules can be written in
\unnnn format,
where nnnn is the hexadecimal
Unicode code point value. Basic Latin letters
A-Z and a-z can also
be written literally (this is a MySQL limitation; the LDML
specification allows literal non-Latin1 characters in the
rules). Only characters in the Basic Multilingual Plane can
be specified. This notation does not apply to characters
outside the BMP range of 0000 to
FFFF .
A reset rule does not specify any ordering in and of itself.
Instead, it “resets” the ordering for
subsequent shift rules to cause them to be taken in relation
to a given character. Either of the following rules resets
subsequent shift rules to be taken in relation to the letter
'A' :
<reset>A</reset>
<reset>\u0041</reset>
Shift rules define primary, secondary, and tertiary
differences of a character from another character. They are
specified using <p> ,
<s> , and
<t> elements. Either of the
following rules specifies a primary shift rule for the
'G' character:
<p>G</p>
<p>\u0047</p>
Use primary differences to distinguish separate letters.
Use secondary differences to distinguish accent
variations.
Use tertiary differences to distinguish lettercase
variations.
To add a UCA collation for a Unicode character set without
recompiling MySQL, use the following procedure. The example adds
a collation named utf8_phone_ci to the
utf8 character set. The collation is designed
for a scenario involving a Web application for which users post
their names and phone numbers. Phone numbers can be given in
very different formats:
+7-12345-67
+7-12-345-67
+7 12 345 67
+7 (12) 345 67
+71234567
The problem raised by dealing with these kinds of values is that
the varying allowable formats make searching for a specific
phone number very difficult. The solution is to define a new
collation that reorders punctuation characters, making them
ignorable.
Choose a collation ID, as shown in
Section 9.5.2, “Choosing a Collation ID”. The
following steps use an ID of 252.
You will need to modify the Index.xml
configuration file. This file will be located in the
directory named by the
character_sets_dir system
variable. You can check the variable value as follows,
although the path name might be different on your system:
mysql> SHOW VARIABLES LIKE 'character_sets_dir';
+--------------------+-----------------------------------------+
| Variable_name | Value |
+--------------------+-----------------------------------------+
| character_sets_dir | /user/local/mysql/share/mysql/charsets/ |
+--------------------+-----------------------------------------+
Choose a name for the collation and list it in the
Index.xml file. In addition, you'll
need to provide the collation ordering rules. Find the
<charset> element for the character
set to which the collation is being added, and add a
<collation> element that indicates
the collation name and ID. Within the
<collation> element, provide a
<rules> element containing the
ordering rules:
<charset name="utf8">
...
<!-- associate collation name with its ID -->
<collation name="utf8_phone_ci" id="252">
<rules>
<reset>\u0000</reset>
<s>\u0020</s> <!-- space -->
<s>\u0028</s> <!-- left parenthesis -->
<s>\u0029</s> <!-- right parenthesis -->
<s>\u002B</s> <!-- plus -->
<s>\u002D</s> <!-- hyphen -->
</rules>
</collation>
...
</charset>
If you want a similar collation for other Unicode character
sets, add other <collation>
elements. For example, to define
ucs2_phone_ci , add a
<collation> element to the
<charset name="ucs2"> element.
Remember that each collation must have its own unique ID.
Restart the server and use this statement to verify that the
collation is present:
mysql> SHOW COLLATION LIKE 'utf8_phone_ci';
+---------------+---------+-----+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+---------------+---------+-----+---------+----------+---------+
| utf8_phone_ci | utf8 | 252 | | | 8 |
+---------------+---------+-----+---------+----------+---------+
Now we can test the collation to make sure that it has the
desired properties.
Create a table containing some sample phone numbers using the
new collation:
mysql> CREATE TABLE phonebook (
-> name VARCHAR(64),
-> phone VARCHAR(64) CHARACTER SET utf8 COLLATE utf8_phone_ci
-> );
Query OK, 0 rows affected (0.09 sec)
mysql> INSERT INTO phonebook VALUES ('Svoj','+7 912 800 80 02');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO phonebook VALUES ('Hf','+7 (912) 800 80 04');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO phonebook VALUES ('Bar','+7-912-800-80-01');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO phonebook VALUES ('Ramil','(7912) 800 80 03');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO phonebook VALUES ('Sanja','+380 (912) 8008005');
Query OK, 1 row affected (0.00 sec)
Run some queries to see whether the ignored punctuation
characters are in fact ignored for sorting and comparisons:
mysql> SELECT * FROM phonebook ORDER BY phone;
+-------+--------------------+
| name | phone |
+-------+--------------------+
| Sanja | +380 (912) 8008005 |
| Bar | +7-912-800-80-01 |
| Svoj | +7 912 800 80 02 |
| Ramil | (7912) 800 80 03 |
| Hf | +7 (912) 800 80 04 |
+-------+--------------------+
5 rows in set (0.00 sec)
mysql> SELECT * FROM phonebook WHERE phone='+7(912)800-80-01';
+------+------------------+
| name | phone |
+------+------------------+
| Bar | +7-912-800-80-01 |
+------+------------------+
1 row in set (0.00 sec)
mysql> SELECT * FROM phonebook WHERE phone='79128008001';
+------+------------------+
| name | phone |
+------+------------------+
| Bar | +7-912-800-80-01 |
+------+------------------+
1 row in set (0.00 sec)
mysql> SELECT * FROM phonebook WHERE phone='7 9 1 2 8 0 0 8 0 0 1';
+------+------------------+
| name | phone |
+------+------------------+
| Bar | +7-912-800-80-01 |
+------+------------------+
1 row in set (0.00 sec)
9.6. Problems With Character Sets
If you try to use a character set that is not compiled into your
binary, you might run into the following problems:
Your program uses an incorrect path to determine where the
character sets are stored (which is typically the
share/mysql/charsets or
share/charsets directory under the MySQL
installation directory). This can be fixed by using the
--character-sets-dir option when you run the
program in question. For example, to specify a directory to be
used by MySQL client programs, list it in the
[client] group of your option file. The
examples given here show what the setting might look like for
Unix or Windows, respectively:
[client]
character-sets-dir=/usr/local/mysql/share/mysql/charsets
[client]
character-sets-dir="C:/Program Files/MySQL/MySQL Server 5.0/share/charsets"
The character set is a complex character set that cannot be
loaded dynamically. In this case, you must recompile the
program with support for the character set.
For Unicode character sets, you can define collations without
recompiling by using LDML notation. See
Section 9.5.4, “Adding a UCA Collation to a Unicode Character Set”.
The character set is a dynamic character set, but you do not
have a configuration file for it. In this case, you should
install the configuration file for the character set from a
new MySQL distribution.
If your character set index file does not contain the name for
the character set, your program displays an error message. The
file is named Index.xml and the message
is:
Character set 'charset_name ' is not a compiled character set and is not
specified in the '/usr/share/mysql/charsets/Index.xml' file
To solve this problem, you should either get a new index file
or manually add the name of any missing character sets to the
current file.
For MyISAM tables, you can check the character
set name and number for a table with myisamchk -dvv
tbl_name .
9.7. MySQL Server Time Zone Support
The MySQL server maintains several time zone settings:
The system time zone. When the server starts, it attempts to
determine the time zone of the host machine and uses it to set
the system_time_zone system
variable. The value does not change thereafter.
You can set the system time zone for MySQL Server at startup
with the
--timezone=timezone_name
option to mysqld_safe. You can also set it
by setting the TZ environment variable
before you start mysqld. The allowable
values for --timezone or
TZ are system-dependent. Consult your
operating system documentation to see what values are
acceptable.
The server's current time zone. The global
time_zone system variable
indicates the time zone the server currently is operating in.
The initial value for
time_zone is
'SYSTEM' , which indicates that the server
time zone is the same as the system time zone.
The initial global server time zone value can be specified
explicitly at startup with the
--default-time-zone=timezone
option on the command line, or you can use the following line
in an option file:
default-time-zone='timezone '
If you have the SUPER
privilege, you can set the global server time zone value at
runtime with this statement:
mysql> SET GLOBAL time_zone = timezone ;
Per-connection time zones. Each client that connects has its
own time zone setting, given by the session
time_zone variable.
Initially, the session variable takes its value from the
global time_zone variable,
but the client can change its own time zone with this
statement:
mysql> SET time_zone = timezone ;
The current session time zone setting affects display and storage
of time values that are zone-sensitive. This includes the values
displayed by functions such as
NOW() or
CURTIME() , and values stored in and
retrieved from TIMESTAMP columns.
Values for TIMESTAMP columns are
converted from the current time zone to UTC for storage, and from
UTC to the current time zone for retrieval.
The current time zone setting does not affect values displayed by
functions such as UTC_TIMESTAMP()
or values in DATE ,
TIME , or
DATETIME columns. Nor are values in
those data types stored in UTC; the time zone applies for them
only when converting from TIMESTAMP values. If
you want locale-specific arithmetic for
DATE ,
TIME , or
DATETIME values, convert them to
UTC, perform the arithmetic, and then convert back.
The current values of the global and client-specific time zones
can be retrieved like this:
mysql> SELECT @@global.time_zone, @@session.time_zone;
timezone values can be given in several
formats, none of which are case sensitive:
The value 'SYSTEM' indicates that the time
zone should be the same as the system time zone.
The value can be given as a string indicating an offset from
UTC, such as '+10:00' or
'-6:00' .
The value can be given as a named time zone, such as
'Europe/Helsinki' ,
'US/Eastern' , or 'MET' .
Named time zones can be used only if the time zone information
tables in the mysql database have been
created and populated.
The MySQL installation procedure creates the time zone tables in
the mysql database, but does not load them. You
must do so manually using the following instructions. (If you are
upgrading to MySQL 4.1.3 or later from an earlier version, you can
create the tables by upgrading your mysql
database. Use the instructions in Section 4.4.9, “mysql_upgrade — Check Tables for MySQL Upgrade”.
After creating the tables, you can load them.)
Note
Loading the time zone information is not necessarily a one-time
operation because the information changes occasionally. For
example, the rules for Daylight Saving Time in the United
States, Mexico, and parts of Canada changed in 2007. When such
changes occur, applications that use the old rules become out of
date and you may find it necessary to reload the time zone
tables to keep the information used by your MySQL server
current. See the notes at the end of this section.
If your system has its own zoneinfo
database (the set of files describing time zones), you should use
the mysql_tzinfo_to_sql program for filling the
time zone tables. Examples of such systems are Linux, FreeBSD, Sun
Solaris, and Mac OS X. One likely location for these files is the
/usr/share/zoneinfo directory. If your system
does not have a zoneinfo database, you can use the downloadable
package described later in this section.
The mysql_tzinfo_to_sql program is used to load
the time zone tables. On the command line, pass the zoneinfo
directory path name to mysql_tzinfo_to_sql and
send the output into the mysql program. For
example:
shell> mysql_tzinfo_to_sql /usr/share/zoneinfo | mysql -u root mysql
mysql_tzinfo_to_sql reads your system's time
zone files and generates SQL statements from them.
mysql processes those statements to load the
time zone tables.
mysql_tzinfo_to_sql also can be used to load a
single time zone file or to generate leap second information:
To load a single time zone file
tz_file that corresponds to a time
zone name tz_name , invoke
mysql_tzinfo_to_sql like this:
shell> mysql_tzinfo_to_sql tz_file tz_name | mysql -u root mysql
With this approach, you must execute a separate command to
load the time zone file for each named zone that the server
needs to know about.
If your time zone needs to account for leap seconds,
initialize the leap second information like this, where
tz_file is the name of your time
zone file:
shell> mysql_tzinfo_to_sql --leap tz_file | mysql -u root mysql
After running mysql_tzinfo_to_sql, it is
best to restart the server so that it does not continue to use
any previously cached time zone data.
If your system is one that has no zoneinfo database (for example,
Windows or HP-UX), you can use the package of pre-built time zone
tables that is available for download at the MySQL Developer Zone:
http://dev.mysql.com/downloads/timezones.html
This time zone package contains .frm ,
.MYD , and .MYI files for
the MyISAM time zone tables. These tables
should be part of the mysql database, so you
should place the files in the mysql
subdirectory of your MySQL server's data directory. The server
should be stopped while you do this and restarted afterward.
Warning
Do not use the downloadable package if your system has a
zoneinfo database. Use the
mysql_tzinfo_to_sql utility instead.
Otherwise, you may cause a difference in datetime handling
between MySQL and other applications on your system.
For information about time zone settings in replication setup,
please see Section 16.3.1, “Replication Features and Issues”.
9.7.1. Staying Current with Time Zone Changes
As mentioned earlier, when the time zone rules change,
applications that use the old rules become out of date. To stay
current, it is necessary to make sure that your system uses
current time zone information is used. For MySQL, there are two
factors to consider in staying current:
The operating system time affects the value that the MySQL
server uses for times if its time zone is set to
SYSTEM . Make sure that your operating
system is using the latest time zone information. For most
operating systems, the latest update or service pack
prepares your system for the time changes. Check the Web
site for your operating system vendor for an update that
addresses the time changes.
If you replace the system's
/etc/localtime timezone file with a
version that uses rules differing from those in effect at
mysqld startup, you should restart
mysqld so that it uses the updated rules.
Otherwise, mysqld might not notice when
the system changes its time.
If you use named time zones with MySQL, make sure that the
time zone tables in the mysql database
are up to date. If your system has its own zoneinfo
database, you should reload the MySQL time zone tables
whenever the zoneinfo database is updated, using the
instructions given earlier in this section. For systems that
do not have their own zoneinfo database, check the MySQL
Developer Zone for updates. When a new update is available,
download it and use it to replace your current time zone
tables. mysqld caches time zone
information that it looks up, so after replacing the time
zone tables, you should restart mysqld to
make sure that it does not continue to serve outdated time
zone data.
If you are uncertain whether named time zones are available, for
use either as the server's time zone setting or by clients that
set their own time zone, check whether your time zone tables are
empty. The following query determines whether the table that
contains time zone names has any rows:
mysql> SELECT COUNT(*) FROM mysql.time_zone_name;
+----------+
| COUNT(*) |
+----------+
| 0 |
+----------+
A count of zero indicates that the table is empty. In this case,
no one can be using named time zones, and you don't need to
update the tables. A count greater than zero indicates that the
table is not empty and that its contents are available to be
used for named time zone support. In this case, you should be
sure to reload your time zone tables so that anyone who uses
named time zones will get correct query results.
To check whether your MySQL installation is updated properly for
a change in Daylight Saving Time rules, use a test like the one
following. The example uses values that are appropriate for the
2007 DST 1-hour change that occurs in the United States on March
11 at 2 a.m.
The test uses these two queries:
SELECT CONVERT_TZ('2007-03-11 2:00:00','US/Eastern','US/Central');
SELECT CONVERT_TZ('2007-03-11 3:00:00','US/Eastern','US/Central');
The two time values indicate the times at which the DST change
occurs, and the use of named time zones requires that the time
zone tables be used. The desired result is that both queries
return the same result (the input time, converted to the
equivalent value in the 'US/Central' time zone).
Before updating the time zone tables, you would see an incorrect
result like this:
mysql> SELECT CONVERT_TZ('2007-03-11 2:00:00','US/Eastern','US/Central');
+------------------------------------------------------------+
| CONVERT_TZ('2007-03-11 2:00:00','US/Eastern','US/Central') |
+------------------------------------------------------------+
| 2007-03-11 01:00:00 |
+------------------------------------------------------------+
mysql> SELECT CONVERT_TZ('2007-03-11 3:00:00','US/Eastern','US/Central');
+------------------------------------------------------------+
| CONVERT_TZ('2007-03-11 3:00:00','US/Eastern','US/Central') |
+------------------------------------------------------------+
| 2007-03-11 02:00:00 |
+------------------------------------------------------------+
After updating the tables, you should see the correct result:
mysql> SELECT CONVERT_TZ('2007-03-11 2:00:00','US/Eastern','US/Central');
+------------------------------------------------------------+
| CONVERT_TZ('2007-03-11 2:00:00','US/Eastern','US/Central') |
+------------------------------------------------------------+
| 2007-03-11 01:00:00 |
+------------------------------------------------------------+
mysql> SELECT CONVERT_TZ('2007-03-11 3:00:00','US/Eastern','US/Central');
+------------------------------------------------------------+
| CONVERT_TZ('2007-03-11 3:00:00','US/Eastern','US/Central') |
+------------------------------------------------------------+
| 2007-03-11 01:00:00 |
+------------------------------------------------------------+
9.7.2. Time Zone Leap Second Support
Before MySQL 5.0.74, if the operating system is configured to
return leap seconds from OS time calls or if the MySQL server
uses a time zone definition that has leap seconds, functions
such as NOW() could return a
value having a time part that ends with
:59:60 or :59:61 . If such
values are inserted into a table, they would be dumped as is by
mysqldump but considered invalid when
reloaded, leading to backup/restore problems.
As of MySQL 5.0.74, leap second values are returned with a time
part that ends with :59:59 . This means that a
function such as NOW() can return
the same value for two or three consecutive seconds during the
leap second. It remains true that literal temporal values having
a time part that ends with :59:60 or
:59:61 are considered invalid.
If it is necessary to search for
TIMESTAMP values one second
before the leap second, anomalous results may be obtained if you
use a comparison with 'YYYY-MM-DD hh:mm:ss'
values:
mysql> CREATE TABLE t1 (a INT, ts TIMESTAMP DEFAULT NOW(), PRIMARY KEY (ts));
Query OK, 0 rows affected (0.11 sec)
mysql> # Simulate NOW() = '2009-01-01 02:59:59'
mysql> SET timestamp = 1230768022;
Query OK, 0 rows affected (0.00 sec)
mysql> INSERT INTO t1 (a) VALUES (1);
Query OK, 1 row affected (0.07 sec)
mysql> # Simulate NOW() = '2009-01-01 02:59:60'
mysql> SET timestamp = 1230768023;
Query OK, 0 rows affected (0.00 sec)
mysql> INSERT INTO t1 (a) VALUES (2);
Query OK, 1 row affected (0.02 sec)
mysql> SELECT * FROM t1;
+------+---------------------+
| a | ts |
+------+---------------------+
| 1 | 2008-12-31 18:00:22 |
| 2 | 2008-12-31 18:00:23 |
+------+---------------------+
2 rows in set (0.02 sec)
mysql> SELECT * FROM t1 WHERE ts = '2009-01-01 02:59:59';
Empty set (0.03 sec)
To work around this, you can use a comparison based on the UTC
value actually stored in column, which has the leap second
correction applied:
mysql> SELECT * FROM t1 WHERE UNIX_TIMESTAMP(ts) = 1230768023;
+------+---------------------+
| a | ts |
+------+---------------------+
| 2 | 2008-12-31 18:00:23 |
+------+---------------------+
1 row in set (0.02 sec)
9.8. MySQL Server Locale Support
Beginning with MySQL 5.0.25, the locale indicated by the
lc_time_names system variable
controls the language used to display day and month names and
abbreviations. This variable affects the output from the
DATE_FORMAT() ,
DAYNAME() , and
MONTHNAME() functions.
Locale names have language and region subtags listed by IANA
(http://www.iana.org/assignments/language-subtag-registry)
such as 'ja_JP' or 'pt_BR' .
The default value is 'en_US' regardless of your
system's locale setting, but you can set the value at server
startup or set the GLOBAL value if you have the
SUPER privilege. Any client can
examine the value of
lc_time_names or set its
SESSION value to affect the locale for its own
connection.
mysql> SET NAMES 'utf8';
Query OK, 0 rows affected (0.09 sec)
mysql> SELECT @@lc_time_names;
+-----------------+
| @@lc_time_names |
+-----------------+
| en_US |
+-----------------+
1 row in set (0.00 sec)
mysql> SELECT DAYNAME('2010-01-01'), MONTHNAME('2010-01-01');
+-----------------------+-------------------------+
| DAYNAME('2010-01-01') | MONTHNAME('2010-01-01') |
+-----------------------+-------------------------+
| Friday | January |
+-----------------------+-------------------------+
1 row in set (0.00 sec)
mysql> SELECT DATE_FORMAT('2010-01-01','%W %a %M %b');
+-----------------------------------------+
| DATE_FORMAT('2010-01-01','%W %a %M %b') |
+-----------------------------------------+
| Friday Fri January Jan |
+-----------------------------------------+
1 row in set (0.00 sec)
mysql> SET lc_time_names = 'es_MX';
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT @@lc_time_names;
+-----------------+
| @@lc_time_names |
+-----------------+
| es_MX |
+-----------------+
1 row in set (0.00 sec)
mysql> SELECT DAYNAME('2010-01-01'), MONTHNAME('2010-01-01');
+-----------------------+-------------------------+
| DAYNAME('2010-01-01') | MONTHNAME('2010-01-01') |
+-----------------------+-------------------------+
| viernes | enero |
+-----------------------+-------------------------+
1 row in set (0.00 sec)
mysql> SELECT DATE_FORMAT('2010-01-01','%W %a %M %b');
+-----------------------------------------+
| DATE_FORMAT('2010-01-01','%W %a %M %b') |
+-----------------------------------------+
| viernes vie enero ene |
+-----------------------------------------+
1 row in set (0.00 sec)
The day or month name for each of the affected functions is
converted from utf8 to the character set
indicated by the
character_set_connection system
variable.
lc_time_names may be set to any
of the following locale values.
lc_time_names currently does not
affect the STR_TO_DATE() or
GET_FORMAT() function.
|
|