Detecting and Conveting Japanese Multibyte Encodings in PHP

January 30th, 2012

PHP has a large collection of multibyte functions in the standard library for handling multibyte strings such as Japanese. Two useful multibyte functions that PHP provides are for detecting the encoding of a multibyte string, and converting from one multibyte encoding to another.

To check if $string is in UTF-8 encoding, we call mb_check_encoding() like this:

if (mb_check_encoding($string, "UTF-8")) { // do_something(); }

To convert $string, which is currently Shift-JIS, to UTF-8, we call mb_convert_encoding() like this:

$convertedString = mb_convert_encoding($string, "UTF-8", "Shift-JIS);

A convenient feature of mb_convert_encoding() is that you can generalize the function by adding a list of character encodings to convert from. This can come in very handy if you want to convert all Japanese multibyte string encodings to UTF-8, or something else. There are actually 18 Japanese-specific multibyte encodings (that I know of), not including all the Unicode variants like UTF-8, UTF-16, etc. A lot of them come from the Japanese mobile phone carriers.

Let’s put all of this together and check if a string is UTF-8, and if it’s not, meaning it is one of the other 18 Japanese encoding types, let’s convert it to UTF-8.

if (!mb_check_encoding($string, "UTF-8")) {

   $string = mb_convert_encoding($string, "UTF-8",
      "Shift-JIS, EUC-JP, JIS, SJIS, JIS-ms, eucJP-win, SJIS-win, ISO-2022-JP,
       ISO-2022-JP-MS, SJIS-mac, SJIS-Mobile#DOCOMO, SJIS-Mobile#KDDI,
       SJIS-Mobile#SOFTBANK, UTF-8-Mobile#DOCOMO, UTF-8-Mobile#KDDI-A,
       UTF-8-Mobile#KDDI-B, UTF-8-Mobile#SOFTBANK, ISO-2022-JP-MOBILE#KDDI");
}

4 Responses to “Detecting and Conveting Japanese Multibyte Encodings in PHP”

  1. Daniele says:

    I’m having some trouble with this.
    When I run the following code, it does not return the actual character encoding. How did you get this to work properly?

  2. Daniele says:

    oops, it didn’t print the code. Here it is:
    ——–
    setlocale(LC_ALL, “ja_JP.utf8”);

    echo “Enter website URL: “;
    $string = trim(fgets( STDIN ));

    if (!($handle = file_get_contents($string)) ) die( “Cannot read or access the specified URL.” );

    echo “Encoding used: “.mb_detect_encoding($string, “Shift-JIS, EUC-JP, JIS, SJIS, JIS-ms, eucJP-win, SJIS-win, ISO-2022-JP, ISO-2022-JP-MS, SJIS-mac, SJIS-Mobile#DOCOMO, SJIS-Mobile#KDDI, SJIS-Mobile#SOFTBANK, UTF-8-Mobile#DOCOMO, UTF-8-Mobile#KDDI-A, UTF-8-Mobile#KDDI-B, UTF-8-Mobile#SOFTBANK, ISO-2022-JP-MOBILE#KDDI”).”\n”;
    ——–

  3. Daniele says:

    I’ve extended your code a bit to take concatenations of kanji from Asahi news’ articles. Here’s the article and code: http://asia-gazette.com/news/japan/109

Leave a Reply