Archive for the ‘Programming’ Category

Unlocking Secured Password Protected PDF Files

Saturday, February 23rd, 2013

Out of all the possible file formats out there, translating a PDF document is usually the worst-case scenario. The only thing worse than a PDF, is a locked password-protected PDF with security settings that don’t allow you to even copy and paste text out of it. It’s very difficult to look up Japanese words if you can’t copy the text, and you definitely can’t make use of translation memory software if you have no access to the text. So here is a simple way to unlock those secured PDFs to get access to the text.

You Will Need

A Linux computer or VM with Ghostscript installed.

How to Unlock the Secured PDF with Ghostscript

We’re actually going to create a identical copy of the PDF that is unlocked. We’ll use Ghostscript to do this. Ghostscript is a PostScript and PDF language interpreter and previewer that is commonly found pre-installed on most major Linux distributions. If it isn’t already installed, it can be easily obtained using your distro’s package management tool.

Using Ghostscript on Linux, you can unlock a PDF with a single command:

gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=OUTPUT.pdf -c 
.setpdfwrite -f INPUT.pdf

That’s a rather long series of parameters, so let’s break it down so you understand what is going on.

  • The gs command invokes Ghostscript.
  • The -q switch invokes quiet mode, which suppresses a lot of messages that you probably aren’t interested in.
  • The -dNOPAUSE switch enables no pause after each page.
  • The -dBATCH switch will exit after the last file completes.
  • The -sDEVICE=pdfwrite switch selects the device. In this case, we select the pdfwrite device to create a PDF. Ghostscript works with numerous devices for almost every possible format, including graphic formats such as jpeg, tiff, png, bmp, etc.
  • The -sOutputFile=OUTPUT.pdf switch selects the output file that we are creating. We read in a locked PDF, and we create a new, unlocked PDF file called OUTPUT.pdf in this case.
  • The .setpdfwrite operator automatically sets up parameters that are useful for creating PDFs using the pdfwrite output device.
  • Our locked PDF file we want to unlock is INPUT.pdf in this example.

Converting Multiple Files in Batch

Suppose you have an entire directory full of locked PDF files you want to unlock. Here is a quick little Bash script you can do on the Linux command line to unlock all the PDFs at once.

for x in *.pdf
do
   gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=converted_$x -c 
   .setpdfwrite -f $x;
done

In this script, all of the converted PDFs will be prefixed with converted_.

That’s all there is to it. If you have a Linux computer or virtual machine, it just takes one command to create an unlocked copy of the PDF.

A Star is Born – Introducing Honyaku Star Japanese/English Dictionary

Thursday, October 4th, 2012

Today I’m launching Honyaku Star, a new online Japanese/English dictionary.

The goal of Honyaku Star is to be the world’s most comprehensive free, online Japanese/English dictionary and corpus. Honyaku Star is built on top of numerous excellent community dictionaries, and adds to it the Honyaku Star dictionary, which tries to fill in everything else that isn’t in those general dictionaries, with the goal to be the best, and only dictionary you need.

Honyaku Star is more than a dictionary, it is also a Japanese/English bilingual corpus. In other words, a database of parallel texts to provide context, usage, and examples of words and phrases you search for. Many words have lots of valid translation varieties, and seeing it used in various different contexts can help you understand the different meanings and pick the appropriate usage. When you search in Honyaku Star, you get dictionary results and example sentences together.

I built Honyaku Star because I use online Japanese/English dictionaries every day, and none of them satisfied me. There are certain things I want, and don’t want out of an online Japanese/English dictionary. So, I built Honyaku Star based on the principles I think are important.

  • Dictionaries and language resources should be free and easily accessible.
  • The one and only dictionary advanced students and translators will need.
  • Lots of relevant results for a search query. 1,000 results if possible.
  • Provide in-context usage and examples sentences.
  • Clean, simple user interface.
  • No pagination in the UI.
  • Searches should be super fast. Instantaneous!
  • No visual distractions. No ads. No random Web content. No useless information like character encoding codes.
  • No advanced search! It should be smart and bring back results in an intelligent way.
  • The primary goal of a searchable dictionary is to be a useful language resource–it should not a means to draw you in and sell you language services or books.

I think I’ve kept with my design principles on this initial version, and it’s only going to get better with time.

The technology behind Honyaku Star is Linux, PHP, Perl, MySQL, and the awesome full-text index Mroonga. I’ll post more about some of the technical challenges in future posts.

Honestly, I made Honyaku Star for myself, to be the ideal dictionary that I’d want to use everyday. But my hope is others will find it useful. All user feedback is welcome and appreciated. And if you use and like Honyaku Star, consider contributing translations to it.

Start using Honyaku Star today at http://honyakustar.com.

Japanese Encoding Conversion

Monday, July 16th, 2012

Japanese has many different text encodings, and one that pops up a lot when you are working on text files is EUC-JP (Japanese Extended Unix Code). You find EUC-JP encoding used in many Japanese Web sites, text documents, JMdict and EDICT glossary files, and so on. This encoding is particularly troublesome because a lot of English-language text editors and utilities don’t know how to deal with it.

Usually you want to work with UTF-8 instead, so here are some strategies for converting EUC-JP encoding into UTF-8.

Simple Command-Line Conversion in Linux

On Linux this is really easy. Use the iconv command-line conversion utility.

iconv -f EUC-JP -t UTF-8 input.txt > output.txt

or

iconv -f EUC-JP -t UTF-8 input.txt -o output.txt

input.txt is in EUC-JP encoding, and the resultant output.txt is converted to UTF-8. Short and sweet, and can easily be piped to further commands.

We can use a Bash loop to automate this. For example, to convert all XML files to UTF-8:

for x in *.xml;
do
  iconv -f EUC-JP -t UTF-8 $x > converted/$x
done

Command-Line Perl Program

Let’s write a simple Perl program that will take two command-line arguments: the input file in EUC-JP encoding, and the resultant output file converted to UTF-8. We will be able to run the program like this:

./convert input.txt output.txt

For this, we will use the from_to() function that is part of the Encode module. The from_to() function takes three parameters: the input, the encoding of the input, and the desired encoding of the output.

from_to($input, "euc-jp", "utf8");

Here is the full program:

#!/usr/bin/perl
use strict;
use warnings;
use Encode "from_to";

my $inputFilename  = $ARGV[0];
open(INFILE,  "<", "$inputFilename")  or die "Can't open $inputFilename:  $!";

my $outputFilename = $ARGV[1];
open(OUTFILE, ">", "$outputFilename") or die "Can't open $outputFilename: $!";

while (<INFILE>) {
   from_to($_, "euc-jp", "utf8");
   print OUTFILE $_;
}

close INFILE  or die "INFILE:  $!";
close OUTFILE or die "OUTFILE: $!";

Command Line PHP Program

PHP programs can also be run on the command line. Let’s add a little bit more this time and convert all non-PHP files in a directory from EUC-JP to UTF-8 and put them in a tmp directory using command-line PHP.

We will use the mb_convert_encoding() function which works on multi-byte strings. The mb_convert_encoding() function takes three parameters: the input, the desired encoding of the output, and the encoding of the input.

mb_convert_encoding($input, "UTF-8", "EUC-JP");

Here is the full program:

<?php

$dirHandler = opendir(".");

while ($fileName = readdir($dirHandler)) {

   if ($fileName != '.' && $fileName != '..' 
                        && $fileName != 'php' && $fileName != 'tmp') {

      $file = file_get_contents("./$fileName", FILE_USE_INCLUDE_PATH);
      $convertedText = mb_convert_encoding($file, "UTF-8", "EUC-JP");

      echo "$fileName\n";

      $writeFile  = "../tmp/$fileName";
      $fileHandle = fopen($writeFile, 'w') or die("can't open file");
      fwrite($fileHandle, $convertedText);
      fclose($fileHandle);
   }
 }

?>

Just Use Firefox and a Text Editor

Finally, a very simple way to convert EUC-JP text to UTF-8 if you are working with plain text is to simply open the file in Firefox. Firefox almost always gets the encoding right, and if it doesn’t, you can manually set it in the Character Encoding menu. Then, copy and paste the text into your favorite text editor and save it as UTF-8.

As you can see, you have lots of easy options for converting text between various encodings. And the same scripts can be used for other Japanese encoded strings such as JIS and Shift-JIS.

Kanji Usage Count Concordance Web App

Sunday, February 5th, 2012

A few posts ago I posted some PHP code to show how to extract all the kanji from a string and create a concordance that orders them by how often each kanji is used.

Expanding on this idea, I created a Web application that allows you to generate a kanji usage count concordance from any Web page to see what the most used kanji are. You can access it at the new Localizing Japan Kanji Usage Count Concordance App page. You can also access it from the main navigation tabs at the top of the site.

It’s very easy to use, just copy and paste the URL of a Japanese Web page you want to analyze and submit. The kanji concordance will show you in descending order what the most used kanji are on that Web page. This can be useful for students studying Japanese who want to know what the most used kanji are on certain sites so they can focus their studying, among many other uses.

Enjoy the Kanji Usage Count Concordance App.

Detecting and Conveting Japanese Multibyte Encodings in PHP

Monday, January 30th, 2012

PHP has a large collection of multibyte functions in the standard library for handling multibyte strings such as Japanese. Two useful multibyte functions that PHP provides are for detecting the encoding of a multibyte string, and converting from one multibyte encoding to another.

To check if $string is in UTF-8 encoding, we call mb_check_encoding() like this:

if (mb_check_encoding($string, "UTF-8")) { // do_something(); }

To convert $string, which is currently Shift-JIS, to UTF-8, we call mb_convert_encoding() like this:

$convertedString = mb_convert_encoding($string, "UTF-8", "Shift-JIS);

A convenient feature of mb_convert_encoding() is that you can generalize the function by adding a list of character encodings to convert from. This can come in very handy if you want to convert all Japanese multibyte string encodings to UTF-8, or something else. There are actually 18 Japanese-specific multibyte encodings (that I know of), not including all the Unicode variants like UTF-8, UTF-16, etc. A lot of them come from the Japanese mobile phone carriers.

Let’s put all of this together and check if a string is UTF-8, and if it’s not, meaning it is one of the other 18 Japanese encoding types, let’s convert it to UTF-8.

if (!mb_check_encoding($string, "UTF-8")) {

   $string = mb_convert_encoding($string, "UTF-8",
      "Shift-JIS, EUC-JP, JIS, SJIS, JIS-ms, eucJP-win, SJIS-win, ISO-2022-JP,
       ISO-2022-JP-MS, SJIS-mac, SJIS-Mobile#DOCOMO, SJIS-Mobile#KDDI,
       SJIS-Mobile#SOFTBANK, UTF-8-Mobile#DOCOMO, UTF-8-Mobile#KDDI-A,
       UTF-8-Mobile#KDDI-B, UTF-8-Mobile#SOFTBANK, ISO-2022-JP-MOBILE#KDDI");
}

Regular Expressions for Japanese Text

Friday, January 20th, 2012

Regular expressions are extremely useful for matching patterns in text. But when it comes to Japanese Unicode text, it isn’t obvious what you should do to create regular expressions to match a range of Japanese characters. You can try something like [あ-ん] to match all hiragana characters—and you would be close—but it isn’t the best way to do it. Also, direct input of Japanese isn’t always an option.

To deal with this, know that each character in Unicode has a hexadecimal code point. For example, the code point for the hiragana あ is 3042, and this is designated by U+3042. This code point can be used in a regular expression like this: \x3042. This will match a hiragana あ. This is very useful for programmers who must code pattern matching for Japanese on a system where they cannot input or display Japanese text, or have the know-how to do it (See some of my great Japanese input posts if you need to know how!).

Additionally, some flavors of regular expressions have what are known as Unicode block properties, or Unicode scripts. These are pre-defined blocks of regex Unicode character classes. Hiragana, katakana, and kanji are included in the block properties—very convenient if you need the full-script match in your regular expression.

With this basic knowledge, the following is a thorough list of different Japanese character classes and the various Japanese regular expressions that match those character classes. And further down, a few programming examples showing them in use.

Hiragana

Unicode code points regex: [\x3041-\x3096]
Unicode block property regex: \p{Hiragana}

ぁ あ ぃ い ぅ う ぇ え ぉ お か が き ぎ く ぐ け げ こ ご さ ざ し じ す ず せ ぜ そ ぞ た だ ち ぢ っ つ づ て で と ど な に ぬ ね の は ば ぱ ひ び ぴ ふ ぶ ぷ へ べ ぺ ほ ぼ ぽ ま み む め も ゃ や ゅ ゆ ょ よ ら り る れ ろ ゎ わ ゐ ゑ を ん ゔ ゕ ゖ  ゙ ゚ ゛ ゜ ゝ ゞ ゟ

Katakana (Full Width)

Unicode code points regex: [\x30A0-\x30FF]
Unicode block property regex: \p{Katakana}

゠ ァ ア ィ イ ゥ ウ ェ エ ォ オ カ ガ キ ギ ク グ ケ ゲ コ ゴ サ ザ シ ジ ス ズ セ ゼ ソ ゾ タ ダ チ ヂ ッ ツ ヅ テ デ ト ド ナ ニ ヌ ネ ノ ハ バ パ ヒ ビ ピ フ ブ プ ヘ ベ ペ ホ ボ ポ マ ミ ム メ モ ャ ヤ ュ ユ ョ ヨ ラ リ ル レ ロ ヮ ワ ヰ ヱ ヲ ン ヴ ヵ ヶ ヷ ヸ ヹ ヺ ・ ー ヽ ヾ ヿ

Kanji

Unicode code points regex: [\x3400-\x4DB5\x4E00-\x9FCB\xF900-\xFA6A]
Unicode block property regex: \p{Han}

漢字 日本語 文字 言語 言葉 etc. Too many characters to list.

This regular expression will match all the kanji, including those used in Chinese.

Kanji Radicals

Unicode code points regex: [\x2E80-\x2FD5]

⺀ ⺁ ⺂ ⺃ ⺄ ⺅ ⺆ ⺇ ⺈ ⺉ ⺊ ⺋ ⺌ ⺍ ⺎ ⺏ ⺐ ⺑ ⺒ ⺓ ⺔ ⺕ ⺖ ⺗ ⺘ ⺙ ⺚ ⺛ ⺜ ⺝ ⺞ ⺟ ⺠ ⺡ ⺢ ⺣ ⺤ ⺥ ⺦ ⺧ ⺨ ⺩ ⺪ ⺫ ⺬ ⺭ ⺮ ⺯ ⺰ ⺱ ⺲ ⺳ ⺴ ⺵ ⺶ ⺷ ⺸ ⺹ ⺺ ⺻ ⺼ ⺽ ⺾ ⺿ ⻀ ⻁ ⻂ ⻃ ⻄ ⻅ ⻆ ⻇ ⻈ ⻉ ⻊ ⻋ ⻌ ⻍ ⻎ ⻏ ⻐ ⻑ ⻒ ⻓ ⻔ ⻕ ⻖ ⻗ ⻘ ⻙ ⻚ ⻛ ⻜ ⻝ ⻞ ⻟ ⻠ ⻡ ⻢ ⻣ ⻤ ⻥ ⻦ ⻧ ⻨ ⻩ ⻪ ⻫ ⻬ ⻭ ⻮ ⻯ ⻰ ⻱ ⻲ ⻳
⼀ ⼁ ⼂ ⼃ ⼄ ⼅ ⼆ ⼇ ⼈ ⼉ ⼊ ⼋ ⼌ ⼍ ⼎ ⼏ ⼐ ⼑ ⼒ ⼓ ⼔ ⼕ ⼖ ⼗ ⼘ ⼙ ⼚ ⼛ ⼜ ⼝ ⼞ ⼟ ⼠ ⼡ ⼢ ⼣ ⼤ ⼥ ⼦ ⼧ ⼨ ⼩ ⼪ ⼫ ⼬ ⼭ ⼮ ⼯ ⼰ ⼱ ⼲ ⼳ ⼴ ⼵ ⼶ ⼷ ⼸ ⼹ ⼺ ⼻ ⼼ ⼽ ⼾ ⼿ ⽀ ⽁ ⽂ ⽃ ⽄ ⽅ ⽆ ⽇ ⽈ ⽉ ⽊ ⽋ ⽌ ⽍ ⽎ ⽏ ⽐ ⽑ ⽒ ⽓ ⽔ ⽕ ⽖ ⽗ ⽘ ⽙ ⽚ ⽛ ⽜ ⽝ ⽞ ⽟ ⽠ ⽡ ⽢ ⽣ ⽤ ⽥ ⽦ ⽧ ⽨ ⽩ ⽪ ⽫ ⽬ ⽭ ⽮ ⽯ ⽰ ⽱ ⽲ ⽳ ⽴ ⽵ ⽶ ⽷ ⽸ ⽹ ⽺ ⽻ ⽼ ⽽ ⽾ ⽿ ⾀ ⾁ ⾂ ⾃ ⾄ ⾅ ⾆ ⾇ ⾈ ⾉ ⾊ ⾋ ⾌ ⾍ ⾎ ⾏ ⾐ ⾑ ⾒ ⾓ ⾔ ⾕ ⾖ ⾗ ⾘ ⾙ ⾚ ⾛ ⾜ ⾝ ⾞ ⾟ ⾠ ⾡ ⾢ ⾣ ⾤ ⾥ ⾦ ⾧ ⾨ ⾩ ⾪ ⾫ ⾬ ⾭ ⾮ ⾯ ⾰ ⾱ ⾲ ⾳ ⾴ ⾵ ⾶ ⾷ ⾸ ⾹ ⾺ ⾻ ⾼ ⾽ ⾾ ⾿ ⿀ ⿁ ⿂ ⿃ ⿄ ⿅ ⿆ ⿇ ⿈ ⿉ ⿊ ⿋ ⿌ ⿍ ⿎ ⿏ ⿐ ⿑ ⿒ ⿓ ⿔ ⿕

Katakana and Punctuation (Half Width)

Unicode code points regex: [\xFF5F-\xFF9F]

⦅ ⦆ 。 「 」 、 ・ ヲ ァ ィ ゥ ェ ォ ャ ュ ョ ッ ー ア イ ウ エ オ カ キ ク ケ コ サ シ ス セ ソ タ チ ツ テ ト ナ ニ ヌ ネ ノ ハ ヒ フ ヘ ホ マ ミ ム メ モ ヤ ユ ヨ ラ リ ル レ ロ ワ ン ゙

Japanese Symbols and Punctuation

Unicode code points regex: [\x3000-\x303F]

、 。 〃 〄 々 〆 〇 〈 〉 《 》 「 」 『 』 【 】 〒 〓 〔 〕 〖 〗 〘 〙 〚 〛 〜 〝 〞 〟 〠 〡 〢 〣 〤 〥 〦 〧 〨 〩 〪 〫 〬 〭 〮 〯 〰 〱 〲 〳 〴 〵 〶 〷 〸 〹 〺 〻 〼 〽 〾 〿

Miscellaneous Japanese Symbols and Characters

Unicode code points regex: [\x31F0-\x31FF\x3220-\x3243\x3280-\x337F]

ㇰ ㇱ ㇲ ㇳ ㇴ ㇵ ㇶ ㇷ ㇸ ㇹ ㇺ ㇻ ㇼ ㇽ ㇾ ㇿ
㈠ ㈡ ㈢ ㈣ ㈤ ㈥ ㈦ ㈧ ㈨ ㈩ ㈪ ㈫ ㈬ ㈭ ㈮ ㈯ ㈰ ㈱ ㈲ ㈳ ㈴ ㈵ ㈶ ㈷ ㈸ ㈹ ㈺ ㈻ ㈼ ㈽ ㈾ ㈿ ㉀ ㉁ ㉂ ㉃
㊀ ㊁ ㊂ ㊃ ㊄ ㊅ ㊆ ㊇ ㊈ ㊉ ㊊ ㊋ ㊌ ㊍ ㊎ ㊏ ㊐ ㊑ ㊒ ㊓ ㊔ ㊕ ㊖ ㊗ ㊘ ㊙ ㊚ ㊛ ㊜ ㊝ ㊞ ㊟ ㊠ ㊡ ㊢ ㊣ ㊤ ㊥ ㊦ ㊧ ㊨ ㊩ ㊪ ㊫ ㊬ ㊭ ㊮ ㊯ ㊰ ㊱ ㊲ ㊳ ㊴ ㊵ ㊶ ㊷ ㊸ ㊹ ㊺ ㊻ ㊼ ㊽ ㊾ ㊿
㋀ ㋁ ㋂ ㋃ ㋄ ㋅ ㋆ ㋇ ㋈ ㋉ ㋊ ㋋  ㋐ ㋑ ㋒ ㋓ ㋔ ㋕ ㋖ ㋗ ㋘ ㋙ ㋚ ㋛ ㋜ ㋝ ㋞ ㋟ ㋠ ㋡ ㋢ ㋣ ㋤ ㋥ ㋦ ㋧ ㋨ ㋩ ㋪ ㋫ ㋬ ㋭ ㋮ ㋯ ㋰ ㋱ ㋲ ㋳ ㋴ ㋵ ㋶ ㋷ ㋸ ㋹ ㋺ ㋻ ㋼ ㋽ ㋾
㌀ ㌁ ㌂ ㌃ ㌄ ㌅ ㌆ ㌇ ㌈ ㌉ ㌊ ㌋ ㌌ ㌍ ㌎ ㌏ ㌐ ㌑ ㌒ ㌓ ㌔ ㌕ ㌖ ㌗ ㌘ ㌙ ㌚ ㌛ ㌜ ㌝ ㌞ ㌟ ㌠ ㌡ ㌢ ㌣ ㌤ ㌥ ㌦ ㌧ ㌨ ㌩ ㌪ ㌫ ㌬ ㌭ ㌮ ㌯ ㌰ ㌱ ㌲ ㌳ ㌴ ㌵ ㌶ ㌷ ㌸ ㌹ ㌺ ㌻ ㌼ ㌽ ㌾ ㌿ ㍀ ㍁ ㍂ ㍃ ㍄ ㍅ ㍆ ㍇ ㍈ ㍉ ㍊ ㍋ ㍌ ㍍ ㍎ ㍏ ㍐ ㍑ ㍒ ㍓ ㍔ ㍕ ㍖ ㍗ ㍘ ㍙ ㍚ ㍛ ㍜ ㍝ ㍞ ㍟ ㍠ ㍡ ㍢ ㍣ ㍤ ㍥ ㍦ ㍧ ㍨ ㍩ ㍪ ㍫ ㍬ ㍭ ㍮ ㍯ ㍰ ㍱ ㍲ ㍳ ㍴ ㍵ ㍶  ㍻ ㍼ ㍽ ㍾ ㍿

Alphanumeric and Punctuation (Full Width)

Unicode code points regex: [\xFF01-\xFF5E]

! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
@ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _
` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~

 

Japanese RegEx Code Examples

Find all hiragana in a text string

// PHP
$pattern = "/[\x{3041}-\x{3096}]/u";
preg_match_all($pattern, $text, $matches);
print_r($matches);
# Perl
if ($text =~ m/[\x{3041}-\x{3096}]/) { print $text; }

Remove all hiragana from a text string

//PHP
$pattern = "/\p{Hiragana}/u";
$text = preg_replace($pattern, "", $text);
# Perl
$text =~ s/\p{Hiragana}//g;

Remove everything but Kanji

// PHP
// \P{Han} matches everything other than kanji
$pattern = "/\P{Han}/u";
$text = preg_replace($pattern, "", $text);

Note: In PHP and Perl, the Unicode code block regular expression is written with curly braces around the hexadecimal codes. So the regex of \x3041 becomes \x{3041} and so on.

Note: In Perl you have to make sure you have Unicode set up properly to get regular expressions to work over Japanese. You may also have to run perl with the -CS options (perl -CS) to get rid of any Wide character in print warnings. See http://ahinea.com/en/tech/perl-unicode-struggle.html for more information.

Kanji Usage Count Concordance in PHP

Wednesday, December 7th, 2011

A concordance is a list of all words used in a document, Web site, or publication, and some additional useful information about those words. A useful concordance in translation and localization is a list of the most frequently used words. This can be used to identify important terms that should be picked up for a glossary. It can also be used by students of a foreign language to identify important vocabulary words they should dedicate time to study.

Students of Japanese often wonder what kanji they should learn. It can be hard to identify what kanji is most important. And even between subject matters what kanji is more important will differ.

To help with this, I’ve created a kanji concordance application in PHP to create a list of kanji and their usage counts in descending order.

Example

If you had this Japanese text:

私の名前はマークです。私はテキサス大学を卒業しました。すしが大好きです。

The kanji concordance would generate a list that looked like this:

2 私
2 大
1 名
1 前
1 学
1 卒
1 業
1 好

The kanji 私 and 大 are both used twice, so they are at the top of the list with the number 2 for the usage count. The rest of the kanji are used once and show a usage count of 1.

Kanji Concordance Code Explanation

The first thing we do in PHP is set the language locale with the setlocale() function. This is always good practice when dealing with language-related applications.

setlocale(LC_ALL, "ja_JP.utf8");

The LC_ALL parameter sets the locale for all categories, and the ja_JP.utf8 parameter sets the language and locale to Japanese/Japan in Unicode UTF-8.

Next, we will need some string of Japanese text that we want to examine and create our kanji concordance from. In our simple example we will use a hard-coded string. But in a real application we would probably dynamically input the string from some source.

$string = "私の名前はマークです。私はテキサス大学を卒業しました。私はすしが大好きです。
           私は漢字が好きです。";

Once we have our input string, we need to strip it of everything but the kanji, since that is all we are interested in. Japanese text can have hiragana, katakana, English characters, and various punctuation. If we remove all of those, we’ll be left with just the kanji. We will define a regular expression to match these unwanted characters, and then replace them with nothing.

$pattern = "/[a-zA-Z0-90-9あ-んア-ンー。、?!<>: 「」(){}≪≫〈〉《》【】
            『』〔〕[]・\n\r\t\s\(\) ]/u";

This regular expression pattern is fairly straight forward. a-zA-Z0-9 matches the English alphanumeric characters. We also match the double-byte numbers with 0-9. あ-ん will match all the hiragana, and ア-ン will match all of the katakana. Finally, we match the various punctuation marks and other special characters we can expect to find. The u at the end of the pattern is a pattern modifier that tells PHP that this pattern is Unicode UTF-8. I’ve probably left out some punctuation characters but for our example purposes this will do. (We can actually do a much simplier regex than this. For a throrough discussion of regular expressions for Japanese text, see this post on Japanese regex.)

We will use the regular expression search and replace function preg_replace() to match our input string against the regex pattern to remove the unwanted characters.

$kanjiString = preg_replace($pattern, "", $string);

The first parameter, $pattern, is the regular expression pattern to match against. The second paramter “” is an empty string that we use to replace the regex matches. We match an unwanted non-kanji character and replace it with nothing—in other words, we delete it. The last parameter $string is the input string of Japanese to match against the regex pattern and remove everything but the kanji.

The variable $kanjiString now contains only the kanji characters from our original input string.

// $kanjiString = "私名前私大学卒業私大好私漢字好";

Our next step is to split up all the kanji characters and insert them into an array. We will do this in one step with the split by regular expression function preg_split().

$kanjiArray = preg_split("//u", $kanjiString, -1, PREG_SPLIT_NO_EMPTY);

The first parameter “//u” is a regular expression that will match everything, and the u pattern modifer argument puts it in Unicode match mode. The second parameter $kanjiString is the input string to match against the regular expression. The third parameter -1 is the limit parameter, and -1 indicates no limit. This means it will parse the entire string. The final parameter is the PREG_SPLIT_NO_EMPTY flag. This flag sets it so only non-empty items will be returned.

Now that we have an array full of individual kanji, we want to count them to get our kanji usage numbers. The array_count_values() function will count all the values of our input array, and return a new array with those values and their usage count.

$countedArray = array_count_values($kanjiArray);

With our array of kanji and their usage counts, we just need to sort them in reverse order with the arsort() function.

arsort($countedArray);

Our counted_array now contains a list of all the kanji used from our input string in order of their usage counts. In other words, we have successfully built a kanji usage count concordance.

The final step is to iterate through the array and display our concordance to the screen. We will do this with a simple foreach loop over our array.

foreach ($countedArray as $kanji => $count) {
   echo "$count $kanji <br/>";
}

Our kanji usage count concordance will display like this:

4 私
2 好
2 大
1 漢
1 字
1 業
1 学
1 名
1 前
1 卒

There we go. We have a list of all the kanji used and in order from most used to least used. As we can see, 私 seems to be a pretty important kanji. Better put it on your list to study.

In this example our Japanese input string was hard coded, but we can easily expand this code to take in input from a file or even screen scrape a Web site and see what their most used kanji are. With a large enough input sample, we can get a pretty good list of kanji and usage counts for our concordance.

PHP Source Code

Here is the full source code for the kanji count usage concordance in PHP that we built.

<?php
setlocale(LC_ALL, "ja_JP.utf8");

$string = "私の名前はマークです。私はテキサス大学を卒業しました。私はすしが大好きです。
           私は漢字が好きです。";

$pattern = "/[a-zA-Z0-90-9あ-んア-ンー。、?!<>: 「」(){}≪≫〈〉《》【】
            『』〔〕[]・\n\r\t\s\(\) ]/u";
$kanjiString = preg_replace($pattern, "", $string);

$kanjiArray = preg_split("//u", $kanjiString, -1, PREG_SPLIT_NO_EMPTY);

$countedArray = array_count_values($kanjiArray);
arsort($countedArray);

foreach ($countedArray as $kanji => $count) {
   echo "$count $kanji <br/>";
}
?>

Localizing a CakePHP Application

Thursday, November 10th, 2011

If you build a PHP application using the CakePHP framework, it is easy to localize the application into multiple languages, provided you have the proper translations for those languages. If you want to internationalize your application to a global market, it is important to localize it for each language and region you want to target.

Fortunately, CakePHP and PHP itself provide us with some easy mechanisms to provide translations and localize our code without much effort. You do not have to make copies of HTML or PHP files. Everything will be done with the PHP files you already have, and the translated text strings will dynamically be inserted at render time, ready for the user in their localized language and format.

In this example, we will localize a CakePHP application that was written in English into Japanese. Let’s assume we have a menu for e-mail functions that we want to localize. This example assumes CakePHP version 2.0 (But earlier and perhaps later versions of CakePHP will work in a similar manner).

Wrapping Translatable Text in __() Functions

The first step in the localization process is to identify the text strings that will need to be translated, and replace them with CakePHP’s localized string __() function. Supposing our menu looked like the following:

<ul>
   <li>Send</li>
   <li>Reply</li>
   <li>Forward</li>
   <li>Delete</li>
</ul>

We would wrap each text string inside the __() function like as follows:

<ul>
   <li><?php echo __('Send') ?></li>
   <li><?php echo __('Reply') ?></li>
   <li><?php echo __('Forward') ?></li>
   <li><?php echo __('Delete') ?></li>
</ul>

The __() function identifies these strings as translatable text that will differ by language locale and uses the text within the __() function as the message ID. If we define the translations for a certain language, those translations will appear in place of these functions. If we do not define the translations for that language, the text within the __() function will display instead by default.

Creating the Localized PO Files

The next step is to create the PO files which will contain the translations for each language to be dynamically inserted in each of the __() functions. CakePHP has two ways you can do this: automatically using the console shell; or manually.

Using the I18N Shell

CakePHP has some console shell programs that you can run on the command line, including one to generate the PO file to use as the original language source file for translations. In our case it will be a file with all the English text strings.

To run the i18n shell command, type the following on the Linux command line in your CakePHP application directory:

./Console/cake i18n extract

Then follow the onscreen menu.

The shell command will examine all of your application files for instances of the __() function and generate a PO file for the original source language that you can use to create the PO files for each of the translations you are going to use.

Creating the PO Files Manually

If you want to do this manually—for example you don’t have many translatable text strings like in our example—you can create the PO files by hand in a text editor.

First we will create the original source language English version here:

/app/Locale/eng/LC_MESSAGES/default.po

The default.po file will have this format:

msgid "ID"
msgstr "STRING"

Where msgid is the ID within the __() function; msgstr is the localized translation that should appear as output.

Our full English source PO file will look like this:

msgid "Send"
msgstr "Send"

msgid "Reply"
msgstr "Reply"

msgid "Forward"
msgstr "Forward"

msgid "Delete"
msgstr "Delete"

To create the Japanese localized version, we copy the English PO file to the Japanese directory, and then replace the English strings in the msgstr field with the Japanese translations. (If you have a large application being localized into dozens of languages, it is at this point that you send the PO files to a language service provider to translate the localized strings.)

Our localized PO file with Japanese translations will go here:

/app/Locale/jpn/LC_MESSAGES/default.po

Our final localized Japanese PO file will look like this.

msgid "Send"
msgstr "送信"

msgid "Reply"
msgstr "返信"

msgid "Forward"
msgstr "転送"

msgid "Delete"
msgstr "削除"

That’s it. The translations for English and Japanese will display appropriately for the proper locale. If you want to add translations for other languages, you do the same process and put the new PO file in the directory that corresponds with that language code. CakePHP uses the ISO 639-2 standard for naming locales. Follow that standard for naming your localized directories. Make sure you save these files at UTF-8.

Detecting and Changing Languages and Locales

Having the translations ready is nice, but you still have to detect and change to the proper language in your PHP code to get the translations to appear. Detecting the user’s language is tricky. You could do it in JavaScript or try to use something like the following PECL function:

locale_accept_from_http($_SERVER['HTTP_ACCEPT_LANGUAGE']);

In either case, you have to trust the user agent to report back the right language.

Another option is to simply have a button or menu for the user to select their language. Flag icons usually work well for this. Then you can set the language manually. In CakePHP it is easy to do with the CakeSession class:

CakeSession::write('Config.language', 'jpn);

Finally, you also have to set the locale in PHP. Since you already know this from however you determined the language above, you can use PHP’s setlocale function to do this. This is important for localization of date, time, money, and numeric separator formats among others.

setlocale("LC_ALL", "ja_JP.utf8");

That’s all there is to localizing a CakePHP application with proper translations and other locale-specific customizations.

Virtualizing a Linux System (Creating a Linux VM P2V)

Saturday, March 5th, 2011

This tutorial article is going to show you how to create a Linux virtual machine from a physical Linux system. These instructions are generic enough to work with any Linux distribution, such as Ubuntu, Fedora, Red Hat, CentOS, Debian, Mint, etc.

There are many reasons why you would create a VM of a physical system you have running. You might want to test out things before you try them on your actual system. It is useful when you are translating to have both the English and Japanese (or other language) OS and applications open side by side to reference the correct translations easily. Whatever the reason, this article will show you one way to do it pretty easily.

Overview of the Linux VM creation task:

Tools and Resources Needed

  • SystemRescueCd ISO file
  • Blank CD-ROM or USB disk
  • USB disk drive large enough to fit entire Linux system
  • VMware or VirtualBox

Preparation Tasks

  1. Make note of the disk partitioning
  2. Create a bootable Linux rescue disk

Main Tasks

  1. Image the hard drive partitions
  2. Create an empty Virtual Machine
  3. Recreate the hard drive partitions
  4. Restore the hard drive partitions
  5. Set up the boot loader

Final Task – Boot the VM

Optional Task – Configure X11

Preparation Tasks

Make Note of the Disk Partitioning

On the physical Linux system we want to virualize, run the df command to list the partitions and mount points

df -h

Make a note of the partitions, their sizes, and mount points. You will use this information later to recreate the disk partitioning in the virtual machine.

Create a Bootable Linux Rescue Disk

For the task of converting a physical Linux system to a virtual machine we are going to use another version of Linux to do the work in. Any bootable version of Linux will work, and I really like SystemRescueCd for this task. It is a light-weight Linux system that comes with all the system tools you’ll need for this job like partimage and fdisk (or GParted).

Download the SystemRescueCd ISO file.

Burn the ISO file to a CD-ROM, or follow the instructions to make a bootable USB stick.

Power down the physical Linux computer we are going to virtualize and put the SystemRescueCd in the CD-ROM drive or USB drive.

Turn the computer on and boot to SystemRescueCd Linux.

Main Tasks

Image the Hard Drive Partitions

Plug in your external USB hard drive.

Run the dmesg command to find the device name of the USB hard drive.

dmesg

Look for your hard drive name and description. For example, if you plugged in a Western Digital My Passport drive you should see something similar to this:

usb 2-1: Product: My Passport 070A
usb 2-1: Manufacturer: Western Digital
sd 4:0:0:0: [sdb] 1463775232 512-byte logical blocks: (749 GB/697 GiB)
sd 4:0:0:0: [sdb] Write Protect is off
sd 4:0:0:0: [sdb] Mode Sense: 23 00 10 00
sd 4:0:0:0: [sdb] Assuming drive cache: write through
sdb: sdb1

The key piece of information here is the sdb1 on the last line. This is the device name we will use to mount the USB hard drive.

Create a directory to mount the USB hard drive. For example, a new directory called flash.

mkdir /mnt/flash

Mount the USB hard drive, device sdb1, on the newly created directory.

mount /dev/sdb1 /mnt/flash

Run the partimage program to image the partitions.

partimage

Use the GUI to select the partition to image.

Press Tab and enter the file name for the partition. For example (Assuming the partitions are on device sda):

/mnt/flash/sda1.partimage.gz

Press F5 twice to navigate to the next screens and press OK to start the imaging process.

Repeat this process for each partition on the Linux system. Make sure to name the files appropriately.

Note: Partimage will also show the partitions of the USB drive you mounted. Do not image the partitions of your USB disk. Also, do not image any extended or swap partitions.

When you are finished imaging all of the disk partitions, unmount the USB disk drive.

umount /mnt/flash

Shut down SystemRescueCd and restart your Linux system.

reboot

Create an Empty Virtual Machine

Create a new VM in VMware (or VirtualBox).

Configure the VM to have similar hardware specifications as the physical Linux computer: RAM, processor, hard disk. It is important that the hard disk be the same size or larger than the physical machine so the partitions fit.

Set the VM to boot from the CD-ROM drive using the SystemRescueCd ISO file.

Boot the empty virtual machine into SystemRescueCd.

Recreate the Hard Drive Partitions

Run the fdisk command to find the hard drive device.

fdisk -l /dev/sda

If it is sda and your drive was around 100 GB, you will see something like this:

Disk /dev/sda: 105.2 GB, 105226698752 bytes

Use fdisk to recreate the disk partitions of the original physical Linux computer. You should have made note of these in the preparation tasks. fdisk is a command line program to partition the drive. (You can also use the GUI GParted program in X Windows if you prefer. Press startx and select GParted from the menu.)

fdisk /dev/sda

Press n to add new partitions.

Press a to toggle the bootable partition (the /boot partition).

Press t to toggle the swap partition by setting it to 82.

Press w to write changes to disk.

Press m at any time for a list of options.

Restore the Hard Drive Partitions

Plug in your external USB hard drive and connect it to the virtual machine.

Run the dmesg command to find the device name of the USB hard drive.

dmesg

Look for your hard drive name and description.

Create a directory to mount the USB hard drive. For example, a new directory called flash.

mkdir /mnt/flash

Mount the USB hard drive, for example device sdb1, on the newly created directory.

mount /dev/sdb1 /mnt/flash

Run the partimage program to restore the partitions.

partimage

Use the GUI to select the partition to restore.

Press Tab and enter the file name and location for the image file. For example:

/mnt/flash/sda1.partimage.gz

Press Tab and change the Action to be done to Restore partition from an image file.

Press F5 twice to navigate to the next screens and press OK to start the restore image process. In VMware you will probably have to press Function F5 to get the F5 key to work.

Repeat this process for each partition on the Linux system.

When you are finished imaging all of the disk partitions, unmount the USB disk drive.

umount /mnt/flash

Set Up the Boot Loader

The final step is to set up the boot loader and install it into the master boot record.

Mount the boot directory. For example, if sda1 is the boot partition and sda3 is the root partition.

mkdir /mnt/root
mount /dev/sda3 /mnt/root
mount /dev/sda1 /mnt/root/boot

Verify the configuration of the boot configuration file. Assuming you are using GRUB:

nano /mnt/root/boot/grub/device.map

Nano is a Linux text editor. You can also use pico or vi.

You want to verify that the device in the configuration file matches what it is in the VM. For example, if it says this:

(hd0) /dev/hda

You may need to change hda to sda. In this example we need to change it.

(hd0) /dev/sda

Exit Nano or whatever text editor you used.

Run grub-install to install GRUB into the MBR.

grub-install --root-directory=/mnt/root /dev/sda

Final Tasks

We’re all done. Now reboot in SystemRescueCd and your virtual machine should now boot into the same Linux setup that is on your physical machine.

reboot

This VM is now an exact copy of the physical Linux computer. You have successfully done a P2V (Physical to Virtual) conversion of your Linux system.

Optional Task – Configure X11

Depending on the version of Linux you are using, it may not be able to use the VMware settings to display X Windows properly. In that case, you will need to make a simple change to the X11Config file.

First, make a backup of the X11Config file. This assumes it is located in /etc/X11.

cp /etc/X11/XF86Config /etc/X11/XF86Config.backup2

Edit the X11Config file.

nano /etc/X11/XF86Config

Change the Driver and BoardName settings in the Device section from the VMware settings to a generic Vesa setting.

Section      "Device"
Identifier   "Videocard0"
Driver       "vesa"
VendorName   "Videocard vendor"
BoardName    "VESA driver (generic)"

Save the file and restart. You should be able to get X Windows to start now.

That’s it. It looks like a lot of steps, but it is not that difficult to do. The longest part is imaging and restoring the partitions.

Now that you have a virtual version of your Linux computer, you are able to do unique things like snapshots and work with multiple configurations or languages at the same time. This is really helpful when translating software from one language to another because you can now have both language versions running at the same time on the same desktop.

Sorting in Japanese — An Unsolved Problem

Sunday, February 13th, 2011

Sorting Japanese is not only difficult—it’s an unsolved problem. This seems hard to believe if you are not familiar with the complexities of processing Japanese digitally. But what is trivially easy in English is impossible in Japanese, even with the amount of computer power we have available today.

The problem comes from the complex nature of written Japanese. Contrast it with English, which only has 26 letters: a comes before b; b comes before c; and so on. On the other hand, Japanese not only has thousands of characters, it also has four different kinds of written characters. But this is only the beginning of the difficulty. The unique nature of kanji characters and their associated pronunciations is the language feature that makes Japanese unsortable.

Let’s work our way through the complexities to understand why Japanese cannot be sorted.

A Simple Sort

Let’s do a simple sort of a list of English words. Here I have a list of characters from the video game Street Fighter.

  • Ryu
  • Ken
  • Chun Li
  • Yun

Let’s put this list through a simple sort function using PHP.

<?php
   $names = array (“Ryu”, “Ken”, “Chun-Li”, “Yun”);
   sort ($names);

   foreach ($names as $name) {
      echo “$name<br/>”;
   }
?>

Here is the result:

  • Chun Li
  • Ken
  • Ryu
  • Yun

This is the result we expect—it’s in alphabetical order. A computer can easily sort English in alphabetical order because there are simple rules. C comes before K; K comes before R; and R comes before Y. You should have learned this in the first grade.

Now let’s start looking at the complexities of Japanese, and see why sorting does not work as easily.

Multiple Character Sets

Japanese has four different character sets in the written language. Don’t worry about why there are four different types of characters, just know that there are.

  • Hiragana alphabet — ひらがな
  • Katakana alphabet — カタカナ
  • Kanji characters — 漢字
  • ABC alphabet — abc

Here is where the difficulty comes in: each character set has characters with the same pronunciations as characters in the other sets. On top of that, all four character sets are written together to form what is modern written Japanese. If you only had to deal with one character set at a time (ignoring kanji for the moment, we will get to that later), you could sort Japanese automatically just like English. Hiragana sorts just fine; katakana sorts just fine; and the ABC alphabet sorts just fine. But, in combination, it is not clear how you would sort these.

I should note that there are two different alphabetical sorting orders in Japanese. For this article I am going to use the a i u e o (あいうえお) sort order.

Sorting Settings

Now let’s look at an example of sorting mixed character sets. Again, using PHP.

<?php
   setlocale(LC_ALL, ‘jpn’);
   $settings = array (“システム”, “画面”, “Windows ファイウォール”,
      “インターネット オプション”,  “キーボード”, “メール”, “音声認識”, “管理ツール”,
      “自動更新”, “日付と時刻”, “タスク”, “プログラムの追加と削除”, “フォント”,
      “電源オプション”, “マウス”, “地域と言語オプション”, “電話とモデムのオプション”,
      “Java”, “NVIDIA”);
   sort ($settings);

   foreach ($settings as $setting) {
      echo “$setting<br/>”;
   }
?>

Here is the result.

  • Java
  • NVIDIA
  • Windows ファイアウォール
  • インターネット オプション
  • キーボード
  • システム
  • タスク
  • フォント
  • プログラムの追加と削除
  • マウス
  • メール
  • 地域と言語のオプション
  • 日付と時刻
  • 画面
  • 管理ツール
  • 自動更新
  • 電源オプション
  • 電話とモデムのオプション
  • 音声認識

Take a look at what happened with this sort. The first three strings start with characters of the alphabet, and were sorted as we expect. The next eight strings are in katakana, and they are sorted correctly according to the Japanese a i u e o sort order. The rest of the strings all start with kanji and are not sorted in any way that makes sense to a human.

So what is going on here? In this case, it seems that PHP is using the character code to determine the sort order. This works fine with alphabets like English, or even the Japanese katakana, because the character codes go in order with the sort order. But the character codes do not go in order when mixed with other character sets. In this example you can see ABC and katakana are separated. Kanji are then separated from katakana. There were no hiragana in this list but they would do the same. Sort order by character code works fine for alphabets when the alphabets are by themselves. But once you mix alphabets together, you cannot have any sensible sorting order by doing it that way.

An observant reader might have noticed what these items in our list are: Control Panel items in Windows XP. It’s clear that PHP’s sort function can’t sort this properly. But what about Windows XP Japanese edition?

Microsoft seems to have the same problem. They do alright with sorting each character set individually. But they don’t seem to be able to integrate the character sets together like a Japanese user would expect. It’s OK, I don’t expect Microsoft to be able to solve such a hard problem.

Sorting Names

Let’s look at another example to show what happens when you have all four character sets sorted together. Here we have two names, both written four different ways—using each character set: ABC alphabet, hiragana, katakana, and kanji.

Ayumi、 あゆみ、アユミ、歩美

Tanaka、たなか、タナカ、田中

It is very possible to have different people with the same name write their name in different character sets. The traditional way of writing the Japanese name of Ayumi would be written in kanji; a modern, stylish way would be to write it in hiragana, and a second generation Japanese-American might write their name in katakana or the alphabet.

Put these names into the same PHP sort function and look what happens.

<?php
   setlocale(LC_ALL, ‘jpn’);
   $names = array (“Ayumi”, “アユミ”, “あゆみ”,  “歩美”,  
   “Tanaka”, “タナカ”,  “たなか”, “田中”);
   sort ($names);

   foreach ($names as $name) {
      echo “$name<br/>”;
   }
?>

Here is the result:

  • Ayumi
  • Tanaka
  • あゆみ (Ayumi)
  • たなか (Takana)
  • アユミ (Ayumi)
  • タナカ (Tanaka)
  • 歩美 (Ayumi)
  • 田中 (Tanaka)

Within each character set Ayumi is sorted before Tanaka, which is correct for the ABC, hiragana, and katakana alphabets. The kanji pair had a 50/50 chance of being right. But as you can see, the different character sets are not integrated together. If these were all names in your phone’s contact list or your Facebook friends list, you would expect all of the Ayumis and Tanakas to be listed together.

The ABC, hiragana, and katakana alphabets can be sorted—although which character set of Ayumi gets sort preference is a whole other issue—once that preference is agreed upon, sorting can be done just as easily as English.

Kanji — The Real Problem

The real problem with sorting Japanese text is kanji. Kanji aren’t just difficult for students of Japanese to make sense of, they are literally impossible for computers to process with the same intelligence as a human. The reason for this is the following:

Kanji have multiple pronunciations, determined by the context in which it appears.

This fact keeps students up nights studying for years trying to remember how to pronounce kanji right. And it also makes our sorting problem extremely nontrivial. We sort things in language by the pronunciations. Up until now we were dealing with letters. ABC, hiragana, katakana—these are all letters which a single pronunciation. There is only one place they can go.

Kanji on the other hand all have multiple pronunciations. Some have over ten! Only from the context in which the kanji appears do you know how to pronounce it. Our simple sorting problem has now turned into a natural language processing problem.

Here is an example:

私は私立大学で勉強しています。

Here the kanji 私 is used in two different contexts. The first usage, is 私 (watashi). The second usage is part of the compound word 私立大学 (shiritsu daigaku). Using the Japanese sort order, these words should be sorted like this:

  • 私立大学 (しりつだいがく)
  • 私(わたし)

A second year Japanese student could figure this out. For a computer, this is a very difficult problem.

Here is another, more extreme example.

There are four Japanese women whose names you have to sort: Junko, Atsuko, Kiyoko, and Akiko. This does not seem difficult, until they each show you how they write their names in kanji:

  • 淳子 (Junko)
  • 淳子 (Atsuko)
  • 淳子 (Kiyoko)
  • 淳子 (Akiko)

As you can see, this is rather troublesome. This comes back to kanji having multiple pronunciations. If this was for an address book of your phone contacts for example, you would want Atsuko and Akiko listed with the A names like Ayumi and Akira. But you would not want Junko and Kiyoko listed there.

And this problem is not limited to names. Regular, everyday words also have multiple pronunciations. For example, 故郷 (ふるさと、こきょう), 上手 (じょうず、じょうて、うわて、かみて…) etc.

So how do we deal with this? They have phones and social networking Web sites in Japan with sorted contact lists, so how can we sort these words properly?

The Wrong Way – Using IME Input

First, let’s look at a good try, but failed attempt at Microsoft to try to solve this problem. What good would Excel be if you could not sort on columns and rows. Microsoft clearly understands the issue with sorting Japanese—they just didn’t think through the solution thoroughly.

What Microsoft does in Excel is to capture the input the user types to get the kanji character. For example, if you typed Junko to get 淳子, it will save that input string as meta data in the background. When it is time to sort, it sorts on the input pronunciation meta data rather than the kanji that are displayed. You can actually see what the meta data looks like in Excel 2003 if you save as XML.

You can see the kanji 淳子 is in two different rows, but the input used to get them was different, Atsuko and Junko, so those are saved as meta data to assist with sorting later on.

The problem with this approach is it doesn’t take into account of how people actually interact with computers using a Japanese IME system. Japanese input works with a dictionary of possible kanji conversions based on what has been input. But not every word or name is in that dictionary. Sometimes you have to type each kanji individually or use a totally different pronunciation to get the kanji you want to show up. This results in the wrong pronunciation being saved as meta data, and sorting will not work as expected.

This system also doesn’t work with cutting and pasting text from other sources, as well as any sort of CSV or database import, etc. This was a good try by Microsoft to solve this problem, but it just doesn’t work.

The Right Way – Ask the User

A computer simply cannot guess the correct pronunciation of kanji, even if it logs the users input, because that might not even be correct. The easiest way to solve this problem is just ask the user for the pronunciation! Most software developed in Japan uses this approach.

Let’s look at this approach done correctly: Amazon.com. Let’s look at their new user registration First, notice the fields in the English version of this screen.

Now look at the Japanese version of this screen.

As you can see, the Japanese version has an extra field. This is for the user to enter the pronunciation of their name in katakana. This way, Amazon has their name in kanji, and the correct pronunciation to go with. They can now sort their user information correctly. This is the approach that most Japanese software takes. It is an extra step, but it solves the problem.

The big takeaway from this is that you cannot just translate software, or even a Web site, and expect it to work. Something as simple as registering a new user has to be completely reworked. In the case of a simple Web site, you will need to redo not only the Web interface, but also the database back end and the code to interface with the database and Web site generation. Localizing a site into Japanese is much more complicated than other languages because of the extra functionality that is required.

While Amazon.com does do the interface and programming localization correct, they do have something on their site that isn’t localized for the Japanese audience: Their logo.

In English, the logo goes with their saying: “Everything from A to Z.” This is indicated by the arrow. But in Japan, and any other country that doesn’t use English, A and Z aren’t always the first and last letters of the alphabet. The A to Z thing works in English because the name Amazon has A and Z in it. But in other countries, they might not have any idea why there is an arrow under the Amazon logo.

Final Thoughts

Sorting in Japanese is hard. Without user input, it is impossible in some contexts to know how to sort some Japanese words. People developing and localizing software need to understand these issues. But regarding the general problem of sorting Japanese when you don’t have user input to give the pronunciation, there may not be a way to automate this until computers can understand language as well as a native Japanese person. For a computer to understand Japanese is far more complex than most other languages. You can see this first hand by using machine translation software and comparing Japanese to something like French.

I think this is an interesting problem. This goes beyond just sorting. How can you expect a machine translation program to work if it doesn’t even know the pronunciation of a word—something that can be key to understanding what that word is. I can imagine even statistical machine translation being confused, especially with names.

Japanese is an interesting language, and processing it with computers is even more interesting.