Archive for April, 2011

Double Clicking Japanese Text

Tuesday, April 5th, 2011

Double clicking text in Asian languages, especially Japanese, is something that has been overlooked in the localization of pretty much every operating system and application. If you want to copy a word into the clipboard, it is convenient to double click somewhere over the word and have the system automatically highlight the entire word for you.

Double clicking text in English and other European languages works like you expect it to. Double click on any word and the entire word is highlighted, like this: “DoubleClickMe.”

The reason why this works in English is because there are spaces on each side of the word. Japanese on the other hand has no spaces between each word. This makes the problem very difficult. Next, add in the multiple character types that Japanese has (Hiragana, katakana, and kanji), and now you have an extremely difficult problem.

Let’s look at some examples and see what happens.

今日は楽しかったです。

Double click on 「今日」 and it highlights the entire word.
Double click on 「は」 and it highlights the 「は」 particle.
So far so good.

Double click on the kanji of 「楽しかった」 and it highlights only 「楽」.
Double click on the okurigana (hiragana) after 楽 and it highlights 「しかったです」.

It is pretty obvious what is going on. Double clicking on Japanese text will highlight an entire string of one kind of character type.
In other words,

  • If you click on a katakana character, it will highlight the entire katakana string.
  • If you click on a hiragana character, it will highlight the entire hiragana string.
  • If you click on a kanji character, it will highlight the entire kanji string.

This is obviously not what we want it to do, but it makes sense why it does this. The OS or application that you are using doesn’t have any intelligence to be able to parse Japanese into individual, complete words. Therefore, by default, double clicking will highlight the longest string of similar character types.

Let’s look at a couple of interesting examples that really illustrate this behavior.

First, a complete sentence all in hiragana.
わたしはすしがきらいです。

Clicking anywhere on this sentence highlights the entire sentence. There is no way to highlight individual words by double clicking.

Next, a sentence with a lot of random kanji together.
寿司酒刺身河豚鮪日本米国東京横浜は、ランダムな漢字の文字列です。

Clicking anywhere on the initial long string of random kanji words will highlight the entire string. Without any intelligence, the system does not recognize that there are nine different words there, and a result, highlights the entire string.

Double clicking Japanese text does not work. It will highlight stuff, but it does not highlight anything meaningful most of the time. This behavior is universally wrong across all operating systems (Windows, Mac OS X, Linux, etc.) and applications.

Is this a Solvable Problem?

The answer is yes. What we need is an OS or application level intelligence about Japanese. One way to achieve this intelligence is to match text against a Japanese dictionary. When you double click Japanese text, the OS should match against the longest hit in the dictionary.

For example:
寿司食べ放題

If you click on 「寿」, it will hit 寿 (ことぶき) as a valid word in the dictionary, but it should continue on and recognize 寿司 (すし) as the longer and correct word, and highlight 「寿司」. It should not highlight the 「食」 character. If you click on 「食」, it should continue on in the dictionary search to highlight 「食べ放題 」. The kanji/hiragana mix should not play a role as a boundary character like it currently does.

For example:
わたしはすしが好きです。

If you click on 「好」, it will hit 好 (こう) as a valid word in the dictionary, but it should not stop there. It should also hit 好き (すき) as the longer, and correct word to parse in this case.

This should give us much better results when double clicking, but a dictionary compare is not enough to give us consistently correct results most of the time. The problem is a dictionary is only going to have the root words/conjugations/inflections etc. Therefore, we also need intelligence to understand parts of speech and conjugations and how they relate to the root words in the dictionary.

For example, the previous sentence:
今日は楽しかったです。

If you click on 「楽」, it will hit 楽 (らく) as a valid word in the dictionary, but it should not stop there. It should also hit 楽しかった even though it is not in the dictionary, because it is the past tense of the word 楽しい, which is in the dictionary. 「楽しかった」 should be completely highlighted, and it should not continue on and highlight です because that is not part of the word 楽しい.

Conclusion

There are no operating systems that properly parse and highlight the correct words when double clicking. This is definitely not an easy problem to solve, but it is possible and should be done on Japanese systems.

There are two software applications that I know of that can parse Japanese properly most of the time: Rikaichan and NJStar. Rikaichan is an add-on to the Firefox browser, and NJStar is a Japanese word processing application. They both have a mouse-over hover function that parses complete Japanese words. You can also double click and get the expected result as well. These two applications both use a Japanese dictionary back end and have enough Japanese language intelligence to parse conjugations and inflections of words to get the expected match most of the time.

There are times when Rikaichan does not parse the expected result. I will cover those exceptions in a future article about parsing Japanese. However, Rikaichan is right about 99% of the time, which is great considering that your OS is usually wrong 90-95% of the time when it comes to double clicking Japanese text.