Archive for the ‘Localization’ Category

A Star is Born – Introducing Honyaku Star Japanese/English Dictionary

Thursday, October 4th, 2012

Today I’m launching Honyaku Star, a new online Japanese/English dictionary.

The goal of Honyaku Star is to be the world’s most comprehensive free, online Japanese/English dictionary and corpus. Honyaku Star is built on top of numerous excellent community dictionaries, and adds to it the Honyaku Star dictionary, which tries to fill in everything else that isn’t in those general dictionaries, with the goal to be the best, and only dictionary you need.

Honyaku Star is more than a dictionary, it is also a Japanese/English bilingual corpus. In other words, a database of parallel texts to provide context, usage, and examples of words and phrases you search for. Many words have lots of valid translation varieties, and seeing it used in various different contexts can help you understand the different meanings and pick the appropriate usage. When you search in Honyaku Star, you get dictionary results and example sentences together.

I built Honyaku Star because I use online Japanese/English dictionaries every day, and none of them satisfied me. There are certain things I want, and don’t want out of an online Japanese/English dictionary. So, I built Honyaku Star based on the principles I think are important.

  • Dictionaries and language resources should be free and easily accessible.
  • The one and only dictionary advanced students and translators will need.
  • Lots of relevant results for a search query. 1,000 results if possible.
  • Provide in-context usage and examples sentences.
  • Clean, simple user interface.
  • No pagination in the UI.
  • Searches should be super fast. Instantaneous!
  • No visual distractions. No ads. No random Web content. No useless information like character encoding codes.
  • No advanced search! It should be smart and bring back results in an intelligent way.
  • The primary goal of a searchable dictionary is to be a useful language resource–it should not a means to draw you in and sell you language services or books.

I think I’ve kept with my design principles on this initial version, and it’s only going to get better with time.

The technology behind Honyaku Star is Linux, PHP, Perl, MySQL, and the awesome full-text index Mroonga. I’ll post more about some of the technical challenges in future posts.

Honestly, I made Honyaku Star for myself, to be the ideal dictionary that I’d want to use everyday. But my hope is others will find it useful. All user feedback is welcome and appreciated. And if you use and like Honyaku Star, consider contributing translations to it.

Start using Honyaku Star today at

Localizing a CakePHP Application

Thursday, November 10th, 2011

If you build a PHP application using the CakePHP framework, it is easy to localize the application into multiple languages, provided you have the proper translations for those languages. If you want to internationalize your application to a global market, it is important to localize it for each language and region you want to target.

Fortunately, CakePHP and PHP itself provide us with some easy mechanisms to provide translations and localize our code without much effort. You do not have to make copies of HTML or PHP files. Everything will be done with the PHP files you already have, and the translated text strings will dynamically be inserted at render time, ready for the user in their localized language and format.

In this example, we will localize a CakePHP application that was written in English into Japanese. Let’s assume we have a menu for e-mail functions that we want to localize. This example assumes CakePHP version 2.0 (But earlier and perhaps later versions of CakePHP will work in a similar manner).

Wrapping Translatable Text in __() Functions

The first step in the localization process is to identify the text strings that will need to be translated, and replace them with CakePHP’s localized string __() function. Supposing our menu looked like the following:


We would wrap each text string inside the __() function like as follows:

   <li><?php echo __('Send') ?></li>
   <li><?php echo __('Reply') ?></li>
   <li><?php echo __('Forward') ?></li>
   <li><?php echo __('Delete') ?></li>

The __() function identifies these strings as translatable text that will differ by language locale and uses the text within the __() function as the message ID. If we define the translations for a certain language, those translations will appear in place of these functions. If we do not define the translations for that language, the text within the __() function will display instead by default.

Creating the Localized PO Files

The next step is to create the PO files which will contain the translations for each language to be dynamically inserted in each of the __() functions. CakePHP has two ways you can do this: automatically using the console shell; or manually.

Using the I18N Shell

CakePHP has some console shell programs that you can run on the command line, including one to generate the PO file to use as the original language source file for translations. In our case it will be a file with all the English text strings.

To run the i18n shell command, type the following on the Linux command line in your CakePHP application directory:

./Console/cake i18n extract

Then follow the onscreen menu.

The shell command will examine all of your application files for instances of the __() function and generate a PO file for the original source language that you can use to create the PO files for each of the translations you are going to use.

Creating the PO Files Manually

If you want to do this manually—for example you don’t have many translatable text strings like in our example—you can create the PO files by hand in a text editor.

First we will create the original source language English version here:


The default.po file will have this format:

msgid "ID"
msgstr "STRING"

Where msgid is the ID within the __() function; msgstr is the localized translation that should appear as output.

Our full English source PO file will look like this:

msgid "Send"
msgstr "Send"

msgid "Reply"
msgstr "Reply"

msgid "Forward"
msgstr "Forward"

msgid "Delete"
msgstr "Delete"

To create the Japanese localized version, we copy the English PO file to the Japanese directory, and then replace the English strings in the msgstr field with the Japanese translations. (If you have a large application being localized into dozens of languages, it is at this point that you send the PO files to a language service provider to translate the localized strings.)

Our localized PO file with Japanese translations will go here:


Our final localized Japanese PO file will look like this.

msgid "Send"
msgstr "送信"

msgid "Reply"
msgstr "返信"

msgid "Forward"
msgstr "転送"

msgid "Delete"
msgstr "削除"

That’s it. The translations for English and Japanese will display appropriately for the proper locale. If you want to add translations for other languages, you do the same process and put the new PO file in the directory that corresponds with that language code. CakePHP uses the ISO 639-2 standard for naming locales. Follow that standard for naming your localized directories. Make sure you save these files at UTF-8.

Detecting and Changing Languages and Locales

Having the translations ready is nice, but you still have to detect and change to the proper language in your PHP code to get the translations to appear. Detecting the user’s language is tricky. You could do it in JavaScript or try to use something like the following PECL function:


In either case, you have to trust the user agent to report back the right language.

Another option is to simply have a button or menu for the user to select their language. Flag icons usually work well for this. Then you can set the language manually. In CakePHP it is easy to do with the CakeSession class:

CakeSession::write('Config.language', 'jpn);

Finally, you also have to set the locale in PHP. Since you already know this from however you determined the language above, you can use PHP’s setlocale function to do this. This is important for localization of date, time, money, and numeric separator formats among others.

setlocale("LC_ALL", "ja_JP.utf8");

That’s all there is to localizing a CakePHP application with proper translations and other locale-specific customizations.

Special Concerns for Translating Japanese Using Translation Memory

Thursday, October 6th, 2011

The use of translation memory, such as software products like SDL Trados, greatly increase the speed and efficiency of a translator. However, there are special concerns that must be taken into account when translating with a translation memory where Japanese is the source language. Japanese has some linguistic characteristics that are significantly different from English, and when using a Japanese to English translation memory, you can run into trouble if you are not careful.

The biggest benefit a translation memory can bring you is providing you with a 100% match and eliminating any translation work for that sentence. Best practices say you should always proofread your translations, even if it is from a 100% TM match—although it is hardly ever done. With Japanese, however, you check your 100% matches because the translations may not be accurate for reasons we will discuss.


Japanese does not have different singular and plural forms of nouns the same way English does. There are specific instances where a plural-like form is used, but these are the exception rather than the norm. Let’s look at a simple example:


You could translate this sentence two different ways:

  • Remove the screw.
  • Remove the screws.

Which is correct? Well, that depends on how many screws there are. In Japanese, this one sentence covers both instances. Suppose your translation memory had only this translation pair in the database:

  • JA: ねじを取り外す。
  • EN: Remove the screw.

If the sentence you are currently translating matches the Japanese, but in this present context there are multiple screws, the matching 100% translation is not correct.

This shows why context is important—even more so in Japanese. And if you use software like SDL Trados or some other CAT tool that only provides you with an XLIFF file, you may not have the surrounding images and context to know whether there is one or many screws.

How can we remedy this for the next person that uses our translation memory? We can definitely save a new translation for this sentence, and our TM now looks like this.

  • JA: ねじを取り外す。
  • EN: Remove the screw.
  • EN: Remove the screws.

This is fine for the translator—they can cycle through the multiple translations and select the best one, assuming they know the context to be able to pick the right one. On the other hand, this is not ideal for the person paying for the translation. Generally only 100% matches are done for free or at a greatly reduced price. When duplicate translations exists for a single source segment, SDL Trados and other software will flag this with some sort of penalty so it will be less than a 100% match, often a 99% match, which will cost more to translate.

The best way to deal with this, and the hardest to implement, is for the original Japanese language authors to write with context, knowing that their documents will be the source language for translation. Ideally, there should be multiple versions of the Japanese sentence. For example

  • JA: ねじ(1本)を取り外す。
  • EN: Remove the screw.
  • JA: ねじ(2本)を取り外す。
  • EN: Remove the two screws.
  • JA: ねじ(3本)を取り外す。
  • EN: Remove the three screws.

If the source Japanese text is written with specific contextual information, this solves the problem and there will not be any ambiguous 100% hits in the translation memory. Unfortunately, original source texts are hardly ever written with translation in mind.

Capital and Lowercase Letters

Japanese similarly does not have an equivalent of capital and lowercase letters. A hiragana is a hiragana and a kanji character is a kanji character. In English, usually only the first word of a sentence is capitalized. This is called sentence capitalization. However, titles, headings, etc. have all the major words capitalized. This is called heading capitalization.

Japanese will have the same exactly sentence whether it is a title/heading sentence or it is a normal sentence in the text body. English will have two variations. One with heading capitalization, and one with sentence capitalization. Similar to the ambiguity with plurals, we have ambiguity with capitalization. If we only have the heading capitalization style sentence in the translation memory, that will hit a 100% match when the same Japanese sentence appears in the text body, but the corresponding English 100% match will have the wrong capitalization.

Unlike the pluralization problem, there is no clear fix to avoid the capitalization problem. There is no simple and obvious way we could rewrite the original Japanese text to have multiple variations for heading and sentence style contexts. In these instances, it is important to verify all 100% matches in the translation memory for the proper context.

Sentences with No Subject or Object

This is something uniquely Japanese: sentences with no subject. This is completely normal in Japanese—and absolutely unheard of in English. Sentences can also have direct object verbs with no object whatsoever. There is nothing wrong with sentences without subjects or objects in Japanese. The problem, however, is when translating these sentences. It is difficult without proper context. Now, consider translating with a translation memory, and you can begin to understand the complexity of the situation.

Consider how you would translate this Japanese sentence:


This sentence has no subject and no object. In context it is probably clear what the meaning is, but by itself it is all sorts of vague. Let’s imagine two completely different, but totally reasonable translations for this sentence:

  • When it’s done, remove it.
  • When you are finished, take apart the pieces

Both of these are reasonable translations in two completely different contexts. However if the first English translation is registered in the translation memory and it came up as a 100% match, it would be totally wrong if the context were the second sentence.

Context is everything when translating ambiguous Japanese sentences. But a translation memory does not preserve that context. Even if you know what came before and what comes afterwards, that still may not be enough to know the full context of the original Japanese meaning. Even though you are getting a 100% match in the translation memory, instead of being just a little wrong such as singular/plural or capitalization mistakes, it may be completely wrong in terms of meaning!

Same Words Written Differently

In English we have many words and expressions that have the same meaning, and therefore, we can say the same thing many different ways. But Japanese takes this a step further: you can write the same word many different ways!

For example, the word screw could be written: ねじ、ネジ、ネジ、螺子. That’s four different ways to write the same word.

Another example, the word install could be written: 取り付ける、取付ける、取りつける、とりつける. Again, that is four different ways to write the same word, and we didn’t even consider other forms such as です・ます調 or 敬語.

Now, take these two words and construct the same sentence, and look at the number of possibilities you have to say the same exact thing with the same exact words, only written differently.

  • ねじを取り付ける。
  • ねじを取付ける。
  • ねじを取りつける。
  • ねじをとりつける。
  • ネジを取り付ける。
  • ネジを取付ける。
  • ネジを取りつける。
  • ネジをとりつける。
  • ネジを取り付ける。
  • ネジを取付ける。
  • ネジを取りつける。
  • ネジをとりつける。
  • ネ螺子を取り付ける。
  • ネ螺子を取付ける。
  • ネ螺子を取りつける。
  • ネ螺子をとりつける。

That is 16 different possibilities! Now imagine your translation memory only has one of these variations registered in the database. When you come across the exact same sentence, but only written differently, you will not get a 100% match, even though you have basically the exact same sentence in one form or another right in front of you. And this sentence is so short, you might not even get any match at all if the hit percentage is set high.

Author variation and style guides conformance are very important in the original source language to prevent these kinds of problems. This is big issue in itself that I’ll take up in another article.

Japanese is very different from English, and when translating, you have to take into greater account the textual context and other issues. And this becomes even more so when dealing with a translation memory containing Japanese as a source language.

Translation memory software such as SDL Trados is very useful, and can be used to great benefit even in Japanese. However, you must be aware of these kinds of issues and double check all of your translations, especially your 100% matches.

Japanese Related Jobs

Monday, September 5th, 2011

If you speak Japanese, or are learning Japanese in college and want to find a job when you graduate where you can use your Japanese language skills, there are lots of options out there. Here are some industries and strategies for a Japanese-related career.

Local Japanese Companies

Assuming you don’t live in Japan, you can try to find a Japanese company that does business where you live. Many large Japanese companies are multinational and have offices all around the globe. For example, the top Japanese auto manufacturers: Toyota, Honda, Nissan, Mitsubishi, etc; High tech companies: Fujitsu, Toshiba, Tokyo Electron, Sony, Hitachi, Canon, NEC, Sharp, Sanyo, Fujifilm; Airlines: JAL, ANA; Video game companies: Square-Enix, Capcom, Nintendo; and the list goes on.

Japanese companies outside of Japan will always need Japanese speakers for everything from translation to supply chain operations to human resources for expats. Some positions could make full-time use of your Japanese skill, while for others, your main duty is something else, but merely having Japanese ability will be a huge plus to helping the work go smoother.

You don’t have to move to Japan to get a job where you use Japanese—you might have a Japanese company in your town that needs Japanese language skills.

Local Companies Doing Business in Japan

Non-Japanese companies are probably the most overlooked sources of Japanese-related jobs for Japanese speakers. Most large companies and multinationals all do business abroad. Some have direct operations in other countries, and some work through more indirect methods. In either case, there is often a need for people with language ability for those markets.

If a company has any sort of documentation, manuals, Web sites, etc., they have a need for product localization. Some may do it all in-house, while others contract it out to language service providers. For larger companies that do a lot of business in Japan, they may have on-staff language resources for product localization, testing, translation, etc. And even if they contract the translation work out to a vendor, they will need a localization manager to handle this work. Being bilingual is almost always a requirement for handling the translation vendor relations. Even if Japanese isn’t the most used language, it still helps that you have experience with a second language in order to be familiar with some of the issues that come up with translation and localization.

In addition to translation and localization related jobs, there are many opportunities on the business side for a Japanese speaker: supply chain management, import and export, making business arrangements, product marketing, etc. For any role that is needed to make business operations successful, having language ability for the target market can only improve your ability to handle the business matters even more smoothly.

Working in Japan

If you already live in Japan, then you have an advantage if your goal is to find a Japanese-related job. For everybody else, this will be a little more difficult. You can always apply directly to companies you are interested in, and there are also Web sites and organizations that can help.

The JET program is well known for bringing English teachers to Japan, but they also have a lesser known program called CIR (Coordinator for International Relations). Unlike a JET English teacher, a JET CIR works in a local Japanese city office or something similar and works on international projects and organizes cultural activities. Also unlike the JET English teacher position, the CIR position requires Japanese language ability because you will be working among Japanese coworkers.

If the JET CIR if not what you are looking for, you can always try to find a job from abroad through job posting sites. One site in particular that specializes in jobs in Japan is (Work in Japan). DaiJob has a full range of listings. What I find very useful is each listing will list what level of Japanese proficiency is required. For example, some jobs might only required a casual conversational level, whereas others may require a level 2 or level 1 proficiency in the JLPT. This can really help you gauge what jobs are appropriate for your language ability level. There is also the business focused Japanese certification test, the BJT (Business Japanese Proficiency Test).

Language Service Providers

Language service providers, or LSPs, are companies that provide translation and other linguistic services. These are big, and often multinational companies that work with companies to provide translation, localization, and even product marketing services to take their products into new markets. Some companies, like SimulTrans for example, concentrate on the translation, localization, and globalization of your products/documentation. Other LSPs, like Sajan and SDL offer those services and much more. They have their own specialized software and managed services they offer in addition to the traditional translation services.

You can definitely work for an LSP as a translator, but that is not the limit of what they offer. Translation also requires proofreaders, reviews by subject matter experts, graphic designers and desktop publishing experts, audio and visual engineers, programmers, and quality assurance people. Additionally, project managers are there in every step to handle the business aspects as well as the translation resources. I imagine it is a requirement at most LSPs to be bilingual to even be a project manager. You can work in the translation industry and put your Japanese skill to use even if you aren’t doing translation work directly. A good project manager makes a big difference in the quality of the translations.


If you can speak, read and write Japanese proficiently as well as another language, then you can work as a translator. Translation jobs vary greatly by subject matter, location, job arrangement, etc. In other words, there are many ways to get into and go about being a translator.

Language Service Providers

One common option is to work for one of the previously mentioned language service providers. Most LSPs require you to translate into your native language. So for example, if your native language is English and you also speak Japanese, you would translate Japanese texts into English.

While it is probably possible to work at an actual LSP office as a translator, you would most likely work from home. You can set your own hours as long as you meet deadlines and commitments. Contracting with an LSP is similar to freelance translation work, except that the LSP takes care of all the business arrangements and provides you with the work for projects. As a translator, you can focus more on the translation task without worrying about trying to find, line up, and manage clients. The downside to this is you may have to accept their word rate and use whatever translation tools they require or provide. However, the upside is they handle the business arrangements and are more likely to be able to continually supply you with steady work.

LSPs generally contract with translators in any country that there is need. If you live in the U.S.A., Japan, China, Taiwan, Korea, and most European countries, it should be easy to find work doing Japanese translation work. While general translation work is always in need, specialized translation is also very common. As a translator, it is helpful to have experience or expertise in a certain field. Common areas that might require experience or expertise to get the translation job are: Legal, IT/computers, patents, medical, manufacturing, marketing/advertising, and so on. If you have a background or past experience in a certain field, that can be the advantage you have over other translators to being an LSP’s go-to person for those types of translations.

Translation is a big part of what LSPs do, but it isn’t the only job you can do for them as a contractor working at home. Every translation will need reviewers and proofreaders. Proofreaders may just check translations for language, and subject matter experts may check the translation for technical accuracy. They also need people with language skills who can work with graphics and page layout software.

To work for an LSP, you will most likely need to be familiar with how translation memory software works. The most used software is SDL Trados, although it is certainly not the only translation memory software out there. Learning translation memory software, multilingual glossary software, and learning what other dictionaries and translation tools are out there is definitely worth doing.

Freelance Translation

If you prefer even more freedom than is offered from an LSP, you can go into business for yourself as a freelance translator. You find your own clients, set your own prices, and determine how you will get the work done. For some people, this is a more attractive option than directly contracting with an LSP, especially if you are more business minded. For others, the freedom you have is the key point. For instance, many LSPs will require that you use specific translation software, such as SDL Trados. If you don’t like Trados, find it prohibitively expensive, or just prefer other software, then freelancing usually affords you that freedom.

The downside to freelancing of course is that you are on your own to find work. For any job posted to a job site, you may be competing with any number of other people for the job. There is also an element of risk when always working with different clients. Method of payment, details of work, required software, expectations, and attitudes vary greatly from client to client. However, if you find a good client and do good work, you have the potential to be their go-to person and receive steady work. Also, when you work directly for the client, you have a better chance of getting direct answers to questions that may come up during a job, whereas with an LSP you may have to ask questions indirectly through the LSP’s project manager.

To get started in the world of freelance translation, I would recommend checking out This is an excellent Web site for translators. They have job postings and client ratings, and also an active BBS for a wide range of translation-related discussions.

To get a leg up over other freelance translators, you may also consider improving your qualifications. A language-related certification could be the thing that a client uses to pick you over another translator. For Japanese, the gold standard in language skills is the JLPT (Japanese Language Proficiency Test). For practical translation skills, there are translation certifications as well. Once such qualification is from the American Translators Association.

Translator at a Company

Large companies that do business in Japan, especially if they have local documentation, product localization, and Web operations teams, may have a need for local language experts. For example, a company like Apple Inc., which has the majority of its operations in the United States, probably has Japanese speakers at the local U.S. operations to manage the Japanese localization of products, Web sites, and online help bulletin boards. There are probably many companies in the same situation that sell globally, but operate locally.


Interpreters are needed whenever people with different language backgrounds need to communicate. There are two types of interpretation: simultaneous and consecutive. Simultaneous interpretation is when the interpreter is speaking at the same time as the speaker. The people listening to the interpreter usually have a headset on. This is common for the United Nations, and big, generally assembly type meetings or conferences. Simultaneous interpreters need to be exceptionally proficient in both languages and be trained to translate in real time. Simultaneous interpreters usually command a high hourly rate and are used only at high profile events.

Consecutive interpretation is when one party speaks, and then you translate what they said. And then the other party speaks, and you translate what they said. This type of interpretation is often needed for business meetings, training conferences, business telephone calls, court rooms, hospitals, and any place where people of different language backgrounds will be working together. A business meeting may last all day and provide a nice hourly rate. In contrast, an insurance company may only need to communicate with someone involved in a car accident over the phone for 10 minutes. And then, legal matters that end up going to court could potentially last for weeks. This type of interpretation is less demanding than simultaneous interpretation, but you still may need expertise in a certain field.

You can definitely freelance as an interpreter. In fact, some people supplement their normal translation jobs with interpretation jobs. However, the need for interpretation can some times come up without notice, and many businesses may need to keep on file a language service provider that can accommodate these types of requests. It is very common to work for an LSP and be dispatched to local businesses or take phone calls as the need for interpretation comes up.


If you fly to or from Japan, half of the passengers will likely be Japanese. In turn, the airlines will need bilingual Japanese speakers to be flight attendants and gate personnel and other roles. American carriers like American, Continental, Delta, etc., and overseas carriers like JAL and ANA all fly internationally to and from Japan, and need people who speak Japanese. And from my experience flying, a high level of fluency isn’t required. You just need enough conversational skills to serve meals, take requests, and see that the flight goes smoothly.

Japanese Jobs Summary

Japanese related jobs are out there, and often available in places you may not even considered looking. While Japanese translators will always be needed, it is not the only field where Japanese language skills are required. Be business minded and think about what the company needs might be, be it locally or globally. And even if a business does not do business in Japan, you might be the person with Japanese language skills they were waiting for to initiate their global entry into Japan.

Good luck in your Japanese related job searches.

Double Clicking Japanese Text

Tuesday, April 5th, 2011

Double clicking text in Asian languages, especially Japanese, is something that has been overlooked in the localization of pretty much every operating system and application. If you want to copy a word into the clipboard, it is convenient to double click somewhere over the word and have the system automatically highlight the entire word for you.

Double clicking text in English and other European languages works like you expect it to. Double click on any word and the entire word is highlighted, like this: “DoubleClickMe.”

The reason why this works in English is because there are spaces on each side of the word. Japanese on the other hand has no spaces between each word. This makes the problem very difficult. Next, add in the multiple character types that Japanese has (Hiragana, katakana, and kanji), and now you have an extremely difficult problem.

Let’s look at some examples and see what happens.


Double click on 「今日」 and it highlights the entire word.
Double click on 「は」 and it highlights the 「は」 particle.
So far so good.

Double click on the kanji of 「楽しかった」 and it highlights only 「楽」.
Double click on the okurigana (hiragana) after 楽 and it highlights 「しかったです」.

It is pretty obvious what is going on. Double clicking on Japanese text will highlight an entire string of one kind of character type.
In other words,

  • If you click on a katakana character, it will highlight the entire katakana string.
  • If you click on a hiragana character, it will highlight the entire hiragana string.
  • If you click on a kanji character, it will highlight the entire kanji string.

This is obviously not what we want it to do, but it makes sense why it does this. The OS or application that you are using doesn’t have any intelligence to be able to parse Japanese into individual, complete words. Therefore, by default, double clicking will highlight the longest string of similar character types.

Let’s look at a couple of interesting examples that really illustrate this behavior.

First, a complete sentence all in hiragana.

Clicking anywhere on this sentence highlights the entire sentence. There is no way to highlight individual words by double clicking.

Next, a sentence with a lot of random kanji together.

Clicking anywhere on the initial long string of random kanji words will highlight the entire string. Without any intelligence, the system does not recognize that there are nine different words there, and a result, highlights the entire string.

Double clicking Japanese text does not work. It will highlight stuff, but it does not highlight anything meaningful most of the time. This behavior is universally wrong across all operating systems (Windows, Mac OS X, Linux, etc.) and applications.

Is this a Solvable Problem?

The answer is yes. What we need is an OS or application level intelligence about Japanese. One way to achieve this intelligence is to match text against a Japanese dictionary. When you double click Japanese text, the OS should match against the longest hit in the dictionary.

For example:

If you click on 「寿」, it will hit 寿 (ことぶき) as a valid word in the dictionary, but it should continue on and recognize 寿司 (すし) as the longer and correct word, and highlight 「寿司」. It should not highlight the 「食」 character. If you click on 「食」, it should continue on in the dictionary search to highlight 「食べ放題 」. The kanji/hiragana mix should not play a role as a boundary character like it currently does.

For example:

If you click on 「好」, it will hit 好 (こう) as a valid word in the dictionary, but it should not stop there. It should also hit 好き (すき) as the longer, and correct word to parse in this case.

This should give us much better results when double clicking, but a dictionary compare is not enough to give us consistently correct results most of the time. The problem is a dictionary is only going to have the root words/conjugations/inflections etc. Therefore, we also need intelligence to understand parts of speech and conjugations and how they relate to the root words in the dictionary.

For example, the previous sentence:

If you click on 「楽」, it will hit 楽 (らく) as a valid word in the dictionary, but it should not stop there. It should also hit 楽しかった even though it is not in the dictionary, because it is the past tense of the word 楽しい, which is in the dictionary. 「楽しかった」 should be completely highlighted, and it should not continue on and highlight です because that is not part of the word 楽しい.


There are no operating systems that properly parse and highlight the correct words when double clicking. This is definitely not an easy problem to solve, but it is possible and should be done on Japanese systems.

There are two software applications that I know of that can parse Japanese properly most of the time: Rikaichan and NJStar. Rikaichan is an add-on to the Firefox browser, and NJStar is a Japanese word processing application. They both have a mouse-over hover function that parses complete Japanese words. You can also double click and get the expected result as well. These two applications both use a Japanese dictionary back end and have enough Japanese language intelligence to parse conjugations and inflections of words to get the expected match most of the time.

There are times when Rikaichan does not parse the expected result. I will cover those exceptions in a future article about parsing Japanese. However, Rikaichan is right about 99% of the time, which is great considering that your OS is usually wrong 90-95% of the time when it comes to double clicking Japanese text.

Italics in Japanese

Sunday, February 27th, 2011

When translating a document with formatting, such as a Microsoft Word document, you can’t always use the original source-language formatting in the translated language as is. This is especially true of italic type in Japanese. What works in italics in English does not work in Japanese. The formatting must be changed.

The main reason for this is Japanese text can become nearly unreadable when set in italic type. This is especially the case on low resolution monitors when displaying kanji in bold, italic fonts.

When translating English into Japanese, it is best to change the formatting for text in Japanese that was originally in italic type in English.

Here is a mini style guide of recommendations of how to format Japanese text that was translated from English set in italic type.


Use a Gothic bold-face type, or write the word in katakana if appropriate.


Another way to show emphasis is to use a well-known English phrase and write it in katakana.


Titles of Books, Publications, Media, etc.

Use the Japanese double quotation marks to quote the name of a publication.


Foreign Words

In English these would be written in italics. In Japanese, they will be written in either katakana or romanized type, which serves the function of designating it a foreign word.

Introducing or Defining Terms

Use the Japanese single quotation marks.



In other instances where italics are used in English, it is usually safe to use the Japanese single quotation marks.

In general, it is best to avoid italic type in Japanese. Certain Japanese typefaces don’t even have an italic font to begin with. It is very important to thoroughly proofread documents translated into Japanese for these types of formatting issues. What is natural in English can produce something almost unreadable in Japanese. And it will be a lot more natural to use something other than italics.

This also works the other way around when translating from Japanese into English. Where quotation  marks and katakana etc. are used in Japanese should be changed into italics in the English translation where appropriate.

Sorting in Japanese — An Unsolved Problem

Sunday, February 13th, 2011

Sorting Japanese is not only difficult—it’s an unsolved problem. This seems hard to believe if you are not familiar with the complexities of processing Japanese digitally. But what is trivially easy in English is impossible in Japanese, even with the amount of computer power we have available today.

The problem comes from the complex nature of written Japanese. Contrast it with English, which only has 26 letters: a comes before b; b comes before c; and so on. On the other hand, Japanese not only has thousands of characters, it also has four different kinds of written characters. But this is only the beginning of the difficulty. The unique nature of kanji characters and their associated pronunciations is the language feature that makes Japanese unsortable.

Let’s work our way through the complexities to understand why Japanese cannot be sorted.

A Simple Sort

Let’s do a simple sort of a list of English words. Here I have a list of characters from the video game Street Fighter.

  • Ryu
  • Ken
  • Chun Li
  • Yun

Let’s put this list through a simple sort function using PHP.

   $names = array (“Ryu”, “Ken”, “Chun-Li”, “Yun”);
   sort ($names);

   foreach ($names as $name) {
      echo “$name<br/>”;

Here is the result:

  • Chun Li
  • Ken
  • Ryu
  • Yun

This is the result we expect—it’s in alphabetical order. A computer can easily sort English in alphabetical order because there are simple rules. C comes before K; K comes before R; and R comes before Y. You should have learned this in the first grade.

Now let’s start looking at the complexities of Japanese, and see why sorting does not work as easily.

Multiple Character Sets

Japanese has four different character sets in the written language. Don’t worry about why there are four different types of characters, just know that there are.

  • Hiragana alphabet — ひらがな
  • Katakana alphabet — カタカナ
  • Kanji characters — 漢字
  • ABC alphabet — abc

Here is where the difficulty comes in: each character set has characters with the same pronunciations as characters in the other sets. On top of that, all four character sets are written together to form what is modern written Japanese. If you only had to deal with one character set at a time (ignoring kanji for the moment, we will get to that later), you could sort Japanese automatically just like English. Hiragana sorts just fine; katakana sorts just fine; and the ABC alphabet sorts just fine. But, in combination, it is not clear how you would sort these.

I should note that there are two different alphabetical sorting orders in Japanese. For this article I am going to use the a i u e o (あいうえお) sort order.

Sorting Settings

Now let’s look at an example of sorting mixed character sets. Again, using PHP.

   setlocale(LC_ALL, ‘jpn’);
   $settings = array (“システム”, “画面”, “Windows ファイウォール”,
      “インターネット オプション”,  “キーボード”, “メール”, “音声認識”, “管理ツール”,
      “自動更新”, “日付と時刻”, “タスク”, “プログラムの追加と削除”, “フォント”,
      “電源オプション”, “マウス”, “地域と言語オプション”, “電話とモデムのオプション”,
      “Java”, “NVIDIA”);
   sort ($settings);

   foreach ($settings as $setting) {
      echo “$setting<br/>”;

Here is the result.

  • Java
  • Windows ファイアウォール
  • インターネット オプション
  • キーボード
  • システム
  • タスク
  • フォント
  • プログラムの追加と削除
  • マウス
  • メール
  • 地域と言語のオプション
  • 日付と時刻
  • 画面
  • 管理ツール
  • 自動更新
  • 電源オプション
  • 電話とモデムのオプション
  • 音声認識

Take a look at what happened with this sort. The first three strings start with characters of the alphabet, and were sorted as we expect. The next eight strings are in katakana, and they are sorted correctly according to the Japanese a i u e o sort order. The rest of the strings all start with kanji and are not sorted in any way that makes sense to a human.

So what is going on here? In this case, it seems that PHP is using the character code to determine the sort order. This works fine with alphabets like English, or even the Japanese katakana, because the character codes go in order with the sort order. But the character codes do not go in order when mixed with other character sets. In this example you can see ABC and katakana are separated. Kanji are then separated from katakana. There were no hiragana in this list but they would do the same. Sort order by character code works fine for alphabets when the alphabets are by themselves. But once you mix alphabets together, you cannot have any sensible sorting order by doing it that way.

An observant reader might have noticed what these items in our list are: Control Panel items in Windows XP. It’s clear that PHP’s sort function can’t sort this properly. But what about Windows XP Japanese edition?

Microsoft seems to have the same problem. They do alright with sorting each character set individually. But they don’t seem to be able to integrate the character sets together like a Japanese user would expect. It’s OK, I don’t expect Microsoft to be able to solve such a hard problem.

Sorting Names

Let’s look at another example to show what happens when you have all four character sets sorted together. Here we have two names, both written four different ways—using each character set: ABC alphabet, hiragana, katakana, and kanji.

Ayumi、 あゆみ、アユミ、歩美


It is very possible to have different people with the same name write their name in different character sets. The traditional way of writing the Japanese name of Ayumi would be written in kanji; a modern, stylish way would be to write it in hiragana, and a second generation Japanese-American might write their name in katakana or the alphabet.

Put these names into the same PHP sort function and look what happens.

   setlocale(LC_ALL, ‘jpn’);
   $names = array (“Ayumi”, “アユミ”, “あゆみ”,  “歩美”,  
   “Tanaka”, “タナカ”,  “たなか”, “田中”);
   sort ($names);

   foreach ($names as $name) {
      echo “$name<br/>”;

Here is the result:

  • Ayumi
  • Tanaka
  • あゆみ (Ayumi)
  • たなか (Takana)
  • アユミ (Ayumi)
  • タナカ (Tanaka)
  • 歩美 (Ayumi)
  • 田中 (Tanaka)

Within each character set Ayumi is sorted before Tanaka, which is correct for the ABC, hiragana, and katakana alphabets. The kanji pair had a 50/50 chance of being right. But as you can see, the different character sets are not integrated together. If these were all names in your phone’s contact list or your Facebook friends list, you would expect all of the Ayumis and Tanakas to be listed together.

The ABC, hiragana, and katakana alphabets can be sorted—although which character set of Ayumi gets sort preference is a whole other issue—once that preference is agreed upon, sorting can be done just as easily as English.

Kanji — The Real Problem

The real problem with sorting Japanese text is kanji. Kanji aren’t just difficult for students of Japanese to make sense of, they are literally impossible for computers to process with the same intelligence as a human. The reason for this is the following:

Kanji have multiple pronunciations, determined by the context in which it appears.

This fact keeps students up nights studying for years trying to remember how to pronounce kanji right. And it also makes our sorting problem extremely nontrivial. We sort things in language by the pronunciations. Up until now we were dealing with letters. ABC, hiragana, katakana—these are all letters which a single pronunciation. There is only one place they can go.

Kanji on the other hand all have multiple pronunciations. Some have over ten! Only from the context in which the kanji appears do you know how to pronounce it. Our simple sorting problem has now turned into a natural language processing problem.

Here is an example:


Here the kanji 私 is used in two different contexts. The first usage, is 私 (watashi). The second usage is part of the compound word 私立大学 (shiritsu daigaku). Using the Japanese sort order, these words should be sorted like this:

  • 私立大学 (しりつだいがく)
  • 私(わたし)

A second year Japanese student could figure this out. For a computer, this is a very difficult problem.

Here is another, more extreme example.

There are four Japanese women whose names you have to sort: Junko, Atsuko, Kiyoko, and Akiko. This does not seem difficult, until they each show you how they write their names in kanji:

  • 淳子 (Junko)
  • 淳子 (Atsuko)
  • 淳子 (Kiyoko)
  • 淳子 (Akiko)

As you can see, this is rather troublesome. This comes back to kanji having multiple pronunciations. If this was for an address book of your phone contacts for example, you would want Atsuko and Akiko listed with the A names like Ayumi and Akira. But you would not want Junko and Kiyoko listed there.

And this problem is not limited to names. Regular, everyday words also have multiple pronunciations. For example, 故郷 (ふるさと、こきょう), 上手 (じょうず、じょうて、うわて、かみて…) etc.

So how do we deal with this? They have phones and social networking Web sites in Japan with sorted contact lists, so how can we sort these words properly?

The Wrong Way – Using IME Input

First, let’s look at a good try, but failed attempt at Microsoft to try to solve this problem. What good would Excel be if you could not sort on columns and rows. Microsoft clearly understands the issue with sorting Japanese—they just didn’t think through the solution thoroughly.

What Microsoft does in Excel is to capture the input the user types to get the kanji character. For example, if you typed Junko to get 淳子, it will save that input string as meta data in the background. When it is time to sort, it sorts on the input pronunciation meta data rather than the kanji that are displayed. You can actually see what the meta data looks like in Excel 2003 if you save as XML.

You can see the kanji 淳子 is in two different rows, but the input used to get them was different, Atsuko and Junko, so those are saved as meta data to assist with sorting later on.

The problem with this approach is it doesn’t take into account of how people actually interact with computers using a Japanese IME system. Japanese input works with a dictionary of possible kanji conversions based on what has been input. But not every word or name is in that dictionary. Sometimes you have to type each kanji individually or use a totally different pronunciation to get the kanji you want to show up. This results in the wrong pronunciation being saved as meta data, and sorting will not work as expected.

This system also doesn’t work with cutting and pasting text from other sources, as well as any sort of CSV or database import, etc. This was a good try by Microsoft to solve this problem, but it just doesn’t work.

The Right Way – Ask the User

A computer simply cannot guess the correct pronunciation of kanji, even if it logs the users input, because that might not even be correct. The easiest way to solve this problem is just ask the user for the pronunciation! Most software developed in Japan uses this approach.

Let’s look at this approach done correctly: Let’s look at their new user registration First, notice the fields in the English version of this screen.

Now look at the Japanese version of this screen.

As you can see, the Japanese version has an extra field. This is for the user to enter the pronunciation of their name in katakana. This way, Amazon has their name in kanji, and the correct pronunciation to go with. They can now sort their user information correctly. This is the approach that most Japanese software takes. It is an extra step, but it solves the problem.

The big takeaway from this is that you cannot just translate software, or even a Web site, and expect it to work. Something as simple as registering a new user has to be completely reworked. In the case of a simple Web site, you will need to redo not only the Web interface, but also the database back end and the code to interface with the database and Web site generation. Localizing a site into Japanese is much more complicated than other languages because of the extra functionality that is required.

While does do the interface and programming localization correct, they do have something on their site that isn’t localized for the Japanese audience: Their logo.

In English, the logo goes with their saying: “Everything from A to Z.” This is indicated by the arrow. But in Japan, and any other country that doesn’t use English, A and Z aren’t always the first and last letters of the alphabet. The A to Z thing works in English because the name Amazon has A and Z in it. But in other countries, they might not have any idea why there is an arrow under the Amazon logo.

Final Thoughts

Sorting in Japanese is hard. Without user input, it is impossible in some contexts to know how to sort some Japanese words. People developing and localizing software need to understand these issues. But regarding the general problem of sorting Japanese when you don’t have user input to give the pronunciation, there may not be a way to automate this until computers can understand language as well as a native Japanese person. For a computer to understand Japanese is far more complex than most other languages. You can see this first hand by using machine translation software and comparing Japanese to something like French.

I think this is an interesting problem. This goes beyond just sorting. How can you expect a machine translation program to work if it doesn’t even know the pronunciation of a word—something that can be key to understanding what that word is. I can imagine even statistical machine translation being confused, especially with names.

Japanese is an interesting language, and processing it with computers is even more interesting.

Translating Sentences for Trados Rather Than Ideas

Sunday, March 30th, 2008

The benefits of translation memory tools such as Trados for translating are numerous; but they have their negatives as well. They encourage the translator to translate everything on a sentence by sentence basis. Every source sentence will have a corresponding target sentence. It is not always ideal to translate in this manner.

For example, consider the following Japanese sentence and its translation:


I like sushi. However, I cannot eat sea urchin.

Notice that in Japanese it is natural to say that all in one sentence. English on the other hand works better as two sentences.

If you translated that Japanese sentence using Trados, you can split the English translation into two sentences. However, if you use that translation memory for translating English, you will get no matches for the sentences I like sushi, or However, I cannot eat sea urchin. You would have to know to expand the segment to span two sentences.

Translation memory CAT tools like Trados encourage you to translate with a one-to-one correspondence so the translation memory is useful in both directions. It is wasteful to misalign sentences because the resulting TM will not work if the language direction is reversed. Therefore, a translator using Trados will probably translate the above sentence as I like sushi, but I cannot eat sea urchin. This sentence is fine by itself, but it doesn’t have the same impact as separating them as single ideas.

This is just a simple example, but the problem is much bigger than style choices. When using Trados, you translate entire paragraphs line by line. Every source has a matching target. However, the way you organize a paragraph and express an idea in one language, may not be the same as in another language. But with Trados, you don’t have that freedom. You are given a sentence to translate, and then another, and another. You don’t have the freedom you would if you were translating by hand. If you choose the expand the source segment to encompass the entire paragraph, you have essentially made that segment worthless with respect to the translation memory.

Trados and translation memory CAT software are great tools, but they encourage translation of single sentences, rather than ideas or concepts. A test often used after a translation is to run the translation memory that was created against the original source document. You expect to get 100% matches for the entire document. However, a good translator will not translate everything line by line with one-to-one correspondence between source and target.

Translation is more than converting a sentence from one language to another. It’s about expressing something naturally in a different language. CAT tools like Trados don’t encourage the natural translation of ideas, but rather the conversion of sentences.

First Ever to be Trados Certified

Sunday, February 17th, 2008

In 2006, SDL unveiled their Trados Certification program. I had been using Trados extensively at work and thought it would be neat to have the official Trados certification on my resume.

Soon after they released the Trados training program and certification tests, I signed up online to take the tests. To my surprise, the tests were hard. It had questions about what the specific menu names were and little details like that. If I had not used Trados as much as I had, I don’t think I would have passed. You had to be really familiar with the entire suite of tools.

I passed the test and got my own personal certification page generatred:

What was surprising was what came next. A few weeks later I got a package at work from SDL Trados. They sent a congratulatory card informing me that I was the first person to pass the Trados certification, and a bottle of vintage champagne! I certainly wasn’t expecting any of that.

Following that, they contacted me again for a quote and profile to put up on their certification Web site: ( They also asked for a picture of me, but I guess I wasn’t photogenic enough for their site because they put a generic image of someone else above my quote. The current version of the page has a women and multiple quotes now.

SDL Trados Certification Page

In the end, it’s kind of neat. I can tell people I was the first person to ever be certified by SDL Trados. Since then I have also passed their SDL Trados 2007 certification as well.

Can You Translate This?

Saturday, January 19th, 2008

At work I’m often asked things such as “How long will it take to translate 10 pages?” Managers usually don’t like my answer; They want to hear a specific time frame to fill in some gantt chart or something. The reality is, it depends. Most managers and such don’t understand what goes into translating something. It’s not as straightforward as just translating the words. There are other aspects that go into the localization process than just translation.

Expertise. No one is an expert on everything. If you have a technical document that needs translating, you first have to understand the content of the document. If you don’t understand electromagnetic fields in your native language, how are you going to translate that subject from another language. You will often have to research the subject matter before and during the translation process. It takes time to get familiar with a topic. It takes more time to look up industry and field specific terminology and concepts. If you have SMEs at your company, you are at the mercy of their schedules when you cannot locate information yourself.

Working with others. Unfortunately, not all translation projects can be done solely by the translator. For example, if you are given a video and asked to subtitle it in a different language, there are many steps involve that most people don’t realize:

  • Transcribe the audio
  • Translate the text
  • Match the text to the video
  • Reedit the video with the translated subtitles
  • QA check the subtitled video

Ideally, the translator will be provided with the transcription of the audio with a copy of the video so they can immediately start the translation. Then, work with the editor to set cut points for the translated text. Then the editor will reedit the video, and send it to the translator to check.

Unfortunately, what usually happens is the translator is sent the video and asked to translate subtitles for it. Now the translator has to spend time transcribing the text, then translate it. Next, they must come up with cut points themselves, and hope the editor understands it. The editor will then receive the translation and edit them into the video. It will probably never be checked, and most likely the subtitles aren’t going to match the on-screen dialog.

File formats. How long it will take to translate a document depends on what format it’s in. An XML file with pure text content and no markup can be translated easily. The text can be extracted and run through the translators favorite translation software.

A PDF on the other hand is not as easily accessible. Text may be extractable with some amount of effort, but the original document structure and style cannot be rebuilt automatically. Therefore, the translator will have to spend considerable time doing page layout work.

The worst case scenario is a scanned document, or raster graphics files. The text cannot even be extracted from the document, so translation software can’t be used. With a language like Japanese where a translator may not know the pronunciation of hard technical terms, the inability to cut and paste those words into an online dictionary creates lots of problems.

Most people don’t consider the file format when sending something to be translated. The just want it translated, and don’t want to pay for page layout and text extraction, because they don’t think that is involved with the translation process. If you send a Word document to a translator, but that word document has 50 JPGs in it with text to be translated, you are asking the translator to be a graphics specialist as well.

There are a lot that goes into the translation process. Translators often have to do much more than just translate words to do a good translation. Managers need to understand what goes into this process and provide translators with the resources they need so they can specialize on what they do best.