Archive for the ‘Japanese’ Category

Japanese Input on Linux Mint 13 Maya LTS (Mate)

Saturday, June 2nd, 2012

This tutorial will show you how to install Japanese input IME (日本語入力方法) in Linux Mint 13 under the Mate desktop environment. Japanese IME is required to be able to type in Japanese. It is pretty easy to get working, so let’s start.

Click on the Mint Menu and select System → Software Manager.

In the Software Manager, search for ibus.

Select ibus.

Click Install.

In the Authentication Required dialog box, enter your system password and press Authenticate.

Software Manager will now download and install IBus in the background.

While IBus is installing, search for anthy.

Select ibus-anthy and click Install.

In the Authentication Required dialog box, enter your system password and press Authenticate.

Software Manager will now download and install ibus-anthy in the background.

When the activity bar on the bottom shows 0 ongoing actions, installation is complete.

Close Software Manager.

From the Mint Menu, select to System → Control Center.

Open Language Support.

On the Language Support screen, select Install / Remove Languages….

Scroll down and check Japanese, and then press Apply Changes.

In the Authentication Required dialog box, enter your system password and press Authenticate.

The Applying changes popup screen will display. Wait for it to finish applying changes. It may take a few minutes.

On the Language Support screen, press the Keyboard input method system: drop down and select ibus.

Then press Close.

On the Control Center screen, scroll down and select Other → Keyboard Input Methods.

You may get a popup dialog box that says Keyboard Input Methods (IBus Daemon has not been started. Do you want to start it now? Select Yes to start IBus.

The IBus daemon is now started and the IBus preferences screen will now display.

On the IBus Preferences screen, go to the Input Method tab.

Select the Customize active input methods check box.

Press the Select an input method dropdown and select JapaneseAnthy.

Press Add on the IBus Preferences screen to add the Anthy Japanese input method.

Press Close to exit the IBus Preferences screen.

The IBus i icon should now display in the bottom panel (The square icon; not the shield icon). If it does not show up, log out and log back in. It should now show up. I recommend logging out and logging back in just to make sure it starts properly. It may show up but give you a No Input Window error if it doesn’t start up properly.

Open a text application like Text Editor. While the cursor is in the text field, press the IBus i icon in the bottom panel and select Japanese – Anthy.

Anthy is now activated. To toggle between English and Japanese input, press Control + Space Bar. The IBus icon will now change to the Anthy Aち icon, indicating that you can type in Japanese.

That’s all there is to it. Now you can type in 日本語.

Note: Sometimes the input method reverts back to English if you are changing back and forth between windows and applications. Just press Control + Space Bar to toggle back to Anthy if this happens.

Japanese Input on Ubuntu Linux 12-04 LTS Precise Pangolin

Tuesday, May 29th, 2012

This tutorial will show you how to set up Japanese input IME (日本語入力方法) on Ubuntu Linux 12.04 from the Unity interface. The installation procedure is very similar to the previous Unity release of Ubuntu 11-10.

Setup Procedure

To start, select Dash home from the Unity Launcher.

From the Dash home, search for Language Support.

Select Language Support.

On the Language tab of the Language Support screen, press Install / Remove Languages…

On the Installed Languages screen, scroll down to Japanese and check Installed, and then press Apply Changes.

Enter your password on the Authenticate screen.

It will take a few moments to download and install the Japanese IME packages.

Back on the Language Support screen, select ibus for the Keyboard input method system, and then press Close.

Once again select Dash home from the Unity Launcher.

From the Dash home, search for Keyboard Input Methods.

Select Keyboard Input Methods.

You may get a pop up message saying Keyboard Input Methods (IBus Daemon) has not been started. Do you want to start it now? Select Yes.

On the Input Method tab of the Ibus Preferences screen, select the Customize active input methods check box.

Press Select an input method and select Japanese → Anthy.

Press Add and then press Close.

The Ibus keyboard icon will now display on the top panel.

Open up any application with a text box such as gedit and place the cursor in the text box.

Press the Ibus keyboard icon on the tap panel and select Japanese-Anthy.

The Ibus keyboard icon will now change to the Anthy Aち icon.

That’s it. You can now type in Japanese in Ubuntu 12.04. 難しくない手順ですね。

Kanji Usage Count Concordance Web App

Sunday, February 5th, 2012

A few posts ago I posted some PHP code to show how to extract all the kanji from a string and create a concordance that orders them by how often each kanji is used.

Expanding on this idea, I created a Web application that allows you to generate a kanji usage count concordance from any Web page to see what the most used kanji are. You can access it at the new Localizing Japan Kanji Usage Count Concordance App page. You can also access it from the main navigation tabs at the top of the site.

It’s very easy to use, just copy and paste the URL of a Japanese Web page you want to analyze and submit. The kanji concordance will show you in descending order what the most used kanji are on that Web page. This can be useful for students studying Japanese who want to know what the most used kanji are on certain sites so they can focus their studying, among many other uses.

Enjoy the Kanji Usage Count Concordance App.

Detecting and Conveting Japanese Multibyte Encodings in PHP

Monday, January 30th, 2012

PHP has a large collection of multibyte functions in the standard library for handling multibyte strings such as Japanese. Two useful multibyte functions that PHP provides are for detecting the encoding of a multibyte string, and converting from one multibyte encoding to another.

To check if $string is in UTF-8 encoding, we call mb_check_encoding() like this:

if (mb_check_encoding($string, "UTF-8")) { // do_something(); }

To convert $string, which is currently Shift-JIS, to UTF-8, we call mb_convert_encoding() like this:

$convertedString = mb_convert_encoding($string, "UTF-8", "Shift-JIS);

A convenient feature of mb_convert_encoding() is that you can generalize the function by adding a list of character encodings to convert from. This can come in very handy if you want to convert all Japanese multibyte string encodings to UTF-8, or something else. There are actually 18 Japanese-specific multibyte encodings (that I know of), not including all the Unicode variants like UTF-8, UTF-16, etc. A lot of them come from the Japanese mobile phone carriers.

Let’s put all of this together and check if a string is UTF-8, and if it’s not, meaning it is one of the other 18 Japanese encoding types, let’s convert it to UTF-8.

if (!mb_check_encoding($string, "UTF-8")) {

   $string = mb_convert_encoding($string, "UTF-8",
      "Shift-JIS, EUC-JP, JIS, SJIS, JIS-ms, eucJP-win, SJIS-win, ISO-2022-JP,
       ISO-2022-JP-MS, SJIS-mac, SJIS-Mobile#DOCOMO, SJIS-Mobile#KDDI,
       SJIS-Mobile#SOFTBANK, UTF-8-Mobile#DOCOMO, UTF-8-Mobile#KDDI-A,
       UTF-8-Mobile#KDDI-B, UTF-8-Mobile#SOFTBANK, ISO-2022-JP-MOBILE#KDDI");
}

Regular Expressions for Japanese Text

Friday, January 20th, 2012

Regular expressions are extremely useful for matching patterns in text. But when it comes to Japanese Unicode text, it isn’t obvious what you should do to create regular expressions to match a range of Japanese characters. You can try something like [あ-ん] to match all hiragana characters—and you would be close—but it isn’t the best way to do it. Also, direct input of Japanese isn’t always an option.

To deal with this, know that each character in Unicode has a hexadecimal code point. For example, the code point for the hiragana あ is 3042, and this is designated by U+3042. This code point can be used in a regular expression like this: \x3042. This will match a hiragana あ. This is very useful for programmers who must code pattern matching for Japanese on a system where they cannot input or display Japanese text, or have the know-how to do it (See some of my great Japanese input posts if you need to know how!).

Additionally, some flavors of regular expressions have what are known as Unicode block properties, or Unicode scripts. These are pre-defined blocks of regex Unicode character classes. Hiragana, katakana, and kanji are included in the block properties—very convenient if you need the full-script match in your regular expression.

With this basic knowledge, the following is a thorough list of different Japanese character classes and the various Japanese regular expressions that match those character classes. And further down, a few programming examples showing them in use.

Hiragana

Unicode code points regex: [\x3041-\x3096]
Unicode block property regex: \p{Hiragana}

ぁ あ ぃ い ぅ う ぇ え ぉ お か が き ぎ く ぐ け げ こ ご さ ざ し じ す ず せ ぜ そ ぞ た だ ち ぢ っ つ づ て で と ど な に ぬ ね の は ば ぱ ひ び ぴ ふ ぶ ぷ へ べ ぺ ほ ぼ ぽ ま み む め も ゃ や ゅ ゆ ょ よ ら り る れ ろ ゎ わ ゐ ゑ を ん ゔ ゕ ゖ  ゙ ゚ ゛ ゜ ゝ ゞ ゟ

Katakana (Full Width)

Unicode code points regex: [\x30A0-\x30FF]
Unicode block property regex: \p{Katakana}

゠ ァ ア ィ イ ゥ ウ ェ エ ォ オ カ ガ キ ギ ク グ ケ ゲ コ ゴ サ ザ シ ジ ス ズ セ ゼ ソ ゾ タ ダ チ ヂ ッ ツ ヅ テ デ ト ド ナ ニ ヌ ネ ノ ハ バ パ ヒ ビ ピ フ ブ プ ヘ ベ ペ ホ ボ ポ マ ミ ム メ モ ャ ヤ ュ ユ ョ ヨ ラ リ ル レ ロ ヮ ワ ヰ ヱ ヲ ン ヴ ヵ ヶ ヷ ヸ ヹ ヺ ・ ー ヽ ヾ ヿ

Kanji

Unicode code points regex: [\x3400-\x4DB5\x4E00-\x9FCB\xF900-\xFA6A]
Unicode block property regex: \p{Han}

漢字 日本語 文字 言語 言葉 etc. Too many characters to list.

This regular expression will match all the kanji, including those used in Chinese.

Kanji Radicals

Unicode code points regex: [\x2E80-\x2FD5]

⺀ ⺁ ⺂ ⺃ ⺄ ⺅ ⺆ ⺇ ⺈ ⺉ ⺊ ⺋ ⺌ ⺍ ⺎ ⺏ ⺐ ⺑ ⺒ ⺓ ⺔ ⺕ ⺖ ⺗ ⺘ ⺙ ⺚ ⺛ ⺜ ⺝ ⺞ ⺟ ⺠ ⺡ ⺢ ⺣ ⺤ ⺥ ⺦ ⺧ ⺨ ⺩ ⺪ ⺫ ⺬ ⺭ ⺮ ⺯ ⺰ ⺱ ⺲ ⺳ ⺴ ⺵ ⺶ ⺷ ⺸ ⺹ ⺺ ⺻ ⺼ ⺽ ⺾ ⺿ ⻀ ⻁ ⻂ ⻃ ⻄ ⻅ ⻆ ⻇ ⻈ ⻉ ⻊ ⻋ ⻌ ⻍ ⻎ ⻏ ⻐ ⻑ ⻒ ⻓ ⻔ ⻕ ⻖ ⻗ ⻘ ⻙ ⻚ ⻛ ⻜ ⻝ ⻞ ⻟ ⻠ ⻡ ⻢ ⻣ ⻤ ⻥ ⻦ ⻧ ⻨ ⻩ ⻪ ⻫ ⻬ ⻭ ⻮ ⻯ ⻰ ⻱ ⻲ ⻳
⼀ ⼁ ⼂ ⼃ ⼄ ⼅ ⼆ ⼇ ⼈ ⼉ ⼊ ⼋ ⼌ ⼍ ⼎ ⼏ ⼐ ⼑ ⼒ ⼓ ⼔ ⼕ ⼖ ⼗ ⼘ ⼙ ⼚ ⼛ ⼜ ⼝ ⼞ ⼟ ⼠ ⼡ ⼢ ⼣ ⼤ ⼥ ⼦ ⼧ ⼨ ⼩ ⼪ ⼫ ⼬ ⼭ ⼮ ⼯ ⼰ ⼱ ⼲ ⼳ ⼴ ⼵ ⼶ ⼷ ⼸ ⼹ ⼺ ⼻ ⼼ ⼽ ⼾ ⼿ ⽀ ⽁ ⽂ ⽃ ⽄ ⽅ ⽆ ⽇ ⽈ ⽉ ⽊ ⽋ ⽌ ⽍ ⽎ ⽏ ⽐ ⽑ ⽒ ⽓ ⽔ ⽕ ⽖ ⽗ ⽘ ⽙ ⽚ ⽛ ⽜ ⽝ ⽞ ⽟ ⽠ ⽡ ⽢ ⽣ ⽤ ⽥ ⽦ ⽧ ⽨ ⽩ ⽪ ⽫ ⽬ ⽭ ⽮ ⽯ ⽰ ⽱ ⽲ ⽳ ⽴ ⽵ ⽶ ⽷ ⽸ ⽹ ⽺ ⽻ ⽼ ⽽ ⽾ ⽿ ⾀ ⾁ ⾂ ⾃ ⾄ ⾅ ⾆ ⾇ ⾈ ⾉ ⾊ ⾋ ⾌ ⾍ ⾎ ⾏ ⾐ ⾑ ⾒ ⾓ ⾔ ⾕ ⾖ ⾗ ⾘ ⾙ ⾚ ⾛ ⾜ ⾝ ⾞ ⾟ ⾠ ⾡ ⾢ ⾣ ⾤ ⾥ ⾦ ⾧ ⾨ ⾩ ⾪ ⾫ ⾬ ⾭ ⾮ ⾯ ⾰ ⾱ ⾲ ⾳ ⾴ ⾵ ⾶ ⾷ ⾸ ⾹ ⾺ ⾻ ⾼ ⾽ ⾾ ⾿ ⿀ ⿁ ⿂ ⿃ ⿄ ⿅ ⿆ ⿇ ⿈ ⿉ ⿊ ⿋ ⿌ ⿍ ⿎ ⿏ ⿐ ⿑ ⿒ ⿓ ⿔ ⿕

Katakana and Punctuation (Half Width)

Unicode code points regex: [\xFF5F-\xFF9F]

⦅ ⦆ 。 「 」 、 ・ ヲ ァ ィ ゥ ェ ォ ャ ュ ョ ッ ー ア イ ウ エ オ カ キ ク ケ コ サ シ ス セ ソ タ チ ツ テ ト ナ ニ ヌ ネ ノ ハ ヒ フ ヘ ホ マ ミ ム メ モ ヤ ユ ヨ ラ リ ル レ ロ ワ ン ゙

Japanese Symbols and Punctuation

Unicode code points regex: [\x3000-\x303F]

、 。 〃 〄 々 〆 〇 〈 〉 《 》 「 」 『 』 【 】 〒 〓 〔 〕 〖 〗 〘 〙 〚 〛 〜 〝 〞 〟 〠 〡 〢 〣 〤 〥 〦 〧 〨 〩 〪 〫 〬 〭 〮 〯 〰 〱 〲 〳 〴 〵 〶 〷 〸 〹 〺 〻 〼 〽 〾 〿

Miscellaneous Japanese Symbols and Characters

Unicode code points regex: [\x31F0-\x31FF\x3220-\x3243\x3280-\x337F]

ㇰ ㇱ ㇲ ㇳ ㇴ ㇵ ㇶ ㇷ ㇸ ㇹ ㇺ ㇻ ㇼ ㇽ ㇾ ㇿ
㈠ ㈡ ㈢ ㈣ ㈤ ㈥ ㈦ ㈧ ㈨ ㈩ ㈪ ㈫ ㈬ ㈭ ㈮ ㈯ ㈰ ㈱ ㈲ ㈳ ㈴ ㈵ ㈶ ㈷ ㈸ ㈹ ㈺ ㈻ ㈼ ㈽ ㈾ ㈿ ㉀ ㉁ ㉂ ㉃
㊀ ㊁ ㊂ ㊃ ㊄ ㊅ ㊆ ㊇ ㊈ ㊉ ㊊ ㊋ ㊌ ㊍ ㊎ ㊏ ㊐ ㊑ ㊒ ㊓ ㊔ ㊕ ㊖ ㊗ ㊘ ㊙ ㊚ ㊛ ㊜ ㊝ ㊞ ㊟ ㊠ ㊡ ㊢ ㊣ ㊤ ㊥ ㊦ ㊧ ㊨ ㊩ ㊪ ㊫ ㊬ ㊭ ㊮ ㊯ ㊰ ㊱ ㊲ ㊳ ㊴ ㊵ ㊶ ㊷ ㊸ ㊹ ㊺ ㊻ ㊼ ㊽ ㊾ ㊿
㋀ ㋁ ㋂ ㋃ ㋄ ㋅ ㋆ ㋇ ㋈ ㋉ ㋊ ㋋  ㋐ ㋑ ㋒ ㋓ ㋔ ㋕ ㋖ ㋗ ㋘ ㋙ ㋚ ㋛ ㋜ ㋝ ㋞ ㋟ ㋠ ㋡ ㋢ ㋣ ㋤ ㋥ ㋦ ㋧ ㋨ ㋩ ㋪ ㋫ ㋬ ㋭ ㋮ ㋯ ㋰ ㋱ ㋲ ㋳ ㋴ ㋵ ㋶ ㋷ ㋸ ㋹ ㋺ ㋻ ㋼ ㋽ ㋾
㌀ ㌁ ㌂ ㌃ ㌄ ㌅ ㌆ ㌇ ㌈ ㌉ ㌊ ㌋ ㌌ ㌍ ㌎ ㌏ ㌐ ㌑ ㌒ ㌓ ㌔ ㌕ ㌖ ㌗ ㌘ ㌙ ㌚ ㌛ ㌜ ㌝ ㌞ ㌟ ㌠ ㌡ ㌢ ㌣ ㌤ ㌥ ㌦ ㌧ ㌨ ㌩ ㌪ ㌫ ㌬ ㌭ ㌮ ㌯ ㌰ ㌱ ㌲ ㌳ ㌴ ㌵ ㌶ ㌷ ㌸ ㌹ ㌺ ㌻ ㌼ ㌽ ㌾ ㌿ ㍀ ㍁ ㍂ ㍃ ㍄ ㍅ ㍆ ㍇ ㍈ ㍉ ㍊ ㍋ ㍌ ㍍ ㍎ ㍏ ㍐ ㍑ ㍒ ㍓ ㍔ ㍕ ㍖ ㍗ ㍘ ㍙ ㍚ ㍛ ㍜ ㍝ ㍞ ㍟ ㍠ ㍡ ㍢ ㍣ ㍤ ㍥ ㍦ ㍧ ㍨ ㍩ ㍪ ㍫ ㍬ ㍭ ㍮ ㍯ ㍰ ㍱ ㍲ ㍳ ㍴ ㍵ ㍶  ㍻ ㍼ ㍽ ㍾ ㍿

Alphanumeric and Punctuation (Full Width)

Unicode code points regex: [\xFF01-\xFF5E]

! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
@ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _
` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~

 

Japanese RegEx Code Examples

Find all hiragana in a text string

// PHP
$pattern = "/[\x{3041}-\x{3096}]/u";
preg_match_all($pattern, $text, $matches);
print_r($matches);
# Perl
if ($text =~ m/[\x{3041}-\x{3096}]/) { print $text; }

Remove all hiragana from a text string

//PHP
$pattern = "/\p{Hiragana}/u";
$text = preg_replace($pattern, "", $text);
# Perl
$text =~ s/\p{Hiragana}//g;

Remove everything but Kanji

// PHP
// \P{Han} matches everything other than kanji
$pattern = "/\P{Han}/u";
$text = preg_replace($pattern, "", $text);

Note: In PHP and Perl, the Unicode code block regular expression is written with curly braces around the hexadecimal codes. So the regex of \x3041 becomes \x{3041} and so on.

Note: In Perl you have to make sure you have Unicode set up properly to get regular expressions to work over Japanese. You may also have to run perl with the -CS options (perl -CS) to get rid of any Wide character in print warnings. See http://ahinea.com/en/tech/perl-unicode-struggle.html for more information.

Kanji Usage Count Concordance in PHP

Wednesday, December 7th, 2011

A concordance is a list of all words used in a document, Web site, or publication, and some additional useful information about those words. A useful concordance in translation and localization is a list of the most frequently used words. This can be used to identify important terms that should be picked up for a glossary. It can also be used by students of a foreign language to identify important vocabulary words they should dedicate time to study.

Students of Japanese often wonder what kanji they should learn. It can be hard to identify what kanji is most important. And even between subject matters what kanji is more important will differ.

To help with this, I’ve created a kanji concordance application in PHP to create a list of kanji and their usage counts in descending order.

Example

If you had this Japanese text:

私の名前はマークです。私はテキサス大学を卒業しました。すしが大好きです。

The kanji concordance would generate a list that looked like this:

2 私
2 大
1 名
1 前
1 学
1 卒
1 業
1 好

The kanji 私 and 大 are both used twice, so they are at the top of the list with the number 2 for the usage count. The rest of the kanji are used once and show a usage count of 1.

Kanji Concordance Code Explanation

The first thing we do in PHP is set the language locale with the setlocale() function. This is always good practice when dealing with language-related applications.

setlocale(LC_ALL, "ja_JP.utf8");

The LC_ALL parameter sets the locale for all categories, and the ja_JP.utf8 parameter sets the language and locale to Japanese/Japan in Unicode UTF-8.

Next, we will need some string of Japanese text that we want to examine and create our kanji concordance from. In our simple example we will use a hard-coded string. But in a real application we would probably dynamically input the string from some source.

$string = "私の名前はマークです。私はテキサス大学を卒業しました。私はすしが大好きです。
           私は漢字が好きです。";

Once we have our input string, we need to strip it of everything but the kanji, since that is all we are interested in. Japanese text can have hiragana, katakana, English characters, and various punctuation. If we remove all of those, we’ll be left with just the kanji. We will define a regular expression to match these unwanted characters, and then replace them with nothing.

$pattern = "/[a-zA-Z0-90-9あ-んア-ンー。、?!<>: 「」(){}≪≫〈〉《》【】
            『』〔〕[]・\n\r\t\s\(\) ]/u";

This regular expression pattern is fairly straight forward. a-zA-Z0-9 matches the English alphanumeric characters. We also match the double-byte numbers with 0-9. あ-ん will match all the hiragana, and ア-ン will match all of the katakana. Finally, we match the various punctuation marks and other special characters we can expect to find. The u at the end of the pattern is a pattern modifier that tells PHP that this pattern is Unicode UTF-8. I’ve probably left out some punctuation characters but for our example purposes this will do. (We can actually do a much simplier regex than this. For a throrough discussion of regular expressions for Japanese text, see this post on Japanese regex.)

We will use the regular expression search and replace function preg_replace() to match our input string against the regex pattern to remove the unwanted characters.

$kanjiString = preg_replace($pattern, "", $string);

The first parameter, $pattern, is the regular expression pattern to match against. The second paramter “” is an empty string that we use to replace the regex matches. We match an unwanted non-kanji character and replace it with nothing—in other words, we delete it. The last parameter $string is the input string of Japanese to match against the regex pattern and remove everything but the kanji.

The variable $kanjiString now contains only the kanji characters from our original input string.

// $kanjiString = "私名前私大学卒業私大好私漢字好";

Our next step is to split up all the kanji characters and insert them into an array. We will do this in one step with the split by regular expression function preg_split().

$kanjiArray = preg_split("//u", $kanjiString, -1, PREG_SPLIT_NO_EMPTY);

The first parameter “//u” is a regular expression that will match everything, and the u pattern modifer argument puts it in Unicode match mode. The second parameter $kanjiString is the input string to match against the regular expression. The third parameter -1 is the limit parameter, and -1 indicates no limit. This means it will parse the entire string. The final parameter is the PREG_SPLIT_NO_EMPTY flag. This flag sets it so only non-empty items will be returned.

Now that we have an array full of individual kanji, we want to count them to get our kanji usage numbers. The array_count_values() function will count all the values of our input array, and return a new array with those values and their usage count.

$countedArray = array_count_values($kanjiArray);

With our array of kanji and their usage counts, we just need to sort them in reverse order with the arsort() function.

arsort($countedArray);

Our counted_array now contains a list of all the kanji used from our input string in order of their usage counts. In other words, we have successfully built a kanji usage count concordance.

The final step is to iterate through the array and display our concordance to the screen. We will do this with a simple foreach loop over our array.

foreach ($countedArray as $kanji => $count) {
   echo "$count $kanji <br/>";
}

Our kanji usage count concordance will display like this:

4 私
2 好
2 大
1 漢
1 字
1 業
1 学
1 名
1 前
1 卒

There we go. We have a list of all the kanji used and in order from most used to least used. As we can see, 私 seems to be a pretty important kanji. Better put it on your list to study.

In this example our Japanese input string was hard coded, but we can easily expand this code to take in input from a file or even screen scrape a Web site and see what their most used kanji are. With a large enough input sample, we can get a pretty good list of kanji and usage counts for our concordance.

PHP Source Code

Here is the full source code for the kanji count usage concordance in PHP that we built.

<?php
setlocale(LC_ALL, "ja_JP.utf8");

$string = "私の名前はマークです。私はテキサス大学を卒業しました。私はすしが大好きです。
           私は漢字が好きです。";

$pattern = "/[a-zA-Z0-90-9あ-んア-ンー。、?!<>: 「」(){}≪≫〈〉《》【】
            『』〔〕[]・\n\r\t\s\(\) ]/u";
$kanjiString = preg_replace($pattern, "", $string);

$kanjiArray = preg_split("//u", $kanjiString, -1, PREG_SPLIT_NO_EMPTY);

$countedArray = array_count_values($kanjiArray);
arsort($countedArray);

foreach ($countedArray as $kanji => $count) {
   echo "$count $kanji <br/>";
}
?>

Japanese Input on OpenSUSE Linux 12.1 (KDE 4.7)

Tuesday, December 6th, 2011

Setting up Japanese input IME (日本語入力方法) on openSUSE Linux 12.1 is not difficult, but it requires a little know-how of what packages need to be installed. It only takes a few minutes to download all the files and get it set up. Once installed and configured, you will be able to type in Japanese in your Linux applications. If you’ve used the previous 11.4 version of openSUSE, it’s exactly the same, although some icons look a little different.

Prerequisites

  • YaST software repositories are configured properly.

Setup Procedure

Click on the Kickoff Application Launcher.

On the Computer tab, click Install/Remove Software.

On the Search tab, search for anthy.

In the search results window showing the matching packages, select the anthy and ibus-anthy packages.

Press the Accept button on the bottom right of the window.

YaST will now download, install, and configure the anthy packages.

Do the same for ibus. Open Install/Remove Software, search for ibus, and select the package for ibus. Press Accept to install.

Click on the Kickoff Application Launcher, and from the Leave tab, click Restart to restart openSUSE with the new configuration.

 

After restarting, log back in.

You will now have the IBus input method framework keyboard icon in the bottom panel.

Right click the IBus input method framework keyboard icon and click on Preferences.

On the Input Method tab, select Japanese → Anthy from the dropdown menu.

Press the Add button to add Japanese Anthy input method, and then press Close.

Open up a text editor or any application with a text input window, and click on the IBus input method framework icon and select Japanese – Anthy.

The IBus input method framework keyboard icon will change to the Anthy Aち icon.

You can now type in Japanese.

Click the Anthy Aち icon to select between the various Japanese input modes.

That’s it. Setting up Japanese input on openSUSE 12.1 is not very difficult. When you try to type Japanese, make sure the cursor is in a text box in an application, or you may get an error saying No input window. 日本語入力方法を楽しんでください。

Japanese Input on Linux Mint 12 Lisa

Saturday, December 3rd, 2011

This tutorial will show you how to set up and install Japanese input method IME (日本語入力方法) on Linux Mint 12 Lisa so you can type in Japanese. Linux Mint is quickly becoming one of the more popular Linux distributions. Linux Mint 12 comes in a Gnome 2 and Gnome 3 variety. This tutorial works for either version, however, the menus look a little different in Gnome 2.

Linux Mint 12 Japanese IME Setup Procedure

Click on the Mint Menu and navigate to Other → Software Manager.

In the Software Manager, search for ibus.

Select ibus.

Click Install.

In the Authentication Required dialog box, enter your system password and press Authenticate.

Software Manager will now download and install IBus in the background.

While IBus is installing, search for anthy.

Select ibus-anthy and click Install.

In the Authentication Required dialog box, enter your system password and press Authenticate.

Software Manager will now download and install ibus-anthy in the background.

When the activity bar on the bottom shows 0 ongoing actions, installation is complete.

Close Software Manager.

From the Mint Menu, navigate to System Tools → System Settings.

Open Language Support.

Note: If language support was not installed during the Mint install process you may get a pop up dialog indicating that the language support is not installed completely. In that case, select Install to install the language support. In the Authentication Required dialog box, enter your system password and press Authenticate. The Applying changes screen will display and show the installation progress. When the language support has been fully installed, the Language Support screen will display.

On the Language Support screen, select Install / Remove Languages….

Scroll down and check Japanese, and then press Apply Changes.

On the Language Support screen, press the Keyboard input method system: drop down and select ibus.

Then press Close.

Click on the Mint Menu and select System Tools → IBus.

You should now have the little IBus keyboard icon displayed somewhere on the right side of your Gnome top panel.

Click on the IBus keyboard icon and select Preferences.

On the IBus Preferences screen, go to the Input Method tab.

Press the Select an input method dropdown and select JapaneseAnthy.

Press Add on the IBus Preferences screen to add the Anthy Japanese input method.

Open a text application like Text Editor. While the cursor is in the text field, press the Ibus keyboard icon in the top panel and select Japanese – Anthy.

The Japanese Anthy toolbar should appear and you can now type in Japanese. Place the cursor in a text input application like Text Edit and try to type in Japanese.

That’s all there is to it. Linux Mint is known to be a very easy to use distribution, but it takes quite a few more steps to install Japanese input than the latest versions of Ubuntu or Fedora.

Note: I had issues with the Anthy toolbar not appearing and instead showed this icon which usually means no input window found. But, I could still type in Japanese in this mode, so no worries if this happens to you.

Note: I had trouble when trying to add Japanese on the Install / Remove Languages screen. It worked fine in the Gnome 2 version of Mint 12 I installed in a virtual machine, but it gave a Software database is broken error message in the Gnome 3 version of Mint 12 I installed on a physical laptop. I tried reinstalling twice but I kept getting the same problem. It may have been a problem with the laptop because I also had issues when trying to install drivers for the wireless card. I had no issues with Japanese input on Mint in a VM.

ALC Advanced Search Options (英辞郎 on the Web)

Wednesday, November 23rd, 2011

Space ALC (英辞郎 on the Web) is one of the most useful Japanese translation tools on the Internet. It is a translation dictionary and translation memory that can be searched in both Japanese and English. It has everything from highly technical terminology to colloquial spoken slang. The key feature that separates ALC from all the other online dictionaries is the huge set of example sentences it has in its database. Whether you are looking up a word or phrase, ALC returns results for what you looked up as well as in-context example sentences.

ALC has many advanced search options similar to search engines like Google that you can use to refine your search queries. Let’s take a look at some of these search options.

Basic Search Options

And Search (Word1 Word2)

Search for phrases containing two or more search terms in the results. The search results will contain all the search terms.

Instructions: Put a space between each search term to be included in the search result.

Example: 野球 サッカー

Example: up down

Or Operator [Standalone] (Word1 | Word2)

Search for phrases containing one or more search terms in the results. The search results will contain at least one of the search terms.

Instructions: Put a | (vertical bar) between the search terms.

Example: 製造装置 | 製造設備

Example: USPS | FedEx

Or Operator [Within Phrase] (Word1 | Word2)

Search for different variations of phrases containing one of the terms in the parenthesis.

Instructions: Put a | (vertical bar) between the search terms that are inside of ().

Example: (ケーキ|ピザ)を食べます

Example: do (one’s | my | your | his | her | its | our | their) best

Exact Phrase Search (“Phrase”)

Search for an exact phrase.

Instructions: Put the phrase within double quotes “”.

Example: “open source software”

Advanced Search Options

Designating Number of Words In Between (Word {#} Word)

Specify a certain number of words between search terms.

Instructions: Put the number of words you want to appear between words in braces. For a specific number of words, put one number, like {2}. For a range of possibilities, put the end limits in braces, like {1,3}.

Example: make {2} request

This example will find phrases like make a personal request that have two words between make and request, but will not find phrases like make a request that only have one word in between.

Example: thank you {2,4} cooperation

Search All Conjugations ([Word])

Search for all variations of an English word such as verb conjugations and plurals, etc.

Instructions: Put the variable word in brackets [].

Example: “[go] the distance”

This example finds all forms of the word go, including the past tense went the distance. Notice we put the entire search query in quotes to find the full phrase.

Example: [take] pictures of

This example fines take, takes, taking, took, etc.

Terms to Exclude (-Word)

Exclude certain translations from your search results. Useful to narrow your focus when there are multiple translations for a word.

Instructions: Put a dash – before the word to exclude.

Example: サッカー -soccer

This example will find examples of the word 「サッカー」 that exclude the American translation of soccer, and finds those examples that use football instead.

Example: diet -国会

This example will exclude the Japanese governmental body the Diet. This is useful if you are looking for food and diet related translations.

Multiple Search Options

You can combine search options for really advanced search queries.

Example: “[take] (my | our | your | his | her) picture -can”

This example uses the exact phrase quotes, the conjugation search [], the or operator within a phrase, and the not operator to remove phrases containing the word can.

Searching ALC can often find hundreds of translations. These advanced search options are easy to use and can help narrow down what you are looking for.

For more information, refer to the ALC Help – Basic Usage, Help – High Level Usage, and Search Tips pages. These ALC help pages are all in Japanese.

Japanese Input on Fedora 16 Linux (Gnome 3)

Saturday, November 12th, 2011

Setting up Japanese input (IME) on Fedora 16 Linux is really easy and only takes a few minutes.

Fedora still uses the IBus keyboard input method system and uses the Anthy Japanese input method for the Japanese keyboard input, so it will be a familiar process to set up and use if you have done it on earlier Fedora Linux distributions.

For previous versions of Fedora, refer to:

Fedora 16 Japanese IME Setup Procedure

To start, open Activities from the Top Panel.

In the Search Box, type Input Method and select the Input Method Selector.

In the Input Method Selector screen, select Use IBus (recommended).

 

Press the Preference… link to the right of Use IBus (recommended) to open the IBus Preferences screen.

On the Input Method tab, check the Customize active input methods check box.

Press the Select an input method dropdown and select Show all input methods.

Press the Select an input method dropdown once again and now select Japanese → Anthy.

Press the Add button, and then press Close.

You must log out for the changes to take effect, so press the Log Out button on the Input Method Selector screen.

When you log back in you will now have the IBus input method framework button on the Gnome top panel (It looks like a small keyboard). This is the button to change input modes. Open a text editor such as gedit or some other application with a text input window.

Press the IBus input method framework button and select Japanese – Anthy.

The keyboard icon has now changed to Aち, which shows the letter A and the hiragana character chi, which probably is trying to get something close the the pronunciation of Anthy while indicating Japanese/English input modes.

You should now be able to type in Japanese.

Use the Anthy Aち button to toggle between Japanese, English, and other Japanese IME modes.

Note: I did not have to log out and log back in for the changes to take effect to allow me to type in Japanese in Firefox. However, there may be applications that cannot take advantage of the IME changes until after logging out.

Note: If you get the message No input window when you try to select Japanese Anthy, make sure you have the mouse cursor in an application with a text input box, such as a text editor or a Web browser.

That’s it. You should be able to type in Japanese now. Setting up Japanese IME input on Fedora Linux is simple and very similar to previous versions of Fedora.