Kanji Usage Count Concordance in PHP

December 7th, 2011

A concordance is a list of all words used in a document, Web site, or publication, and some additional useful information about those words. A useful concordance in translation and localization is a list of the most frequently used words. This can be used to identify important terms that should be picked up for a glossary. It can also be used by students of a foreign language to identify important vocabulary words they should dedicate time to study.

Students of Japanese often wonder what kanji they should learn. It can be hard to identify what kanji is most important. And even between subject matters what kanji is more important will differ.

To help with this, I’ve created a kanji concordance application in PHP to create a list of kanji and their usage counts in descending order.

Example

If you had this Japanese text:

私の名前はマークです。私はテキサス大学を卒業しました。すしが大好きです。

The kanji concordance would generate a list that looked like this:

2 私
2 大
1 名
1 前
1 学
1 卒
1 業
1 好

The kanji 私 and 大 are both used twice, so they are at the top of the list with the number 2 for the usage count. The rest of the kanji are used once and show a usage count of 1.

Kanji Concordance Code Explanation

The first thing we do in PHP is set the language locale with the setlocale() function. This is always good practice when dealing with language-related applications.

setlocale(LC_ALL, "ja_JP.utf8");

The LC_ALL parameter sets the locale for all categories, and the ja_JP.utf8 parameter sets the language and locale to Japanese/Japan in Unicode UTF-8.

Next, we will need some string of Japanese text that we want to examine and create our kanji concordance from. In our simple example we will use a hard-coded string. But in a real application we would probably dynamically input the string from some source.

$string = "私の名前はマークです。私はテキサス大学を卒業しました。私はすしが大好きです。
           私は漢字が好きです。";

Once we have our input string, we need to strip it of everything but the kanji, since that is all we are interested in. Japanese text can have hiragana, katakana, English characters, and various punctuation. If we remove all of those, we’ll be left with just the kanji. We will define a regular expression to match these unwanted characters, and then replace them with nothing.

$pattern = "/[a-zA-Z0-90-9あ-んア-ンー。、?!<>: 「」(){}≪≫〈〉《》【】
            『』〔〕[]・\n\r\t\s\(\) ]/u";

This regular expression pattern is fairly straight forward. a-zA-Z0-9 matches the English alphanumeric characters. We also match the double-byte numbers with 0-9. あ-ん will match all the hiragana, and ア-ン will match all of the katakana. Finally, we match the various punctuation marks and other special characters we can expect to find. The u at the end of the pattern is a pattern modifier that tells PHP that this pattern is Unicode UTF-8. I’ve probably left out some punctuation characters but for our example purposes this will do. (We can actually do a much simplier regex than this. For a throrough discussion of regular expressions for Japanese text, see this post on Japanese regex.)

We will use the regular expression search and replace function preg_replace() to match our input string against the regex pattern to remove the unwanted characters.

$kanjiString = preg_replace($pattern, "", $string);

The first parameter, $pattern, is the regular expression pattern to match against. The second paramter “” is an empty string that we use to replace the regex matches. We match an unwanted non-kanji character and replace it with nothing—in other words, we delete it. The last parameter $string is the input string of Japanese to match against the regex pattern and remove everything but the kanji.

The variable $kanjiString now contains only the kanji characters from our original input string.

// $kanjiString = "私名前私大学卒業私大好私漢字好";

Our next step is to split up all the kanji characters and insert them into an array. We will do this in one step with the split by regular expression function preg_split().

$kanjiArray = preg_split("//u", $kanjiString, -1, PREG_SPLIT_NO_EMPTY);

The first parameter “//u” is a regular expression that will match everything, and the u pattern modifer argument puts it in Unicode match mode. The second parameter $kanjiString is the input string to match against the regular expression. The third parameter -1 is the limit parameter, and -1 indicates no limit. This means it will parse the entire string. The final parameter is the PREG_SPLIT_NO_EMPTY flag. This flag sets it so only non-empty items will be returned.

Now that we have an array full of individual kanji, we want to count them to get our kanji usage numbers. The array_count_values() function will count all the values of our input array, and return a new array with those values and their usage count.

$countedArray = array_count_values($kanjiArray);

With our array of kanji and their usage counts, we just need to sort them in reverse order with the arsort() function.

arsort($countedArray);

Our counted_array now contains a list of all the kanji used from our input string in order of their usage counts. In other words, we have successfully built a kanji usage count concordance.

The final step is to iterate through the array and display our concordance to the screen. We will do this with a simple foreach loop over our array.

foreach ($countedArray as $kanji => $count) {
   echo "$count $kanji <br/>";
}

Our kanji usage count concordance will display like this:

4 私
2 好
2 大
1 漢
1 字
1 業
1 学
1 名
1 前
1 卒

There we go. We have a list of all the kanji used and in order from most used to least used. As we can see, 私 seems to be a pretty important kanji. Better put it on your list to study.

In this example our Japanese input string was hard coded, but we can easily expand this code to take in input from a file or even screen scrape a Web site and see what their most used kanji are. With a large enough input sample, we can get a pretty good list of kanji and usage counts for our concordance.

PHP Source Code

Here is the full source code for the kanji count usage concordance in PHP that we built.

<?php
setlocale(LC_ALL, "ja_JP.utf8");

$string = "私の名前はマークです。私はテキサス大学を卒業しました。私はすしが大好きです。
           私は漢字が好きです。";

$pattern = "/[a-zA-Z0-90-9あ-んア-ンー。、?!<>: 「」(){}≪≫〈〉《》【】
            『』〔〕[]・\n\r\t\s\(\) ]/u";
$kanjiString = preg_replace($pattern, "", $string);

$kanjiArray = preg_split("//u", $kanjiString, -1, PREG_SPLIT_NO_EMPTY);

$countedArray = array_count_values($kanjiArray);
arsort($countedArray);

foreach ($countedArray as $kanji => $count) {
   echo "$count $kanji <br/>";
}
?>

2 Responses to “Kanji Usage Count Concordance in PHP”

  1. Edwin says:

    Awesome stuff! I was just researching word frequency for foreign language study, and the problem of the differences in written and spoken language.
    A concordance file based on spoken will be very different from that of written!
    Do you know of any programs that I could input text into, such as a movie transcript/s (subtitle file but in plain text) and have a word/kanji frequency count?

    Awesome website by the way. I wish English was taught using a bit of this method.

  2. Bryan says:

    This splits each and every Kanji. But it ignores words. Some words in Japanese contain 1-3 Kanji.

    If you need the text split by words best to use this PHP version of TinySegmentor:
    http://programming-magic.com/?id=172

    Yes the page is in Japanese but the google translation of it is good enough. Besides, it comes with two example php scripts you can hack away at.

Leave a Reply