Regular Expressions for Japanese Text

January 20th, 2012

Regular expressions are extremely useful for matching patterns in text. But when it comes to Japanese Unicode text, it isn’t obvious what you should do to create regular expressions to match a range of Japanese characters. You can try something like [あ-ん] to match all hiragana characters—and you would be close—but it isn’t the best way to do it. Also, direct input of Japanese isn’t always an option.

To deal with this, know that each character in Unicode has a hexadecimal code point. For example, the code point for the hiragana あ is 3042, and this is designated by U+3042. This code point can be used in a regular expression like this: \x3042. This will match a hiragana あ. This is very useful for programmers who must code pattern matching for Japanese on a system where they cannot input or display Japanese text, or have the know-how to do it (See some of my great Japanese input posts if you need to know how!).

Additionally, some flavors of regular expressions have what are known as Unicode block properties, or Unicode scripts. These are pre-defined blocks of regex Unicode character classes. Hiragana, katakana, and kanji are included in the block properties—very convenient if you need the full-script match in your regular expression.

With this basic knowledge, the following is a thorough list of different Japanese character classes and the various Japanese regular expressions that match those character classes. And further down, a few programming examples showing them in use.

Hiragana

Unicode code points regex: [\x3041-\x3096]
Unicode block property regex: \p{Hiragana}

ぁ あ ぃ い ぅ う ぇ え ぉ お か が き ぎ く ぐ け げ こ ご さ ざ し じ す ず せ ぜ そ ぞ た だ ち ぢ っ つ づ て で と ど な に ぬ ね の は ば ぱ ひ び ぴ ふ ぶ ぷ へ べ ぺ ほ ぼ ぽ ま み む め も ゃ や ゅ ゆ ょ よ ら り る れ ろ ゎ わ ゐ ゑ を ん ゔ ゕ ゖ  ゙ ゚ ゛ ゜ ゝ ゞ ゟ

Katakana (Full Width)

Unicode code points regex: [\x30A0-\x30FF]
Unicode block property regex: \p{Katakana}

゠ ァ ア ィ イ ゥ ウ ェ エ ォ オ カ ガ キ ギ ク グ ケ ゲ コ ゴ サ ザ シ ジ ス ズ セ ゼ ソ ゾ タ ダ チ ヂ ッ ツ ヅ テ デ ト ド ナ ニ ヌ ネ ノ ハ バ パ ヒ ビ ピ フ ブ プ ヘ ベ ペ ホ ボ ポ マ ミ ム メ モ ャ ヤ ュ ユ ョ ヨ ラ リ ル レ ロ ヮ ワ ヰ ヱ ヲ ン ヴ ヵ ヶ ヷ ヸ ヹ ヺ ・ ー ヽ ヾ ヿ

Kanji

Unicode code points regex: [\x3400-\x4DB5\x4E00-\x9FCB\xF900-\xFA6A]
Unicode block property regex: \p{Han}

漢字 日本語 文字 言語 言葉 etc. Too many characters to list.

This regular expression will match all the kanji, including those used in Chinese.

Kanji Radicals

Unicode code points regex: [\x2E80-\x2FD5]

⺀ ⺁ ⺂ ⺃ ⺄ ⺅ ⺆ ⺇ ⺈ ⺉ ⺊ ⺋ ⺌ ⺍ ⺎ ⺏ ⺐ ⺑ ⺒ ⺓ ⺔ ⺕ ⺖ ⺗ ⺘ ⺙ ⺚ ⺛ ⺜ ⺝ ⺞ ⺟ ⺠ ⺡ ⺢ ⺣ ⺤ ⺥ ⺦ ⺧ ⺨ ⺩ ⺪ ⺫ ⺬ ⺭ ⺮ ⺯ ⺰ ⺱ ⺲ ⺳ ⺴ ⺵ ⺶ ⺷ ⺸ ⺹ ⺺ ⺻ ⺼ ⺽ ⺾ ⺿ ⻀ ⻁ ⻂ ⻃ ⻄ ⻅ ⻆ ⻇ ⻈ ⻉ ⻊ ⻋ ⻌ ⻍ ⻎ ⻏ ⻐ ⻑ ⻒ ⻓ ⻔ ⻕ ⻖ ⻗ ⻘ ⻙ ⻚ ⻛ ⻜ ⻝ ⻞ ⻟ ⻠ ⻡ ⻢ ⻣ ⻤ ⻥ ⻦ ⻧ ⻨ ⻩ ⻪ ⻫ ⻬ ⻭ ⻮ ⻯ ⻰ ⻱ ⻲ ⻳
⼀ ⼁ ⼂ ⼃ ⼄ ⼅ ⼆ ⼇ ⼈ ⼉ ⼊ ⼋ ⼌ ⼍ ⼎ ⼏ ⼐ ⼑ ⼒ ⼓ ⼔ ⼕ ⼖ ⼗ ⼘ ⼙ ⼚ ⼛ ⼜ ⼝ ⼞ ⼟ ⼠ ⼡ ⼢ ⼣ ⼤ ⼥ ⼦ ⼧ ⼨ ⼩ ⼪ ⼫ ⼬ ⼭ ⼮ ⼯ ⼰ ⼱ ⼲ ⼳ ⼴ ⼵ ⼶ ⼷ ⼸ ⼹ ⼺ ⼻ ⼼ ⼽ ⼾ ⼿ ⽀ ⽁ ⽂ ⽃ ⽄ ⽅ ⽆ ⽇ ⽈ ⽉ ⽊ ⽋ ⽌ ⽍ ⽎ ⽏ ⽐ ⽑ ⽒ ⽓ ⽔ ⽕ ⽖ ⽗ ⽘ ⽙ ⽚ ⽛ ⽜ ⽝ ⽞ ⽟ ⽠ ⽡ ⽢ ⽣ ⽤ ⽥ ⽦ ⽧ ⽨ ⽩ ⽪ ⽫ ⽬ ⽭ ⽮ ⽯ ⽰ ⽱ ⽲ ⽳ ⽴ ⽵ ⽶ ⽷ ⽸ ⽹ ⽺ ⽻ ⽼ ⽽ ⽾ ⽿ ⾀ ⾁ ⾂ ⾃ ⾄ ⾅ ⾆ ⾇ ⾈ ⾉ ⾊ ⾋ ⾌ ⾍ ⾎ ⾏ ⾐ ⾑ ⾒ ⾓ ⾔ ⾕ ⾖ ⾗ ⾘ ⾙ ⾚ ⾛ ⾜ ⾝ ⾞ ⾟ ⾠ ⾡ ⾢ ⾣ ⾤ ⾥ ⾦ ⾧ ⾨ ⾩ ⾪ ⾫ ⾬ ⾭ ⾮ ⾯ ⾰ ⾱ ⾲ ⾳ ⾴ ⾵ ⾶ ⾷ ⾸ ⾹ ⾺ ⾻ ⾼ ⾽ ⾾ ⾿ ⿀ ⿁ ⿂ ⿃ ⿄ ⿅ ⿆ ⿇ ⿈ ⿉ ⿊ ⿋ ⿌ ⿍ ⿎ ⿏ ⿐ ⿑ ⿒ ⿓ ⿔ ⿕

Katakana and Punctuation (Half Width)

Unicode code points regex: [\xFF5F-\xFF9F]

⦅ ⦆ 。 「 」 、 ・ ヲ ァ ィ ゥ ェ ォ ャ ュ ョ ッ ー ア イ ウ エ オ カ キ ク ケ コ サ シ ス セ ソ タ チ ツ テ ト ナ ニ ヌ ネ ノ ハ ヒ フ ヘ ホ マ ミ ム メ モ ヤ ユ ヨ ラ リ ル レ ロ ワ ン ゙

Japanese Symbols and Punctuation

Unicode code points regex: [\x3000-\x303F]

、 。 〃 〄 々 〆 〇 〈 〉 《 》 「 」 『 』 【 】 〒 〓 〔 〕 〖 〗 〘 〙 〚 〛 〜 〝 〞 〟 〠 〡 〢 〣 〤 〥 〦 〧 〨 〩 〪 〫 〬 〭 〮 〯 〰 〱 〲 〳 〴 〵 〶 〷 〸 〹 〺 〻 〼 〽 〾 〿

Miscellaneous Japanese Symbols and Characters

Unicode code points regex: [\x31F0-\x31FF\x3220-\x3243\x3280-\x337F]

ㇰ ㇱ ㇲ ㇳ ㇴ ㇵ ㇶ ㇷ ㇸ ㇹ ㇺ ㇻ ㇼ ㇽ ㇾ ㇿ
㈠ ㈡ ㈢ ㈣ ㈤ ㈥ ㈦ ㈧ ㈨ ㈩ ㈪ ㈫ ㈬ ㈭ ㈮ ㈯ ㈰ ㈱ ㈲ ㈳ ㈴ ㈵ ㈶ ㈷ ㈸ ㈹ ㈺ ㈻ ㈼ ㈽ ㈾ ㈿ ㉀ ㉁ ㉂ ㉃
㊀ ㊁ ㊂ ㊃ ㊄ ㊅ ㊆ ㊇ ㊈ ㊉ ㊊ ㊋ ㊌ ㊍ ㊎ ㊏ ㊐ ㊑ ㊒ ㊓ ㊔ ㊕ ㊖ ㊗ ㊘ ㊙ ㊚ ㊛ ㊜ ㊝ ㊞ ㊟ ㊠ ㊡ ㊢ ㊣ ㊤ ㊥ ㊦ ㊧ ㊨ ㊩ ㊪ ㊫ ㊬ ㊭ ㊮ ㊯ ㊰ ㊱ ㊲ ㊳ ㊴ ㊵ ㊶ ㊷ ㊸ ㊹ ㊺ ㊻ ㊼ ㊽ ㊾ ㊿
㋀ ㋁ ㋂ ㋃ ㋄ ㋅ ㋆ ㋇ ㋈ ㋉ ㋊ ㋋  ㋐ ㋑ ㋒ ㋓ ㋔ ㋕ ㋖ ㋗ ㋘ ㋙ ㋚ ㋛ ㋜ ㋝ ㋞ ㋟ ㋠ ㋡ ㋢ ㋣ ㋤ ㋥ ㋦ ㋧ ㋨ ㋩ ㋪ ㋫ ㋬ ㋭ ㋮ ㋯ ㋰ ㋱ ㋲ ㋳ ㋴ ㋵ ㋶ ㋷ ㋸ ㋹ ㋺ ㋻ ㋼ ㋽ ㋾
㌀ ㌁ ㌂ ㌃ ㌄ ㌅ ㌆ ㌇ ㌈ ㌉ ㌊ ㌋ ㌌ ㌍ ㌎ ㌏ ㌐ ㌑ ㌒ ㌓ ㌔ ㌕ ㌖ ㌗ ㌘ ㌙ ㌚ ㌛ ㌜ ㌝ ㌞ ㌟ ㌠ ㌡ ㌢ ㌣ ㌤ ㌥ ㌦ ㌧ ㌨ ㌩ ㌪ ㌫ ㌬ ㌭ ㌮ ㌯ ㌰ ㌱ ㌲ ㌳ ㌴ ㌵ ㌶ ㌷ ㌸ ㌹ ㌺ ㌻ ㌼ ㌽ ㌾ ㌿ ㍀ ㍁ ㍂ ㍃ ㍄ ㍅ ㍆ ㍇ ㍈ ㍉ ㍊ ㍋ ㍌ ㍍ ㍎ ㍏ ㍐ ㍑ ㍒ ㍓ ㍔ ㍕ ㍖ ㍗ ㍘ ㍙ ㍚ ㍛ ㍜ ㍝ ㍞ ㍟ ㍠ ㍡ ㍢ ㍣ ㍤ ㍥ ㍦ ㍧ ㍨ ㍩ ㍪ ㍫ ㍬ ㍭ ㍮ ㍯ ㍰ ㍱ ㍲ ㍳ ㍴ ㍵ ㍶  ㍻ ㍼ ㍽ ㍾ ㍿

Alphanumeric and Punctuation (Full Width)

Unicode code points regex: [\xFF01-\xFF5E]

! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
@ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _
` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~

 

Japanese RegEx Code Examples

Find all hiragana in a text string

// PHP
$pattern = "/[\x{3041}-\x{3096}]/u";
preg_match_all($pattern, $text, $matches);
print_r($matches);
# Perl
if ($text =~ m/[\x{3041}-\x{3096}]/) { print $text; }

Remove all hiragana from a text string

//PHP
$pattern = "/\p{Hiragana}/u";
$text = preg_replace($pattern, "", $text);
# Perl
$text =~ s/\p{Hiragana}//g;

Remove everything but Kanji

// PHP
// \P{Han} matches everything other than kanji
$pattern = "/\P{Han}/u";
$text = preg_replace($pattern, "", $text);

Note: In PHP and Perl, the Unicode code block regular expression is written with curly braces around the hexadecimal codes. So the regex of \x3041 becomes \x{3041} and so on.

Note: In Perl you have to make sure you have Unicode set up properly to get regular expressions to work over Japanese. You may also have to run perl with the -CS options (perl -CS) to get rid of any Wide character in print warnings. See http://ahinea.com/en/tech/perl-unicode-struggle.html for more information.

6 Responses to “Regular Expressions for Japanese Text”

  1. Daniele says:

    Excellent. Thanks! This is great to know ;)

  2. Daniele says:

    hmm I’m having some trouble though. If a page is encoded in EUC-JP or in Shift_JIS the $pattern = “/\P{Han}/u”; does not work.

    So this code returns a blank page:

    I’m trying to figure out how to make it detect the page’s encoding, convert to UTF-8 so that I can use the \P{Han}/u regexp to only get lists of Kanji from websites.
    Any insight into how this may be done?

    Thanks :D

  3. Daniele says:

    Oops the code didn’t go through. Here is it again:

    setlocale(LC_ALL, “ja_JP.UTF8″);

    $url = file( “http://www.asahi.com/” , FILE_SKIP_EMPTY_LINES );
    $pattern = “/\P{Han}/u”;
    $kanjiString = “”;

    foreach ($url as $line) {
    $kanjiString .= preg_replace($pattern, “”, $line);
    }

    echo $kanjiString;

  4. mark says:

    Hi Daniele,

    PHP has two convenient functions to do what you want.
    mb_check_encoding()
    mb_convert_encoding()

    Something simple like this should do the trick for you.
    if (!mb_check_encoding($text, "UTF-8")) {
    $text = mb_convert_encoding($text, "UTF-8", "Shift-JIS, EUC-JP");
    }

  5. Taylor says:

    I was putting off having to do something like this myself. Thanks so much for this. I’ll have to experiment later, but I trust all is in good order.

  6. Raul V says:

    Thanks for you posts.
    Actually I would like to ask for your help.
    I tried this:
    perl -pe ‘s!\p{Hiragana}!C!g’ myfile.txt
    don’t do anything.
    perl -pe ‘s!\P{InHiragana}!C!g’ myfile.txt
    replaces everithing, also the kanji.
    Using the rank [\x3041-\x3096] also failed
    Some ideas?

Leave a Reply