Regular Expressions for Japanese Text

January 20th, 2012

Regular expressions are extremely useful for matching patterns in text. But when it comes to Japanese Unicode text, it isn’t obvious what you should do to create regular expressions to match a range of Japanese characters. You can try something like [あ-ん] to match all hiragana characters—and you would be close—but it isn’t the best way to do it. Also, direct input of Japanese isn’t always an option.

To deal with this, know that each character in Unicode has a hexadecimal code point. For example, the code point for the hiragana あ is 3042, and this is designated by U+3042. This code point can be used in a regular expression like this: \x3042. This will match a hiragana あ. This is very useful for programmers who must code pattern matching for Japanese on a system where they cannot input or display Japanese text, or have the know-how to do it (See some of my great Japanese input posts if you need to know how!).

Additionally, some flavors of regular expressions have what are known as Unicode block properties, or Unicode scripts. These are pre-defined blocks of regex Unicode character classes. Hiragana, katakana, and kanji are included in the block properties—very convenient if you need the full-script match in your regular expression.

With this basic knowledge, the following is a thorough list of different Japanese character classes and the various Japanese regular expressions that match those character classes. And further down, a few programming examples showing them in use.

Hiragana

Unicode code points regex: [\x3041-\x3096]
Unicode block property regex: \p{Hiragana}

ぁあぃいぅうぇえぉおかがきぎくぐけげこごさざしじすずせぜそぞただちぢっつづてでとどなにぬねのはばぱひびぴふぶぷへべぺほぼぽまみむめもゃやゅゆょよらりるれろゎわゐゑをんゔゕゖ゙゚゛゜ゝゞゟ

Katakana (Full Width)

Unicode code points regex: [\x30A0-\x30FF]
Unicode block property regex: \p{Katakana}

゠ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヷヸヹヺ・ーヽヾヿ

Kanji

Unicode code points regex: [\x3400-\x4DB5\x4E00-\x9FCB\xF900-\xFA6A]
Unicode block property regex: \p{Han}

漢字日本語文字言語言葉 etc. Too many characters to list.

This regular expression will match all the kanji, including those used in Chinese.

Kanji Radicals

Unicode code points regex: [\x2E80-\x2FD5]

⺀⺁⺂⺃⺄⺅⺆⺇⺈⺉⺊⺋⺌⺍⺎⺏⺐⺑⺒⺓⺔⺕⺖⺗⺘⺙⺚⺛⺜⺝⺞⺟⺠⺡⺢⺣⺤⺥⺦⺧⺨⺩⺪⺫⺬⺭⺮⺯⺰⺱⺲⺳⺴⺵⺶⺷⺸⺹⺺⺻⺼⺽⺾⺿⻀⻁⻂⻃⻄⻅⻆⻇⻈⻉⻊⻋⻌⻍⻎⻏⻐⻑⻒⻓⻔⻕⻖⻗⻘⻙⻚⻛⻜⻝⻞⻟⻠⻡⻢⻣⻤⻥⻦⻧⻨⻩⻪⻫⻬⻭⻮⻯⻰⻱⻲⻳
⼀⼁⼂⼃⼄⼅⼆⼇⼈⼉⼊⼋⼌⼍⼎⼏⼐⼑⼒⼓⼔⼕⼖⼗⼘⼙⼚⼛⼜⼝⼞⼟⼠⼡⼢⼣⼤⼥⼦⼧⼨⼩⼪⼫⼬⼭⼮⼯⼰⼱⼲⼳⼴⼵⼶⼷⼸⼹⼺⼻⼼⼽⼾⼿⽀⽁⽂⽃⽄⽅⽆⽇⽈⽉⽊⽋⽌⽍⽎⽏⽐⽑⽒⽓⽔⽕⽖⽗⽘⽙⽚⽛⽜⽝⽞⽟⽠⽡⽢⽣⽤⽥⽦⽧⽨⽩⽪⽫⽬⽭⽮⽯⽰⽱⽲⽳⽴⽵⽶⽷⽸⽹⽺⽻⽼⽽⽾⽿⾀⾁⾂⾃⾄⾅⾆⾇⾈⾉⾊⾋⾌⾍⾎⾏⾐⾑⾒⾓⾔⾕⾖⾗⾘⾙⾚⾛⾜⾝⾞⾟⾠⾡⾢⾣⾤⾥⾦⾧⾨⾩⾪⾫⾬⾭⾮⾯⾰⾱⾲⾳⾴⾵⾶⾷⾸⾹⾺⾻⾼⾽⾾⾿⿀⿁⿂⿃⿄⿅⿆⿇⿈⿉⿊⿋⿌⿍⿎⿏⿐⿑⿒⿓⿔⿕

Katakana and Punctuation (Half Width)

Unicode code points regex: [\xFF5F-\xFF9F]

｟｠｡｢｣､･ｦｧｨｩｪｫｬｭｮｯｰｱｲｳｴｵｶｷｸｹｺｻｼｽｾｿﾀﾁﾂﾃﾄﾅﾆﾇﾈﾉﾊﾋﾌﾍﾎﾏﾐﾑﾒﾓﾔﾕﾖﾗﾘﾙﾚﾛﾜﾝﾞ

Japanese Symbols and Punctuation

Unicode code points regex: [\x3000-\x303F]

、。〃〄々〆〇〈〉《》「」『』【】〒〓〔〕〖〗〘〙〚〛〜〝〞〟〠〡〢〣〤〥〦〧〨〩〪〭〮〯〫〬〰〱〲〳〴〵〶〷〸〹〺〻〼〽〾〿

Miscellaneous Japanese Symbols and Characters

Unicode code points regex: [\x31F0-\x31FF\x3220-\x3243\x3280-\x337F]

ㇰㇱㇲㇳㇴㇵㇶㇷㇸㇹㇺㇻㇼㇽㇾㇿ
㈠㈡㈢㈣㈤㈥㈦㈧㈨㈩㈪㈫㈬㈭㈮㈯㈰㈱㈲㈳㈴㈵㈶㈷㈸㈹㈺㈻㈼㈽㈾㈿㉀㉁㉂㉃
㊀㊁㊂㊃㊄㊅㊆㊇㊈㊉㊊㊋㊌㊍㊎㊏㊐㊑㊒㊓㊔㊕㊖㊗㊘㊙㊚㊛㊜㊝㊞㊟㊠㊡㊢㊣㊤㊥㊦㊧㊨㊩㊪㊫㊬㊭㊮㊯㊰㊱㊲㊳㊴㊵㊶㊷㊸㊹㊺㊻㊼㊽㊾㊿
㋀㋁㋂㋃㋄㋅㋆㋇㋈㋉㋊㋋㋐㋑㋒㋓㋔㋕㋖㋗㋘㋙㋚㋛㋜㋝㋞㋟㋠㋡㋢㋣㋤㋥㋦㋧㋨㋩㋪㋫㋬㋭㋮㋯㋰㋱㋲㋳㋴㋵㋶㋷㋸㋹㋺㋻㋼㋽㋾
㌀㌁㌂㌃㌄㌅㌆㌇㌈㌉㌊㌋㌌㌍㌎㌏㌐㌑㌒㌓㌔㌕㌖㌗㌘㌙㌚㌛㌜㌝㌞㌟㌠㌡㌢㌣㌤㌥㌦㌧㌨㌩㌪㌫㌬㌭㌮㌯㌰㌱㌲㌳㌴㌵㌶㌷㌸㌹㌺㌻㌼㌽㌾㌿㍀㍁㍂㍃㍄㍅㍆㍇㍈㍉㍊㍋㍌㍍㍎㍏㍐㍑㍒㍓㍔㍕㍖㍗㍘㍙㍚㍛㍜㍝㍞㍟㍠㍡㍢㍣㍤㍥㍦㍧㍨㍩㍪㍫㍬㍭㍮㍯㍰㍱㍲㍳㍴㍵㍶㍻㍼㍽㍾㍿

Alphanumeric and Punctuation (Full Width)

Unicode code points regex: [\xFF01-\xFF5E]

！＂＃＄％＆＇（）＊＋，－．／０１２３４５６７８９：；＜＝＞？
＠ＡＢＣＤＥＦＧＨＩＪＫＬＭＮＯＰＱＲＳＴＵＶＷＸＹＺ［＼］＾＿
｀ａｂｃｄｅｆｇｈｉｊｋｌｍｎｏｐｑｒｓｔｕｖｗｘｙｚ｛｜｝～

Japanese RegEx Code Examples

Find all hiragana in a text string

// PHP
$pattern = "/[\x{3041}-\x{3096}]/u";
preg_match_all($pattern, $text, $matches);
print_r($matches);

# Perl
if ($text =~ m/[\x{3041}-\x{3096}]/) { print $text; }

Remove all hiragana from a text string

//PHP
$pattern = "/\p{Hiragana}/u";
$text = preg_replace($pattern, "", $text);

# Perl
$text =~ s/\p{Hiragana}//g;

Remove everything but Kanji

// PHP
// \P{Han} matches everything other than kanji
$pattern = "/\P{Han}/u";
$text = preg_replace($pattern, "", $text);

Note: In PHP and Perl, the Unicode code block regular expression is written with curly braces around the hexadecimal codes. So the regex of \x3041 becomes \x{3041} and so on.

Note: In Perl you have to make sure you have Unicode set up properly to get regular expressions to work over Japanese. You may also have to run perl with the -CS options (perl -CS) to get rid of any Wide character in print warnings. See http://ahinea.com/en/tech/perl-unicode-struggle.html for more information.

Posted in Japanese, Programming | 10 Comments »
leave a response | trackback

10 Responses to “Regular Expressions for Japanese Text”

Daniele says:

January 22, 2012 at 3:39 pm

Excellent. Thanks! This is great to know 😉
Daniele says:

January 22, 2012 at 6:53 pm

hmm I’m having some trouble though. If a page is encoded in EUC-JP or in Shift_JIS the $pattern = “/\P{Han}/u”; does not work.

So this code returns a blank page:

I’m trying to figure out how to make it detect the page’s encoding, convert to UTF-8 so that I can use the \P{Han}/u regexp to only get lists of Kanji from websites.
Any insight into how this may be done?

Thanks 😀
Daniele says:

January 22, 2012 at 6:54 pm

Oops the code didn’t go through. Here is it again:

setlocale(LC_ALL, “ja_JP.UTF8”);

$url = file( “http://www.asahi.com/” , FILE_SKIP_EMPTY_LINES );
$pattern = “/\P{Han}/u”;
$kanjiString = “”;

foreach ($url as $line) {
$kanjiString .= preg_replace($pattern, “”, $line);
}

echo $kanjiString;
mark says:

January 23, 2012 at 4:58 pm

Hi Daniele,

PHP has two convenient functions to do what you want.
mb_check_encoding()
mb_convert_encoding()

Something simple like this should do the trick for you.
if (!mb_check_encoding($text, "UTF-8")) { $text = mb_convert_encoding($text, "UTF-8", "Shift-JIS, EUC-JP"); }
Taylor says:

June 12, 2012 at 9:05 pm

I was putting off having to do something like this myself. Thanks so much for this. I’ll have to experiment later, but I trust all is in good order.
Raul V says:

August 21, 2012 at 8:19 pm

Thanks for you posts.
Actually I would like to ask for your help.
I tried this:
perl -pe ‘s!\p{Hiragana}!C!g’ myfile.txt
don’t do anything.
perl -pe ‘s!\P{InHiragana}!C!g’ myfile.txt
replaces everithing, also the kanji.
Using the rank [\x3041-\x3096] also failed
Some ideas?
hachi8833 says:

November 22, 2014 at 7:35 am

Thank you for the fine reference!
Philipp Klein says:

May 19, 2015 at 4:03 am

Thanks a lot for this article, helped me a lot!
I wanted to mention, that you forgot the to put the Japanese whitespace character in the table. Yours starts with U+3001, it should start with U+3000 (“　”)
Kirimaru says:

August 26, 2015 at 8:23 am

If you’re having trouble with your regex, it may be helpful to use block names (\p{InHiragana}, \p{InKatakana}, and the various InCJK* blocks) instead of script names (\p{Hiragana}, \p{Katakana}, \p{Han}).

For example, even though “ー” (KATAKANA-HIRAGANA PROLONGED SOUND MARK) is included in the Katakana block above, in my tests \p{Katakana} doesn’t match it, while \p{InKatakana} does.
Mathias says:

February 15, 2017 at 7:30 pm

The following character are not half-width:｟｠.
Half-width detection should be : [\x{FF61}-\x{FF9F}]

But Thank you for the great post.

Categories

Pages

Latest Posts

Archives

Sites

Credentials