Archive for July, 2012

Japanese Encoding Conversion

Monday, July 16th, 2012

Japanese has many different text encodings, and one that pops up a lot when you are working on text files is EUC-JP (Japanese Extended Unix Code). You find EUC-JP encoding used in many Japanese Web sites, text documents, JMdict and EDICT glossary files, and so on. This encoding is particularly troublesome because a lot of English-language text editors and utilities don’t know how to deal with it.

Usually you want to work with UTF-8 instead, so here are some strategies for converting EUC-JP encoding into UTF-8.

Simple Command-Line Conversion in Linux

On Linux this is really easy. Use the iconv command-line conversion utility.

iconv -f EUC-JP -t UTF-8 input.txt > output.txt

or

iconv -f EUC-JP -t UTF-8 input.txt -o output.txt

input.txt is in EUC-JP encoding, and the resultant output.txt is converted to UTF-8. Short and sweet, and can easily be piped to further commands.

We can use a Bash loop to automate this. For example, to convert all XML files to UTF-8:

for x in *.xml;
do
  iconv -f EUC-JP -t UTF-8 $x > converted/$x
done

Command-Line Perl Program

Let’s write a simple Perl program that will take two command-line arguments: the input file in EUC-JP encoding, and the resultant output file converted to UTF-8. We will be able to run the program like this:

./convert input.txt output.txt

For this, we will use the from_to() function that is part of the Encode module. The from_to() function takes three parameters: the input, the encoding of the input, and the desired encoding of the output.

from_to($input, "euc-jp", "utf8");

Here is the full program:

#!/usr/bin/perl
use strict;
use warnings;
use Encode "from_to";

my $inputFilename  = $ARGV[0];
open(INFILE,  "<", "$inputFilename")  or die "Can't open $inputFilename:  $!";

my $outputFilename = $ARGV[1];
open(OUTFILE, ">", "$outputFilename") or die "Can't open $outputFilename: $!";

while (<INFILE>) {
   from_to($_, "euc-jp", "utf8");
   print OUTFILE $_;
}

close INFILE  or die "INFILE:  $!";
close OUTFILE or die "OUTFILE: $!";

Command Line PHP Program

PHP programs can also be run on the command line. Let’s add a little bit more this time and convert all non-PHP files in a directory from EUC-JP to UTF-8 and put them in a tmp directory using command-line PHP.

We will use the mb_convert_encoding() function which works on multi-byte strings. The mb_convert_encoding() function takes three parameters: the input, the desired encoding of the output, and the encoding of the input.

mb_convert_encoding($input, "UTF-8", "EUC-JP");

Here is the full program:

<?php

$dirHandler = opendir(".");

while ($fileName = readdir($dirHandler)) {

   if ($fileName != '.' && $fileName != '..' 
                        && $fileName != 'php' && $fileName != 'tmp') {

      $file = file_get_contents("./$fileName", FILE_USE_INCLUDE_PATH);
      $convertedText = mb_convert_encoding($file, "UTF-8", "EUC-JP");

      echo "$fileName\n";

      $writeFile  = "../tmp/$fileName";
      $fileHandle = fopen($writeFile, 'w') or die("can't open file");
      fwrite($fileHandle, $convertedText);
      fclose($fileHandle);
   }
 }

?>

Just Use Firefox and a Text Editor

Finally, a very simple way to convert EUC-JP text to UTF-8 if you are working with plain text is to simply open the file in Firefox. Firefox almost always gets the encoding right, and if it doesn’t, you can manually set it in the Character Encoding menu. Then, copy and paste the text into your favorite text editor and save it as UTF-8.

As you can see, you have lots of easy options for converting text between various encodings. And the same scripts can be used for other Japanese encoded strings such as JIS and Shift-JIS.