The Unicode character set provides for the first time a convenient way to deal with text files from different parts of East Asia, since widely used national codes are mapped to one common codespace. Unfortunately however, due to the source separation rule, which makes sure that characters distinguished in any of the source character sets will also be distinguished in Unicode, some commonly used characters will be assigned different Unicode codepoints for the same characters, depending whether the source is a Big5 encoded document from Taiwan or a JIS encoded document from Japan.
To give an example, the character nei4 or uchi is 內 U-5167 when converting from Big5, but the same character - with a slightly different glyph shape - becomes 内 U-5185 when the source file was produced in Japan. Obviously, a search program has to be rather smart to realize, that these characters are in fact different.
To overcome this problem, I created a conversion table and some MS-Word macros to convert the affected text files. Why MS-Word? Currently, Word is one of the few application with a decent degree of Unicode support - so that is where I am running into this problem. You will need Word 97 or later for this to work and the same applies to this macro (I am not sure if this will work on Word 98 for the Macintosh -- if somebody finds out, please tell me!). The macron itself was developed with Word2000, English version. I am at least sure that it works for that version on my computer:-)
I am also making the table available down there, so if anybody wants to create a solution for a different platform - please go ahead. If you drop me a line, I would be glad to link to your site.
If you want to give it the conversion a try, please download the file cjkconv.zip to your computer and unzip it. You should find the file CJKconv.doc. Open this file and follow the instructions contained therein.
Since the conversion table is a rather largish file, you can browse it in a separate file here. The table is also available as zipped HTML-file and as zipped text file in 16-bit Unicode or in UTF-8.