In Japanese, often words are written in a mixture of two scripts: kanji
(logographs) and hiragana (syllabary). For example, where upper-case
letters indicate kanji, lower-case represent hiragana, and a space
indicates character boundaries, you might find the following word:
HI k KO shi
Unfortunately, anything that's written in kanji can alternatively be
written using hiragana.
hi k ko shi
Further complicating the problem, sometimes hiragana occurring after a
kanji (okurigana) are omitted or assumed.
HI k KOSHI
HIK KO shi
Thus, a word like this can be written five different ways. Given all
this, how would one go about doing a word-frequency count in Japanese?
One option is to standardize everything to hiragana (doable). The
problem with this is that you then end up with a high percentage of
homographic heteronyms (they would be heterographic, were they written
Any other ideas?
And a related question: does anyone have an extensive list of Japanese
transitive / intransitive verb pairs?
This archive was generated by hypermail 2b29 : Tue Jun 17 2003 - 17:16:02 MET DST