The other day, someone told me that the kanji 漢 (the 漢 in 漢字) usually indicates a male name. This made me wonder what other kanji there are that might indicate a female or male name. So I downloaded JMnedict.xml and processed it a little bit.
Side note: Both XML::LibXML and MySQL’s import from XML thingy turned out to be uselessly slow, so I did this using XML::Twig. Here’s some lazy code to extract all the <keb> elements and their associated name_type:
#!/usr/bin/perl -w use XML::Twig; use strict; binmode(STDOUT, ":utf8"); my $t = XML::Twig->new(twig_handlers => { entry => \&entry }); $t->parsefile($ARGV[0]); sub entry { my ($t, $entry) = @_; my ($keb, $name_type); eval { $keb = $entry->first_child("k_ele")->first_child("keb")->text; $name_type = $entry->first_child("trans")->first_child("name_type")->text; $entry->purge; }; print "$keb\t$name_type\n" if (!$@); }
Now we have a file that looks like this:
ゝ泉 given name or forename, gender not specified 〆 female given name or forename ...
Great names, eh? Then we just do:
grep female all_names.csv > female_names.csv grep "[^e]male" all_names.csv > male_names.csv perl -C -n -e 'while ($_ =~ s/(\p{Block=CJK_Unified_Ideographs})//) { print "$1\n" }' female_names.csv | sort -n | uniq -c | sort -n > female_kanji.csv # Perl's -C option enables unicode everywhere. Unfortunately, this option doesn't work on the #! line. perl -C -n -e 'while ($_ =~ s/(\p{Block=CJK_Unified_Ideographs})//) { print "$1\n" }' male_names.csv | sort -n | uniq -c | sort -n > male_kanji.csv sed -i "s/^\s*//" female_kanji.csv # fix leading whitespace chars from uniq -c sed -i "s/^\s*//" male_kanji.csv
Then we put the two files into two different sheets of the same file in a spreadsheet program. (I used LibreOffice, but Excel is better. Seriously.) Call one sheet “Female”, the other “Male”, and on a third sheet, concatenate the two lists of kanji and filter out the duplicates (in column A), and use the following formulas for columns B to E, respectively:
=IF(ISERROR(VLOOKUP(A2, Female.$A$1:$B$1601, 2, 0)), 0, VLOOKUP(A2, Female.$A$1:$B$1601, 2, 0)) =IF(ISERROR(VLOOKUP(A2, Male.$A$1:$B$1601, 2, 0)), 0, VLOOKUP(A2, Male.$A$1:$B$1601, 2, 0)) =IF(ISERROR(B2/C2), 1000000, B2/C2) =B2+C2
(In LibreOffice Calc, you use a period instead of an exclamation mark to reference cells on a different sheet in formulas. LibreOffice Calc doesn’t support the IFERROR() function. I know that 1,000,000 is not the answer to n/0, but we’d like a high number for sorting purposes.) Copy the formulas down and perhaps add the following headers in the top row: Kanji, Female, Male, Ratio, Count. Copy the whole thing without formulas to a new sheet, sort by ratio, and then by count. Perhaps filter out all the kanji that have a count < 10. Here’s a link to my files: kanji_usage_in_names.ods (OpenDocument) and kanji_usage_in_names.xlsx (Excel 2007+).
So it turns out that there are hundreds of kanji that are strong indicators for the gender of a name. By the way, JMnedict.xml’s data isn’t very good: for example, even names like 大介 aren’t gender-classified yet. We’ve got only 1,719 unique kanji for all the gender-classified names, 1,601 unique kanji for female names, and 873 unique kanji for male names. Pretty low and weird numbers. So don’t expect too much accuracy.