Male and female kanji (And how to process ENAMDICT’s XML in Perl)

The other day, someone told me that the kanji 漢 (the 漢 in 漢字) usually indicates a male name. This made me wonder what other kanji there are that might indicate a female or male name. So I downloaded JMnedict.xml and processed it a little bit.

Side note: Both XML::LibXML and MySQL’s import from XML thingy turned out to be uselessly slow, so I did this using XML::Twig. Here’s some lazy code to extract all the <keb> elements and their associated name_type:

#!/usr/bin/perl -w

use XML::Twig;
use strict;

binmode(STDOUT, ":utf8");

my $t = XML::Twig->new(twig_handlers => { entry => \&entry });

$t->parsefile($ARGV[0]);

sub entry {
    my ($t, $entry) = @_;
    my ($keb, $name_type);

    eval {
        $keb = $entry->first_child("k_ele")->first_child("keb")->text;
        $name_type = $entry->first_child("trans")->first_child("name_type")->text;
        $entry->purge;
    };
    print "$keb\t$name_type\n" if (!$@);
}

Now we have a file that looks like this:

ゝ泉    given name or forename, gender not specified
〆      female given name or forename
...

Great names, eh? Then we just do:

grep female all_names.csv > female_names.csv
grep "[^e]male" all_names.csv > male_names.csv
perl -C -n -e 'while ($_ =~ s/(\p{Block=CJK_Unified_Ideographs})//) { print "$1\n" }' female_names.csv | sort -n | uniq -c | sort -n > female_kanji.csv # Perl's -C option enables unicode everywhere. Unfortunately, this option doesn't work on the #! line.
perl -C -n -e 'while ($_ =~ s/(\p{Block=CJK_Unified_Ideographs})//) { print "$1\n" }' male_names.csv | sort -n | uniq -c | sort -n > male_kanji.csv
sed -i "s/^\s*//" female_kanji.csv # fix leading whitespace chars from uniq -c
sed -i "s/^\s*//" male_kanji.csv

Then we put the two files into two different sheets of the same file in a spreadsheet program. (I used LibreOffice, but Excel is better. Seriously.) Call one sheet “Female”, the other “Male”, and on a third sheet, concatenate the two lists of kanji and filter out the duplicates (in column A), and use the following formulas for columns B to E, respectively:

=IF(ISERROR(VLOOKUP(A2, Female.$A$1:$B$1601, 2, 0)), 0, VLOOKUP(A2, Female.$A$1:$B$1601, 2, 0))
=IF(ISERROR(VLOOKUP(A2, Male.$A$1:$B$1601, 2, 0)), 0, VLOOKUP(A2, Male.$A$1:$B$1601, 2, 0))
=IF(ISERROR(B2/C2), 1000000, B2/C2)
=B2+C2

(In LibreOffice Calc, you use a period instead of an exclamation mark to reference cells on a different sheet in formulas. LibreOffice Calc doesn’t support the IFERROR() function. I know that 1,000,000 is not the answer to n/0, but we’d like a high number for sorting purposes.) Copy the formulas down and perhaps add the following headers in the top row: Kanji, Female, Male, Ratio, Count. Copy the whole thing without formulas to a new sheet, sort by ratio, and then by count. Perhaps filter out all the kanji that have a count < 10. Here’s a link to my files: kanji_usage_in_names.ods (OpenDocument) and kanji_usage_in_names.xlsx (Excel 2007+).

So it turns out that there are hundreds of kanji that are strong indicators for the gender of a name. By the way, JMnedict.xml’s data isn’t very good: for example, even names like 大介 aren’t gender-classified yet. We’ve got only 1,719 unique kanji for all the gender-classified names, 1,601 unique kanji for female names, and 873 unique kanji for male names. Pretty low and weird numbers. So don’t expect too much accuracy.