For an assignment on my job, I had to compute distances from a couple dozen locations in Japan (which I will call “source locations” in this article) to all postal codes that exist in Japan. Japan has approximately 123,401 postal codes, and we don’t want the straight line distance, but actual routes. So what do we do now? There’s no clear answer, but you may find an idea or two after reading this article.
Geocoding
“Geocoding” refers to the conversion of addresses to latitude/longitude pairs. In our case, (disregarding the couple dozen source locations,) addresses consist of only a postal code (and the country, which is the same for all addresses). Unfortunately, there is no publicly accessible database that matches postal codes with latitude/longitude pairs. (For Japan, at least. Maybe you’ll find that such a database exists for your country.) You could, however, cross-reference data from this database from the Ministry of Land, Infrastructure, Transport and Tourism and this one from the Japan Post website, but realize that this would be very hard because of differences in the spellings of names (e.g., 旭ヶ丘 vs. 旭ケ丘 and オシツクシ vs. 白糠町). It’s best to buy a database, especially if you’re doing this for a company. GeoPostcodes is one company that sells such databases, and at the time of this writing, the database for Japan costs €69.95.
You could also scrape the Google Maps API, but that would violate Google’s terms of service in most cases. You’d also be limited to (currently) 3,000 queries per IP per day. (You could also get access to the enterprise version for $10,000 per year. You’d still be limited to 100,000 requests per day. Hmm. I wonder if you’re allowed to resell your access.) At the time of this writing, you probably won’t find a geocoding API that will let you do this both comfortably and without violating its terms of service.
Calculating routes
With its 3,000 queries per IP per day limit, you won’t be able to use the Google Maps API or any other API (e.g., Bing Maps’ or Nokia’s) to calculate a couple million distances. (Probably not even mapquest open, which is free and merely requests that you ask for permission before firing off thousands of queries.) This is where OpenStreetMap comes in. You will be calculating routes on your own computer. You will find that there is a lot of open-source software to calculate routes using OpenStreetMap data. I’ve tested the following programs (in the following order):
- Gosmore: Gave up after waiting over 24 hours for the conversion process from OpenStreetMap data (which is XML) to Gosmore’s data format to complete
- Routino: Slow for long distances (think 60 seconds or more for routes that are longer than 1000 km), interface is extremely easy to use programmatically
- Navit: Slightly buggy, reasonably fast, interface is either graphical or dbus-based and hard to use programmatically
- OSRM: Extremely fast, interface is reasonably easy to use programmatically
OSRM is the clear winner here. OSRM manages to calculate even long routes within a few milliseconds. However, you’ll need some dead-serious hardware to convert OpenStreetMap data to a format OSRM can use. The conversion tool ended up using about 30 GB of memory (if I remember correctly) to convert OpenStreetMap data for Japan.
OSRM will start a multithreaded web server (on any port you wish). Within a perl script, you could perhaps perform queries like this:
$response = `curl -s \"http://127.0.0.1:5000/viaroute?loc=$lat1,$long1&loc=$lat2,$long2\"`;
OSRM by default returns JSON, and will by default return alternate routes in addition to the route it deems fastest. By the way, calculating all routes took less than half a day (using ~24 not-too-modern Opteron cores).
Occasionally, OSRM will fail to find a route between two points. In our case, this happened 5,078 times, and it happens for the following reasons:
- There is no road nearby. (A single postal code can cover a large area.)
-
The road that is closest to the specified coordinates is not connected to the wider road network. You might be on an island somewhere. (E.g., Hokkaido! Note that routes in Hokkaido were calculated completely separately from the other routes, and the numbers and statistics in this article may disregard Hokkaido.) Or there might be an error in the OpenStreetMap data. (By the way, it would be great if you could correct some of these errors! Anybody can edit OpenStreetMap data.)
We can fix a large number of these broken routes by calculating routes to points in the vicinity of the latitude/longitude pair in question. I chose to check the points 200 m to the north, south, west, and east, and if that failed, incremented the distance to 400, 600, 800, and finally 1,000 m. This reduced the number of broken routes to 1,411.
So what do we do now? We get rid of all postal codes that point to an island. That will eliminate a couple hundred postal codes, and it’s fun, because you’ll get to click around a lot, just like in a game! We’re going to add map markers on a Google Map for all broken postal codes and then click away and make a list of the ones we don’t need. How do you add markers on a Google Map? The answer is only one Google search away. We take the code from the JSFiddle demo and modify it a little bit to make markers go away when we click them and add the relevant latitude/longitude pair to a text box. That leaves only 336 postal codes. Here’s a link to the modified code: display_lat_long_pairs_on_map_bad_zips.html
So what do you do with the remaining 336 postal codes? That’s up to you to decide. I suggest trying mapquest open.
Other facts about postal codes
- There are buildings in Japan that have multiple postal codes, e.g., one for each floor.
- The term “ZIP code” only applies to postal codes in the US. “ZIP” is short for “Zone Improvement Plan”. You can read all about it in the Wikipedia article.
- The number of Japanese postal codes changes (very slightly) every month. (You can find a CSV file that matches postal codes with addresses on the Japan Post website. There you’ll also find small files that contain the updates to this database.) Codes get added, deleted, and re-assigned.
If you have any questions, feel that I’ve left out something important, know of any reasonable alternatives to using OSRM for this, or just found this article helpful, please leave a comment! And if you need distance tables and don’t feel like calculating them yourself, feel free to ask me. I’ll probably manage to get them calculated for you very quickly (for a very modest fee, which mostly depends on if I need to buy external databases).