Scraping ISO country codes with Nokogiri

Wednesday, May 9 2012

The ISO country codes are used in many business settings to identify the countries of the world: for instance, to indicate the destination of a shipment.

ISO makes the two-letter codes available for free, but the three-letter and the numeric codes are jealously guarded and even if you pay you will not receive them in a very friendly format (only PDF and Microsoft Access are available).

Wikipedia has got a table containing all the ISO country codes you will ever need, so I have decided to write a little script to extract the ISO country codes from the web page.

My first thought was using Python’s Beautiful Soup, but since at work most people are more familiar with Ruby I have decided to use Ruby and the Nokogiri library.

The library is basically a wrapper around the C libxml. It allows you to construct an HTML DOM and to navigate around the DOM using Xpath or CSS selectors. It can also apply XSLT rules, but I have decided not to go down that road.

There were no difficulties installing the gem under Ruby 1.9.3 on Mac OS X. The main obstacle in using the library effectively is the documentation. The tutorial only covers a very small subset of the library and the rest is auto-generated API docs, which are at times fairly obscure (what is the difference between Node#next and Node#next_element?) and require you to already know the name of a class to look up its usage, which leaves you guessing as to how a certain functionality might be named in order to find the documentation.

Nevertheless, with some familiarity with the DOM and XPath you should be able to guess which methods to use to obtain the wished-for result. If you want to see the result of my endeavours, see this gist. Maybe you can take advantage of it too, until the Wikipedia markup changes. The code expects that you have extracted the table containing the ISO country codes from the rest of the page.