Late last week the British Library released over 1 million public domain images from books published between 1500 and 1900. The images themselves can be found on Flickr, which of course has its own search mechanisms, but for me one of the more interesting things was that they released metadata for all of them via Github. You can read a bit more about how they generated the data and a tool called the Mechanical Curator here.
First off let’s say a huge thank you to the British Library – this is a massive amount of completely free material available for pretty much any purpose, and there is some very interesting material in there.
The image I chose for this post is entitled “Jupiter, an idol of the Chinese”, which I’ll leave without comment(!) It’s taken from a book about the geography of the world, published in 1808, purportedly based on the voyages of Captain Cook amongst others.
Anyway, back to the data. I’ve been looking for a project, and suddenly 1,000,000+ image catalogue entries land in my lap. Fate? who knows.. but I dove right on in.
If you’ve been keeping up with the blog you’ll know I learned a bit of Ruby a couple of weeks ago – and it seemed like an ideal chance to put that knowledge into practice. The metadata is split into 770 tab-separated files, broken down by year and image size. Each line of each file represents one image, and contains the Flickr ID and URL, along with the title, first author, publication date and other information about the book.
So, the first thing I wanted to do was to get the metadata into some kind of database, so I could look at it. My initial choice was MongoDB, simply because it’s always my first choice – it’s so simple to get data into and it gives you quite a lot of scope for detailed analysis via the new aggregation framework, straight from the command line interface. I also wanted to try out Elasticsearch, and I had a feeling keyword searching on the data might also be quite interesting.
My first effort was a basic Ruby script which loaded the data into MongoDB – I stuck it on Github straight away, but I wasn’t very pleased with the quality of my code, despite the fact it did the job. It wasn’t very easily refactored to re-use components for the Elasticsearch loader. So, I spent an hour or two moving the classes into separate files, and making sure they were decoupled from each other to the right degree. I’ve ended up with a TSV file reading framework that will stand me in good stead, and is particularly useful for loading this BL image data into various repositories. Ruby is absolutely perfect for this task – I really like it as a language. It’s neat and compact, typed where you need it but agnostic enough where you don’t care.
Adding an Elasticsearch loader was pretty trivial at this point, and it would be really easy to add any other database, if you chose – just implement a new tsv/TsvToYourDB.rb file, and a YourDBLoader.rb script to call it. My initial decision to use MongoDB was justified though – the initial load on ES took over ten times as long.
You can find all the code on Github, with instructions – you just need Ruby somewhere, and a couple of gems to access the database drivers.
Expansion and updates
Over the weekend, github user straup added direct image URLs and sizes for small, medium. large and original for all the images, which nearly doubled the size of the data.
Luckily I’d written the loaders in such a way that they worked out of the box with the new fields, and also worked as updates without needing to remove the previously loaded data, so I was pretty happy when I ran them again on the updates and everything went smoothly.
As soon as I started trying to run range queries on the image sizes, to find the biggest ones, I realised I’d made a bit of an error when loading the data from the TSV files. Everything was being loaded as a string, which meant that both MongoDB and Elasticsearch weren’t treating them as numbers.
I’ve hacked in a pretty crude algorithm for deciding which fields I want to be treated as integers, which worked great with Mongo, but I realised I had more of a problem with Elasticsearch. ES is very picky about field types within the index – it appears to decide what they should be the first time it sees a field value, and then they’re fixed forever. Unfortunately this meant I had to completely delete the index and start again – another 40 minutes loading time.
Anyway, enough about the technical side, what about the images??
Exploring the data
The first thing I did was to figure out the image distribution by year – turns out this is very highly skewed towards the late 1800s, which explains why the data loaders get much slower as they get towards the later files.
The CSV data behind this chart is also in my Github repo.
Next up I spent some time with Elasticsearch exploring various terms to try to find things of local interest to my part of the world. Sadly my home town of Whitstable doesn’t appear anywhere in the metadata but I was able to find this image of Oyster fishermen from a Victorian guide to Herne Bay, published in 1889, and I also found a beautiful map of the Isle of Sheppey, of all places – hardly the best known or most visited place in the world.
I’ve always had a bit of a thing about maps, so this got me started on a treasure hunt, using Elasticsearch to hunt through the database. I’m still finding some amazing old maps, but I’ve collected the best ones so far in a gallery on Flickr.
Back to the metadata, I thought it might be fun to calculate some pointless trivia. So, here we have the 5 biggest images in the collection by pixel area:
|Wegweiser durch Cleve und dessen nächste Umgebung, etc. [With plates and plans.] CHAR, F. C.||1855|
|Bibliothèque Bourbonnaise. Générale description du Bourbonnois … Publiée avec une introduction et une table annotée des noms de personnes et de lieux par A. Vayssière NICOLAY, Nicolas de Seigneur d’Arfeuille||1889|
|The State of Michigan: embracing sketches of its history, position, resources and industries. Compiled … by S. B. McCracken. [With plates and a map.] MACCRACKEN, Stephen Bromley.||1876|
|Notes historiques sur Sarreguemines depuis l’an 706 jusqu’ après la Révolution française … Cartes et plans THOMIRE, Auguste.||1887|
|Reality versus Romance in South Central Africa. An account of a journey across the Continent … With … illustrations, etc JOHNSTON, James M.D., of Brown’s Town, Jamaica||1893|
– the scripts I used to create this are also available in my Github repo – I used the MongoDB database, mostly from the command line. Just a bit of fun, but should give you pointers if there’s any other simple queries you want to run of this sort.
You know what? all the biggest images turn out to be maps! I didn’t fudge this, that’s just the way it worked out.
Trawling more randomly through the images, it seems an awful lot of them are illustrated letters from the start of each chapter in the older books, and there’s a lot of intricate design work that’s simply decoration (but often beautifully done).
In between there are a lot of amazing woodcut images, and sometimes you even find the odd photograph
I’ll leave you with this lovely geometric picture from an 1806 book on travelling in Egypt – I assume it’s the Avenue of the Sphinxes at the Karnak temple but it doesn’t look like that now!
I spruced this one up a tiny bit in Photoshop. One thing you will find is that a number of the images are the wrong way around in Flickr but remember, you’re free to do what you like with them.