Bootstrapping content with Hpricot
On my latest project, I discovered I had to pre-populate the project's database with existing content. Jon Udell just posted about how much of a waste of time this can be in some circumstances, but in this case, Hpricot and database migrations made it easy. This wouldn't be a solution I'd use if I needed the data as anything beyond a one-off bootstrap, but in this case it worked really well.
Hpricot, for those who don't know, is an HTML parser for Ruby that's fun to use. When I was first learning Ruby, most of the simplest yet useful projects I could come up with used Hpricot to grab content off of websites and format or combine it in different ways. Its syntax looks like this:
require 'hpricot'
require 'open-uri'
uri = URI.parse(link)
doc = Hpricot(open(uri))
name = (doc/"li.active a").inner_html
page_title = (doc/"title").inner_html
body = (doc/"#content_body").html
In this example, Hpricot is using CSS selectors to grab different pieces of content out of the page in link. The nice thing about using CSS selectors here is the code tends to be less fragile than screenscrapers that depend on the architecture of the page.
Page scraping can be a frustrating art, especially if the page layout changes or if pages are inconsistent, or have unique properties. Luckily, in this case, I only had to get it right once, and even then, I didn't have to get it completely right. I used this four-stage process:
- Use Hpricot to get as much data off the page and into our data structures as possible.
- Persist this data to the database, and make appropriate changes that Hpricot missed, or couldn't catch.
- Dump the database to a file, and use it to bootstrap our production database.
- Repeat until finished.
Rails database migrations made this relatively easy. I ended up with three migrations. The first migration created the structure of the database. The second loaded the current page data dump from the dump file. The third grabbed a few pages I still needed to parse, and I was left with data that I could tweak and dump, overwriting it with a dump containing all the page data (including the stuff I just tweaked). I could then blow away the database and repeat until I didn't have any more pages to parse.
This worked perfectly, since I didn't have to spend time getting my Hpricot parsing perfect (since I could modify the resulting data using our CMS and re-dump), and I was left with a dump of all the data that I needed in order to dynamically generate these formerly mostly static pages.
Trackbacks
Use the following link to trackback from your own site:
http://blog.uberweiss.net/trackbacks?article_id=bootstrapping-content-with-hpricot&day=26&month=02&year=2008