Where did the old posts go?

jmcp

2018-05-15 10:00

When I set up my blog in 2006, I chose to use roller, which was the same engine that the now-defunct blogs.sun.com used at the time. Later, having gotten tired of the interface and being more impressed with Wordpress' facility for image galleries, I started running my own instance of that.

After a while, however, the frequent CVEs in both PHP and the Wordpress base+plugin systems got me sufficiently motivated to change that I did.

My friends Shawn and Liane suggested going for a static site generator, so after having a brief look at Pelican and Nikola, I instead chose Hugo.

Hugo's chief attraction was the relative ease with which I could create image galleries, along with the ability to import Wordpress sites. Yay, thought I, I can have a fairly seamless transition, and away we went.

I didn't actually check the site backup that I made before turning off the Wordpress instance, however, and when I went looking for the pkgrepo procedure that I used for building darktable and didn't find the content that I wanted, I was a bit annoyed.

Given that Solaris' packaged version of Go is somewhat behind the community version, and that Hugo depends on a much newer version, this was also the trigger for me to re-explore Pelican and Nikola, both of which are written in Python. After a brief flirtation with Pelican I settled instead on Nikola and did the initial Hugo to Nikola migration fairly easily. Chris pointed me to the gallery directive plugin and I was able to make a start with some of my more recent gallery collections. A quick implementation of a PR for captioned and ordered images got me the rest of the way and then I could get back to the real problem: the missing content.

Fortunately for me, the wayback machine had a copy of the old site entries, and with a quick installation of the wayback machine downloader I was able to grab the whole site as it was up to 2016.

Phew!

Except that all the post files were chock full of Wordpress and roller's javascript and expanded css, which were a real mess to look at, let alone extract the posts from.

So I did what anybody else would do, and wrote some Python using BeautifulSoup to provide a best effort extraction which would translate the html+js+css into the plain-text (and therefore portable) ReStructured Text format. Now since this is a best effort attempt, I'm not too concerned about getting the output as perfect rst which matches my original post, and I went through about 20 different entries to tidy up the input so that the script would produce something close to what I wanted. I knew that I'd have to go and post-process quite a few entries as well, I just wanted to not have to do too much to get that going.

I've converted about 340 posts extracted from the wayback machine archive using this script wp-to-rest.py which took about 2 days to write and finesse, and another day or so to muck around with several posts to get them into better shape (ie, running nikola build doesn't yell at me). There are still a bunch of broken links in there, but at least now I've got all my content back, and can very easily fix things up as I get the inclination.