Converting old site to a new static site generator
This post is in incomplete walkthrough of converting my old blog that was built with TRAC to a static site generated blog, as of know with jekyll just to get up quickly with github pages.
Dump the site
- Grab and archive the existing site, using wget to mirror it so you have a working copy.
wget --mirror --convert-links --adjust-extension --page-requisites http://www.mywebsite.com/
One could find a way to dump records from the existing site database and then convert the content, but that requires customizing the dump. It also doesn’t make a usable archive of the old site that you can dig through.
Convert the html to markdown
Using Pandoc, generally like so
pandoc -f html -t markdown
Find the content, and ignore everything else. The content can be located with the following html tag
<div id="blog-main">
, and I followed these tips
#!/bin/bash
echo "converting $1"
cat $1 | sed '1,/<div class="blog-main">/d' | sed '/<div class="asset-footer">/,/<\/html>/d' | pandoc --wrap=none --from html --to markdown_strict > $1.md
Won’t quite work, since we just want what’s in the div tag, but not the tag.
Following additional ideas.
#!/bin/bash
echo "converting $1"
xmllint --html --xpath "//div[@id='blog-main']/node()" $1 | pandoc --wrap=none --atx-headers --from html-native_divs-native_spans --to markdown_github > $1.md
github markdown drops the extra attributes from the headers (header_attributes in pandoc, there’s probably a more direct way), but we want those for out mardown headers.
Now we can run the script over all the files to create markdowns of every post.
find *.html -exec ./trac2md.sh '{}' \;
- Next in a new repo setup for Jekyll on Github Pages, going to copy all the markdown files in as posts.
- Need to also grab all the files from the raw-attachments folder and put them into assets. These are all the images used in posts and files attached (pdfs, etc…)
mv raw-attachments/*/* ../blog/assets
We could have added command line args to set some of the meta tags but just added it to a filter that also sets the title based on the top level 1 header.
pandoc _drafts/airlinemap.md -o _drafts/airlinemap.md --filter assets/metadata.py
# Now do it in batch
find _drafts/*.md -exec sh -c 'pandoc -s {} -o {} --filter assets/metadata.py' \;
# now update the datetime
find _drafts/*.md -exec sh -c 'pandoc -s {} -o {} --filter assets/set-datetime.py' \;
# and then drop the Attachment section onward
find _drafts/*.md -exec sed -i '/Posted/Q' {} \;
Probably going to do this part by hand, there are so few and not standard.
# links to attachments (can be images) need to be fixed
# before
../raw-attachment/blog/babyquail/babyquail.jpg
# The regex to find them
../raw-attachment/blog/.*?/
# after
attachments/babyquail.jpg
#Now apply with sed
find *.md -type f -exec sed 's|../raw-attachment/blog/.*?/|attachments|g' {} + | less
find *.md -type f -exec sed -i 's|../raw-attachment/blog/.*?/|attachments|g' {} +
# The more complicated ones, need to also drop the .html at the end
../attachment/blog/babyquail/babyquail.jpg.html