How to Download an Entire WordPress Blog

Sun, Apr 1, 2012 4-minute read

Sometimes, you stumble on a blog that is so chock full of information that you revel in its every word. And then you realize their archive goes back 5 years!

image

I’ve read a bunch of great posts on Nate Lawson’s awesome security blog, and decided that I wanted to read it beginning to end.

If you are the owner of said WordPress blog, the solution is easy – use WordPress’ built-in Export feature. There are even handy services that will turn this into a PDF, eBook, or even printed book.

If you’re not the owner, things aren’t so easy:

  1. Sit in front of the computer
  2. Go to the oldest month in the archives menu that I hadn’t yet visited
  3. Read that page
  4. Click “Next Page” until those links stop appearing
  5. Go back to the home page
  6. (Repeat steps 2-5 until you’re done with the blog)

Oh, and if you intend to take a break at any point in time, add in a few “try to remember where you were, and find that blog post again” entries.

What I was really hoping for was:

  1. Open a PDF on my Kindle, and read the entire thing in chronological order, letting the Kindle software keep track of where I am.

It turns out that the difference between reality and desire is about twelve lines of PowerShell!

PowerShell’s recent technology previews (and the Windows 8 consumer and developer preview) include the Invoke-WebRequest cmdlet.Think wget / curl, but with PowerShell’s traditional object-based awesome-sauce. For example:

PS C:\\temp> Invoke-WebRequest http://www.leeholmes.com/blog |
>>     Foreach-Object Links |
>>     Where-Object InnerText -match "August" |
>>     Foreach-Object Href

http://www.leeholmes.com/blog/2011/08/
http://www.leeholmes.com/blog/2010/08/
http://www.leeholmes.com/blog/2008/08/
http://www.leeholmes.com/blog/2007/08/
http://www.leeholmes.com/blog/2006/08/
http://www.leeholmes.com/blog/2005/08/
                                                                                  

When you look at links to the monthly archives, they all follow the pattern:

//">//">http://www.example.com/url/<number><number><number><number>/<number><number>/

When you visit any of these pages, they have another link. The exact text depends on the blog itself – but it may be “Earlier Entries”, “Next Page”, or similar:

PS C:\\temp> $page = Invoke-WebRequest http://www.leeholmes.com/blog/2005/06/
PS C:\\temp> $page.Links | Where-Object InnerText -match "Earlier Entries" |
>>     Select-Object -First 1
>>


innerHTML : Earlier Entries ?
innerText : Earlier Entries ?
outerHTML : <A href="http://www.leeholmes.com/blog/2005/06/page/2/">Earlier Entries ?</A>
outerText : Earlier Entries ?
tagName   : A
href      : http://www.leeholmes.com/blog/2005/06/page/2/


                                                                                                                        

Given that knowledge, we can automate the download of the entire blog, dumping it into an HTML file as we go. As a final step, we print this HTML to PDF, and upload it to our Kindle or other reading device.

Note to purists: this HTML file is brutally malformed. It is a collection of HTML pages packed into the same file, rather than one HTML page with all the important content. It is of course possible to make this a valid HTML file by manipulating the content before writing it – there’s just no need to do it if the destination is a PDF anyhow.

And how about time effort? In the end, I had a PDF of the entire blog on my Kindle 20 minutes after first having thought of it.

Here’s the PowerShell script that automates this all – cleaned up for your consumption, of course :)

## Things you might want to change
$blogUrl = "http:/www.leeholmes.com/blog"
$archiveLinkPattern = '/\d\d\d\d/\d\d/$'
$nextPageText = "Earlier Entries"

## Get the page
$r = Invoke-WebRequest $blogUrl

## Extract the archives links
$links = $r.Links | Where-Object href -match $archiveLinkPattern |
    Foreach-Object href

## Sort the archives in reverse order
$links = $links[$links.Count..0]

## Go through each archive page
foreach($link in $links)
{
    ## Create a variable to hold the HTML content for this month
    $monthExport = ""

    do
    {
        ## Get the archives for that month
        $month = Invoke-WebRequest $link

        ## Get the page content, and put it at the beginning of the
        ## monthExport variable. That's because "Earlier Entries"
        ## should be placed before the content we just got.
        $monthExport = $month.Content + "`r`n" + $monthExport

        ## Find the link to "Earlier Entires"
        $link = $month.Links | ? innertext -match $nextPageText |
            Foreach-Object href | Select-Object -First 1

    ## Keep on doing this while we found an "Earlier Entries" link
    } while($link)

    ## Now that we're done with the month, put it at the end of the
    ## HTML file (since we're processing months in order)
    $monthExport >> leeholmes.html
}