Archives for the Month of March, 2012

How to Download an Entire WordPress Blog

Sometimes, you stumble on a blog that is so chock full of information that you revel in its every word. And then you realize their archive goes back 5 years!

image

I’ve read a bunch of great posts on Nate Lawson’s awesome security blog, and decided that I wanted to read it beginning to end.

If you are the owner of said WordPress blog, the solution is easy – use WordPress’ built-in Export feature. There are even handy services that will turn this into a PDF, eBook, or even printed book.

If you’re not the owner, things aren’t so easy:

  1. Sit in front of the computer
  2. Go to the oldest month in the archives menu that I hadn’t yet visited
  3. Read that page
  4. Click “Next Page” until those links stop appearing
  5. Go back to the home page
  6. (Repeat steps 2-5 until you’re done with the blog)

Oh, and if you intend to take a break at any point in time, add in a few “try to remember where you were, and find that blog post again” entries.

What I was really hoping for was:

  1. Open a PDF on my Kindle, and read the entire thing in chronological order, letting the Kindle software keep track of where I am.

It turns out that the difference between reality and desire is about twelve lines of PowerShell!

PowerShell’s recent technology previews (and the Windows 8 consumer and developer preview) include the Invoke-WebRequest cmdlet.Think wget / curl, but with PowerShell’s traditional object-based awesome-sauce. For example:

PS C:\temp> Invoke-WebRequest http://www.leeholmes.com/blog |
>>     Foreach-Object Links |
>>     Where-Object InnerText -match "August" |
>>     Foreach-Object Href

http://www.leeholmes.com/blog/2011/08/
http://www.leeholmes.com/blog/2010/08/
http://www.leeholmes.com/blog/2008/08/
http://www.leeholmes.com/blog/2007/08/
http://www.leeholmes.com/blog/2006/08/
http://www.leeholmes.com/blog/2005/08/
                                                                                  

When you look at links to the monthly archives, they all follow the pattern:

//">//">http://www.example.com/url/<number><number><number><number>/<number><number>/

When you visit any of these pages, they have another link. The exact text depends on the blog itself – but it may be “Earlier Entries”, “Next Page”, or similar:

PS C:\temp> $page = Invoke-WebRequest http://www.leeholmes.com/blog/2005/06/
PS C:\temp> $page.Links | Where-Object InnerText -match "Earlier Entries" |
>>     Select-Object -First 1
>>


innerHTML : Earlier Entries ?
innerText : Earlier Entries ?
outerHTML : <A href="http://www.leeholmes.com/blog/2005/06/page/2/">Earlier Entries ?</A>
outerText : Earlier Entries ?
tagName   : A
href      : http://www.leeholmes.com/blog/2005/06/page/2/


                                                                                                                        

Given that knowledge, we can automate the download of the entire blog, dumping it into an HTML file as we go. As a final step, we print this HTML to PDF, and upload it to our Kindle or other reading device.

Note to purists: this HTML file is brutally malformed. It is a collection of HTML pages packed into the same file, rather than one HTML page with all the important content. It is of course possible to make this a valid HTML file by manipulating the content before writing it – there’s just no need to do it if the destination is a PDF anyhow.

And how about time effort? In the end, I had a PDF of the entire blog on my Kindle 20 minutes after first having thought of it.

Here’s the PowerShell script that automates this all – cleaned up for your consumption, of course 🙂

## Things you might want to change
$blogUrl = "http:/www.leeholmes.com/blog"
$archiveLinkPattern = '/\d\d\d\d/\d\d/$'
$nextPageText = "Earlier Entries"

## Get the page
$r = Invoke-WebRequest $blogUrl

## Extract the archives links
$links = $r.Links | Where-Object href -match $archiveLinkPattern |
    Foreach-Object href

## Sort the archives in reverse order
$links = $links[$links.Count..0]

## Go through each archive page
foreach($link in $links)
{
    ## Create a variable to hold the HTML content for this month
    $monthExport = ""

    do
    {
        ## Get the archives for that month
        $month = Invoke-WebRequest $link

        ## Get the page content, and put it at the beginning of the
        ## monthExport variable. That's because "Earlier Entries"
        ## should be placed before the content we just got.
        $monthExport = $month.Content + "`r`n" + $monthExport

        ## Find the link to "Earlier Entires"
        $link = $month.Links | ? innertext -match $nextPageText |
            Foreach-Object href | Select-Object -First 1

    ## Keep on doing this while we found an "Earlier Entries" link
    } while($link)

    ## Now that we're done with the month, put it at the end of the
    ## HTML file (since we're processing months in order)
    $monthExport >> leeholmes.html
}

A Celebration, if You Can Figure it Out

We were talking about a very cool astrological date on the internal PowerShell mailing list recently. In celebration of this event, Josh Rowe made this brilliant comment. See if you can figure out what it does 🙂

clear;$00=(0..1250|%{9608}),(0..7645|%{9617})|%{$_};(-10..29)|
%{$OO='';$O0=$_;-10..64|%{$0O=$_;$OO+=[char]($00[$0O*$0O-48*$0O+
1720+4*$O0*$O0-96*$O0],@($00[$0O*$0O-52*$0O+1644+4*$O0*$O0-88*
$O0],9617,9617)[(0,1)[($0O-lt28)]+($O0-gt12)])[(0,1)[$0O-gt24]*
($O0-lt14)]};$OO};0..573892165|%{[email protected]((($OO+0)*4*$_*$_/(4*$_*
$_-1)),1d)[$_-lt1];write-progress ":-)"($OO*2)}

I suppose that’s not really fair. Here it is in all of its syntax-highlighted glory:

clear;$00=(0..1250|%{9608}),(0..7645|%{9617})|%{$_};(-10..29)|
%{$OO='';$O0=$_;-10..64|%{$0O=$_;$OO+=[char]($00[$0O*$0O-48*$0O+
1720+4*$O0*$O0-96*$O0],@($00[$0O*$0O-52*$0O+1644+4*$O0*$O0-88*
$O0],9617,9617)[(0,1)[($0O-lt28)]+($O0-gt12)])[(0,1)[$0O-gt24]*
($O0-lt14)]};$OO};0..573892165|%{$OO=@((($OO+0)*4*$_*$_/(4*$_*
$_-1)),1d)[$_-lt1];write-progress ":-)"($OO*2)}

PowerShell Book Reviews

There have been a handful of useful posts recently giving reviews across the spectrum of PowerShell books. I always love reading these posts, as they let you compare and contrast the whole range of quality and approaches. When reading reviews that focus only on a single book, it’s sometimes hard to calibrate – does the reviewer get this excited about blank reams of paper? Slag Shakespeare for his typos?

My favourite is Richard Siddaway’s summary, freshly updated today: http://richardspowershellblog.wordpress.com/2012/03/11/powershell-booksmarch-2012/. The ecosystem of PowerShell books has really blossomed – I love that there are two books on managing VMWare on this list!

Jonathan Medd also has a good list here: http://www.jonathanmedd.net/2011/01/recommended-powershell-books.html.

A resource I am hopeful about is Don Jones’ recently launched http://www.powershellbooks.com. Its goal is to “help you select the Windows PowerShell book or books that best fit your current learning and reference needs.” Several of Don’s books are in every “PowerShell must have” list. His “Learn PowerShell in a Month of Lunches” book for beginners is tearing up the charts. Right now, the review site only covers books by Manning Press (plus an additional book that Don wrote), so it’s not an objective survey. I hope he expands his scope to make it one. It also doesn’t actually review the books (it links to product pages), but I’m still crossing my fingers because Don’s touch has a way of turning things into gold 🙂

The books that tend to get left off these lists are the domain-specific ones: PowerShell + SQL, PowerShell + Exchange, PowerShell + SharePoint. Those ecosystems are becoming robust enough to support a survey of “PowerShell SQL” books, for example, although I’m not aware of any that have done so.

If you know of any good book comparison lists, please let me know!