Extracting Tables from PowerShell’s Invoke-WebRequest
Monday, 5 January 2015
If you’ve ever wanted to extract tables from a web page in PowerShell, the Invoke-WebRequest cmdlet is exactly what the doctor ordered.
Once you’ve invoked the cmdlet, the ‘ParsedHtml’ property gives you access to the Internet Explorer DOM of that page. From there, you can get elements by tag name (“TABLE”), ID, and more.
One neat application of this technique is to automatically parse data out of tables on the web page. I recently needed to do this, and the PowerShell script really wasn’t that complicated. In true PowerShell style, each row of the table is output as an object – that way, you can access the data as you would with any other PowerShell cmdlet. Even better - if the table uses the TH tag (“Table Heading”), it uses those headings as property names for the output objects.
Here’s an example of it in action:
1 [C:\Users\leeholm] >> $url = 'http://www.egyptianhieroglyphs.net/gardiners-sign-list/domestic-and-funerary-furniture/' 2 [C:\Users\leeholm] >> $r = Invoke-WebRequest $url 3 [C:\Users\leeholm] >> Get-WebRequestTable.ps1 $r -TableNumber 0 | Format-Table -Auto P1 P2 P3 P4 -- -- -- -- Gardiner Number Hieroglyph Description of Glyph Details Q1 Seat Phono. st, ws, . In st ?seat, place,? wsir ?Osiris,? ?tm ?perish.? Q2 Portable seat Phono. ws. In wsir ?Osiris.? Q3 Stool Phono. p. Q4 Headrest Det. in wrs ?headrest.? Q5 Chest Det. in hn ?box,? ?fdt ?chest.? Q6 Coffin Det. or Ideo. in qrs ?bury,? krsw ?coffin.? Q7 Brazier with flame Det. of fire. In ?t ?fire,? s?t ?flame,? srf ?temperature.? 4 [C:\Users\leeholm]
And the script: