Extracting Tables from PowerShell’s Invoke-WebRequest
Monday, 5 January 2015
If you’ve ever wanted to extract tables from a web page in PowerShell, the Invoke-WebRequest cmdlet is exactly what the doctor ordered.
Once you’ve invoked the cmdlet, the ‘ParsedHtml’ property gives you access to the Internet Explorer DOM of that page. From there, you can get elements by tag name (“TABLE”), ID, and more.
One neat application of this technique is to automatically parse data out of tables on the web page. I recently needed to do this, and the PowerShell script really wasn’t that complicated. In true PowerShell style, each row of the table is output as an object – that way, you can access the data as you would with any other PowerShell cmdlet. Even better - if the table uses the TH tag (“Table Heading”), it uses those headings as property names for the output objects.
Here’s an example of it in action:
1 [C:\Users\leeholm] >> $url = 'http://www.egyptianhieroglyphs.net/gardiners-sign-list/domestic-and-funerary-furniture/' 2 [C:\Users\leeholm] >> $r = Invoke-WebRequest $url 3 [C:\Users\leeholm] >> Get-WebRequestTable.ps1 $r -TableNumber 0 | Format-Table -Auto P1 P2 P3 P4 -- -- -- -- Gardiner Number Hieroglyph Description of Glyph Details Q1 Seat Phono. st, ws, . In st ?seat, place,? wsir ?Osiris,? ?tm ?perish.? Q2 Portable seat Phono. ws. In wsir ?Osiris.? Q3 Stool Phono. p. Q4 Headrest Det. in wrs ?headrest.? Q5 Chest Det. in hn ?box,? ?fdt ?chest.? Q6 Coffin Det. or Ideo. in qrs ?bury,? krsw ?coffin.? Q7 Brazier with flame Det. of fire. In ?t ?fire,? s?t ?flame,? srf ?temperature.? 4 [C:\Users\leeholm]
And the script:
001
002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 |
param(
[Parameter(Mandatory = $true)] [Microsoft.PowerShell.Commands.HtmlWebResponseObject] $WebRequest,
[Parameter(Mandatory = $true)] [int] $TableNumber ) ## Extract the tables out of the web request $tables = @($WebRequest.ParsedHtml.getElementsByTagName("TABLE")) $table = $tables[$TableNumber] $titles = @() $rows = @($table.Rows) ## Go through all of the rows in the table foreach($row in $rows) { $cells = @($row.Cells)
## If we've found a table header, remember its titles if($cells[0].tagName -eq "TH") { $titles = @($cells | % { ("" + $_.InnerText).Trim() }) continue } ## If we haven't found any table headers, make up names "P1", "P2", etc. if(-not $titles) { $titles = @(1..($cells.Count + 2) | % { "P$_" }) } ## Now go through the cells in the the row. For each, try to find the ## title that represents that column and create a hashtable mapping those ## titles to content $resultObject = [Ordered] @{} for($counter = 0; $counter -lt $cells.Count; $counter++) { $title = $titles[$counter] if(-not $title) { continue }
$resultObject[$title] = ("" + $cells[$counter].InnerText).Trim() } ## And finally cast that hashtable to a PSCustomObject [PSCustomObject] $resultObject } |
No. 1 — January 6th, 2015 at 6:29 am
[…] Extracting Tables from PowerShell’s Invoke-WebRequest (Lee Holmes) […]
No. 2 — January 15th, 2015 at 4:08 pm
I tried modifying your script to use querySelector instead of getElementsByTagName but it always returns $null. What’s up with that?
No. 3 — February 5th, 2015 at 4:29 pm
[…] stored the URL and IP address information into tables on this page, but thanks to Lee Holmes Get-WebRequestTable script I could loop through the various tables that contain the URL and IP address range […]
No. 4 — February 26th, 2015 at 6:02 am
[…] Extracting Tables from PowerShell’s Invoke-WebRequest […]
No. 5 — March 8th, 2015 at 2:35 am
[…] makes Invoke-WebRequest incredibly powerful and useful for a good deal more than just downloading files. If this wasn’t the case, the syntax would be simpler than […]
No. 6 — November 2nd, 2015 at 7:39 am
Very good and handy code published. Thank You
No. 7 — February 10th, 2016 at 6:17 pm
Thanks for this, I incorporated it into automation for pulling down the patch notes from MS every month to assemble a report I am responsible for.
No. 8 — March 1st, 2016 at 11:45 pm
[…] script uses Get-HtmlTable (Invoke-Html wraps this) which is a tweaked version of Lee Holmes Extracting Tables from PowerShell’s Invoke-WebRequest. The property Days On Gallery is calculated and add to PowerShell objects return from the gallery […]
No. 9 — June 22nd, 2016 at 2:18 pm
Great script, so simple, so beautiful. Works absolutely perfect. Now I just have to tweak it a little bit to get it to output data to a CSV.
No. 10 — October 12th, 2016 at 4:57 pm
After login …. how I can do the Get-WebRequestTable.ps1 ??? please
No. 11 — December 14th, 2016 at 2:56 pm
Nice work exactly what I have been looking for, although need to tweak for my individual needs this is the closest I have come! Not very good with PowerShell yet!
Quick question if I wanted to add a column from a variable/Parameter how would I do this? I’m building this into a for each loop to ideally want the server or environment its querying next to the output.
No. 12 — September 21st, 2017 at 1:16 pm
Hi. Great script! However if innerText is empty i want it to get the tag value. Example:
0
0.03 %
The percentage of alien packets
How can I retrieve the value ‘green’for that column instead of empty string?
No. 13 — September 21st, 2017 at 1:18 pm
It parsed my html. Above the first value there is td bgcolor=’green’></.
No. 14 — October 16th, 2017 at 2:07 pm
This is bloody good. Thanks a lot.
No. 15 — November 14th, 2017 at 2:18 pm
This works awesome, just what I was looking for. Thank you.
No. 16 — September 5th, 2018 at 9:22 pm
Hey, i would like to use your code in a blog post on my site, can i be publish your code and reference your site? Thansk.
No. 17 — October 12th, 2020 at 5:09 pm
Encoding of the webpage is not preserved. Non-ascii characters will turn into garbage.