Extracting Tables from PowerShell’s Invoke-WebRequest

If you’ve ever wanted to extract tables from a web page in PowerShell, the Invoke-WebRequest cmdlet is exactly what the doctor ordered.

Once you’ve invoked the cmdlet, the ‘ParsedHtml’ property gives you access to the Internet Explorer DOM of that page. From there, you can get elements by tag name (“TABLE”), ID, and more.

One neat application of this technique is to automatically parse data out of tables on the web page. I recently needed to do this, and the PowerShell script really wasn’t that complicated. In true PowerShell style, each row of the table is output as an object – that way, you can access the data as you would with any other PowerShell cmdlet. Even better – if the table uses the TH tag (“Table Heading”), it uses those headings as property names for the output objects.

Here’s an example of it in action:

1 [C:\Users\leeholm]
>> $url = 'http://www.egyptianhieroglyphs.net/gardiners-sign-list/domestic-and-funerary-furniture/'

2 [C:\Users\leeholm]
>> $r = Invoke-WebRequest $url

3 [C:\Users\leeholm]
>> Get-WebRequestTable.ps1 $r -TableNumber 0 | Format-Table -Auto

P1              P2         P3                   P4
--              --         --                   --
Gardiner Number Hieroglyph Description of Glyph Details
Q1                         Seat                 Phono. st, ws, . In st ?seat, place,? wsir ?Osiris,? ?tm ?perish.?
Q2                         Portable seat        Phono. ws. In wsir ?Osiris.?
Q3                         Stool                Phono. p.
Q4                         Headrest             Det. in wrs ?headrest.?
Q5                         Chest                Det. in hn ?box,? ?fdt ?chest.?
Q6                         Coffin               Det. or Ideo. in qrs ?bury,? krsw ?coffin.?
Q7                         Brazier with flame   Det. of fire. In ?t ?fire,? s?t ?flame,? srf ?temperature.?

4 [C:\Users\leeholm]                                                                                                    

And the script:

001

002

003

004

005

006

007

008

009

010

011

012

013

014

015

016

017

018

019

020

021

022

023

024

025

026

027

028

029

030

031

032

033

034

035

036

037

038

039

040

041

042

043

044

045

046

047

048

param(

    [Parameter(Mandatory = $true)]

    [Microsoft.PowerShell.Commands.HtmlWebResponseObject] $WebRequest,

   

    [Parameter(Mandatory = $true)]

    [int] $TableNumber

)

## Extract the tables out of the web request

$tables = @($WebRequest.ParsedHtml.getElementsByTagName("TABLE"))

$table = $tables[$TableNumber]

$titles = @()

$rows = @($table.Rows)

## Go through all of the rows in the table

foreach($row in $rows)

{

    $cells = @($row.Cells)

   

    ## If we’ve found a table header, remember its titles

    if($cells[0].tagName -eq "TH")

    {

        $titles = @($cells | % { ("" + $_.InnerText).Trim() })

        continue

    }

    ## If we haven’t found any table headers, make up names "P1", "P2", etc.

    if(-not $titles)

    {

        $titles = @(1..($cells.Count + 2) | % { "P$_" })

    }

    ## Now go through the cells in the the row. For each, try to find the

    ## title that represents that column and create a hashtable mapping those

    ## titles to content

    $resultObject = [Ordered] @{}

    for($counter = 0; $counter -lt $cells.Count; $counter++)

    {

        $title = $titles[$counter]

        if(-not $title) { continue }

       

        $resultObject[$title] = ("" + $cells[$counter].InnerText).Trim()

    }

    ## And finally cast that hashtable to a PSCustomObject

    [PSCustomObject] $resultObject

}

11 Responses to “Extracting Tables from PowerShell’s Invoke-WebRequest”

  1. Dew Drop – January 6, 2015 (#1927) | Morning Dew writes:

    […] Extracting Tables from PowerShell’s Invoke-WebRequest (Lee Holmes) […]

  2. asdf writes:

    I tried modifying your script to use querySelector instead of getElementsByTagName but it always returns $null. What’s up with that?

  3. PowerShell Script: Retrieve all Office 365 URLs and IP Ranges | Anything about IT writes:

    […] stored  the URL and IP address information into tables on this page, but thanks to Lee Holmes Get-WebRequestTable script I could loop through the various tables that contain the URL and IP address range […]

  4. February 2015 NoVa PSUG Meeting Notes | PowerShell.org writes:

    […] Extracting Tables from PowerShell’s Invoke-WebRequest […]

  5. PowerShell: Download a list of files | User Error writes:

    […] makes Invoke-WebRequest incredibly powerful and useful for a good deal more than just downloading files. If this wasn’t the case, the syntax would be simpler than […]

  6. Eric writes:

    Very good and handy code published. Thank You

  7. Bronson writes:

    Thanks for this, I incorporated it into automation for pulling down the patch notes from MS every month to assemble a report I am responsible for.

  8. Tracking PowerShell Gallery Downloads – Development in a Blink writes:

    […] script uses Get-HtmlTable (Invoke-Html wraps this) which is a tweaked version of Lee Holmes Extracting Tables from PowerShell’s Invoke-WebRequest. The property Days On Gallery is calculated and add to PowerShell objects return from the gallery […]

  9. Vanegas writes:

    Great script, so simple, so beautiful. Works absolutely perfect. Now I just have to tweak it a little bit to get it to output data to a CSV.

  10. Jean Paul writes:

    After login …. how I can do the Get-WebRequestTable.ps1 ??? please

  11. Graham writes:

    Nice work exactly what I have been looking for, although need to tweak for my individual needs this is the closest I have come! Not very good with PowerShell yet!

    Quick question if I wanted to add a column from a variable/Parameter how would I do this? I’m building this into a for each loop to ideally want the server or environment its querying next to the output.

Leave a Reply