Extracting Tables from PowerShell's Invoke-WebRequest

Mon, Jan 5, 2015 3-minute read

If you’ve ever wanted to extract tables from a web page in PowerShell, the Invoke-WebRequest cmdlet is exactly what the doctor ordered.

Once you’ve invoked the cmdlet, the ‘ParsedHtml’ property gives you access to the Internet Explorer DOM of that page. From there, you can get elements by tag name (“TABLE”), ID, and more.

One neat application of this technique is to automatically parse data out of tables on the web page. I recently needed to do this, and the PowerShell script really wasn’t that complicated. In true PowerShell style, each row of the table is output as an object – that way, you can access the data as you would with any other PowerShell cmdlet. Even better - if the table uses the TH tag (“Table Heading”), it uses those headings as property names for the output objects.

Here’s an example of it in action:

1 [C:\Users\leeholm]
>> $url = 'http://www.egyptianhieroglyphs.net/gardiners-sign-list/domestic-and-funerary-furniture/'

2 [C:\Users\leeholm]
>> $r = Invoke-WebRequest $url

3 [C:\Users\leeholm]
>> Get-WebRequestTable.ps1 $r -TableNumber 0 | Format-Table -Auto

P1              P2         P3                   P4
--              --         --                   --
Gardiner Number Hieroglyph Description of Glyph Details
Q1                         Seat                 Phono. st, ws, . In st ?seat, place,? wsir ?Osiris,? ?tm ?perish.?
Q2                         Portable seat        Phono. ws. In wsir ?Osiris.?
Q3                         Stool                Phono. p.
Q4                         Headrest             Det. in wrs ?headrest.?
Q5                         Chest                Det. in hn ?box,? ?fdt ?chest.?
Q6                         Coffin               Det. or Ideo. in qrs ?bury,? krsw ?coffin.?
Q7                         Brazier with flame   Det. of fire. In ?t ?fire,? s?t ?flame,? srf ?temperature.?

4 [C:\Users\leeholm]

And the script:

param(
    [Parameter(Mandatory = $true)]
    [Microsoft.PowerShell.Commands.HtmlWebResponseObject] $WebRequest,
  
    [Parameter(Mandatory = $true)]
    [int] $TableNumber
)

## Extract the tables out of the web request
$tables = @($WebRequest.ParsedHtml.getElementsByTagName("TABLE"))
$table = $tables[$TableNumber]
$titles = @()
$rows = @($table.Rows)

## Go through all of the rows in the table
foreach($row in $rows)
{
    $cells = @($row.Cells)
   
    ## If we've found a table header, remember its titles
    if($cells[0].tagName -eq "TH")
    {
        $titles = @($cells | % { ("" + $_.InnerText).Trim() })
        continue
    }

    ## If we haven't found any table headers, make up names "P1", "P2", etc.
    if(-not $titles)
    {
        $titles = @(1..($cells.Count + 2) | % { "P$_" })
    }

    ## Now go through the cells in the the row. For each, try to find the
    ## title that represents that column and create a hashtable mapping those
    ## titles to content
    $resultObject = [Ordered] @{}
    for($counter = 0; $counter -lt $cells.Count; $counter++)
    {
        $title = $titles[$counter]
        if(-not $title) { continue }  

        $resultObject[$title] = ("" + $cells[$counter].InnerText).Trim()
    }

    ## And finally cast that hashtable to a PSCustomObject
    [PSCustomObject] $resultObject
}