parse-textObject – AWK with a vengeance.

Hopefully you’ve been following along with Eric’s regular expression exercises, because we’re about to add another cool tool to your Monad toolbox.  If your regex-fu is strong, you will soon dice text streams with ease.


As you well know, one of the strongest features of Monad is that the pipeline is object-based.  You don’t waste your energy creating, destroying, and recreating the object representation of your data.  In past shells, you destroy the full- fidelity representation of data when the pipeline converts it to pure text.  You can regain some of it through excessive text parsing, but not all of it. 


However, we still often have to interact with low-fidelity input originating from outside of Monad.  Text-based data files and legacy programs are two examples.


If you’re used to searching through files with Grep, you’ve hopefully discovered Monad’s match-string cmdlet.  If you’re used to dynamically replacing patterns in a stream of text with Sed, you’ve hopefully discovered the [Regex]::Replace() method.  If you’re used to extracting text from a stream with Awk, you’ve hopefully discovered… [String]::Split()?  OK, it’s the best you have so far, but it gets better.


The following parse-textObject script allows you to convert many text streams into a meaningful object-based representation.  From there, you can use all of Monad’s powerful object-based filtering cmdlets as you would normally.


Here’s an example, using the output of a source control system we use at work.



MSH:48 D:\enlistment > sd opened
//depot/main/dirs#2 – edit default change (text)
//depot/main/output.txt#0 – add default change (text)
//depot/main/sdb.ini#1 – delete default change (text)
MSH:49 D:\enlistment >
MSH:49 D:\enlistment > $sdObjectDefinition = @(“Path”,”Revision”,”Change”,”ChangeList”,”Type”)
MSH:50 D:\enlistment > $sdFormat = “(.*)#([^ \t]+) – ([^ \t]+) (.*) \((.*)\)”
MSH:51 D:\enlistment > $results = sd opened | parse-textobject -ParseExpression:$sdFormat `
>> -ObjectDefinition:$sdObjectDefinition
>>


MSH:52 D:\enlistment > $results | format-table


Path                      Revision                  Change                    ChangeList                Type
—-                      ——–                  ——                    ———-                —-
//depot/main/dirs         2                         edit                      default change            text
//depot/main/output.txt   0                         add                       default change            text
//depot/main/sdb.ini      1                         delete                    default change            text


MSH:53 D:\enlistment > $results | where { $_.Revision -gt 1 }


Path       : //depot/main/dirs
Revision   : 2
Change     : edit
ChangeList : default change
Type       : text


And here’s the script:











001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
046
047
048
049
050
051
052
053
054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
090
091
092
093
094
095
096
097
098
099
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210

##############################################################################
##
## Convert-TextObject.ps1 — Convert a simple string into a custom PowerShell
## object.
##
## From Windows PowerShell Cookbook (O’Reilly)
## by Lee Holmes (http://www.leeholmes.com/guide)
##
## Parameters:
##
## [string] Delimiter
## If specified, gives the .NET Regular Expression with which to
## split the string. The script generates properties for the
## resulting object out of the elements resulting from this split.
## If not specified, defaults to splitting on the maximum amount
## of whitespace: “\s+”, as long as ParseExpression is not
## specified either.
##
## [string] ParseExpression
## If specified, gives the .NET Regular Expression with which to
## parse the string. The script generates properties for the
## resulting object out of the groups captured by this regular
## expression.
##
## ** NOTE ** Delimiter and ParseExpression are mutually exclusive.
##
## [string[]] PropertyName
## If specified, the script will pair the names from this object
## definition with the elements from the parsed string. If not
## specified (or the generated object contains more properties
## than you specify,) the script uses property names in the
## pattern of Property1,Property2,…,PropertyN
##
## [type[]] PropertyType
## If specified, the script will pair the types from this list with
## the properties from the parsed string. If not specified (or the
## generated object contains more properties than you specify,) the
## script sets the properties to be of type [string]
##
##
## Example usage:
## “Hello World” | Convert-TextObject
## Generates an Object with “Property1=Hello” and “Property2=World”
##
## “Hello World” | Convert-TextObject -Delimiter “ll”
## Generates an Object with “Property1=He” and “Property2=o World”
##
## “Hello World” | Convert-TextObject -ParseExpression “He(ll.*o)r(ld)”
## Generates an Object with “Property1=llo Wo” and “Property2=ld”
##
## “Hello World” | Convert-TextObject -PropertyName FirstWord,SecondWord
## Generates an Object with “FirstWord=Hello” and “SecondWord=World
##
## “123 456″ | Convert-TextObject -PropertyType $([string],[int])
## Generates an Object with “Property1=123″ and “Property2=456″
## The second property is an integer, as opposed to a string
##
##############################################################################
param(
    [string] $delimiter, 
    [string] $parseExpression, 
    [string[]] $propertyName, 
    [type[]] $propertyType
    )

function Main(
    $inputObjects, $parseExpression, $propertyType, 
    $propertyName, $delimiter)
{
    $delimiterSpecified = [bool] $delimiter
    $parseExpressionSpecified = [bool] $parseExpression

    ## If they’ve specified both ParseExpression and Delimiter, show usage
    if($delimiterSpecified -and $parseExpressionSpecified)
    {
        Usage
        return
    }

    ## If they enter no parameters, assume a default delimiter of whitespace
    if(-not $($delimiterSpecified -or $parseExpressionSpecified))
    {
        $delimiter = “\s+”
        $delimiterSpecified = $true
    }

    ## Cycle through the $inputObjects, and parse it into objects
    foreach($inputObject in $inputObjects)
    {
        if(-not $inputObject) { $inputObject = “” }
        foreach($inputLine in $inputObject.ToString())
        {
            ParseTextObject $inputLine $delimiter $parseExpression `
                $propertyType $propertyName
        }
    }
}

function Usage
{
    “Usage: “
    ” Convert-TextObject”
    ” Convert-TextObject -ParseExpression parseExpression “ +
        “[-PropertyName propertyName] [-PropertyType propertyType]“
    ” Convert-TextObject -Delimiter delimiter “ + 
        “[-PropertyName propertyName] [-PropertyType propertyType]“
    return
}

## Function definition — ParseTextObject.
## Perform the heavy-lifting — parse a string into its components.
## for each component, add it as a note to the Object that we return
function ParseTextObject
{
    param(
        $textInput, $delimiter, $parseExpression,
        $propertyTypes, $propertyNames)

    $parseExpressionSpecified = -not $delimiter

    $returnObject = New-Object PSObject

    $matches = $null
    $matchCount = 0
    if($parseExpressionSpecified)
    {
        ## Populates the matches variable by default
        [void] ($textInput -match $parseExpression)
        $matchCount = $matches.Count
    }
    else
    {
        $matches = [Regex]::Split($textInput, $delimiter)
        $matchCount = $matches.Length
    }

    if(-not $matchCount)
    {
        return
    }

    $counter = 0
    if($parseExpressionSpecified) { $counter++ }
    for(; $counter -lt $matchCount; $counter++)
    {
        $propertyName = “None”
        $propertyType = [string]

        ## Parse by Expression
        if($parseExpressionSpecified)
        {
            $propertyName = “P$counter”

            ## Get the property name
            if($counter -le $propertyNames.Length)
            {
                if($propertyName[$counter - 1])
                {
                    $propertyName = $propertyNames[$counter - 1] 
                }
            }

            ## Get the property value
            if($counter -le $propertyTypes.Length)
            {
                if($types[$counter - 1])
                {
                    $propertyType = $propertyTypes[$counter - 1] 
                }
            }
        }
        ## Parse by delimiter
        else
        {
            $propertyName = “P$($counter + 1)”

            ## Get the property name
            if($counter -lt $propertyNames.Length) 
            {
                if($propertyNames[$counter])
                {
                    $propertyName = $propertyNames[$counter] 
                }
            }

            ## Get the property value
            if($counter -lt $propertyTypes.Length)
            {
                if($propertyTypes[$counter])
                {
                    $propertyType = $propertyTypes[$counter] 
                }
            }
        }

        Add-Note $returnObject $propertyName `
            ($matches[$counter] -as $propertyType)
    }

    $returnObject
}

## Add a note to an object
function Add-Note ($object, $name, $value) 
{
     $object | Add-Member NoteProperty $name $value
}


Main $input $parseExpression $propertyType $propertyName $delimiter

[Edit: Monad has now been renamed to Windows PowerShell. This script or discussion may require slight adjustments before it applies directly to newer builds.]


[Edit 2: Updated script to work with new builds]
[Edit 3: Updated script to add type constraints, and consistent parameter naming]
[Edit 4: Updated again]

4 Responses to “parse-textObject – AWK with a vengeance.”

  1. mitch writes:

    Cool script. Thanks.

    Can’t get it to work on tab delimited text file though. :(

    mitch

  2. Thomas Kunka writes:

    Simple question…how do include this code with my own? I’m struggling with basics I guess…I have 2d edition of your book and I’m not seeing a clear description of how this is done.
    thanks…tk

  3. Lee Holmes writes:

    Hi Thomas;

    If the script is in your path, you should just be able to type the examples as shown at the top of the script. If it’s not in your path, see recipe 1.1 for how to run scripts that are in specific directories.

    Hope this helps.

  4. Opticon OPN2001 – CSV file data manipulation using PowerShell « Ing. Lele's Blog – HeadQuarter writes:

    [...] parse-textObject [...]

Leave a Reply