WebSockets from Scratch

Background

In the web application world – especially single-page applications – smooth and fluid interaction is key. For many years, these applications have been doing a pretty good job of getting this fluid interaction though AJAX techniques and browser support for XMLHttpRequest. One issue, however, is that XMLHttpRequest requires that all of your communication go through an text-based HTTP protocol. Another issue is that XMLHttpRequest doesn’t let a server initiate communication back to connected clients. Instead, clients need to continuously poll the server to find out if it has anything to say.

To solve that need, browser vendors introduced the WebSocket protocol. The WebSocket protocol is a persistent, two-way TCP connection between a WebSocket client (traditionally a browser) and a server.

DrawOnMyBadge.com

You might wonder: “That’s all well and good, but why should I care?” That’s a good question with a simple answer: http://www.drawonmybadge.com Smile This is a really cool project designed by Tim McGuffin (@NotMedic). Whatever you draw on the website goes onto his badge – via a 64x32 LED board, and a bunch of more cool software.

image

Tim carried this around DEF CON this year, and people had lots of fun Smile

image

But can we have some _real_ fun? Smile

Digging into the Implementation

When you dig into the code a little bit, you can see how the browser does the communication. For every pixel you draw on the canvas, it converts it into a JSON object with a command called “DRAW”, and “DATA” that represents an X, Y, Color combination. Then it sends that string data to a connected WebSocket (exampleSocket.send(…)).

image

Partial Automation – Code Generation

The first step to having fun with Tim’s badge was through code generation. It is really easy in PowerShell and .NET to iterate through the pixels of an image (bitmap, PNG, etc.). Each pixel you access gives you its color. From there, I took a picture I manually resized to the correct dimensions and had PowerShell generate JavaScript code that I could copy + paste into the Developer Tools console.

image

And from there, the Mona Lisa made her first appearance on Tim’s badge Smile

image

That was fun, but copying + pasting code that PowerShell generated into a browser still felt a little hacky. Why don’t I just talk to the WebSocket directly from PowerShell? There are a few C# libraries for doing this out there, but I thought it would be a fun / interesting project to implement the protocol from scratch. So that’s what we’re going to do Smile

WebSockets from Scratch

The WebSocket protocol is defined by RFC6455, which goes through the protocol in great detail.

Initial Upgrade Request

For the first phase of the connection, the browser makes a standard TCP connection to the remote server. Here’s how to do that in PowerShell:

image

In that TCP connection, it makes a standard HTTP request, requesting an upgrade to the websocket protocol. Part of this HTTP request includes a Sec-WebSocket-Key header, which is intended to ensure that random HTTP requests can’t be retargeted to WebSocket servers, and that WebSocket client requests can’t be targeted to arbitrary other TCP servers. Here’s an example, which hard-codes the key for demonstration purposes.

image

Once the server accepts the connection upgrade, a well-written client will verify that the key included in the server’s response was correctly derived from the Sec-WebSocket-Key you provided. This is what a server response looks like:

image

Data Frames

Once the connection has been upgraded, client applications send frames of data. This isn’t the raw data itself – there is a structure around the data to describe it appropriately to the server accepting the connection.

image

To hold the frame data in PowerShell, we start with defining a byte array. The first 8 bits are:

  • FIN: 1 bit describing whether this is the final frame or not. In my experimentation with Chrome, this was always set to ‘1’.
  • RSV1, RSV2, and RSV3: 1 bit each that should always be zero unless you’re implementing a special protocol on top of WebSockets.
  • OPCODE: 4 bits. The one used by Chrome by default is “TEXT FRAME”, which has the value of 0001.

Putting these 8 bits together gives you a single byte with the bits of 10000001. So that’s how we start each frame:
image

The next 8 bits (the next byte) are:

  • MASK: Whether the data is “masked” by a random obfuscation key. This is recommended for the web browser developers themselves so that malicious web applications can’t cause arbitrary content to be written to the underlying TCP connection itself.
  • Payload Length: If the content is less than 126 bytes, this represents the payload length directly. Otherwise it needs to be the value ‘126’ with the next two bytes representing the actual payload length.

Since we are always supposed to mask the data, we define our mask as ‘1. However, this is supposed to be a bit at the leftmost position in the byte, so we need to shift it left by 7 places.

image

Since the mask and the payload length need to share the same byte (MASK being the leftmost bit and Payload Length being the rest), we use the –bor (Binary Or) bitwise operator to combine them into a single byte and then add that byte to the frame.

image

In the situation where the message length is greater than 126, though, the next two bytes need to also say how long the payload is. In C# and PowerShell, the way to get the bytes of a number is through the BitConverter.GetBytes() API. This API returns the bytes as they are represented by your processor. On most processors, this is Little Endian, where the least significant digits (think the 1s and 10s columns in decimal) come first.

image

The WebSocket protocol, however, requires that these bytes are in “Network Order” (the opposite of Little Endian), so we need to reverse them.

image

After the frame lengh, we need to provide the actual masking key. This is 4 bytes. As mentioned earlier, these are values used to obfuscate the content being transmitted to the server so that malicious web applications can’t communicate at the TCP level directly to the websocket server. The obfuscation algorithm is a simple XOR, where the four bytes in the masking key are used in a round-robin fashion against bytes in the content. Because this is was just for fun and the security protection is irrelevant, I provided a static masking key of “0” since anything XOR’d by 0 doesn’t change anything. That way, I didn’t have to implement the masking algorithm and the server wouldn’t notice Smile

image

Next, we get to add our actual content to the frame – in this case, a message representing the JSON of pixels that we want to draw.

image

Finally, we write this frame to the TCP connection and flush it

image

When all is said and done, pretty pictures show up on Tim’s badge Smile

When I sent Rick Astley, Defender ATP alerted Microsoft IT that one of its computers had PowerShell making raw TCP connections to a newly-registered domain. Amazing!

image

I tried to WannaCry Tim’s badge, and he still hasn’t paid Sad smile

image

And with true raw WebSockets control, I was able to give Tim’s badge a demoscene Smile Smile
draw_badge_box

So that’s how to go about WebSockets from Scratch. Hope this helps!

Efficiently Generating Python Hash Collisions

In 2003, Crosby and Wallach wrote an excellent paper demonstrating a new form of Denial of Service vulnerability against applications by abusing algorithmic complexity in the data structures that they depend on. For example, some data structures operate quite efficiently when given everyday input, but the performance degrades precipitously in certain edge cases.

If an application provides user-controlled data to one of these data structures, attackers can intentionally provide this worst case form of input to generate a Denial of Service attack against the application.

In 2011, Alexander Klink and Julian demonstrated this form of attack quite affectively against web servers that take attacker-provided HTTP headers and represent them internally as hashtables.

Hashtables work extremely quickly for most input by distributing items into different buckets based on the hash value of the item’s key. There is a degenerate case, however, when hash collisions cause the algorithm to distribute items into the same bucket. Once hundreds or thousands of items start being placed into the same bucket due to having the same hash value, the performance of a hashtable degrades exponentially. A web server that used to be able to handle 1,000 requests per second may now only be able to handle 10. Or 1. Or worse.

image

Many programming languages and frameworks now account for this type of attack. Python fixed this by default in Python 3.3+, which was released over 7 years ago. However, despite the fix being released over 7 years ago, many applications still build on Python 2.7 – including an application I was testing. Alexander and Julian didn’t ever release the code they used to DOS Python, so I embarked on an interesting journey to find the most efficient way possible to generate hashtable collisions to DOS Python.

Before this research, the most efficient mechanism I could find was from Robert Grosse. He focused on 64-bit hashes, and his approach took about 80 hours to generate 50,000 hash collisions. The result of my research (against 32-bit Python) generates billions of collisions essentially instantaneously (as fast as your computer can print them to the screen, write them to a file, etc.).

Background: Understanding Python’s hash implementation

Python (remember: 3.2 and below) uses a FNV-style algorithm to calculate the hash values of strings. The hash value is a single 32-bit number that is the result of running a calculation over all of the characters in the input string. Here’s a simple re-implementation in C that takes the string to hash as its first argument:

image

The variable ‘x’ represents the resulting hash value. The variable ‘a’ represents the string to be hashed. The variable ‘p’ represents the value of the current character being processed.

Let’s walk through an example of this algorithm.
   

image

On line 21, the hash value is initialized to zero. The PowerShell console on the right-hand side shows the state of the hash value throughout the algorithm, and the highlighted line shows the bits of the current value of the hash. They are all zero, of course, since we just initialized it to zero.

image

Line 22 takes the value of the first letter (‘L’ is 1001100 in binary), shifts it left by 7 places, and then blends it into the hash value with the XOR operator.

image

Line 25 is where we start the character-by-character processing. Python takes the current value of the hash, and multiplies it by 1000003. This is a large prime number, which essentially largely and unpredictably churns the bits in the leftmost digits (the red square – I call that area the “bit blender”). Then, it uses the XOR operator to bring this character into the hash value for the rightmost digits.

image

image

Then, we rinse and repeat until the end of the string.

image

For the final stage, Python takes the current hash value and uses the XOR operation to incorporate the length of the string, which is 3 in our case. This step is done to reduce the amount of collisions for short strings.

image

When we convert this final value to an integer, we get the hash value that you know and love from Python 3.2 and below.

Meet in the Middle Attacks

Now that we have an understanding of how Python implements its hash algorithm, we can explore the previous state-of-the-art technique used to attack the algorithm.

Pretend that you’re trying to generate hundreds or thousands of strings that all hash to the value ‘0’ as a way to generate a hashtable denial of service. One way to do this is to brute force your way through billions and billions of strings and hope to find hundreds or thousands that have the hash value of ‘0’. This is incredibly inefficient and will take you many hundreds of years of computation.

Another way is called a “Meet in the Middle Attack”. This is the method used by Alexander Klink and Julian in their CCC presentation, as well as the technique explored more deeply for 64-bit Python by Robert Grosse.

image

As Python iterates through a string, it keeps modifying and altering a running hash value. Once it gets to the end of the string, it outputs the hash value that it’s calculated. For example, imagine that the running hash value of the first part of a string – QCMWaIOFFgH – was 3B847A2A. Well, it turns out that Python’s FNV hash algorithm can partially be reversed as well. If you try to work backwards from a desired hash value (i.e.: 0), you can figure out a set of intermediate hash values such that “if the intermediate hash value is <x> and the rest of the string is <y>, then the resulting hash value will be 0”. This isn’t fully deterministic: for a given hash value or internal state, there are many characters and previous internal states that can lead to it.

You can combine both of these tricks with a lot of brute force to massively speed up a naïve brute force approach.

As the first step, we calculate the hash values of a bunch of strings. We record the internal hash value as Python processes each character. For each internal state value, we save the characters in the string that led up to it.

As the second step, we take a desired hash value (i.e.: 0) and reverse the hashing process. We iterate through a bunch of hypothetical preceding characters and figure out the internal state hash value that could have led to it.

Both of these steps require a lot of computational power and a lot of memory.

The example above demonstrates a situation where we found:

  1. A string, QCMWaIO, where processing it up until that point generated an internal hash value of 3B847A2A
  2. Two strings, vpl and wQl, where if the internal hash value was 3B847A2A then it would result in a hash value of 0

If you combine these two paths together, you get a hash collision: two strings (QCMWaIOvpl and QCMWaIOwQl) that both hash to the value of 0. Apply enough time, memory, and computational power and you have thousands of hash collisions! This is many thousands or millions of times faster than the basic brute force approach.

Hash Stretching: Making Meet-in-the-Middle attacks 250 times more efficient

One thing that became clear after analyzing Python’s FNV-style algorithm is that it has a critical weakness: intermediate hash values of zero.

image

If the intermediate hash state (the variable x in this sample code) ever gets to 0, then multiplying it by 1000003 doesn’t change the intermediate hash value at all and you don’t get the unpredictable mixing that used to previously happen with that step. Then, you blend in an attacker-controlled character (the variable p in this sample code) and attackers have fine-grained control over the internal hash state.

It only takes a few seconds of brute forcing to find strings that at some point have an internal hash state of zero.

This introduces two major attacks:

image

  1. Control of the final hash value. If the internal hash state ever reaches zero, the next character of the string completely defines the new internal hash state. If there are no more characters in the string, then the final hash value (as in the first example above) will simply be the length of the string due to Python’s final XOR with the length of the string. Additionally, since you know that Python will do a final XOR by the length of the string, you can use your next character after the state hits ‘0’ to make this final XOR generate a final hash value of anything between 0 and 255. For example, if you supply the next character as the decimal value ‘5’, an XOR by the new string length of ‘5’ will result in a hash value of zero.
  2. Hash collision stretching. If the internal hash state ever hits zero, processing a character with the value of zero doesn’t change anything except the string length. Because of this, you can stretch a single hash collision into more hash collisions by simply adding new NULL characters when the internal hash value reaches zero. This changes the string length, so you also need to use technique #1 (controlling the final hash value) to remove the string length from the final calculation. This works for strings of length up to 255 (the maximum value of a byte you control), which lets you generate about 250 more collisions for every collision you find where the internal state reaches zero.

Using this approach, this technique was able to generate 10,000+ collisions with only a few hours of brute force. A fun and huge improvement to the state of the art, but we’re not done yet 🙂

Generating Billions of Collisions: The Null Points method

As mentioned, Python’s FNV implementation has a critical weakness whenever the internal hash state reaches zero. But finding those strings with an internal hash state of zero required a couple of hours of brute force. This leads to a question: Since the algorithm starts the hash state at zero and we find a part of a string (i.e. f8ds3k3kk) that leads to an internal hash state of zero, couldn’t we just repeat that substring to keep on finding new strings with internal hash states of zero? Like f8dsf8dsf8dsf8ds3k3kk?

Initially, the answer appeared to be ‘No’.

image

The big issue is Python’s processing of the initial character. If you recall the very first step, Python shifts the initial character to the left by 7 bits, thoroughly placing it into the unpredictable Bit Blender. If we tried to place that character again after we hit the internal state of zero, we wouldn’t be able to replicate the left shift by 7 places. So it seems we’re out of luck.

Except for one initial character!

image

When we shift left by seven bits, there is exactly one byte that doesn’t shift anything into the bit blender. The value ‘1’. When the algorithm shifts ‘1’ left by 7 bits, you get the value ‘128’ (0x80) with nothing in the bit blender – leaving an intermediate hash value that we can still completely influence with further characters under our control. So if we ever find a string that starts with the byte ‘1’ and eventually gets to an internal state of ‘0’, we can absolutely use these little islands of characters to get back to an internal hash state of ‘0’.

Here’s an example. After a few seconds of brute force, we find the string \x01\x02\xBC\x6F\xA4\xB6 that gets to an internal hash state of ‘0’.

image

As our next character, we use 0x80 (the initial ‘1’ shifted left 7 bits), and then we can clone our attack string again (“\x01\x02\xBC\x6F\xA4\xB6”) to get back to an internal hash state of zero. We use our final byte to influence the string length XOR to produce a final hash result of our choice, and we now have a repeatable way to influence the hashing operation from beginning to end.

More importantly, our little attack meta-pattern (“\x01\x02\xBC\x6F\xA4\xB6”) is only one example of a character sequence vulnerable to the null points method. With a bit more brute forcing, it’s possible to find these meta-patterns pretty quickly. With a few hours of computation, I easily found 16 patterns that fit the definition of criticality:

  • They started with the byte ‘1’
  • They caused Python’s internal hash state to reach ‘0’ at some point

The true beauty of this attack is that the meta-patterns used in this attack are interchangeable. The example above demonstrates that we can generate a lot of collisions from strings in the form of "<Pattern1>0x80<Pattern1>0x80<Pattern1>”. But since we’re just abusing the fact that we got to an internal state of ‘0’, we can also generate them in the form of “<Pattern1>0x80<Pattern2>0x80<Pattern1>”. Or “<Pattern1>0x80<Pattern1>0x80<Pattern2>”. Start combining these meta-patterns systematically, and you have nearly infinite collisions at your disposal.

image

With these 16 meta-patterns of only 6 bytes each, you can generate 4 billion hash collisions against Python’s hash algorithm using only 48 bytes each. Due to the magic of combinatorics, the sky really is the limit: this technique will let you generate hash collisions against Python 3.2 and below as fast as your hard drive or network can process them. Have a 256-byte limit on key length? That’s 3.74144419156711E+50 (374144419156711000000000000000000000000000000000000) collisions! And if you want to experience a true embarrassment of riches, we can make this attack 250x more efficient with the hash collision stretching approach discussed earlier. But with this many zeros, who’s counting 🙂

Impact on Python Dictionary Insertion Performance

In raw numbers, Python’s dictionary insertion performance is pretty fast – even with collisions. But at only 50,000 colliding entries in a hashtable, Python starts to peg a CPU for 80 or 90 seconds. This represents about 2mb of raw data. So with a second or two of data transfer, you can keep a CPU busy for a minute or two.

image

The critical problem is that this performance degrades exponentially. Submit 4mb of data (10,000 collisions), and you may never see that CPU again. Submit 4 billion collisions? 3.74144419156711E+50 collisions? It’s safe to say that if you are exposing Python 3.2 (or below) hash tables to untrusted input, you have a significant security vulnerability on your hands. And by the way – json.loads() generates a dictionary backed by a hashtable.

Coordinated Vulnerability Disclosure

While this research demonstrates a fundamental break in the Python 3.2 (or below) hash algorithm, Python fixed this issue 7 years ago. It’s time to upgrade.

Extracting Forensic Script Content from PowerShell Process Dumps

After posting Extracting Activity History from PowerShell Process Dumps, I got an interesting follow up question: “Is it possible to extract the content of scripts (from disk) that were executed, even if those files were not captured?”

The answer is “Yes”, but it’s also complicated. And to make it even more complicated, we’re going to go down a path showing how to do some of this detective work from scratch. This is going to require a lot of WinDbg automation, so for a first step, install the WinDbg module.

To set up our forensics experiment, create this simple script. Save it somewhere like c:\temp:

image

Open a PowerShell session, run the script, and create a dump file.

image

Now, use the WinDbg module to connect to the dump file:

Connect-DbgSession -ArgumentList '-z "C:\Users\lee\AppData\Local\Temp\powershell.DMP"'

Begin our investigation

To begin our investigation, let’s cast a really wide net. We know we want to extract objects (if they exist) that represent scripts that were run in that session. But how do we find these?

First, let’s use SOS’s “Dump Object” command to dump everything it knows about every single object in the process. So, we’ll start with the !DumpHeap command to find all object instances (i.e.: we won’t even use the –Type filter). There are “smarter” ways to do it, but this step and the next will take a very long time, so maybe get it going before bed or something.

$allReferences = dbg !dumpheap -short

Once we have all object references, let’s use the !do (Dump Object) command to have SOS visualize them all. The output of Dump Object doesn’t include the address of the object being dumped, so we’ll use Add-Member to keep track of that as well.

$allObjects = $allReferences | Foreach-Object { $object = dbg "!do $_"; Add-Member -InputObject $object Address $_ -PassThru -Force }

(The next day) That’s a mighty hay stack indeed! On my system, there are about a million objects that SOS knows about in this process instance. Do any of them have any part of the GUID in the way that SOS would visualize them? Let’s find out!

image

Looks like we’re in luck! Out of those million objects, we managed to narrow it down to 7 System.String objects in PowerShell’s memory that somehow referenced the GUID. If we think the information might have been in a System.String all along, we could have made our initial “$allObjects” query faster by using “$allReferences = dbg !dumpheap –type System.String –short”. But how do we figure out what’s holding these GUIDs?

To find out, we’ll use SOS’s !gcroot command. This is commonly used to diagnose managed memory leaks – for example, “What am I doing that’s causing the CLR to hold onto 10 million instances of this string?” For any given object, the !gcroot command tells you what object is referencing it and what object is referencing that one - all the way until you hit the root of the object tree. Let’s explore some of these roots.

image

Ok, so the last one (item #6 in the array) wasn’t actually rooted. It was no longer referenced, and would be cleaned up by the garbage collector shortly.

Item #5 was rooted through an object array (System.Object[]), where one of those elements was a ConcurrentDictionary, which held a ScriptBlock, which held CompiledScriptBlockData, which held nodes in a PowerShell AST, bottoming out in a CommandAst AST that referenced this GUID.

Sounds cool. What about any others? Here’s item #4 in my instance:

image

This is interesting! This one starts with the same root object array (0000026e101e9a40), the same ConcurrentDictionary (0000026e003bc440), but this time bottoms out into a tuple (a simple pairing of two items) that contains our string and another string. Let’s dive into that tuple and the strings it contains.

image

So this tuple has two elements. The first element looks to be the path to the script that was executed, and the second element looks to be the content that was in that script. Let’s see what the PowerShell Source has to say about these data structures. I’ll search for ConcurrentDictionary to see what I can find. On the third page, we can see exactly what we’re looking at:

image

There’s a class called CompiledScriptBlock. It contains a static (process-wide) cache called “s_cachedScripts”. This is a dictionary that maps a pair of strings to an instance of a ScriptBlock. And if you read the source, you can see exactly what the Tuple is as well – a mapping of a script’s path to the content it contained at the time the ScriptBlock was cached:

image

This data structure is what we ended up poking around in. For performance reasons, PowerShell maintains an internal script block cache so that it doesn’t need to re-compile the script blocks every time it sees a script. That cache is keyed off of the path and script contents. The thing stored in the cache is an instance of a ScriptBlock class, which contains  (among other things) the AST of the script that was compiled.

So now that we know this thing exists, we can be much smarter in our automation and extract this stuff intentionally! Now we’re getting into real scripting, so this is what we’ll do:

  1. Use !dumpheap to find instances of this Tuple class. The dumpheap command does a substring search, so we’ll do a bit of post-processing with a regex.
  2. This gives us the MT of the tuple class that we actually want to investigate.
  3. Run !dumpheap again with that MT as a filter

image

Now we can explore one of these nodes. It has a m_key that we can dive into.

image

Almost there! Let’s extract out the two items from those resulting keys, and emit a pretty PowerShell object:

image

It’s been a long journey. But: we investigated a hypothesis from scratch, followed it through, and now are able to forensically recover the content of all scripts from the PowerShell process memory even if you no longer have access to the files in question. Awesome Smile

Here’s a script that packages all of this into a function.

001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
function Get-ScriptBlockCache
{
    $nodeType = dbg !dumpheap -type ConcurrentDictionary |
        Select-String 'ConcurrentDictionary.*Node.*Tuple.*String.*String.*\]\]$'
    $nodeMT = $nodeType | ConvertFrom-String | Foreach-Object P1
    $nodeAddresses = dbg !dumpheap -mt $nodeMT -short
    $keys = $nodeAddresses | % { dbg !do $_ } | Select-String m_key
    $keyAddresses = $keys | ConvertFrom-String | Foreach-Object P7

    foreach($keyAddress in $keyAddresses) {
        $keyObject = dbg !do $keyAddress

        $item1 = $keyObject | Select-String m_Item1 | ConvertFrom-String | % P7
        $string1 = dbg !do $item1 | Select-String 'String:\s+(.*)' | % { $_.Matches.Groups[1].Value }

        $item2 = $keyObject | Select-String m_Item2 | ConvertFrom-String | % P7
        $string2 = dbg !do $item2 | Select-String 'String:\s+(.*)' | % { $_.Matches.Groups[1].Value }

        [PSCustomObject] @{ Path = $string1; Content = $string2 }
    }
}

Extracting Activity History from PowerShell Process Dumps

Imagine that you’re investigating the compromise of a system. The system doesn’t have PowerShell Logging enabled, but you did capture a process dump while activity was happening.

This memory dump is forensic gold, and the managed code debugging extension for WinDbg (“SOS” – Son of Strike) gives you all the tools you need to mine it. After using File | Open Crash Dump, this is what you see:

image

From there, load the SOS extension, fix symbols, and reload:

.loadby sos clr
.symfix
.reload

Use !Help to see all of the features offered by the SOS CLR Debugging Extension, but one useful one is !DumpHeap. It enumerates all of the objects in managed memory, and lets you filter these by type.

PowerShell stores command history in HistoryInfo objects, so lets look for those. We can use !DumpHeap –Type HistoryInfo for this. SOS sometimes generates an exception the first time you use it, so just do it again 🙂

image

We can see that there were 6 items of HistoryInfo (commands typed by the attacker), and 7 arrays of HistoryInfo (internal data structures that hold them). The heap includes objects that are temporarily in use, so not all of these represent unique commands.

If we click on the MT column for HistoryInfo itself, we’ll get that type of object explicitly. Then, we can click on any of the Addresses to see what that HistoryInfo contains. Clicking is just a shortcut for the commands that WinDbg ultimately shows in the output anyways: !DumpHeap /d -mt 00007ff8a140be70 and !DumpObj /d 0000024226256100.

image

There’s a string at offset 8 that contains the _cmdline itself. What’s in it? You can click (!DumpObj /d 0000024226255a30) to find out.

image

Ah, yes. The eternal existential question. Who am I?

Doing this manually for each HistoryInfo can get annoying, so you can use WinDbg’s “scripting language” to automate some of it. The “.foreach” statement lets us iterate over debugger output, assign each item to a variable, and then act on that variable. So here’s a little recipe we’ll run:

.foreach (historyinfo { !dumpheap -type HistoryInfo -short }) { .echo ${historyinfo}; !DumpObj poi(${historyinfo}+8) }

We’ll iterate through the !DumpHeap results, assigning each memory address to a ‘historyinfo’ variable. We’ll echo that original memory address to the screen, and then remember that the _cmdline was offset 8 from that memory address. So, we’ll use the poi function to calculate the address of that new object, and use !DumpObj to dump it.

That gives us a bunch of output that we can then review, or script even further.

image

In our forensic analysis of this image, we can see that we are in trouble indeed. After they ran ‘whoami’, they knew the username and password of the Domain Admin, and then used PowerShell remoting to connect to the Domain Controller with it.

Automation

User interfaces are great for initial analysis, but terrible for scale. To automate scenarios like this, you can use this WinDbg Automation module I shared a little while ago. It leverages cdb.exe – the console version of WinDbg. From there, you can do all kinds of crazy stuff.

As cool as that module is, I’m most excited by far about the recent open sourcing of DbgShell. This is the hobby-time project of @JazzDelightsMe, and exposes much of the debugging engine into PowerShell as PowerShell objects. Incredible!

image

Producer / Consumer Parallelism in PowerShell

After it’s done indexing your data, Scour is blazingly fast at local content searches – far faster than Select-String, grep, or anything else that rips through files when you search. It accomplishes this through the power of the popular Lucene search engine. Similar to the way that publishers have made searching physical books fast, Lucene must first create an index of your content before you can search it.

So while Scour was incredibly fast at searching content, the indexing process took much longer than I wanted it to. In one directory (20,000 files), it took about 13 minutes.

Most of this was my fault – I had put in a call to [GC]::Collect() after every file. This was temporary code I had put in when I was investigating some memory issues, and it destroyed performance. After removing that mistake, indexing my test directory was now down to 45 seconds. After spending an afternoon on further tweaks, the final performance measurement was down to 9 seconds. That’s a 90x improvement if you include the sandbag, and merely a 5x improvement if you don’t 🙂

The majority of this performance came from introducing multi-threading into the indexing process.

There are lots of great threading techniques in PowerShell – both built-in, and from the community: Jobs, PoshRSJob, and Thread Jobs, to name a few. All of these do an excellent job of running multiple blobs of code in parallel.

Managing Parallelism

One thing that’s not often talked about is how you manage this parallelism. Once you have half a dozen threads running, what do you do with them? If you have hundreds of thousands of work items, how do you dispatch them to the threads and receive the output? How do you track overall status?

Image result for producer consumer

One approach to this is the Producer-Consumer pattern. Many cloud architectures use components like Azure Service Bus to accomplish this at a large scale, but it’s equally helpful for work management among threads. In the Producer-Consumer pattern, you have one thread generating work. In this situation, it was a PowerShell script generating lists of file names for Lucene to index. Then, you have multiple threads consuming this shared pipeline – fetching whatever work is available, doing the work itself, and then submitting the output back to a shared output pipeline. In this situation, these were background threads receiving file names and then doing the work to add those files to the index.

This pattern can be successful when the description of the work (i.e.: a filename, or coordinate of an image) is much smaller than the work itself (indexing that filename, or rendering that portion of the image).

Example Implementation

You can see the implementation of this in Scour if you are interested, but the details are somewhat obscured by the details and context of the actual implementation. Instead, here is a heavily-commented example of implementing this pattern in PowerShell, simply passing around GUIDs and printing them out.

image

I’ve also shared this on the PowerShell Gallery (Install-Script Invoke-ProducerConsumer –Scope CurrentUser).

001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
046
047
048
049
050
051
052
053
054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
090
091
092
093
094
095
096
097
098
099
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
<#PSScriptInfo
.VERSION 1.0
.GUID bfb939b9-03f0-433e-ad0f-e4e12f4a009c
.AUTHOR Lee Holmes
#>

<#
.DESCRIPTION
  Example implementation of producer / consumer parallelism in PowerShell
#>

## The script block we want to run in parallel. Threads will all
## retrieve work from $InputQueue, and send results to $OutputQueue
$parallelScript = {
    param(
        ## An Input queue of work to do
        $InputQueue,
       
        ## The output buffer to write responses to
        $OutputQueue,
       
        ## State tracking, to help threads communicate
        ## how much progress they've made
        $OutputProgress, $ThreadId, $ShouldExit
    )

    ## Continually try to fetch work from the input queue, until
    ## the 'ShouldExit' flag is set
    $processed = 0
    $workItem = $null
    while(! $ShouldExit.Value)
    {
        if($InputQueue.TryDequeue([ref] $workItem))
        {
            ## If we got a work item, do something with it. In this
            ## situation, we just create a string and sleep a bit.
            $workItemResult = "Processing $workItem in thread $ThreadId"
            Start-Sleep -Seconds (Get-Random -Max 3)

            ## Add the result to the output queue
            $OutputQueue.Enqueue($workItemResult)

            ## Update our progress
            $processed++
            $OutputProgress[$ThreadId] = $processed
        }
        else
        {
            ## If there was no work, wait a bit for more.
            Start-Sleep -m 100
        }
    }
}

## Create a set of background PowerShell instances to do work, based on the
## number of available processors.
$threads = Get-WmiObject Win32_Processor | Foreach-Object NumberOfLogicalProcessors
$runspaces = 1..$threads | Foreach-Object { [PowerShell]::Create() }
$outputProgress = New-Object 'Int[]' $threads
$inputQueue = New-Object 'System.Collections.Concurrent.ConcurrentQueue[String]'
$outputQueue = New-Object 'System.Collections.Concurrent.ConcurrentQueue[String]'
$shouldExit = $false

## Spin up each of our PowerShell runspaces. Once invoked, these are actively
## waiting for work and consuming once available.
for($counter = 0; $counter -lt $threads; $counter++)
{  
    $null = $runspaces[$counter].AddScript($parallelScript).
        AddParameter("InputQueue", $inputQueue).
        AddParameter("OutputQueue", $outputQueue).
        AddParameter("OutputProgress", $outputProgress).
        AddParameter("ThreadId", $counter).
        AddParameter("ShouldExit", [ref] $shouldExit).BeginInvoke()
}

## Some fake work - send 50 GUIDs into our worker threads
$estimated = 50
1..$estimated | Foreach-Object {
    $currentInput = New-Guid
    $inputQueue.Enqueue($currentInput)
}

## Wait for our worker threads to complete processing the
## work.
try
{
    do
    {
        ## Update the status of how many items we've processed, based on adding up the
        ## output progress from each of the worker threads
        $totalProcessed = $outputProgress | Measure-Object -Sum | Foreach-Object Sum
        Write-Progress "Processed $totalProcessed of $estimated" -PercentComplete ($totalProcessed * 100 / $estimated)
  
        ## If there were any results, output them.
        $scriptOutput = $null
        while($outputQueue.TryDequeue([ref] $scriptOutput))
        {
            $scriptOutput
        } 

        ## If the threads are done processing the input we gave them, let them know they can exit
        if($inputQueue.Count -eq 0)
        {
            $shouldExit = $true
        }      
       
        Start-Sleep -m 100

        ## See if we still have any busy runspaces. If not, exit the loop.
        $busyRunspaces = $runspaces | Where-Object { $_.InvocationStateInfo.State -ne 'Complete' }
    } while($busyRunspaces)
}
finally
{
    ## Clean up our PowerShell instances
    foreach($runspace in $runspaces)
    {
        $runspace.Stop()
        $runspace.Dispose()
    }
}

Scour: Fast, Personal, Local Content Searches

If you have a large collection of documents (source code or text files), searching them with PowerShell or your favourite code editor can feel like it takes forever.

image

How is it that we can search the entire content of the Internet in milliseconds, but searching your local files can take minutes or hours?

It turns out that there is a solution built into many products and technologies that can help us: the Apache Lucene project. Lucene forms the search engine backbone for many popular products: Solr, Twitter, ElasticSearch, and at least a few hundred others. Unfortunately for us, however, it comes as a software development kit and not a simple tool that you can use from the command line.

With the introduction of a Scour – a new PowerShell module – that all changes.

Scour

Scour is a PowerShell module that lets you search any directory on your computer using the power of the Lucene search engine backend. After you run an initial indexing process, future searches are tens or hundreds of times faster than the searches you are currently used to.

Install Scour

To install Scour in PowerShell, simply run:

Install-Module Scour –Scope CurrentUser

Create an Index

Lucene accomplishes its excellent search performance by analyzing your files and then storing information in an index. To create this index, go to a directory that contains the files you care about and run:

Initialize-ScourIndex

By default, Scour indexes text files (*.txt) and the source files for popular programming languages. If you want it to index additional file types, you can use the –Path parameter.

Scour will first scan the current directory (and subdirectories) to determine how many files it has to index, and then displays a progress bar to let you know how much time is left in the indexing process.

image

Search your Content

Once you’ve created the index for a directory, use the Search-ScourContent cmdlet:

Search-ScourContent "query”

As mentioned before, Scour leverages the Lucene search engine under the hood, so query should follow the rules of Lucene Search Syntax. This syntax is also described in the about_query_syntax.txt help file included with Scour. Here are a few examples:

  • Search-ScourContent “word1 word2” – Searches for all files that contain word1 or word2
  • Search-ScourContent “word1 AND word2” – Searches for all files that contain both word1 and word2
  • Search-ScourContent word* – Searches for all files that have a word starting with “word”
  • Search-ScourContent word~ – Searches for all files that have a word similar to “word”

By default, Scour returns the files that match your query. You can pipe the results into Select-String, Copy-Item, or anything other scripting you might want to do on these results.

image

If you start your search from within a specific directory, Scour automatically limits your results to documents in that directory or below.

In addition to the file content, Scour also indexes the document paths. This lets you add path-based restrictions to your searches. For example:

  • Search-ScourContent 'path:william AND following'

If you want to search for specific regular expressions within your files, Scour lets you combine the efficiency of indexed document searches with a line-by-line regular expression match. To do this, add the –RegularExpression parameter to your call.

Here’s an example of finding all documents that have the word “following” in them, and then returning just the lines that match the regular expression “follow.*Cambri.*”.

image

If you want to restrict your searches to a specific file type (i.e.: *.cs), you can use the –Path parameter to Search-ScourContent.

Enjoy!

Creating a Good Security Conference CFP Submission

So you’re interested in submitting a talk for a security conference? Awesome! Above all else, what keeps our industry moving forward is the free and open sharing of information. Submitting a talk can be a scary experience, and the process for how talks are evaluated can feel mysterious.

So what’s the best way to create a good security conference CFP submission?

It’s perhaps best to consider the questions that the review board will ask themselves as they review the submissions:

  • Will this presentation be of broad interest (i.e.: 10-20% of attendees?)
  • Is this presentation innovative and useful?
  • Is this presentation likely to be able to deliver on its outline and promises?
  • Is this presentation consistent with the values of the conference?

They are also likely going to review submissions in an extremely basic web application or Excel.

image

See that scroll bar on the right? It’s tiny. For DerbyCon this year, the CFP board reviewed ~ 500 submissions. That’s a lot of work, but it’s also an incredible honour. It’s like going to an all-you-can eat buffet prepared by some of the top chefs in the world. But that buffet is 3 miles long. It’s overwhelming, but in a good way 🙂

Let’s talk about how you can create a good CFP submission based on the questions reviewers are asking themselves.

Review Board Criteria

Will this presentation be of broad interest?

If a conference is split into 5 tracks, accepted presentations must be interesting to about 20% of the attendees. If your talk is too specialized - such as the latest advances in super-theoretical quantum cryptography - you might find yourself talking to an audience of 4.

A common problem in this category is vendor-specific talks. Unless the technology is incredibly common, it will just come across as a sales pitch. And nobody wants to see a sales pitch.

That said, some talks are of broad interest exactly because they are so far outside people’s day-to-day experience. While an attendee may never have the opportunity to experience the lifestyle of a spy, an expose into the life of one would most certainly have popular appeal.

Is this presentation innovative and useful?

The security industry is incredible at sharing information. For example, @irongeek records and shares thousands of videos from DerbyCon, various BSsides, and more. DEF CON has been sharing videos for the last couple of years, as has Black Hat and Microsoft’s Blue Hat. If an audience member is interested in a topic, there’s a good chance they’ve already watched something about it through one of these channels. In your CFP submission, demonstrate that your presentation is innovative or useful.

  • Does it advance or significantly extend the current state of the art?
  • Does it distill the battle scars from attempting something in the real world (i.e.: a large company, team, or product?)
  • If it’s a 101 / overview type presentation, does it cover the topic well?

Is this presentation likely to be able to deliver on its outline and promises?

Presentation submissions frequently promise far more than what they can accomplish. For example:

  • Content outlines that could never be successfully delivered in the time slot allotted for a presentation.
  • Descriptions of research that is in progress that the presenter hopes will bear fruit. Or worse, research that hasn’t even started.
  • Exaggerated claims or scope that will disappoint the audience when the actual methods are explained.

Is this presentation consistent with the values of the conference?

Some presentations are racist, sexist, or likely to offend attendees. This might not be obvious at first, but slang you use amongst your friends or coworkers can come across much differently to an audience. These ones are easy to weed out.

Many conferences aim to foster a healthy relationship between various aspects of the community (i.e.: researchers, defenders, vendors,) so a talk that is overly negative or discloses an unpatched vulnerability in a product is likely not going to be one that the conference wants to encourage.

On the other hand, some conferences actively cater to the InfoSec tropes of edgy attackers vs defenders and vendors. You might find an otherwise high-quality Blue Team talk rejected from one of those.

Some submissions may appear to skate a fine line on this question, so good ones are explicit about how they will address this concern. For example, mentioning that the vulnerability they are presenting has been disclosed in coordination with the vendor and will be patched by the time the presentation is given.

Common Mistakes

Those are some of the major thoughts going through a reviewer’s mind as they review the conference submissions. Here are a couple of common mistakes that make it hard for a reviewer to judge submissions.

  • Is the talk outline short? If so, the reviewer probably doesn’t have enough information to evaluate how well the presentation addresses the four main questions from above. A good outline is usually between 150 to 500 words. See talks 3, 4, and 5 from the screen shot above to see how this looks in practice!
  • Does the title, description or outline rely heavily on clichés? If so, the presentation is likely going to lack originality or quality – even if it is for fun and profit.
  • Is the talk overly introspective? Talks that focus heavily on the presenter (“My journey to …”) are hard to get right, since attendees usually need to be familiar with the presenter in order for for the talk to have an impact. Many review processes are blind (reviewers don’t know who submitted the session), so this kind of talk is almost impossible to judge.
  • Is the talk a minor variation of another talk? Some presenters submit essentially the same talk, but under two or three names or variations. What primarily drives a reviewer’s decision of a talk is the meat, not a bit of subtle word play in the title. They will likely recognize the multiple variations of the talk and select only one – but which one specifically is unpredictable. When votes are tallied, three talks with one vote each are much less likely to be selected than a single talk with three votes.
  • Is the submission rife with grammar and spelling errors? I don’t personally pay much attention to this category of mistake, but many reviewers do. If you haven’t spent the effort running your submission through spelling and grammar check, how much effort will you spend on the talk itself?

XOR is Not as Fancy as Malware Authors Think

FireEye recently posted some research about an attack leveraging the NetSupport Remote Access tool. The first stage of this attack uses a lot of obfuscation tricks to try to make reverse engineering more difficult.

David Ledbetter and I were chatting about some of the lengths the malware authors went through to obfuscate the content.

One of the major sources of complication is a complicated, iterative XOR:

(Image credit FireEye)

Unlike most malware that obfuscates content by XORing the content with a single byte / key, this malware appears to do something much more clever. See the content starting at ‘var tmpKeyLength = 1;’?

  1. XOR each character of the content with the first byte of the encryption key
  2. XOR characters of the content with bytes from the encryption key in the following pattern: 1, 2, 1, 2, 1, 2, 1, 2, …
  3. XOR characters of the content with bytes from the encryption key in the following pattern: 1, 2, 3, 1, 2, 3, 1, 2, 3, …

When malware uses step #1 alone -- or even a repeating single-key XOR -- I like to call it “Encraption”. It appears complicated, but is vulnerable to many forms of cryptanalysis and can be easily broken. Given that this malware did several levels of Encraption, did they manage to finally invent something more secure than a basic repeating key XOR?

Not even close.

XOR is Associative

One of the biggest challenges with using XOR in cryptography is that it is associative: you can rearrange parenthesis without impacting the final result. For example, consider again a single byte key and the following double XOR encryption:

  1. Take the content
  2. XOR each character by the value ‘123’
  3. XOR each character by the value ‘321’
  4. Emit the result

If we were to add parenthesis to describe the order of operations:

(Content XOR 123) XOR 321

Because XOR is associative, you can rearrange parenthesis (order of operations) to make it:

Content XOR (123 XOR 321)

Which gives 314:

image

So, encraption with two keys is still just encraption with a single key and is vulnerable to all of the same attacks.

But what about that rolling key?

The malware above used something more like a rolling key, however. It didn’t take a couple of single bytes. If the content was 100 bytes, it did 100 rounds of XOR based on the key. Surely that must be secure.

Fortunately? Unfortunately? The answer is no. If we remember the malware’s algorithm:

    1. XOR each character of the content with the first byte of the encryption key
    2. XOR characters of the content with bytes from the encryption key in the following pattern: 1, 2, 1, 2, 1, 2, 1, 2, …
    3. XOR characters of the content with bytes from the encryption key in the following pattern: 1, 2, 3, 1, 2, 3, 1, 2, 3, …

Consider the perspective of a single character. It gets encrapted by one byte of the key, and then a different byte of the key, and then a different byte of the key… and so on. And because XOR is associative, as we demonstrated above, that is the same thing as the single character being encrapted by a single byte.

A PowerShell Demonstration

Let’s take a look at a demonstration of this in PowerShell.

First, let’s look at a faithful re-implementation of the original algorithm:

001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017

$stringToEncrypt = [char[]] "Hello World!"
$encrypted = $stringToEncrypt.Clone()
$key = 97,4,13,252,119,31,208,156,196,56

$tmpKeyLength = 1
while($tmpKeyLength -le $key.Length)
{
    $tmpKey = $key[0..$tmpKeyLength]
    for($i = 0; $i -lt $encrypted.Length; $i++)
    {
        $encrypted[$i] = $encrypted[$i] -bxor $tmpKey[$i % $tmpKey.Length]
    }
    $tmpKeyLength++
}

"Encrypted:"
([byte[]]$encrypted) | Format-Hex | Out-String

When you take a look at the result, here’s what you get:

image

Pretty impressive! Look at all those non-ASCII characters. This must be unbreakable!

To get the equivalent single-pass encraption key, we can just XOR the encrapted string with the original string. How?

XOR is Commutative

We can do this because XOR is commutative as well: you can rearrange the order of terms without impacting the result.

If Encrapted is:

Content XOR Key1 XOR Key2 XOR Key3

then we do:

Encrapted XOR Content

then we get:

Content XOR Key1 XOR Key2 XOR Key3 XOR Content

Because XOR is commutative, we can rearrange terms to get:

Content XOR Content XOR Key1 XOR Key2 XOR Key3

Anything XOR’d with itself can be ignored

One of the reasons XOR encraption works is that anything XOR’d with itself can be ignored. For example:

Encrypted = Content XOR Key

Decrypted = Encrypted XOR Key

By XORing some content with a key twice, you get back the original content. So back to where we got with the last section, if we XOR the final result with the original content and rearrange, we get:

Content XOR Content XOR Key1 XOR Key2 XOR Key3

That gives us an equivalent single key that we can use: Key1 XOR Key2 XOR Key3.

Here’s an example of figuring out this new single-pass key:

001
002
003
004
005
006
007
008

$newKey = New-Object 'char[]' $stringToEncrypt.Length
for($i = 0; $i -lt $stringToEncrypt.Length; $i++)
{
    $newKey[$i] = $encrypted[$i] -bxor $stringToEncrypt[$i] 
}

""
"Equivalent key: " + (([int[]]$newKey) -join ",")

And the result:

image

And to prove they give equivalent results:

001
002
003
004
005
006
007
008
009

$encrypted = $stringToEncrypt.Clone()
for($i = 0; $i -lt $encrypted.Length; $i++)
{
    $encrypted[$i] = $encrypted[$i] -bxor $newKey[$i]
}

""
"Easy encrypted:"
([byte[]]$encrypted) | Format-Hex | Out-String

image

The Nelson Moment

This is where we get to point and laugh a little. Remember how the malware repeatedly XORs the content with various bits of the key? It did far more damage than the malware author realized. Let’s consider the character in the content at position 0 for a moment:

    1. XOR with byte 0 of the key
    2. XOR with byte 0 of the key (thereby stripping the encraption altogether!)
    3. XOR with byte 0 of the key
    4. XOR with byte 0 of the key (thereby stripping the encraption altogether!)

Let’s consider byte 1 of the content

    1. XOR with byte 0 of the key
    2. XOR with byte 1 of the key
    3. XOR with byte 1 of the key (thereby stripping the work done in step 2)
    4. XOR with byte 1 of the key
    5. XOR with byte 1 of the key (thereby stripping the work done in step 4)

Depending on the length of the key and the content, this pattern of alternating doing work and removing the work that was previously done continues. This now takes what could have been a potentially strong key of 100s of bytes down to a super expensive way to compute single key!! Here’s a demo of encrapting some GUIDs:

image

Notice that some characters (28 at the beginning, c9 at offset 0x48) didn’t even get encrypted at all?

Remember folks, don’t roll your own crypto, and especially don’t roll your own crapto.

Part-of-Speech Tagging with PowerShell

When analyzing text, a common goal is to identify the parts of speech within that text – what parts are nouns? Adjectives? Verbs in their gerund form?

To accomplish this goal, the area of natural language processing in Computer Science has developed systems for Part of Speech tagging, or “POS Tagging”. The acronym preceded the version in Urban Dictionary 🙂

The version I used in University was a Perl-based Brill Tagger, but things have advanced quite a bit – and the Stanford NLP group has done a great job implementing a Java version with C# wrappers here:

https://nlp.stanford.edu/software/tagger.shtml

The default English model is 97% correct on known words, and 90% correct on unknown words. “SpeechTagger” is a PowerShell interface to this tagger

image

By default, Split-PartOfSpeech outputs objects that represent words and the part of speech associated with them. The TaggerModel parameter lets you specify an alternate tagger model: the Stanford Part of Speech Tagger supports:

  • Arabic
  • Chinese
  • English
  • French
  • German
  • Spanish

The –Raw parameter emits sentence in the common text-based format for part-of-speech tagging, separating the word and its part of speech with the ‘/’ character. This is sometimes useful for regular expressions, or for adapting code you might have previously written to consume other part-of-speech taggers.

To install this project, simply run the following command from PowerShell:

Install-Module –Name SpeechTagger

Automatic Word Clustering: K-Means Clustering for Words

K-Means clustering is a popular technique to find clusters of data based only on the data itself. This is most commonly applied to data that you can somehow describe as a series of numbers.

convergence

When you can describe the data points as a series of numbers, K-Means clustering (Lloyd’s Algorithm) takes the following steps:

  1. Randomly pick a set of group representatives. Lloyd’s algorithm generally picks random coordinates, although sometimes picks specific random data points.
  2. Assign all of the items to the nearest representative.
  3. Update the group representative to more accurately represent its members. In Lloyd’s algorithm, this means updating the location of the representative to represent the average location of every item assigned to it.
  4. Revisit all of the items, assigning them to their nearest group representative.
  5. If any items shifted groups, repeat steps 3-5.

Applying this technique directly to words is not possible, as words don’t have coordinates. Because of that:

  • Randomly picking a coordinate cannot be used to randomly create a group representative.
  • Refining a group representative based on its current word cluster is more complicated than simply averaging the coordinates of the items in the cluster.

If we follow the philosophy of Lloyd’s algorithm, however, we can still create a version of K-Means Clustering for Words. In our implementation, we:

  1. Pick random words from the provided list as group representatives.
  2. Use Levenshtein Distance (string similarity) to measure "nearest group representative".
  3. Use word averaging to update the nearest group representative. Word averaging is a new word of the "average" word length, with characters at each position created by taking the most common letter at that position.

This is very computationally expensive for large data sets, but can provide some very reasonable clustering for small data sets.

To explore this further, you can download Get-WordCluster from the PowerShell Gallery. It’s as simple as:

  1. Install-Script Get-WordCluster –Scope CurrentUser
  2. (-split "Hello world Hell word SomethingElse") | Get-WordCluster -Count 3