Work Simulator 2020

Can you survive March 2020?


It’s March 4, 2020. A pandemic grips the world, so you’re working from home. Can you last the month?

[Web] [Windows] [Mac]

List of InfoSec Cognitive Biases

The mind is an incredibly complex organ. While all of us attempt to be mostly logical and rational in our day-to-day thought processes and decision making, we are hampered by an enormous number of cognitive biases. Cognitive biases are specific natural tendencies of human thought that often result in irrational decision making, and there are hundreds of them. Everybody has them them and is impacted by them – it is only through awareness that you can take steps to counteract them.  

One of my favourite examples is Loss Aversion. Imagine a game that costs $100 to enter. Most folks would decline to play this game if the odds were that you had a 49% chance of losing your money, and a 51% chance of doubling it. A purely rational decision maker would play this game as often as they could.

The key realization is that being aware of biases helps you limit how much impact they have on your decision making.

InfoSec Cognitive Biases

Like every other avenue of human thought, one place we are impacted by cognitive biases is the Information Security community. While many traditional cognitive biases apply directly to the Information Security context, there are plenty that are unique to our space and are worthy of additional awareness.

This post started as a thread on Twitter, and with the participation of several folks has become quite a useful list of thinking patterns to be careful about when making decisions in the realm of Information Security. Thank you to @passingthehash, @gdbassett, @kjstillabower, @joonatankauppi, @marshray, and @mrjbaker for your contributions!

One important point is that a cognitive bias is completely different than being factually incorrect. A cognitive bias represents a flawed mode of thinking, not a flawed thought. For example, a website that disallows special characters in a user name is a decision, not a bias. While a cognitive bias may have been involved in arriving at a factually incorrect decision, the decision itself is not the bias.

Do you have any biases you find common or unique to the security industry? Please comment below and I’ll add them!

Absolutism bias

Description: The tendency to undervalue mitigations that have less-than-ideal security, yet still produce significant risk and harm reduction
Example: Criticizing users or applications that use SMS-based two-factor authentication, despite the alternative often being no two-factor authentication at all.

Actor bias

Description: The tendency to include actor / operator intent in evaluating the security of a system
Example: Designing cryptographic master keys that allow "the good people" to decrypt private data of "the bad people".

Anchoring bias

Description: The tendency to let early facts you learn in a security investigation overly influence your decision-making process
Example: Dismissing an attack as "just drive-by ransomware", missing attackers that use ransomware to burn infrastructure after a much more damaging intrusion.

Authority bias

Description: The tendency to overvalue the opinions that an expert in one domain has about an unrelated domain
Example: Computer Security experts discussing geopolitical events.

Availability bias

Description: The tendency to focus on applications or systems that are recent, nearby, or under active development
Example: Doing deep security analysis of a new buildout at corporate head offices, while systems at an acquired branch office go unpatched.

Bandwagon bias

Description: The tendency to assign excessive merit to a behaviour or technology because others have done so, or that it has historically been done that way
Example: Websites that prevent copy + paste of passwords, despite making it difficult for users of password managers

Burner bias

Description: The tendency to overestimate one's ambient risk when at industry events, or to adopt some security practices only at those events
Example: Only using a VPN or being suspicious of ATM machines while at Black Hat / DEF CON.

Capability bias

Description: The tendency to overvalue the defensive impact of mitigating a published attack when viewed in the context of an adversary that can adapt
Example: Blocking PowerShell on a server, while still allowing arbitrary unsigned executables.

Commutative bias

Description: The tendency to undervalue the likelihood of an attack that only requires the linking of two or more highly-likely events
Example: Thinking that an internal system is highly protected, despite everybody in the company having access to it - and phishing campaigns industry-wide having nearly a 100% success rate.

Domain bias

Description: The tendency to focus on risks and solutions closely related to one's domain of expertise, rather than all risks to a system
Example: Cryptographic experts adding hardware security modules to an architecture, despite pressing application and network security weaknesses.

Endorsement bias

Description: The tendency to place trust in systems or mechanisms that have financial ability as their only barrier to entry
Example: Making security decisions based on "signed code", despite code signing certificates being available to anybody for $85.

Environment bias

Description: The tendency to undervalue risks to a system when analyzed against minor changes to its threat model
Example: Useful "find my phone" applications that become weapons in the context of domestic abuse.

Fatalism bias

Description: The tendency to think of a system as only compromised or not, without investing in post-breach processes and controls
Example: Threat modeling sessions that include the phrase, "well if they got in there, it's game over."

Headline bias

Description: The tendency to use the summary / headline of an event to understand risk, rather than working to understand mitigating conditions
Example: Mocking Linux for the CVE-2019-14287 "SUDO Backdoor", despite most articles properly explaining the rare and nonstandard configuration that would lead to this being a security vulnerability.

High-profile bias

Description: The tendency to prioritize high-profile events in the media, rather than risks associated with the target environment
Example: Rushing to address CPU side-channel attacks, despite a large fleet of unpatched servers.

Hyperfocus bias

Description: The tendency to inconsistently evaluate security of an application based on its unique capabilities
Example: Criticizing an application for a flaw in a security feature that no comparable application even implements

Impact bias

Description: The tendency to require working proof of a weakness (or impact of a weakness) in a system to sufficiently account for its risk
Example: An unmitigated SQL injection bug that doesn't get fixed until you demonstrate the extraction of data.

Measurability bias

Description: The tendency to place inappropriate weight on the security of a system based on analysis of a measurable security property without regard to context
Example: Criticizing (or applauding) the cryptographic cipher strength used in a system, even when that use has no confidentiality or integrity impact.

More-is-better bias

Description: The tendency to believe that measurable security settings continue to provide return on investment as that control is increased in the "more secure" direction
Example: Recognizing that never-expiring passwords might be a risk, so aggressively pursuing shorter and shorter password expiration durations.

Motivation bias

Description: The tendency to undervalue the risk to a system due to perceived lack of motivation of attackers to target that system
Example: Acknowledging a vulnerability yet dismissing the impact because attackers wouldn't be interested - despite the existence of threat groups that scan the entire internet daily to compromise anything they find exposed.

Novelty bias

Description: The tendency to focus on mitigating the novel aspects of an attack, rather than the root causes and more core defensive mitigations
Example: Focusing on unique command-and-control mechanisms leveraged by an actor, rather than mitigating how they got access in the first place.

Obscurity bias

Description: The tendency to overvalue the security benefit of keeping implementation details secret
Example: Requiring security pen testers to engage in "black box" audits of applications, rather than providing access to source code.

Popularity bias

Description: The tendency to inconsistently evaluate security of an application based on its popularity
Example: Criticizing a popular application for a security weakness that all comparable applications also exhibit.

Publicity bias

Description: The tendency to overestimate the soundness of a decision until subject to broader scrutiny
Example: Deciding to not fix a security issue, yet reversing on this decision as management or the public learns about the risk.

Selection bias

Description: The tendency to make absolute security judgments based on a non-statistical observation of outcomes
Example: Evaluating the security of an application based on the number of CVEs reported on it without accounting for popularity or amount of focus given by security researchers.

Client IP Address Disclosure in various consumer mail servers


When email users of several email services send mail using mechanisms other than that service's web interface (i.e.: their phone or laptop’s email program), services commonly include the user’s IP address in message headers. This information disclosure lets recipients of these messages perform some privacy-invasive actions, such as:

    • Approximate geographical location of the sender
    • Correlation of separate email addresses, but sent by the the same sender
    • Broadband and / or cellphone provider

Users looking to send email in a manner that keeps this information private from message recipients should use either the web interface or an alternative mail provider.


  • Gmail
  • Google Suite
  • Apple (,
  • Office 365
  • (consumer) in uncommon configurations
  • Most ISP email providers

Not impacted:

  • Yahoo, AOL
  • (consumer) in most common configurations
  • Protonmail

Edited 4/13/2020 - A previous version of this post singled out Gmail. While this is the most popular email provider impacted by this flaw in common configurations, they are not the only one.


While the RFC requires that email programs identify themselves to SMTP servers, consumer mail is a relatively new thing as far as internet time goes. It is a change in threat model compared to the work-centric mainframe systems that were at the core when the RFCs were written. Today, people use consumer email systems in privacy sensitive situations that they didn’t before - whistleblowing, at-risk communities, avoiding domestic abuse, and more.

People that are uncomfortable with a mail recipient knowing their approximate geographical location should avoid using email applications to send email via the impacted email services, or use a privacy-preserving consumer email service.


Non-web email clients generally use the Simple Mail Transfer Protocol (SMTP) to connect and send mail via their mail service provider. Examples of these non-web email clients are the mail applications built into phones (Android, iOS) or desktop / laptop operating systems (Apple Mail, Windows Mail).

When these applications connect to their mail server's SMTP server to send an outgoing mail, part of the protocol requires a “HELO” or “EHLO” message. RFC 821 describes this exchange as:

      At the time the transmission channel is opened there is an
      exchange to ensure that the hosts are communicating with the hosts
      they think they are.

      The following two commands are used in transmission channel
      opening and closing:

         HELO <SP> <domain> <CRLF>

         QUIT <CRLF>

      In the HELO command the host sending the command identifies
      itself; the command may be interpreted as saying "Hello, I am


                     Example of Connection Opening

         R: 220 BBN-UNIX.ARPA Simple Mail Transfer Service Ready
         R: 250 BBN-UNIX.ARPA

                               Example 5


Most mail clients include the user’s public IP address and host name as part of this exchange, with Apple Mail even including the computer name and IP address on the internal network used to send the mail. One could argue that mail clients should not send this information as part of the exchange, but some SMTP servers validate that the actual IP address of the sender matches the content of the SMTP exchange as a form of spoofing protection.

As mail servers transport the message from its original SMTP server to the destination email server, much of this conversation is retained as part of the message’s mail header content. Microsoft has a good resource on how to see and interpret these mail headers here:, but an example looks like this:


As you can see in the “Received: from” headers above, some SMTP severs redact the sender’s client IP out of the rest of the exchange, replacing it instead with the address of the SMTP server. Two examples of email services that perform this client IP redaction are (as above) and However, as noted above, there are some scenarios in which does include this information.


The SMTP servers of some services do not perform this redaction, which provides the following data to recipients of the message:



Users looking to send email in a manner that keeps this information private from message recipients should use either the service's web interface or an alternative mail provider.

Disclosure Timeline


March 9, 2020 – Reported to Google security
March 10-19, 2020 – Assisted with steps to reproduce
March 20, 2020 – Resolved by Google as “Won’t Fix (Intended Behavior)”
March 20-31, 2020 – Worked to confirm that there were no future plans to fix
April 2, 2020 – Got confirmation that the team was aware of the issue and has no plans to fix


March 9, 2020 - Reported to Microsoft security
March 19, 2020 - Got confirmation that redaction was intentional for Consumer Outlook. This blog post and responses to it uncovered scenarios where this redaction wasn't applying, and this is still under investigation.


April 3, 2020 - Reported to Apple security

Searching for Content in Base-64 Strings

You might have run into situations in the past where you’re looking for some specific text or binary sequence, but that content is encoded with Base-64. Base-64 is an incredibly common encoding format in malware, and all kinds of binary obfuscation tools alike.

The basic idea behind Base-64 is that it takes arbitrary binary data and encodes it into 64 (naturally) ASCII characters that can be transmitted safely over any normal transmission channel. Wikipedia goes into the full details here:

Some tooling supports decoding of Base-64 automatically, but that requires some pretty detailed knowledge of where the Base-64 starts and stops.

The Problem

Pretend you’re looking for the string, “Hello World” in a log file or SIEM system, but you know that it’s been Base-64 encoded. You might use PowerShell’s handy Base-64 classes to tell you what to search for:


That seems useful. But what if “Hello World” is in the middle of a longer string? Can you still use ‘SGVobG8gV29fbGQ=’? It turns out, no. Adding a single character to the beginning changes almost everything:


Now, we’ve got ‘IEhlbGxvIFdvcmxk’.

The main problem here is the way that Base-64 works. When we’re encoding characters, Base-64 takes 3 characters (24 bits) and re-interprets them as 4 segments of 6 bits each. It then encodes each of those 6 bits into the 64 characters that you know and love. Here’s a graphical example from Wikipedia:


So when we add a character to the beginning, we shift the whole bit pattern to the right and change the encoding of everything that follows!

Another feature of Base-64 is padding. If your content isn’t evenly divisible by 24 bits, Base-64 encoding will pad the remainder with null bytes. It will use the “=” character to denote how many extra padding blocks were used:


When final padding is added, you can’t just remove those "=” characters. If additional content is added to the end of your string (i.e.: “Hello World!”), that additional content will influence both the padding bytes, as well as the character before them.

Another major challenge is when the content is Unicode rather than ASCII. All of these points still apply – but the bit patterns change. Unicode usually represents characters as two bytes (16 bits). This is why the Base-64 encoding of Unicode content representing ASCII text has so many of the ‘A’ character: that is the Base-64 representation of a NULL byte.


The Solution

When you need to search for content that’s been Base-64 encoded, then, the solution is to generate the text at all possible three-byte offsets, and remove the characters that might be influenced by the context: content either preceding what you are looking for, or the content that follows. Additionally, you should do this for both the ASCII as well as Unicode representations of the string.

An Example

One example of Base-64 content is in PowerShell’s –EncodedCommand parameter. This shows up in Windows Event Logs if you have command-line logging enabled (and of course shows up directly if you have PowerShell logging enabled).

Here’s an example of an event log like that:


Here’s an example of launching a bunch of PowerShell instances with the –EncodedCommand parameter, as well as the magical Get-Base64RegularExpression command. That command will generate a regular expression that you can use to match against that content:


As you can see in this example, searching for the Base-64 content of “My voice is my” returned all four log entries, while the “My voice is my passport” search returned the single event log that contained the whole expression.

The Script

Get-Base64RegularExpression is a pretty simple script. You can use this in PowerShell, or any event log system that supports basic regular expression searches.

You can find it on the PowerShell Gallery:

Install-Script Get-Base64RegularExpression.ps1 

Searching for Content in XOR “Encrypted” Data

A while back, we talked about a common challenge in the security industry – searching for some known bad content (i.e.: “Invoke-WebRequest”) in content that you know has been encoded in base64. In a really cool bout of co-discovery, others simultaneously wrote similar implementations. Since then, this approach is now in the process of being integrated into YARA. Very cool times!

Another situation you might have run across is malware authors “encrypting” their content with a static XOR key – a process I like to call “encraption”. One of the neat things about XOR encraption is that you use a single-byte key to encode the data by simply using the XOR operator on each byte of the data. To reverse the process, you just do it again. Despite being horrible from a security perspective, it is somewhat reliable at basic obfuscation to break string searching and simple signatures.

This pattern of decoding content (Base64, XOR, etc.) before running it is extremely common – and is a major driver behind why we added the Antimalware Scan Interface in Windows. This is great at stripping these layers of obfuscation from content at runtime.

But what about static analysis or log hunting?

Like the challenge we had with Base64, SIEM systems don’t generally offer a way to decrapt embedded XOR content to let you search within it. But they do offer regular expressions. Can we take a similar approach to what we did in Base64 – generate a regular expression that matches content in XOR-encoded strings? It turns out, yes!

[Aside – in another wonderful bout of co-discovery, YARA added XOR encoding for files in August 2018.]

Let’s take a simple example – data that has been encrapted directly.


So a little script that reverses this and emits the output looks like this:


One of the key weaknesses of XOR is that there are only 255 possible XOR keys. If this script’s content made it into our SIEM, we could simply brute force the search. We could search for (“encrapted” BXOR 1) and then (“encrapted” BXOR 2) and then … and then (“encrapted” BXOR 40). Eventually, we would end up searching for for “MFKZIX\ML” and find it. And fortunately, Regular Expressions support searching for multiple patterns all at once, so we can have a script simply generate a regular expression for all possible XOR keys.



The full regex is pretty long (255 elements), but this is a portion of what it looks like under the hood:


Now, XOR content is rarely encoded in scripts directly. Depending on the XOR key, the content will usually end up containing bytes that are not valid for use within a string. Usually, you’ll find that scripts have base64-encoded the XOR encraption.

For this scenario, we can leverage the “-Raw” parameter of Get-XorRegularExpression. This will return the raw bytes (rather than the escaped Regex representation), which we can then feed into our base64 regex generator. The result is quite a beast (765 elements: 3 base64 representations of each XOR key), but still a valuable source to hunt with.

Here’s an example of this happening in a script directly (taken from the AMSI blog post earlier):

In this example, the malware author uses a Unicode encoding of the string, so we use the “-Unicode” parameter of Get-XorRegularExpression to have it operate against the Unicode string.


While large, this is a regex we can now use against SIEM systems as well. Here’s an example of searching (and finding!) this content in PowerShell’s Script Block logs in Azure Sentinel:


And for some additional fun, we can even use the –Raw parameter of Get-Base64RegularExpression to generate Yara rules out of these byte sequences.



So, with a bit of creativity, we can now search for base64 content, XOR encoded content, and more in any SIEM that supports regular expressions. Enjoy!

You can download these scripts from the PowerShell Gallery:

Install-Script Get-Base64RegularExpression –Scope CurrentUser
Install-Script Get-XorRegularExpression –Scope CurrentUser
Install-Script New-YaraStringSearchRule –Scope CurrentUser

Star Trek TOS Science, Engineering, Command Embroidery Patterns

If you’re looking to get one of the Star Trek TOS patches (Science, Engineering, or Command) embroidered on something, they often charge a lot of money to convert the logo to the format that embroidery machines understand.


Here are version I created in the Brother PES format, as well as in the native format I created it in (EmbroideryWare STICH).

Command - [Brother PES] [EmbroideryWare STICH]

Science - [Brother PES] [EmbroideryWare STICH]

Engineering - [Brother PES] [EmbroideryWare STICH]

Dragon Ball Z Logo Embroidery Pattern

If you’re looking to get the Dragon Ball Z logo embroidered on something, they often charge a lot of money to convert the logo to the format that embroidery machines understand.


Here’s a version I created in the Brother PES format, as well as in the native format I created it in (EmbroideryWare STICH).

[Brother PES]

[EmbroideryWare STICH]

PowerShell Logo Embroidery Pattern

If you’re looking to get the PowerShell logo embroidered on something, they often charge a lot of money to convert the logo to the format that embroidery machines understand.


Here’s a version I created in the Brother PES format, as well as in the native format I created it in (EmbroideryWare STICH).

[Brother PES]

[EmbroideryWare STICH]

WebSockets from Scratch


In the web application world – especially single-page applications – smooth and fluid interaction is key. For many years, these applications have been doing a pretty good job of getting this fluid interaction though AJAX techniques and browser support for XMLHttpRequest. One issue, however, is that XMLHttpRequest requires that all of your communication go through an text-based HTTP protocol. Another issue is that XMLHttpRequest doesn’t let a server initiate communication back to connected clients. Instead, clients need to continuously poll the server to find out if it has anything to say.

To solve that need, browser vendors introduced the WebSocket protocol. The WebSocket protocol is a persistent, two-way TCP connection between a WebSocket client (traditionally a browser) and a server.

You might wonder: “That’s all well and good, but why should I care?” That’s a good question with a simple answer: Smile This is a really cool project designed by Tim McGuffin (@NotMedic). Whatever you draw on the website goes onto his badge – via a 64x32 LED board, and a bunch of more cool software.


Tim carried this around DEF CON this year, and people had lots of fun Smile


But can we have some _real_ fun? Smile

Digging into the Implementation

When you dig into the code a little bit, you can see how the browser does the communication. For every pixel you draw on the canvas, it converts it into a JSON object with a command called “DRAW”, and “DATA” that represents an X, Y, Color combination. Then it sends that string data to a connected WebSocket (exampleSocket.send(…)).


Partial Automation – Code Generation

The first step to having fun with Tim’s badge was through code generation. It is really easy in PowerShell and .NET to iterate through the pixels of an image (bitmap, PNG, etc.). Each pixel you access gives you its color. From there, I took a picture I manually resized to the correct dimensions and had PowerShell generate JavaScript code that I could copy + paste into the Developer Tools console.


And from there, the Mona Lisa made her first appearance on Tim’s badge Smile


That was fun, but copying + pasting code that PowerShell generated into a browser still felt a little hacky. Why don’t I just talk to the WebSocket directly from PowerShell? There are a few C# libraries for doing this out there, but I thought it would be a fun / interesting project to implement the protocol from scratch. So that’s what we’re going to do Smile

WebSockets from Scratch

The WebSocket protocol is defined by RFC6455, which goes through the protocol in great detail.

Initial Upgrade Request

For the first phase of the connection, the browser makes a standard TCP connection to the remote server. Here’s how to do that in PowerShell:


In that TCP connection, it makes a standard HTTP request, requesting an upgrade to the websocket protocol. Part of this HTTP request includes a Sec-WebSocket-Key header, which is intended to ensure that random HTTP requests can’t be retargeted to WebSocket servers, and that WebSocket client requests can’t be targeted to arbitrary other TCP servers. Here’s an example, which hard-codes the key for demonstration purposes.


Once the server accepts the connection upgrade, a well-written client will verify that the key included in the server’s response was correctly derived from the Sec-WebSocket-Key you provided. This is what a server response looks like:


Data Frames

Once the connection has been upgraded, client applications send frames of data. This isn’t the raw data itself – there is a structure around the data to describe it appropriately to the server accepting the connection.


To hold the frame data in PowerShell, we start with defining a byte array. The first 8 bits are:

  • FIN: 1 bit describing whether this is the final frame or not. In my experimentation with Chrome, this was always set to ‘1’.
  • RSV1, RSV2, and RSV3: 1 bit each that should always be zero unless you’re implementing a special protocol on top of WebSockets.
  • OPCODE: 4 bits. The one used by Chrome by default is “TEXT FRAME”, which has the value of 0001.

Putting these 8 bits together gives you a single byte with the bits of 10000001. So that’s how we start each frame:

The next 8 bits (the next byte) are:

  • MASK: Whether the data is “masked” by a random obfuscation key. This is recommended for the web browser developers themselves so that malicious web applications can’t cause arbitrary content to be written to the underlying TCP connection itself.
  • Payload Length: If the content is less than 126 bytes, this represents the payload length directly. Otherwise it needs to be the value ‘126’ with the next two bytes representing the actual payload length.

Since we are always supposed to mask the data, we define our mask as ‘1. However, this is supposed to be a bit at the leftmost position in the byte, so we need to shift it left by 7 places.


Since the mask and the payload length need to share the same byte (MASK being the leftmost bit and Payload Length being the rest), we use the –bor (Binary Or) bitwise operator to combine them into a single byte and then add that byte to the frame.


In the situation where the message length is greater than 126, though, the next two bytes need to also say how long the payload is. In C# and PowerShell, the way to get the bytes of a number is through the BitConverter.GetBytes() API. This API returns the bytes as they are represented by your processor. On most processors, this is Little Endian, where the least significant digits (think the 1s and 10s columns in decimal) come first.


The WebSocket protocol, however, requires that these bytes are in “Network Order” (the opposite of Little Endian), so we need to reverse them.


After the frame lengh, we need to provide the actual masking key. This is 4 bytes. As mentioned earlier, these are values used to obfuscate the content being transmitted to the server so that malicious web applications can’t communicate at the TCP level directly to the websocket server. The obfuscation algorithm is a simple XOR, where the four bytes in the masking key are used in a round-robin fashion against bytes in the content. Because this is was just for fun and the security protection is irrelevant, I provided a static masking key of “0” since anything XOR’d by 0 doesn’t change anything. That way, I didn’t have to implement the masking algorithm and the server wouldn’t notice Smile


Next, we get to add our actual content to the frame – in this case, a message representing the JSON of pixels that we want to draw.


Finally, we write this frame to the TCP connection and flush it


When all is said and done, pretty pictures show up on Tim’s badge Smile

When I sent Rick Astley, Defender ATP alerted Microsoft IT that one of its computers had PowerShell making raw TCP connections to a newly-registered domain. Amazing!


I tried to WannaCry Tim’s badge, and he still hasn’t paid Sad smile


And with true raw WebSockets control, I was able to give Tim’s badge a demoscene Smile Smile

So that’s how to go about WebSockets from Scratch. Hope this helps!

Efficiently Generating Python Hash Collisions

In 2003, Crosby and Wallach wrote an excellent paper demonstrating a new form of Denial of Service vulnerability against applications by abusing algorithmic complexity in the data structures that they depend on. For example, some data structures operate quite efficiently when given everyday input, but the performance degrades precipitously in certain edge cases.

If an application provides user-controlled data to one of these data structures, attackers can intentionally provide this worst case form of input to generate a Denial of Service attack against the application.

In 2011, Alexander Klink and Julian demonstrated this form of attack quite affectively against web servers that take attacker-provided HTTP headers and represent them internally as hashtables.

Hashtables work extremely quickly for most input by distributing items into different buckets based on the hash value of the item’s key. There is a degenerate case, however, when hash collisions cause the algorithm to distribute items into the same bucket. Once hundreds or thousands of items start being placed into the same bucket due to having the same hash value, the performance of a hashtable degrades exponentially. A web server that used to be able to handle 1,000 requests per second may now only be able to handle 10. Or 1. Or worse.


Many programming languages and frameworks now account for this type of attack. Python fixed this by default in Python 3.3+, which was released over 7 years ago. However, despite the fix being released over 7 years ago, many applications still build on Python 2.7 – including an application I was testing. Alexander and Julian didn’t ever release the code they used to DOS Python, so I embarked on an interesting journey to find the most efficient way possible to generate hashtable collisions to DOS Python.

Before this research, the most efficient mechanism I could find was from Robert Grosse. He focused on 64-bit hashes, and his approach took about 80 hours to generate 50,000 hash collisions. The result of my research (against 32-bit Python) generates billions of collisions essentially instantaneously (as fast as your computer can print them to the screen, write them to a file, etc.).

Background: Understanding Python’s hash implementation

Python (remember: 3.2 and below) uses a FNV-style algorithm to calculate the hash values of strings. The hash value is a single 32-bit number that is the result of running a calculation over all of the characters in the input string. Here’s a simple re-implementation in C that takes the string to hash as its first argument:


The variable ‘x’ represents the resulting hash value. The variable ‘a’ represents the string to be hashed. The variable ‘p’ represents the value of the current character being processed.

Let’s walk through an example of this algorithm.


On line 21, the hash value is initialized to zero. The PowerShell console on the right-hand side shows the state of the hash value throughout the algorithm, and the highlighted line shows the bits of the current value of the hash. They are all zero, of course, since we just initialized it to zero.


Line 22 takes the value of the first letter (‘L’ is 1001100 in binary), shifts it left by 7 places, and then blends it into the hash value with the XOR operator.


Line 25 is where we start the character-by-character processing. Python takes the current value of the hash, and multiplies it by 1000003. This is a large prime number, which essentially largely and unpredictably churns the bits in the leftmost digits (the red square – I call that area the “bit blender”). Then, it uses the XOR operator to bring this character into the hash value for the rightmost digits.



Then, we rinse and repeat until the end of the string.


For the final stage, Python takes the current hash value and uses the XOR operation to incorporate the length of the string, which is 3 in our case. This step is done to reduce the amount of collisions for short strings.


When we convert this final value to an integer, we get the hash value that you know and love from Python 3.2 and below.

Meet in the Middle Attacks

Now that we have an understanding of how Python implements its hash algorithm, we can explore the previous state-of-the-art technique used to attack the algorithm.

Pretend that you’re trying to generate hundreds or thousands of strings that all hash to the value ‘0’ as a way to generate a hashtable denial of service. One way to do this is to brute force your way through billions and billions of strings and hope to find hundreds or thousands that have the hash value of ‘0’. This is incredibly inefficient and will take you many hundreds of years of computation.

Another way is called a “Meet in the Middle Attack”. This is the method used by Alexander Klink and Julian in their CCC presentation, as well as the technique explored more deeply for 64-bit Python by Robert Grosse.


As Python iterates through a string, it keeps modifying and altering a running hash value. Once it gets to the end of the string, it outputs the hash value that it’s calculated. For example, imagine that the running hash value of the first part of a string – QCMWaIOFFgH – was 3B847A2A. Well, it turns out that Python’s FNV hash algorithm can partially be reversed as well. If you try to work backwards from a desired hash value (i.e.: 0), you can figure out a set of intermediate hash values such that “if the intermediate hash value is <x> and the rest of the string is <y>, then the resulting hash value will be 0”. This isn’t fully deterministic: for a given hash value or internal state, there are many characters and previous internal states that can lead to it.

You can combine both of these tricks with a lot of brute force to massively speed up a naïve brute force approach.

As the first step, we calculate the hash values of a bunch of strings. We record the internal hash value as Python processes each character. For each internal state value, we save the characters in the string that led up to it.

As the second step, we take a desired hash value (i.e.: 0) and reverse the hashing process. We iterate through a bunch of hypothetical preceding characters and figure out the internal state hash value that could have led to it.

Both of these steps require a lot of computational power and a lot of memory.

The example above demonstrates a situation where we found:

  1. A string, QCMWaIO, where processing it up until that point generated an internal hash value of 3B847A2A
  2. Two strings, vpl and wQl, where if the internal hash value was 3B847A2A then it would result in a hash value of 0

If you combine these two paths together, you get a hash collision: two strings (QCMWaIOvpl and QCMWaIOwQl) that both hash to the value of 0. Apply enough time, memory, and computational power and you have thousands of hash collisions! This is many thousands or millions of times faster than the basic brute force approach.

Hash Stretching: Making Meet-in-the-Middle attacks 250 times more efficient

One thing that became clear after analyzing Python’s FNV-style algorithm is that it has a critical weakness: intermediate hash values of zero.


If the intermediate hash state (the variable x in this sample code) ever gets to 0, then multiplying it by 1000003 doesn’t change the intermediate hash value at all and you don’t get the unpredictable mixing that used to previously happen with that step. Then, you blend in an attacker-controlled character (the variable p in this sample code) and attackers have fine-grained control over the internal hash state.

It only takes a few seconds of brute forcing to find strings that at some point have an internal hash state of zero.

This introduces two major attacks:


  1. Control of the final hash value. If the internal hash state ever reaches zero, the next character of the string completely defines the new internal hash state. If there are no more characters in the string, then the final hash value (as in the first example above) will simply be the length of the string due to Python’s final XOR with the length of the string. Additionally, since you know that Python will do a final XOR by the length of the string, you can use your next character after the state hits ‘0’ to make this final XOR generate a final hash value of anything between 0 and 255. For example, if you supply the next character as the decimal value ‘5’, an XOR by the new string length of ‘5’ will result in a hash value of zero.
  2. Hash collision stretching. If the internal hash state ever hits zero, processing a character with the value of zero doesn’t change anything except the string length. Because of this, you can stretch a single hash collision into more hash collisions by simply adding new NULL characters when the internal hash value reaches zero. This changes the string length, so you also need to use technique #1 (controlling the final hash value) to remove the string length from the final calculation. This works for strings of length up to 255 (the maximum value of a byte you control), which lets you generate about 250 more collisions for every collision you find where the internal state reaches zero.

Using this approach, this technique was able to generate 10,000+ collisions with only a few hours of brute force. A fun and huge improvement to the state of the art, but we’re not done yet 🙂

Generating Billions of Collisions: The Null Points method

As mentioned, Python’s FNV implementation has a critical weakness whenever the internal hash state reaches zero. But finding those strings with an internal hash state of zero required a couple of hours of brute force. This leads to a question: Since the algorithm starts the hash state at zero and we find a part of a string (i.e. f8ds3k3kk) that leads to an internal hash state of zero, couldn’t we just repeat that substring to keep on finding new strings with internal hash states of zero? Like f8dsf8dsf8dsf8ds3k3kk?

Initially, the answer appeared to be ‘No’.


The big issue is Python’s processing of the initial character. If you recall the very first step, Python shifts the initial character to the left by 7 bits, thoroughly placing it into the unpredictable Bit Blender. If we tried to place that character again after we hit the internal state of zero, we wouldn’t be able to replicate the left shift by 7 places. So it seems we’re out of luck.

Except for one initial character!


When we shift left by seven bits, there is exactly one byte that doesn’t shift anything into the bit blender. The value ‘1’. When the algorithm shifts ‘1’ left by 7 bits, you get the value ‘128’ (0x80) with nothing in the bit blender – leaving an intermediate hash value that we can still completely influence with further characters under our control. So if we ever find a string that starts with the byte ‘1’ and eventually gets to an internal state of ‘0’, we can absolutely use these little islands of characters to get back to an internal hash state of ‘0’.

Here’s an example. After a few seconds of brute force, we find the string \x01\x02\xBC\x6F\xA4\xB6 that gets to an internal hash state of ‘0’.


As our next character, we use 0x80 (the initial ‘1’ shifted left 7 bits), and then we can clone our attack string again (“\x01\x02\xBC\x6F\xA4\xB6”) to get back to an internal hash state of zero. We use our final byte to influence the string length XOR to produce a final hash result of our choice, and we now have a repeatable way to influence the hashing operation from beginning to end.

More importantly, our little attack meta-pattern (“\x01\x02\xBC\x6F\xA4\xB6”) is only one example of a character sequence vulnerable to the null points method. With a bit more brute forcing, it’s possible to find these meta-patterns pretty quickly. With a few hours of computation, I easily found 16 patterns that fit the definition of criticality:

  • They started with the byte ‘1’
  • They caused Python’s internal hash state to reach ‘0’ at some point

The true beauty of this attack is that the meta-patterns used in this attack are interchangeable. The example above demonstrates that we can generate a lot of collisions from strings in the form of "<Pattern1>0x80<Pattern1>0x80<Pattern1>”. But since we’re just abusing the fact that we got to an internal state of ‘0’, we can also generate them in the form of “<Pattern1>0x80<Pattern2>0x80<Pattern1>”. Or “<Pattern1>0x80<Pattern1>0x80<Pattern2>”. Start combining these meta-patterns systematically, and you have nearly infinite collisions at your disposal.


With these 16 meta-patterns of only 6 bytes each, you can generate 4 billion hash collisions against Python’s hash algorithm using only 48 bytes each. Due to the magic of combinatorics, the sky really is the limit: this technique will let you generate hash collisions against Python 3.2 and below as fast as your hard drive or network can process them. Have a 256-byte limit on key length? That’s 3.74144419156711E+50 (374144419156711000000000000000000000000000000000000) collisions! And if you want to experience a true embarrassment of riches, we can make this attack 250x more efficient with the hash collision stretching approach discussed earlier. But with this many zeros, who’s counting 🙂

Impact on Python Dictionary Insertion Performance

In raw numbers, Python’s dictionary insertion performance is pretty fast – even with collisions. But at only 50,000 colliding entries in a hashtable, Python starts to peg a CPU for 80 or 90 seconds. This represents about 2mb of raw data. So with a second or two of data transfer, you can keep a CPU busy for a minute or two.


The critical problem is that this performance degrades exponentially. Submit 4mb of data (10,000 collisions), and you may never see that CPU again. Submit 4 billion collisions? 3.74144419156711E+50 collisions? It’s safe to say that if you are exposing Python 3.2 (or below) hash tables to untrusted input, you have a significant security vulnerability on your hands. And by the way – json.loads() generates a dictionary backed by a hashtable.

Coordinated Vulnerability Disclosure

While this research demonstrates a fundamental break in the Python 3.2 (or below) hash algorithm, Python fixed this issue 7 years ago. It’s time to upgrade.