When we last left off, we discussed hash codes - *a small unique identifier for an item based on its properties or features*.

There are many different ways to calculate unique identifiers for items. These different ways are called *Hash Algorithms*. Each serves a different purpose, but some popular ones right now are are MD5, SHA1, and SHA256.

Why so many?

The reason is security. Our last post finished with the question:

*Using hash codes for simple content verification is one thing. But what if an attacker is trying to trick you with hash code collisions? For example, can you use that Visual Studio download if it’s from your worst enemy – but the hash code still matches? It depends on the algorithm!*

The security industry first started touching on the importance of “secure” hash algorithms in the 1970s. There were a flurry of proposals in the 80s that were immediately dismissed with security problems, but one called “MD5” was introduced in the early 90s that seemed like it might last the test of time.

In 2004, however, advances in security research found issues in MD5 – thus “breaking MD5”. While you can still use MD5 just fine when security is not a concern, you cannot trust a version of Visual Studio from your enemy just because the MD5 hash value matches.

Here are the four features of an ideal cryptographic hash algorithm:

- it is easy (but not necessarily quick) to compute the hash value for any given message
- it is infeasible to generate a message that has a given hash
- it is infeasible to modify a message without changing the hash
- it is infeasible to find two different messages with the same hash

If you get #2, then for the most part you get #3 and #4 automatically.

How does one “break” a hash algorithm? Let’s take a look a a few from the .NET Framework for an interesting exercise in reverse engineering.

As a background, every object in the .NET Framework supports a method called “GetHashCode()”. It does a good job of returning a small random(ish) identifier for an object, so it is used primarily for putting objects into hash tables. This method returns a 32-bit integer, and is **not** designed for security purposes:

publicvirtualint GetHashCode()

The HashAlgorithm class **is** designed for security purposes, however, and should be used when security is a concern. We won’t use that class as an example, because if I can break that class in this blog, then it wouldn’t be in the .NET Framework!

## Breaking Int32

Let’s see how this works on integers, and see how it would fare as a cryptographic hash algorithm:

PS > (1).GetHashCode()
1
PS > (2).GetHashCode()
2
PS > (1337).GetHashCode()
1337
PS > (-1337).GetHashCode()
-1337

I think we see a pattern 🙂 It looks like .NET is returning our input right back at us. It’s a sure way to prevent collisions, that’s for sure! If we open Int32.GetHashCode() in .NET Reflector (or ILSpy), we can see why:

**public** override **int** **GetHashCode**()

{

return **this**;

}

The score sheet?

☑ |
it is easy to compute the hash value for any given message |

☒ |
it is infeasible to generate a message that has a given hash |

☑ |
it is infeasible to modify a message without changing the hash |

☑ |
it is infeasible to find two different messages with the same hash |

Not bad, but you can’t do much with just integers.

## Breaking Int64

That wasn’t too challenging to break. What about Int64? That doesn’t fit in 32 bits by definition, so let’s see if there is any pattern:

PS > ([Int64] 1).GetHashCode()
1
PS > ([Int64] 1337).GetHashCode()
1337
PS > ([Int64] 12345678987654321).GetHashCode()
1656367333

For “low” numbers (below 32 bits?) it seems to be the same as the input number. After that, not quite sure. We can’t break this one by intuition. Let’s see the source:

**public** override **int** **GetHashCode**()

{

return (**int**)**this** ^ (**int**)(**this** >> 32);

}

What the .NET Framework is doing here is pretty simple:

(**int**)**this** |
Takes the lower 32-bits of the 64-bit number. |

(**int**)(**this** >>32) |
Shifts the number to the right by 32 places, and then takes the lower 32 bits of the number. That is the same thing as taking the upper 32 bits of the number. |

(**int**)**this** ^(**int**)(**this** >>32) |
Uses the binary XOR operator to combine the upper 32 bits and the lower 32 bits. |

The first row explains why all of our “low” numbers seemed to just return themselves. The low numbers were the lower 32 bits, and there were no upper 32 bits.

Let’s take a look at:

PS > ([Int64] 12345678987654321).GetHashCode()
1656367333

If we convert this to the upper and lower 32 bits, we get:

PS > [Convert]::ToString(12345678987654321, 2)
101011110111000101010001100010100100011111010010110001
PS > [Convert]::ToString(12345678987654321, 2).PadLeft(64, "0")
0000000000101011110111000101010001100010100100011111010010110001

Now, split them into the top and bottom 32 bits take the binary exclusive OR (either bit is 1, but not both):

Upper 32 |
00000000001010111101110001010100 |

Lower 32 |
01100010100100011111010010110001 |

XOR |
01100010101110100010100011100101 |

Taking a look at the final result, it looks like we have the same as what .NET gave us:

PS > [Convert]::ToInt32("01100010101110100010100011100101", 2)
1656367333
PS > ([Int64] 12345678987654321).GetHashCode()
1656367333

Now that we understand the algorithm, let’s see how we might break the strongest guarantee of a hash algorithm – namely, “*it is infeasible to generate a message that has a given hash*.”

Let’s pretend we want to generate a 64-bit number with the hash value of **1337. **That has the following binary value:

PS > [Convert]::ToString(1337, 2).PadLeft(32, "0")
00000000000000000000010100111001

Since the 64-bit hash is generated by combining the upper and lower halves of the number, the simplest way is to simply ignore the upper 32 bits:

Upper 32 |
00000000000000000000000000000000 |

Lower 32 |
00000000000000000000010100111001 |

XOR |
00000000000000000000010100111001 |

That was simple enough. What about breaking the next strongest feature: “*it is infeasible to find two different messages with the same hash.*”

Given that we know how to find an Int64 for any given hash value, how can we find *multiple* Int64s that have the same hash value?

To figure this out, we’ll use a trick from the XOR operation frequently used in XOR cyphers and XOR encryption:

*If A XOR B = C*

*
**Then A = B XOR C *

And B = A XOR C

If we pretend that A is the upper 32 bits of a number, B is the lower 32 bits of a number, and C is the hash code, then we have the following options:

Upper 32 bits = (Lower 32 bits) XOR (Hash code)

or

Lower 32 bits = (Upper 32 bits) XOR (Hash code)

Given that, here’s all we have to do to find a number with a specific hash code:

- Pick an arbitrary number for the lower (or upper) 32 bits
- XOR it with the hash code
- The result will be the other 32 bits you need

Let’s look at an example, using 10001011001010110101101011010111 as a randomly-chosen number:

Upper 32 |
10001011001010110101101011010111 |

Hash Code |
00000000000000000000010100111001 |

XOR (= Lower 32) |
? |

PowerShell has the “-bxor” operator, so let’s make this easier on ourselves:

PS > $upper32 = [Convert]::ToInt32("10001011001010110101101011010111", 2)
PS > $hash = [Convert]::ToInt32("00000000000000000000010100111001", 2)
PS > $lower32 = $upper32 -bxor $hash
PS > [Convert]::ToString($lower32, 2)
10001011001010110101111111101110

So now we have a hash collision – let’s be sure:

PS > $number = [Convert]::ToInt64(
>> "10001011001010110101101011010111" +
>> "10001011001010110101111111101110", 2)
>>
PS > $number
-8418535196639666194

Let’s see:

PS > ([Int64] -8418535196639666194).GetHashCode()
1337

Bingo. We can now generate hash collisions for arbitrary hash codes. What does that mean?

☑ |
it is easy to compute the hash value for any given message |

☒ |
it is infeasible to generate a message that has a given hash |

☒ |
it is infeasible to modify a message without changing the hash |

☒ |
it is infeasible to find two different messages with the same hash |

**Bonus question**: *how many hash collisions can you generate for a given hash code? And how might you do so systematically?*

Chew on that a bit – in our next post, we’ll take a crack at reverse-engineering **String.GetHashCode()**.