Archives for the Month of November, 2017

Automatic Word Clustering: K-Means Clustering for Words

K-Means clustering is a popular technique to find clusters of data based only on the data itself. This is most commonly applied to data that you can somehow describe as a series of numbers.

convergence

When you can describe the data points as a series of numbers, K-Means clustering (Lloyd’s Algorithm) takes the following steps:

  1. Randomly pick a set of group representatives. Lloyd’s algorithm generally picks random coordinates, although sometimes picks specific random data points.
  2. Assign all of the items to the nearest representative.
  3. Update the group representative to more accurately represent its members. In Lloyd’s algorithm, this means updating the location of the representative to represent the average location of every item assigned to it.
  4. Revisit all of the items, assigning them to their nearest group representative.
  5. If any items shifted groups, repeat steps 3-5.

Applying this technique directly to words is not possible, as words don’t have coordinates. Because of that:

  • Randomly picking a coordinate cannot be used to randomly create a group representative.
  • Refining a group representative based on its current word cluster is more complicated than simply averaging the coordinates of the items in the cluster.

If we follow the philosophy of Lloyd’s algorithm, however, we can still create a version of K-Means Clustering for Words. In our implementation, we:

  1. Pick random words from the provided list as group representatives.
  2. Use Levenshtein Distance (string similarity) to measure "nearest group representative".
  3. Use word averaging to update the nearest group representative. Word averaging is a new word of the "average" word length, with characters at each position created by taking the most common letter at that position.

This is very computationally expensive for large data sets, but can provide some very reasonable clustering for small data sets.

To explore this further, you can download Get-WordCluster from the PowerShell Gallery. It’s as simple as:

  1. Install-Script Get-WordCluster –Scope CurrentUser
  2. (-split "Hello world Hell word SomethingElse") | Get-WordCluster -Count 3

Easily Search for Vanity Ham Call Signs

When you first get your ham radio license, the FCC gives you a random call sign based on your location and roughly your date of application. The resulting call sign is usually pretty impersonal, but the FCC lets you apply for a “vanity” call sign for free.

While the rules for these vanity call signs change depending on your license class (Technician, General, Extra), most of the good (shorter) vanity call signs that fall under the “extra” rules are taken. So realistically, your options will be likely be a 2x3 callsign (group C or group D).

image

One place to look for available call signs is at http://callsign.ualr.edu. Please be kind and don’t pound their server! If you want to do hundreds / thousands of lookups, you can download the FCC database directly.

But if you only want to look at tens / hundreds, here’s a PowerShell script to help out.

001
002
003
004
005
006
007
008
009
010
$final3 = "LEE"
$prefixes = "WX","KZ"

foreach($prefix in $prefixes) {
    foreach($number in 0..9) {
        $call = $prefix + $number + $final3
        $wr = Invoke-WebRequest http://callsign.ualr.edu/detail.php -Method Post -Body @{ call = $call }
        if($wr.Content -match "no records were found!") { $call }
    }
}