The Association for Computational Linguistics held its 2013 conference in Bulgaria earlier this month. Computational Linguistics is the study of how to apply computer science to problems involving language and speech. It’s a very broad subject, which also includes studies of how to decipher codes and cryptograms.

An interesting paper to emerge from this conference is “Beam Search for Solving Substitution Ciphers”, authored by some folks from RWTH Aachen University in Germany. They claim the technique in their paper can very quickly crack substitution ciphers with very small decipherment error. They also apply the technique to the Zodiac Killer’s 408-character cryptogram.

The paper has this claim:

Unlike the previous state-of-the-art approach that uses additional word lists to evaluate possible decipherments, our approach only uses a letter-based 6-gram language model.”

They must not be aware of other effective n-gram based hillclimbers such as zkdecrypto and CryptoCrack. This is probably because those tools are seldom cited in other academic papers, so researchers tend to overlook them.

(Ravi and Knight, 2011a) report the first automatic decipherment of the Zodiac-408 cipher. They use a combination of a 3-gram language model and a word dictionary.

From the referenced paper:

The new method can be applied on more complex substitution ciphers and we demonstrate its utility by cracking the famous Zodiac-408 cipher in a fully automated fashion, which has never been done before.

An early version of zkdecrypto was able to automatically and fully decipher the 408 back in December 2006. My 2008 paper descibes an automatic decipherment of enough of the 408 to easily derive the complete solution. And there must be other automatic solvers out there that could have handled the job. The 408 is not a difficult cryptogram to solve using statistical methods, because there are enough clues in it to give codebreakers leverage.

This small oversight notwithstanding, the new approach described in the ACL 2013 paper is interesting. Beam search is a variation of breadth-first search. A breadth-first search starts with an initial key, and then explores every possible change to it. This is not practical, due to the astronomical number of possible keys. Beam search fixes that by “pruning out” the low-scoring possibilities, so it only has to explore the ones it thinks are worth the trouble. The scores are based on n-gram statistics, weeding out solutions that don’t resemble normal English.

This seems to be a bit more systematic than other hillclimbers such as zkdecrypto that use a randomized search that continually “perturbs” the key, exploring the higher-scoring modifications. It’s unclear how well beam search stacks up against zkdecrypto, since zkdecrypto is already shown to be very fast and effective. The paper does not mention the unsolved 340 at all. The 340 will very likely not succumb to yet another approach that assumes a conventional construction, but I’m still curious what happened when they ran the algorithm against it.

Also in the conference, Kevin Knight gave a tutorial entitled “Decipherment”, in which he discussed the basics of codebreaking, classic historical ciphers, and famous unsolved codes:


Click here to view the slides from his tutorial

Highlights:

  • An interesting angle Knight presents is that translating from a foreign language is a lot like dealing with a cipher, where you substitute words with other words.
  • This slide made me think of zkdecrypto:

    Random restarts are used in zkdecrypto to great effect, helping the hillclimbing search avoid getting stuck on hills that are too small.
  • Knight reviews some recent historical decipherments: Jefferson cipher, a Civil War cryptogram, German Naval Enigma, and Knight’s own work cracking the Copiale cipher.
  • One of the keys to cracking the Copiale cipher was to cluster the cipher alphabet based on similar contexts (i.e., certain symbols tend to be preceded by or followed by specific symbols). The clusters were determined with a very useful “cosine similarity” measurement, which I implemented in CryptoScope. If you switch to the 408 there, and click the “Letter Contacts” button, you’ll see a Cosine similarities section under the contacts grid. It shows how all the real homophones tend to cluster at the top of the list.
  • I’d assumed that much of the Copiale breakthroughs were discovered via computer, but apparently a lot of pencil and paper work was needed:
  • The NSA recently declassified a 1979 paper by Mary D’Imperio about applying modern statistical techniques to the Voynich Manuscript. See also: D’Imperio’s The Voynich Manuscript: An Elegant Enigma.
  • The NSA also recently declassified 4,400 pages of its newsletter, “Cryptolog“. Despite containing numerous redactions, it has some interesting articles on many topics, including the Voynich Manuscript.
  • Knight talked about his 2011 paper with Ravi in which their method produced a very accurate decipherment of the Zodiac’s 408 cipher.
  • He also talked about the unsolved Zodiac 340, and how it has no obvious reading order bias based on analyzing bigrams in several directions:

    There you can see the 408 has a preference for left/right order, since it results in a spike of repeating bigrams. So, the 340 “could be nonsense, or maybe bigrams are smoothed out via more careful substitutions.”

Sounds like it was a fascinating conference. Maybe I can attend the next one. Meanwhile, have a look at the NSA’s upcoming 2013 Cryptologic History Symposium. Among the speakers will be Klaus Schmeh and David Kahn. And I’ve heard that Elonka Dunin will be there. Looks like a unique learning opportunity!

Here’s a fun, but arguably pointless gadget I built as a follow up to the article about ambiguous word searches:


Click here to try out the Word Search Gadget

You enter a word, click the search button, then the gadget hunts through the cipher text for the word. Symbols are interpreted directly as normal letters, and the letters have to be next to each other in the cipher text. Sometimes you can find words that appear in straight lines (HER, BOO, GOD, ZODAIK, etc.) Other times the words are folded up in different directions (BOMB, DOOR, PENTOBARBITONE).

Here are some details about how to work the controls:


Enter your search here. Then fiddle with the other settings, or just click on the Search button to accept the defaults.

The search interprets symbols two different ways: strictly, and loosely. If you choose “strict”, then only these normal-looking letters and flipped letters are interpreted:

If you choose “loose”, then these extra rotations and manipulations are also considered:

Selecting “loose” will usually find more matches, since the search will force many symbols to look like other letters.

The search will match words that wander in different directions. You can limit the number of changes in direction by entering a new number here. For example, if you only want to match words that appear in perfectly straight lines, then enter “0” here.

By default, the search requires that all letters are right next to each other in any direction. But you can modify the search here to allow for some “jumps”, or skipped symbols. Here is an example of the word “KILL” found by allowing at least two jumps:

Be careful with this setting, because sometimes it makes the search take too long, which may make your browser unresponsive for a little while.

Shuffling the cipher text is a way to test how easily words can appear by pure chance alone. When you click the “Shuffle” button, the cipher text is shuffled like a deck of cards until your word is found. It will stop after 1,000 shuffles if your word isn’t found.

After clicking the “Search” button, any results are displayed here. The “permalink” link gives you a way to bookmark the current search. Here are some sample permalinks: POINTYBEARD, DECIPHER, PARADICE, MALEIDENTIFIED, WELLROTTED, and MARTYR.

This part of the results shows you how the symbols are interpreted. Any strongly ambiguous symbols are manipulated to clarify how they are interpreted.

In this list of results, the highlighted result is shown in the cipher text grid to the left. You can see different results by moving your mouse cursor over different rows of the table.
Each result is shown with a score and some other values. “Changes” shows how many direction changes were required to find the word. “Loose” is a count of how many loose interpretations were needed to find the word. “Jumps” indicates how many times a symbol had to be skipped to complete the word. These are combined into the “Score” which is a way to estimate the quality of the result. Lower scores are better, and they are shown at the top of the list.

At the bottom of the cipher grid, you can click these buttons to switch between the 340 and 408 ciphers.


What words can you find? Leave them in the comments!

“EPotts” on Tom’s forum posted this intriguing observation about the names of the Zodiac Killer’s victims:

May be nothing but I noticed names like FErrin-EDwards-STein-BAtes-yet only 2 out of the top 100 most common surnames have letters in alphabetical sequence like this. Maybe just a coincidence? I think it could be but thought I would mention it. sorry if this has been posted b4

Confirmed victims Darlene Ferrin and Paul Stein have last names that begin with two letters that are adjacent in the alphabet. Unconfirmed victims Linda Edwards and Cheri Jo Bates do also.

Let’s look at all the victim names. There are reportedly seven confirmed victims: David Arthur Faraday, Betty Lou Jensen, Michael Renault Mageau, Darlene Elizabeth Ferrin, Bryan Calvin Hartnell, and Paul Lee Stine. There are at least four unconfirmed victims: Robert Domingos, Linda Edwards, Cheri Jo Bates, and Donna Lass (five if you include Kathleen Johns). People often debate over these lists but let’s go with the ten names for now.

As it turns out, if you pick ten last names completely at random, there’s about a 1.3% chance that at least four of them will have this interesting pattern. (If you want to torture yourself by seeing the math behind this, click here.)

Seems very rare, doesn’t it? Did the killer select his victims because of this pattern in their last names?

Well, maybe. But we have to be careful here. We know that the chances of finding this particular interesting pattern are low. But we didn’t consider other interesting patterns.

For instance, what if 4 out of the 10 names start with the same letter? Or if they end with letters that are adjacent in the alphabet? Or if each name starts and ends with the same letter? If we were looking at another list of names, and it had one of these other kinds of patterns, would we think it was significant?

For that reason, we need to change the question from:

  • What are the odds this interesting pattern happens completely by chance?

to:

  • What are the odds any interesting pattern happens completely by chance?

To demonstrate this, click here to run an experiment that shows how patterns can be found in random names. When you click the button there, it will select 10 names at random from the 30,000 most common names in this US census data. The random selections are done 1,000 times. The experiment then displays any sets of names that matched at least four patterns.


Screenshot of experiment page

By default, it only looks for the kind of pattern EPotts discovered. Now, let’s see what happens when we include more kinds of patterns. Click more of the checkboxes to select more patterns, then click the button again. You’ll see that more matches are displayed. In fact, if you select every checkbox, then around 40% of the random trials will result in interesting patterns.

With some creativity, we could think of even more kinds of patterns that would stand out in a list of names. For example, what if the first letters of each name anagrammed to another name or interesting word? If such a pattern is discovered in a list of names, surely it would seem too unlikely to occur by random chance. But when all such patterns are considered, the odds are quite high that we’ll find something interesting.

If the Zodiac killer really did select victims whose names formed a pattern, we’d need some stronger evidence to show that it was intentional. What if all of the victim names followed EPotts’ pattern? If they did, the odds of it happening by chance drop from 1.3% to 0.0000000001%, which is very compelling. But they don’t, so we’d need more proof, such as some other direct reference to the scheme, perhaps among Zodiac’s correspondences.


WARNING: Here comes the boring math part.

As mentioned earlier, if you pick ten last names completely at random, there’s about a 1.3% chance that at least four of them will have this interesting pattern. How can we figure this out directly, instead of running an experiment to estimate the odds?

If you look at the census data, about 1 of every 10 names have EPott’s pattern. So we can say that there’s a 10% chance than any name will have it.

When we pick random names, we are filling 10 empty spots:

After filling the slots, let’s say we found 4 names that have the pattern, and 6 that don’t. We show names with the pattern as green, and the rest as red:

What is the probability of this exact arrangement of slots happening? Take a moment and think of a six-sided die. The probability of rolling a 2 is 1/6. The probability of rolling a 4 is 1/6. What is the probability of first rolling a 2, and then rolling a 4? Answer: 1/6 * 1/6 = 1/36.

Similarly, the probability of a green slot is 1 in 10. The probability of a red slot is 9 in 10. To figure out the probability for all slots, we multply the probabilities: (1/10) * (1/10) * (1/10) * (1/10) * (9/10) * (9/10) * (9/10) * (9/10) * (9/10) * (9/10) = 0.0000531441.

That tiny number is the probability of that one event happening. But there are more ways that four green slots can appear. They can be in different positions:

And there could be more than four green slots:

Again, think about the six-sided die. What is the probability of rolling a 3? Answer: 1/6. What is the probability of rolling a 3 OR rolling a 5? Answer: 1/6 + 1/6 = 1/3.

Similarly, we must add the probabilities for each of the possible events where at least 4 green squares appear. How many arrangements of the slots have at least 4 green squares? We have to use combinatorics to get the answer.

The number of ways to choose 4 slots from 10 is “10 choose 4”, or C(10,4), which is 210.

So, we have to add that tiny probability 210 times. Thus the probability that selecting 10 random names will result in exactly 4 green slots and 6 red slots is: 210 * 0.0000531441 = 0.011160261 (or 1.12%).

But we aren’t quite done yet. We also have to include situations where there are exactly 5 green slots, or 6 green slots, or 7 green slots, etc.

The probability that one particular event has:

  • Exactly 4 green slots and 6 red slots: (1/10)4 * (9/10)6 = 0.0000531441
  • Exactly 5 green slots and 5 red slots: (1/10)5 * (9/10)5 = 5.9049 × 10-6
  • Exactly 6 green slots and 4 red slots: (1/10)6 * (9/10)4 = 6.561 × 10-7
  • Exactly 7 green slots and 3 red slots: (1/10)7 * (9/10)3 = 7.29 × 10-8
  • Exactly 8 green slots and 2 red slots: (1/10)8 * (9/10)2 = 8.1 × 10-9
  • Exactly 9 green slots and 1 red slots: (1/10)9 * (9/10)1 = 9 × 10-10
  • Exactly 10 green slots and 0 red slots: (1/10)10 * (9/10)0 = 1 × 10-10

Now we have to count all the ways for those events to happen:

  • Ways for exactly 4 green slots and 6 red slots to appear: C(10, 4) = 210
  • Ways for exactly 5 green slots and 5 red slots to appear: C(10, 5) = 252
  • Ways for exactly 6 green slots and 4 red slots to appear: C(10, 6) = 210
  • Ways for exactly 7 green slots and 3 red slots to appear: C(10, 7) = 120
  • Ways for exactly 8 green slots and 2 red slots to appear: C(10, 8) = 45
  • Ways for exactly 9 green slots and 1 red slots to appear: C(10, 9) = 10
  • Ways for exactly 10 green slots and 0 red slots to appear: C(10, 10) = 1

Then, to get the total probability that at least 4 green slots will appear among 10 random selections, we multiply the counts by the corresponding probabilities, and then add them all up:

Total probability =
210 * 0.0000531441 +
252 * 5.9049 × 10-6 +
210 * 6.561 × 10-7 +
120 * 7.29 × 10-8 +
45 * 8.1 × 10-9 +
10 * 9 × 10-10 +
1 * 1 × 10-10 = 0.0128 (or 1.28%).

My apologies if these explanations are as cryptic as the ciphers!

Over on the ZodiacKillerSite forums, pi is doing some interesting work investigating the use of route patterns in quadrants of the 340-character cipher.

His idea is to split the cipher text into four quadrants, and then rearrange the text in each quadrant based on all of the possible routes. He then measures the resulting cipher text to see if more repeating patterns emerge from the text, with the hope that the underlying rearrangement has more features of a real message.

What he found was that running this search on the 340 increases the repeating patterns in at least 15% of his tests. By contrast, doing the same search on the 408 only increases the patterns 0.0000076% of the time. I found a similar phenomenon when exploring quadrants with a slightly different approach (see here and here).

But this led to more questions: What would happen to other 340-character test ciphers? How hard is it to make a 340-character cipher that has very few repeated patterns, yet still contains a valid message?

To answer this, pi created a test cipher that contains very few repeated patterns.

This means the repeated pattern count alone is not enough to separate good and bad rearrangements. What can we use instead? Dan’s approach avoids measuring the candidate cipher text altogether by skipping directly to trying to solve the rearranged text via the zkdecrypto hillclimber. It is a slow approach, but methodical. But is there some measurement that we can apply as a short cut? I think this is still an open question, and that we still need to generate more test ciphers to answer it.

Does the 340 cipher contain simple manipulations that have confounded all attempts to crack it? Dan Umanovskis, the skilled programmer who hacked together zkdecrypto-lite, has created a new kind of attack to explore this possibility. Using a genetic programming approach, his new algorithm evolves manipulations of the cipher text, such as swapping rows, removing columns, and flipping the text. The algorithm then feeds the modified cipher text to zkdecrypto-lite which hunts for readable plain text. Candidate manipulations that result in higher zkdecrypto scores are kept (“survival of the fittest”) and “bred” (mixed together) to look for even better scores.

Hopefully his approach will bear fruit. Even if it doesn’t find a solution to the 340, it can be used to exclude certain possibilities. For instance, you can create many test ciphers that resemble the 340 and contain the types of manipulations that the algorithm searches for. If the algorithm can successfully crack all the test ciphers, and no solutions are appearing for the 340, then it’s likely the 340 uses some other method of encryption.

You can download and play with Dan’s algorithm here: https://github.com/umanovskis/lgp-decrypto, and read Dan’s description here: http://www.zodiackillerfacts.com/forum/viewtopic.php?f=50&t=1537.

Clicking the above logo will bring you to a Swedish thrash metal band, rather than mysterious cryptograms. Nevertheless, you may find yourself puzzling over their indecipherable lyrics.

Here is another quick (and pointless) diversion. As explored in the previous post, you can find words in the cryptograms without performing any substitutions. For example, in the 408-character cryptogram, the word “OUT” appears:

And the word “RIDE” can be discerned:

If we apply the solution key to those “words”, we get:


This led me to a question: Using the cipher symbols, can we create any “words” that turn into valid plain text words when deciphered using the 408’s solution key?

The tools I built for the previous post were still lying around, so they were useful in satisfying this nagging curiosity. Turns out that many words can be made this way. In the sampling below, each plaintext word is followed by its encipherment using the 408’s key. Then the interpretation of the encipherment is shown:



(If you really want to challenge your brain, try to find a word whose encipherment’s word can itself be enciphered into yet another valid word.)

One reason I was curious about these “word pairs” was because of the possibility that Zodiac “seeded” his solution key with a keyword. But even if he did, it seems impossible to discover the keyword, since too many words can be found by chance. Here’s another example of how easy it is to draw out ridiculous outcomes solely from chance:

Among the symbols of the 408-character cryptogram, Zodiac includes the normal looking letters A through Z, but excludes C. So, let’s write out those cipher letters, with the corresponding plain text substitutions underneath:


Now we can re-arrange the assignments to look for keywords:


Surely “the buffoon’s sweet genitalia” is not the message Zodiac was trying to convey…

Many people have noticed that words can be found among the symbols in the Zodiac killer’s cryptograms without performing any substitutions. Here are some examples from the 340:

HER

TOM

Looks a lot like “ZODIAC”

FBI

TOO

GOD

BOO is seen three times

BOY is found twice

Five occurrences of BY, resembling the Halloween card

And here are some words easily found in the 408:
(more…)

Between 1956 and 1981, math and science writer Martin Gardner wrote the Mathematical Games column, a series of popular recreational math diversions in Scientific American. Gardner’s gift to the world was to rescue us from the school-borne tedium and fear of math, showing us that mathematics could be elevated into something that could be enjoyed and appreciated. He was one of the heroes of my youth, helping to spark a sense of curiosity in mathematics that still drives me.

Some of Gardner’s wonderful puzzles, games, and curiosities.

He even wrote a book about codes and ciphers.

I recently found an index to a collection of Martin Gardner’s correspondences and notes at Stanford University, and was surprised to discover that some of them pertain to the Zodiac ciphers. Naturally, I had to order photocopies and have a look.

Among the collection are letters that writer and amateur sleuth Gareth Penn wrote to Gardner in 1981 and 1984. Those who follow the Zodiac case have probably heard of Gareth Penn, a man who, for over twenty years, has obsessively written reams of material with bizarre theories linking UC Berkley professor Michael O’Hare to the Zodiac killings. Penn’s fixations have also drawn suspicions that he himself is the Zodiac killer. For a better background on this fascinating example of misplaced obsession, read “With Malice Aforethought” by Michael Martin, and Michael O’Hare’s own responses to the whole affair.

In his first letter to Gardner, Penn writes about how much he enjoys Gardner’s column on mathematical puzzles, and encloses copies of his various articles and writings about the Zodiac Killer:
(more…)


(photo credit: Elonka Dunin)

Thanks to Doc for informing me about this story. Above is CIA analyst David Stein, the first person to crack encrypted messages in the Kryptos sculpture at the CIA, leaving only one message unsolved. Twenty-three years after the sculpture was created, the last message still remains unsolved.

Wired.com posted this story recently, triggered by the NSA widely distributing Stein’s declassified and fascinating account of how he tackled Kryptos for over seven years using only pen and paper. Although Stein was first, his work was only known to CIA colleagues. Cryptographer and computer scientist Jim Gillogly was the first to publicly announce the correct solutions, but he was unaware of Stein’s work. Gillogly is also known for identifying a strange sequence in one of the Beale cryptograms, which he explains is evidence that the Beale papers are a hoax.

Read more about how Stein cracked Kryptos here. And be sure to read Elonka Dunin’s nicely organized breakdown of the entire Stein article. Those of us working on the 340 might find some insight and new ideas in such details.