I had the pleasure of attending this year’s Cryptologic History Symposium, held on Oct 17 and 18 in Laurel, Maryland. The event is organized every two years by the non-profit foundation that supports the NSA’s National Cryptologic Museum.
The proceedings were filled with dozens of fascinating talks by noteworthy professionals in the world of cryptology, and I met several interesting and colorful personalities. Here are some highlights.
NSA Deputy Director John C. Inglis kicked off the symposium with his keynote talk, in which he spent a lot of time defending the NSA in light of the numerous unauthorized disclosures by Edward Snowden of the NSA’s secret and controversal surveillance programs.
York University professor Craig Bauer told a fascinating tale about famed computer scientist Alan Turing’s work on SIGSALY, a telephone voice scrambling machine developed during World War II. Created before the digital age, the machine was humongous and required carefully protected phonograph disks to store the cryptographic keys consisting of random noise.
I spoke with Dr. Bauer and learned that he had included the Zodiac Killer in an interesting talk he gave about famous unsolved codes and ciphers. He’s also working on an upcoming book about this topic. Videos of his talk are available on Youtube:
Click this link to go directly to the Zodiac portion of his talk which starts 12 minutes and 32 seconds into the video. He gives a brief summary of the case, and touches on Gareth Penn’s radian theory and the New York Zodiac copycat killer Heriberto Seda.
He even gave Heriberto Seda’s cryptogram as an exercise for students of York’s cryptology course.
At the conference I also met German cryptology author Klaus Schmeh, who writes an interesting German-language blog about various cryptologic mysteries, including the Zodiac ciphers. His book Nicht Zu Knacken (which I wrote about here) has a chapter devoted to the Zodiac ciphers. At the conference, Klaus gave a talk about compiling a list of 32 encrypted books. He showed some examples, ranging from obscure encrypted texts, to more well-known books such as the Voynich Manuscript and Codex Rohonci. He even brought his own hand-made reproduction of the Codex Rohonci as a prop for people to look through.
Video of Klaus’ talk (Part 1):
Video of Klaus’ talk (Part 2):
(The volume is faint, so you’ll need to crank up the volume on your speakers)
Video game developer, cryptology enthusiast and Kryptos expert Elonka Dunin gave a talk about the use of cryptology in the recreational activity of Geocaching. I was surprised to learn that many caches require solving cryptograms and other puzzles before you can learn the location of the caches. She told me that some of the puzzles remain unsolved. You can view them here. Some puzzles are even based on the German Enigma machine.
Dr. Todd Mateer spoke about his cryptanalysis of one of the Beale ciphers. In the original document, James B. Ward’s The Beale Papers, each number in the solved cipher corresponds to a word from the Declaration of Independence. Mateer tried to reconstruct the solution based on the procedure outlined in Ward’s paper. He found that he ran in to trouble due to the numerous variations of the Declaration of Independence that were published in Ward’s day. He also found that there are mistakes in the encipherment, such as duplicate numberings in the version of the DOI in Ward’s paper. How is it that the so-called Beale cipher uses the exact same DOI variant, and the exact same encipherment mistakes, as that of Ward’s decipherment? What are the chances that Beale and Ward made precisely the same mistakes? Mateer’s conclusion is that it’s because the Beale treasure is likely to be a hoax, invented by whomever authored The Beale Papers. Nick Pelling offers the alternate view that the oddities found in the unsolved Beale ciphers reflect a change in the encipherment procedure that causes bits of an encrypted key to show through. Some experiments in this direction might help clarify his idea.
The talk with the most attention-grabbing title was Shame, Sex and Alcohol: Ciphers in the Context of Everyday Practices of Secrecy in Early Modern Times, about encrypted messages left by noteworthy Hungarian historical figures in diaries and correspondences. Here is one of the more colorful excerpts:
The conference was very educational, provoking a thirst for more knowledge and discovery. I don’t want to miss the next one in 2015!
In the paper, Berk-Kirkpatrick and Klein investigate the effectiveness of the “expectation-maximization algorithm” (EM) when it is used to search for solutions to homophonic ciphers. EM explores cipher solutions by using a statistical model of the English language. The paper describes EM getting stuck in local optima, a common problem with these kinds of automatic decipherment algorithms.
To understand the local optima problem, think of a rugged landscape. Your task is to find the largest mountain in the landscape, but you can’t see more than a few yards ahead of you. You walk around, thinking that your best bet is to find the steepest slope and follow it, You get to the top of a hill, thinking you’ve found the tallest mountain, but you’re only standing atop a medium-sized one. How can you find the tallest one?
For the EM algorithm, the answer is to randomly restart the search many times. In the rugged landscape, this translates to randomly and blindly teleporting you somewhere else in the landscape. Eventually, one of the slopes you follow will be the right one.
Zkdecrypto uses a similar approach. At each step, zkdecrypto is making small changes to a solution key, and keeping the changes that make the plain text look a little bit more like English. But it, too, might get stuck on a “small hill”: no matter what changes it makes to the key, the plain text doesn’t improve, and the real one remains beyond its reach until enough restarts are done.
Long story short: If at first you don’t succeed, try, try again.
The paper tries to answer this question: How many restarts are needed to get the right solution? The researchers’ answer is that it depends on the cipher. They looked at the Zodiac’s 408, which turns out to be an easy cipher to solve: Accurate solutions are discovered without more than a hundred random restarts. Then the researchers created their own 340-character cipher, and accurate solutions were found using hundreds of thousands of restarts. They distributed the numerous restarts to the 512 computational cores of a powerful computer graphics card, an approach that is thousands of times faster than running the restarts one at a time in sequence. The increased difficulty in solving the test cipher is attributed to its smaller length, and its larger number of cipher symbols. But even with only a few hundred restarts, more than half of the correct solution to their test cipher is discovered by the program, which is usually enough for a human solver to take over and complete the solution.
The researchers unfortunately repeat the claim that in 2011, Ravi and Knight were the first to crack the 408 using completely automatic methods. Zkdecrypto has been automatically cracking the 408 since 2006.
Finally, Berk-Kirkpatrick and Klein investigated Zodiac’s 340, and mention the possibility that it may not be a homophonic cipher, or even a valid cipher whatsoever. No one even knows its proper reading order. This paper is the first I’ve seen that presents specific evidence to support the argument that the 340 is not a homophonic cipher in a normal reading order. First, the researchers generated 100 more test ciphers that are similar to the 340, and ran their EM algorithm on them with 10,000 random restarts. The result was an average accuracy of 75%, where only two ciphers had less than 51% accuracy. If the real 340 is “normal”, at least part of its message would surely have been revealed by now. When they ran their EM algorithm on the real 340, the best results were still nonsensical. They also found that the search landscape looks much different for the real 340 than it does for their test ciphers, suggesting a strong difference in its construction.
I’m curious about how closely their test ciphers resemble the real 340. Do they contain similar cyclic sequences of homophones? Do they have the weird pivots? What about spelling errors and encipherment errors, which were abundant in the 408? At this point, it seems best to try to guess what Zodiac may have done to make the cipher unsolvable. Then we can generate more test ciphers and automate their solutions. If the real 340 is constructed the same as the test ciphers, then its solution, too, will emerge.
Then again, perhaps we should heed W.C. Fields’ advice:
If at first you don’t succeed, try, try again. Then quit. There’s no point in being a damn fool about it.
Below is contributor Chris Klein’s interesting analysis of the BTK word puzzle. He breaks down the overall puzzle into three smaller puzzles that are suggested by the appearance of clean breaks between major words and the placements of the letter “X”.
- This puzzle lacks numbers.
- The locations of the red X‘s suggest puzzle boundaries.
- The locations of the red X‘s suggest puzzle boundaries.
- In row 1 column 7 there appears to be a missing number. The letters R, A and D are near this missing number.
- The locations of the red X‘s suggest puzzle boundaries.
- In row 11 column 3 there appears to be a missing number. The letters E and R are near this missing number.
- This puzzle lacks numbers.
- Very many of this puzzle’s letters are used in words.
I believe that the X‘s are some type of directions for the orientation of the three puzzles.
Notes about each puzzle and final summary.
Each of the puzzles has its own theme with regard to the hidden words within. The theme of each puzzle is as follows:
- Puzzle 1 ‐ Recon
- Puzzle 2 ‐ Location
- Puzzle 3 ‐ Method of operation
- Puzzle 4 ‐ Identity?
In each puzzle I did the following:
- Isolate all obvious words by color (yellow).
- Isolate all sequential and repetitive keystrokes that were not part of an obvious word by color (green).
- Isolate all “X”s that were non‐sequential by color (red). (I was drawn to isolate the “X”s separately because of the somewhat uniform location of the “X”s throughout the puzzles.
- Isolate all remaining, apparently random, keystrokes by color(purple).
- Isolate all numbers by color (blue). (I took some liberty here. In the “original” document that I worked from, the numbers were oddly spaced so that they were near adjacent alphabetic keystrokes. I noticed that for every oddly placed number there was a blank space to its left. I shifted the numbers to the blank spaces to the left. After doing this I noticed there were 2 blank spaces left on the chart. Interestingly the blank space in puzzle 2 is surrounded by the letters R, A, D. The blank space in puzzle 3 is preceded by E, R.
- In looking at the progression of the 3 puzzles it seems as if the creator became more efficient as he completed each puzzle. Puzzle 1 seems to have a lot of non‐sequential letters that don’t appear to be part of any hidden words whereas puzzle 3 has very few non‐sequential letters that aren’t part of any obvious words. One question I would ask is why the creator didn’t use sequential letters in puzzle 3 in certain places. Maybe there is no reason but there are only (5) non‐ sequential letters that aren’t part of any obvious word and it seems it would have been easy enough to replace those with some other letter.
- I did not go into the meanings of the different numbers on the chart.
Puzzle 1 ‐ Recon
This puzzle contains no numbers and seems to be pretty uncomplicated.
Puzzle 2 ‐ Location
This puzzle seems to be uncomplicated at first but unlike the previous puzzle it contains sets of numbers. It also contains 3 “X”s on or near the same locations as puzzle 1. This puzzle also contains a blank space adjacent to the letters R,A,D.
Puzzle 3 ‐ Method of operation
This puzzle appears to be the most detailed of the 3 puzzles. Unlike the previous puzzle there are no numbers but there is a blank space again. This blank space is preceded by the letters E, R. It also contains 3 “X”s on or near the same locations as puzzles 1 and 2.
Puzzle 4 ‐ Identity?
Not sure if one would say that all of these puzzles are just parts on 1 main puzzle but I like to consider them separately. That said, I believe the RADER name is announced by the blank space in puzzle 2 and the blank space in puzzle 3. If that were true then I would consider the grouping of puzzles 2&3 to be a separate puzzle and have labeled it as such.
Anyhow, thats what I have.
The May 21, 2011 issue of New Scientist ran a feature called “Uncrackable Codes”, which featured MacGregor Campbell’s summaries of eight famous unsolved mysteries: Somerton Man, Beale’s buried treasure, the MIT time-lock puzzle, Kryptos, the Voynich Manuscript, Enigma, Elgar’s unread message, and the Zodiac Killer.
The Zodiac feature doesn’t have any new details, but here are some highlights:
- “In November 1969, Zodiac sent a code to the local papers that law-enforcers still believe could hold the key to solving the case.” This seems like wishful thinking. Cracking the 408 didn’t catch the killer. But maybe this wish will come true.
- FBI crypto chief Dan Olson says the 340 “is number one on his unit’s internal ‘top 10′ list of unsolved codes”, and that that he gets about 20 to 30 submissions every year from the public. None have led to breakthroughs.
- FBI cryptanalysts believe the 340 contains a real message, since the distribution of characters in the rows is not equal to the distribution of characters in the columns. Olson describes this in more detail in the emails he sent to Tom Voigt back in 2009.
- In 2009, computer scientist Ryan Garlick led his students in an attempt to use genetic algorithms to crack the cryptograms. The attempt was successful for the 408, but not the 340.
- Another attempt at San Jose State University also failed to produce a solution for the 340.
- Garlick thinks to crack the 340, its symbols need to be rearranged somehow. But he says figuring out the rearrangement is very difficult: “You have to happen upon exactly the right thing before any of our software tools would even get close.”
These kinds of “Top Unsolved Codes” lists appear from time to time, and usually only contain the briefest summaries of the mysteries. The MIT time-lock puzzle is one I haven’t seen before. And it was nice to see in the Zodiac feature a summary of academic attempts to crack the codes. How many other places in academia have been working on the dusty old Zodiac ciphers? The ones that I’ve come across include:
- Kevin Knight et. al at the University of Southern California
- Nuhn, Schamper and Ney at RWTH Aachen University, Germany
- Michael James Banks at the University of York
- Raddum and Sys from universities in Norway and Slovakia
- King and Bahler via Hewlett-Packard and North Carolina State University
All the research seems to be centered on attacking the 340 as if it is a simple substitution cipher. Most papers report succeeding at breaking the 408 and failing with the 340. I’ve yet to see any academic research into the idea of rearranging the 340′s symbols, or exploring its other qualities that might offer clues into how it is truly constructed. This seems to point to falsifying the hypothesis that the 340′s plain text is written in valid English, arranged in a normal direction, and enciphered using straightforward homophonic substitution. My guess is that exploring all the strange variations that are possible would result in tools that are too specific to apply to other “pen and paper” style ciphers. A lot of work for potentially zero reward. Who’s up to the challenge?
Audrey Cooper, managing editor of the San Francisco Chronicle, recently tweeted this:
The thickness of that stack of papers suggests lengthy and convoluted attempts to justify the claimed solution. It’d be interesting to know what approach the solver took. In the photo, the letter starts by discussing the cycling of variants (also known as homophones) in the 408:
The above cipher variants were, for the most part, made in a cyclic order, deteriorating toward the end of the third part of the message
The letter shows the key, with all the variants (homophones) displayed beneath each plain text letter:
Then there’s a breakdown of the key, showing normal-looking alphabetic cipher symbols first, followed by the symbols that look like backwards letters, and then the symbols that look like shapes and punctuation:
The letter writer points out the interesting fact that “LMN” happens to decode to “THE”.
So far, the letter is off to a reasonable and logical start. I wonder where it leads.
The file of solution claims at the Chronicle must be massive. Reporter Kevin Fagan once told me they still get about two submissions per week. It’d be interesting to look at the file to see all the different approaches people have taken, and to find out how many of them fall into the usual traps, and if any of them supply new and useful ideas.
Did Zodiac really encipher a name in this letter? If we assume that the cipher is a simple substitution cipher, read from left to right, then it is difficult to find real names that fit. This is because only 8 of the 13 symbols are unique. The repeated symbols are:
- : Occurs 3 times.
- : Occurs 2 times.
- : Occurs 2 times.
- : Occurs 2 times.
Assuming that each of those symbols can only represent one plaintext letter, many names we try to plug in will fail, because they will violate the cipher’s constraints.
We can roughly estimate the probability of 13 characters of plain text fitting into the cipher text. First, we start with these event probabilities:
- Probability of two letters repeating in English (p2): About 0.0655.
- Probability of three letters repeating in English (p3): About 0.00535.
Since we must have three identical letters, AND three sets of identical pairs, we must multiply the probabailities of each event together:
p3 * (p2)3 = 1.5 x 10-6, or about 1 in 670,000
Another way to look at this is to consider all possible 13-letter plain texts. Each position can be one of 26 letters, so that means for 13 spots there are 2613 possible plain texts (about two quintillion). But there are only 8 different symbols in the cipher, which means there are really only 268 possible plain texts (about 200 billion). That means that only about 1 in 12,000,000 plain texts will fit into the cipher without violating its constraints.
I found a database of a hundred million person names and wrote a program that scans them to look for names that fit in the cipher. When examining a name, the program allows the first, middle, and last names to appear in different orders, and allows any of the name parts to be abbreviated with initials. It discards anything that isn’t exactly 13 letters long. This results in almost a half billion tests of name combinations. The algorithm found only 213 plain texts that fit the cipher text. That works out to about 1 hit for every 2,000,000 attempts. Here is a sampling of the more interesting names that fit:
Like Zodiac, serial killer Dennis Rader, also known as BTK, taunted police and newspapers with letters boasting of his crimes.
It was not difficult for people to find several words relevant to Rader’s crimes, such as:
The Association for Computational Linguistics held its 2013 conference in Bulgaria earlier this month. Computational Linguistics is the study of how to apply computer science to problems involving language and speech. It’s a very broad subject, which also includes studies of how to decipher codes and cryptograms.
An interesting paper to emerge from this conference is “Beam Search for Solving Substitution Ciphers”, authored by some folks from RWTH Aachen University in Germany. They claim the technique in their paper can very quickly crack substitution ciphers with very small decipherment error. They also apply the technique to the Zodiac Killer’s 408-character cryptogram.
The paper has this claim:
Unlike the previous state-of-the-art approach that uses additional word lists to evaluate possible decipherments, our approach only uses a letter-based 6-gram language model.”
They must not be aware of other effective n-gram based hillclimbers such as zkdecrypto and CryptoCrack. This is probably because those tools are seldom cited in other academic papers, so researchers tend to overlook them.
(Ravi and Knight, 2011a) report the ﬁrst automatic decipherment of the Zodiac-408 cipher. They use a combination of a 3-gram language model and a word dictionary.
From the referenced paper:
The new method can be applied on more complex substitution ciphers and we demonstrate its utility by cracking the famous Zodiac-408 cipher in a fully automated fashion, which has never been done before.
An early version of zkdecrypto was able to automatically and fully decipher the 408 back in December 2006. My 2008 paper descibes an automatic decipherment of enough of the 408 to easily derive the complete solution. And there must be other automatic solvers out there that could have handled the job. The 408 is not a difficult cryptogram to solve using statistical methods, because there are enough clues in it to give codebreakers leverage.
This small oversight notwithstanding, the new approach described in the ACL 2013 paper is interesting. Beam search is a variation of breadth-first search. A breadth-first search starts with an initial key, and then explores every possible change to it. This is not practical, due to the astronomical number of possible keys. Beam search fixes that by “pruning out” the low-scoring possibilities, so it only has to explore the ones it thinks are worth the trouble. The scores are based on n-gram statistics, weeding out solutions that don’t resemble normal English.
This seems to be a bit more systematic than other hillclimbers such as zkdecrypto that use a randomized search that continually “perturbs” the key, exploring the higher-scoring modifications. It’s unclear how well beam search stacks up against zkdecrypto, since zkdecrypto is already shown to be very fast and effective. The paper does not mention the unsolved 340 at all. The 340 will very likely not succumb to yet another approach that assumes a conventional construction, but I’m still curious what happened when they ran the algorithm against it.
Also in the conference, Kevin Knight gave a tutorial entitled “Decipherment”, in which he discussed the basics of codebreaking, classic historical ciphers, and famous unsolved codes:
- An interesting angle Knight presents is that translating from a foreign language is a lot like dealing with a cipher, where you substitute words with other words.
- This slide made me think of zkdecrypto:
Random restarts are used in zkdecrypto to great effect, helping the hillclimbing search avoid getting stuck on hills that are too small.
- Knight reviews some recent historical decipherments: Jefferson cipher, a Civil War cryptogram, German Naval Enigma, and Knight’s own work cracking the Copiale cipher.
- One of the keys to cracking the Copiale cipher was to cluster the cipher alphabet based on similar contexts (i.e., certain symbols tend to be preceded by or followed by specific symbols). The clusters were determined with a very useful “cosine similarity” measurement, which I implemented in CryptoScope. If you switch to the 408 there, and click the “Letter Contacts” button, you’ll see a Cosine similarities section under the contacts grid. It shows how all the real homophones tend to cluster at the top of the list.
- I’d assumed that much of the Copiale breakthroughs were discovered via computer, but apparently a lot of pencil and paper work was needed:
- The NSA recently declassified a 1979 paper by Mary D’Imperio about applying modern statistical techniques to the Voynich Manuscript. See also: D’Imperio’s The Voynich Manuscript: An Elegant Enigma.
- The NSA also recently declassified 4,400 pages of its newsletter, “Cryptolog“. Despite containing numerous redactions, it has some interesting articles on many topics, including the Voynich Manuscript.
- Knight talked about his 2011 paper with Ravi in which their method produced a very accurate decipherment of the Zodiac’s 408 cipher.
- He also talked about the unsolved Zodiac 340, and how it has no obvious reading order bias based on analyzing bigrams in several directions:
There you can see the 408 has a preference for left/right order, since it results in a spike of repeating bigrams. So, the 340 “could be nonsense, or maybe bigrams are smoothed out via more careful substitutions.”
Sounds like it was a fascinating conference. Maybe I can attend the next one. Meanwhile, have a look at the NSA’s upcoming 2013 Cryptologic History Symposium. Among the speakers will be Klaus Schmeh and David Kahn. And I’ve heard that Elonka Dunin will be there. Looks like a unique learning opportunity!