Try, try again

From an upcoming natural language process conference in Seattle comes this paper from UC Berkeley: Decipherment with a Million Random Restarts.

In the paper, Berk-Kirkpatrick and Klein investigate the effectiveness of the “expectation-maximization algorithm” (EM) when it is used to search for solutions to homophonic ciphers. EM explores cipher solutions by using a statistical model of the English language. The paper describes EM getting stuck in local optima, a common problem with these kinds of automatic decipherment algorithms.

To understand the local optima problem, think of a rugged landscape. Your task is to find the largest mountain in the landscape, but you can’t see more than a few yards ahead of you. You walk around, thinking that your best bet is to find the steepest slope and follow it, You get to the top of a hill, thinking you’ve found the tallest mountain, but you’re only standing atop a medium-sized one. How can you find the tallest one?

For the EM algorithm, the answer is to randomly restart the search many times. In the rugged landscape, this translates to randomly and blindly teleporting you somewhere else in the landscape. Eventually, one of the slopes you follow will be the right one.

Zkdecrypto uses a similar approach. At each step, zkdecrypto is making small changes to a solution key, and keeping the changes that make the plain text look a little bit more like English. But it, too, might get stuck on a “small hill”: no matter what changes it makes to the key, the plain text doesn’t improve, and the real one remains beyond its reach until enough restarts are done.

Long story short: If at first you don’t succeed, try, try again.

The paper tries to answer this question: How many restarts are needed to get the right solution? The researchers’ answer is that it depends on the cipher. They looked at the Zodiac’s 408, which turns out to be an easy cipher to solve: Accurate solutions are discovered without more than a hundred random restarts. Then the researchers created their own 340-character cipher, and accurate solutions were found using hundreds of thousands of restarts. They distributed the numerous restarts to the 512 computational cores of a powerful computer graphics card, an approach that is thousands of times faster than running the restarts one at a time in sequence. The increased difficulty in solving the test cipher is attributed to its smaller length, and its larger number of cipher symbols. But even with only a few hundred restarts, more than half of the correct solution to their test cipher is discovered by the program, which is usually enough for a human solver to take over and complete the solution.

The researchers unfortunately repeat the claim that in 2011, Ravi and Knight were the first to crack the 408 using completely automatic methods. Zkdecrypto has been automatically cracking the 408 since 2006.

Finally, Berk-Kirkpatrick and Klein investigated Zodiac’s 340, and mention the possibility that it may not be a homophonic cipher, or even a valid cipher whatsoever. No one even knows its proper reading order. This paper is the first I’ve seen that presents specific evidence to support the argument that the 340 is not a homophonic cipher in a normal reading order. First, the researchers generated 100 more test ciphers that are similar to the 340, and ran their EM algorithm on them with 10,000 random restarts. The result was an average accuracy of 75%, where only two ciphers had less than 51% accuracy. If the real 340 is “normal”, at least part of its message would surely have been revealed by now. When they ran their EM algorithm on the real 340, the best results were still nonsensical. They also found that the search landscape looks much different for the real 340 than it does for their test ciphers, suggesting a strong difference in its construction.

I’m curious about how closely their test ciphers resemble the real 340. Do they contain similar cyclic sequences of homophones? Do they have the weird pivots? What about spelling errors and encipherment errors, which were abundant in the 408? At this point, it seems best to try to guess what Zodiac may have done to make the cipher unsolvable. Then we can generate more test ciphers and automate their solutions. If the real 340 is constructed the same as the test ciphers, then its solution, too, will emerge.

Then again, perhaps we should heed W.C. Fields’ advice:

If at first you don’t succeed, try, try again. Then quit. There’s no point in being a damn fool about it.