Friday, February 4, 2011

Digging For D'Agapeyeff DNA

What is the biggest obstacle to solving a 14 column fractionated transposition with an underlying substitution?

The solution space.  It's not feasible to check and score the entire space.  Or is it?

Let's start with a smaller example before leaping into the ADFGX cipher.

There is a dance tonight, and you have to form dance partners from a group of 4 people.  Two men:  Bob and Tom.  Two women:  Jenny and Sara.

There are N! ways to make dance partners, or in this case 4! = 24.  However there are only (N! / ( (N/2)! + (N/2)!) ) = 6 ways to make unique pairs where the order matters.  Bob & Jenny, Tom & Sara.  If you flip the men and women:  Jenny & Bob, Sara & Tom.  If you change the dance partners:  Tom & Jenny, Bob & Sara.  And if you flip the men and women:  Jenny & Tom, Sara & Bob.  The other two unique pairs are pairing the men together and the women together, and flipping their order.

How does this apply to the ADFGX cipher and my theory surrounding the D'Agapeyeff cipher?  For that we'll need one more statistic.  Phi value.  With the ADFGX cipher, each permutation of the column order produces unique pairs of columns that come together to make one "letter".  Even when you don't know the underlying substitution, you can still calculate the Phi value for the ciphertext.

Consider a 14x14 completed filled transposition grid.  There are 196 characters total which will eventually pair together to form 98 plaintext alphabetic characters.  As I've stated previously here, the expected Phi value for 98 letters from English text is 634 (98 * 97 * 0.0667).

Although there are 14! ways to arrange the columns, there are only 14! / ( (14/2)! + (14/2)! ) = 8,648,640 unique ways to calculate a Phi value.  Going back to a smaller example:

1,2 and 3,4 = unique phi value, and will result in the correct plaintext in the substitution phase.
2,1 and 4,3 = same phi value and will in fact produce the same plaintext in the substitution phase.  All you've done is flip all the row and column coordinates from the polybius square.

3,4 and 1,2 = same phi value, but will not result in the correct plaintext during the substitution phase because the letters are not in the correct order.
4,3 and 2,1 = same phi value, but will also not result in the correct plaintext during the substitution phase because the letters are not in the correct order.

So you might be asking, what does this accomplish?  One, for an even numbered transposition key, completely filled rectangle the ADFGX cipher has two correct keys.  Secondly, if you have any permutation of the correct key, you need only try (N/2)! permutations in the substitution solving phase to yield the correct plaintext.  For the 4 column example, I need only try (4/2)! = 2 permutations of the best potential keys to find the solution and correct plaintext.

Back to identifying the best potential keys.  Here is where the Phi value is key.  It's basically like the DNA fingerprint for the correct plaintext.  If 634 is the Phi value you'd expect to see for a 98 character plaintext you'd need only check potential solutions within a certain proximity to that value.

Let's go back to the 14 column example and it's 8,648,640 unique Phi values.  If I generated 8,648,640 random permutations of the 14 columns and had perfect distribution, within that pile of Phi values, one would contain some permutation of the correct transposition key.  It would have all the correct column pairings but be in the wrong order.  I'd need to try all (N/2)! or 7! or 5,040 permutations of that key in a substitution solver to yield the correct plaintext.

Think of it this way, you are looking for a toothpick in a large pile of toothpicks, however there are 5,040 variations of each of the 8,648,640 toothpicks.  Randomly drawing, let's say 20,000,000 toothpicks of the pile should give you at least 1 or 2 of the 5,040 variations of the toothpick you are looking for.  And on top of it, only 0.002% of the toothpicks look like yours.

What I found on trial runs with a known plaintext is that only about 40 high phi values show up in those 8,648,640 phi values.  Thus far, in limited testing with different plaintexts and polybius squares the correct phi value for my plaintext has always shown up in the top 40 unqiue phi values.  Meaning I'd need only check 40 * 5040 = 201,600 possible transposition keys for mono-alphabetic substitution.

In my view this is far superior to starting with a hill-climber and trying to maximize a scoring function on 99.99998% of potential keys that never have a chance to yield a correct plaintext.

3 comments:

  1. It seems to me, without doing any testing, that your approach is very limited for these reasons:

    1. in order to solve you need to know the phi value of the plaintext and that is an unknown;

    2. you can only tackle even length transposition keys.

    Could you please comment on these points?

    ReplyDelete
  2. On point 2, I completely agree with and acknowledge. Odd length keys definitely provide more of a challenge. Same goes for incomplete rectangles. The challenge of determining the long columns vs. short columns for the ADFGX(V) cipher would require multiple messages in the same key or other cryptanalytic work.

    On point 1, my hypothesis is that the correct key variant will produce a phi value in the top x% of all possible keys. The x% is a very small space in comparison to the entire space. If the Phi value is not near the expected value for English like with "the quick brown fox jumps over the lazy dog" then the search space is not going to be narrowed much/at all.

    My overall thought is that there seem to be ways to try and identify key length and my hope was to build upon that to reduce the work needed to be done to identify the correct key.

    ReplyDelete
  3. I think that your hypothesis is very worthwhile and deserves to be thoroughly tested and evaluated. Hopefully it will prove to be correct and I for one will spend some time looking at it with different ciphertexts and different 'depths' -- that is, the ratio of (cipher length)/(transposition key length).

    Your approach is similar to the successful way that Jim Gillogly developed for solving long Enigma ciphertexts. He used the ic to evaluate which disks were employed and their settings. And then used the more accurate log tetragraph scoring to work out the fine detail of the plug settings.

    ReplyDelete