The Dorabella Cipher (Part 5)

There are some arguments that the Dorabella Cipher is actually not a real cipher, which are based on the existence of long strings of consecutive symbols with pair-wise different number of semi-circles. Using theory and practise i will show that this is actually not as suspicious as it may seem but it indeed maybe a hint that something is fishy.

There’s an argument that the Dorabella Cipher may not be a real cipher at all. The main reason given is that it contains long stretches where the number of semicircles alternates: adjacent symbols never have the same semicircle-count. In this post I’ll test whether that pattern is actually as suspicious as it sounds.The Dorabella Cipher has remained unsolved for about 130 years. While looking for recent discussion and solution attempts, I went well beyond the first page of Google results and found several excellent deep-dive blogs. One post on aplaceofbrightness.blogspot.com discusses the “alternating semicircle-count” pattern and motivated this analysis.

The findings on the blog are presented below. They adress some strange distribution of chosen cipher symbols. Look at the frist line of the Dorabella Cipher. It starts with the following symbols:

      Line 1 - First 13 symbols
      #2cE##3cW##2cSE##3cE##1cE##2cS##1cN##3cE##1cSW##2cNE##3cSE##2cNW##1cNW#

Do you notice anything? - Two consecutive letters never have the same number of semicircles! We have [2,3,2,3,1,2,1,3,1,2,3,2,1]. This are thirteen symbols in a row!

      Line 2 - First 10 symbols
      #1cN##2cNW##1cN##2cS##1cN##3cE##1cSW##2cSW##3cE##2cSE#

Here there are 10 symbols without two consecutive ones with the same number of semicircle.

      Line 3 - 4th to 12th symbols
      #2cN##3cNW##2cSE##1cSE##2cN##3cN##1cS##3cNW##2cSE#

Here we get nine.

The author from the blog argues that this is a strong clue that the cipher does not resemble normal english.

But is this really exceptional or is such a pattern acutally more likely than it seems?

First, since we use this terminology over and over again, let us define what we exactly mean with "such a pattern".

Definition Property-C

Given a string $S$ of length 87 over $\{1,2,3\}$, we say $S$ has Property-C if it contains a substring of length at least 13 in which adjacent symbols are never equal.

The argument is, Dorabella’s first line has an unusually long run with no equal consecutive semicircle-counts. To answer the question, if Property-C is very unlikely, i made three test:

[AI Theoretical Answer/Random Model] I asked the AI about the probability that a random string $S$ of length 87 has Property-C.
[Monte-Carlo/Random Model] I computed random strings of length 87 with symbols from three categories and tested how many have Property-C.
[Monte-Carlo/Real-text driven] I encrypted actual language with random cipher-symbol/alphabet assignments and tested how many 87 letter long ciphertext have Property-C.

Theory/Random Model

I simply asked the AI (OpenAI v5.2) the prompt:

> Suppose you have string of length 87, that consist of integers from {1,2,3} randomly selected. What is the probability that there is a substring of length 13 that has no two equal consecutive integers?

The answer was a reasoning using Bernoulli arguments and dynamic programming with the result \[ \textsf{Pr}(\text{String S has Property-C}) \approx 0.18450333017 \] I red the result. I was very lengthly (as usual :)) but looks ok.

In other words, for a string of length 87 consisting of random symbols from a set of size 24, where each symbol falls into one of three categories of size 8, the probability that there is a substring of length 13 such that each symbol's category differs from that of its neighbours is ~18.45%.

Monte-Carlo/Random Model

To perform some practical tests, I generated a text of length 87 from a set S consisting of 24 symbols. [I could use only three different integers to save the step of substituting the integer with its category, but for the result it doesnt matter] For simplicity (but no loss of generality) i chose the numbers from $S = [0,23]$. At each position of the text i picked a random symbol from $S$. I divided the set $S$ into three categories, analogous to the three different number of semicircles from the Dorabella Cipher: $C_1 = [0,7], C_2 = [8,15], C_3 = [16,23]$. Then i checked if the generated integer string has Property-C. I repeated this 20.000 times. Below you can find the code.

  
    from random import randint
    from math import floor

    def getCat(i):
        return i // 8  

    total_ctr = 0
    B = 20000  # use more trials for stability

    for k in range(B):
        TXT = [randint(0, 23) for _ in range(87)]

        ctr = 1
        max_ctr = 1

        for j in range(len(TXT) - 1):
            if getCat(TXT[j]) != getCat(TXT[j+1]):
                ctr += 1
            else:
                if ctr > max_ctr:
                    max_ctr = ctr
                ctr = 1

        if ctr > max_ctr:
            max_ctr = ctr

        if max_ctr >= 13:
            total_ctr += 1

    print(f"> {numerical_approx(total_ctr / B, digits = 5) * 100}%")

The result vary a little bit (increase B for better stability) but they are very close to the theoretical computation of 18.45%, which gets a good backup from this result.

Monte-Carlo/Real-text driven

So far in the random model, a run of thirteen such symbols are not very uncommon. However, english language is far from being random, if you look at the letter distribution. So, the two previous results maybe be off when comparing it with a real language test. What i did next is to assign a random monoalphabetic substitution for each symbol. Then i encrypted the first 87 letters from Shakespeare Sonnets and checked for the given property.

  
	from cryptanalysis import etsch_helper_functions as helper

    def getCat(i):
        return i // 8

    A = list("ABCDEFGHIKLMNOPQRSTUWXYZ")  # 24 letters, no J, no V
    rp = helper.get_random_permutation(list(range(24)))
    Ar = helper.apply_permutation(A, rp)   # permuted alphabet

    # fast lookup: letter -> symbol index 0..23
    pos = {Ar[idx]: idx for idx in range(24)}

    with open("Shakespeare-Sonnets.txt", "r", encoding="utf-8") as f:
        text = f.read()

    letters = [c.upper() for c in text if c.isalpha()]
    letters = [('I' if c == 'J' else 'U' if c == 'V' else c) for c in letters]

    N = len(letters)
    num_windows = N - 87 + 1
    total_ctr = 0

    for i in range(num_windows):
        TXT = [pos[letters[i+j]] for j in range(87)]

        ctr = 1
        max_ctr = 1
        for j in range(86):
            if getCat(TXT[j]) != getCat(TXT[j+1]):
                ctr += 1
            else:
                if ctr > max_ctr:
                    max_ctr = ctr
                ctr = 1
        if ctr > max_ctr:
            max_ctr = ctr

        if max_ctr >= 13:
            total_ctr += 1

    print(f"{total_ctr} / {num_windows} = {float(100*total_ctr/num_windows):.5f}%")

The result of this test does depend heavily on the random permutation that assign the alphabet letters ABC.. to the symbols. Since the number of Edgars cipher symbols are only 24, we substituted each 'J' with an 'I' and each 'V' with and 'U' which is no uncommon for ciphers of this kind. The text we encrypted are the Sonnets of Shakespeare. We picked a 87 length substring, encrypted it and tested if the cipertext has Property-C. The result was, that somewhere between $$ \approx \left[25\%, 45\%\right] $$ of 87 long encrypted substring of Shakespeare's Sonnets have the Property-C.

Conclusion

What about the argument that there are three of this substrings? One with length 13 with a probability of 18.45%. The next one of length 10 with a probabilty of 45.43% [here i assumed a total length of 87-13] and a third with length 9 and probability 57.14% [here i assumed a total length of 87-13-10]. They have no overlapping. If we assume independence we can just multiply the probabilities and get very rough estimate of $$ 4.7\% $$

We didnt even take into account the positioning of these substring. To roughly account for the fact that one run ‘uses up’ positions, I recompute the probabilities on shortened strings (87−13, 87−13−10). This still ignores dependencies and boundary effects, so treat the result as a coarse estimate.
Is this too low? Maybe - Maybe not.

So whats this all about? First i want to mentioned that the author of aplaceofbrightness also mentions another odditiy. After the long substrings with Property-C, in the second half of the three lines, more or less the opposite happens. There are far too many symbol-mirrorpairs. A symbol mirror is for example

    	#3cE##3cW#  or  #1cSE##1cNW#

Yes, this maybe a hint, but perhaps it is a consecense that the Property-C substring stops. Per definition now there are more consecutive pairs that share the same number of semicircles. And mirror pairs fall into that category. Also the reasoning that Edgar tried to be "random" in the first symbols of each line thereby doing too much. Then falls back into doing more repeating sequences thereby doing also too much by inserting too many mirror symbols, is quite reasonable. Therefore i am torn between "yes" that might be a hoax or "no" such occurences are not too improbable. What do you think?

Kryptos - The Cipher (Part 4) - Correctly positioned decryption of the word BERLIN

EASTNORTHEAST - This is not exactly the hint Jim Sanborn (JS) gave for K4 on the 29th of January this year. He only gave NORTHEAST - which refers to the positions 26-34 of K4's plaintext. Beside BERLIN and CLOCK it is the third revealed plaintext word of K4. However, also this hint does not seem to help much. However, it just so happened, that a member in the yahoo kryptos group had a conversation with Jim Sanborn due to a submitted solution. Sandborn's answer to the question contained again the last clue which surprisingly was EASTNORTHEAST at position 22-34. Jim Sanborns compass rose at CIA There is disagreement if Jim revealed this on purpose or he did it accidentially, but the new extended clue seem to be serious and valid.Interestingly, EASTNORTHEAST is exactly the direction which is illustrated on the compass rose on one of the stones around kryptos, also created by Jim Sanborn. Actually, i dont really kn...

NumberWorld

Search This Blog

Featured Post

Kryptos - The Cipher (Part 4) - Correctly positioned decryption of the word BERLIN