There’s an argument that the Dorabella Cipher may not be a real cipher at all. The main reason given is that it contains long stretches where the number of semicircles alternates: adjacent symbols never have the same semicircle-count. In this post I’ll test whether that pattern is actually as suspicious as it sounds.The Dorabella Cipher has remained unsolved for about 130 years. While looking for recent discussion and solution attempts, I went well beyond the first page of Google results and found several excellent deep-dive blogs. One post on aplaceofbrightness.blogspot.com discusses the “alternating semicircle-count” pattern and motivated this analysis.
The findings on the blog are presented below. They adress some strange distribution of chosen cipher symbols. Look at the frist line of the Dorabella Cipher. It starts with the following symbols:
Line 1 - First 13 symbols
#2cE##3cW##2cSE##3cE##1cE##2cS##1cN##3cE##1cSW##2cNE##3cSE##2cNW##1cNW#
Do you notice anything? - Two consecutive letters never have the same number of semicircles! We have [2,3,2,3,1,2,1,3,1,2,3,2,1]. This are thirteen symbols in a row!
Line 2 - First 10 symbols
#1cN##2cNW##1cN##2cS##1cN##3cE##1cSW##2cSW##3cE##2cSE#
Here there are 10 symbols without two consecutive ones with the same number of semicircle.
Line 3 - 4th to 12th symbols
#2cN##3cNW##2cSE##1cSE##2cN##3cN##1cS##3cNW##2cSE#
Here we get nine.
The author from the blog argues that this is a strong clue that the cipher does not resemble normal english.
But is this really exceptional or is such a pattern acutally more likely than it seems?
First, since we use this terminology over and over again, let us define what we exactly mean with "such a pattern".
The argument is, Dorabella’s first line has an unusually long run with no equal consecutive semicircle-counts. To answer the question, if Property-C is very unlikely, i made three test:
[AI Theoretical Answer/Random Model] I asked the AI about the probability that a random string $S$ of length 87 has Property-C.
[Monte-Carlo/Random Model] I computed random strings of length 87 with symbols from three categories and tested how many have Property-C.
[Monte-Carlo/Real-text driven] I encrypted actual language with random cipher-symbol/alphabet assignments and tested how many 87 letter long ciphertext have Property-C.
Theory/Random Model
I simply asked the AI (OpenAI v5.2) the prompt:
The answer was a reasoning using Bernoulli arguments and dynamic programming with the result \[ \textsf{Pr}(\text{String S has Property-C}) \approx 0.18450333017 \] I red the result. I was very lengthly (as usual :)) but looks ok.
In other words, for a string of length 87 consisting of random symbols from a set of size 24, where each symbol falls into one of three categories of size 8, the probability that there is a substring of length 13 such that each symbol's category differs from that of its neighbours is ~18.45%.
Monte-Carlo/Random Model
To perform some practical tests, I generated a text of length 87 from a set S consisting of 24 symbols. [I could use only three different integers to save the step of substituting the integer with its category, but for the result it doesnt matter] For simplicity (but no loss of generality) i chose the numbers from $S = [0,23]$. At each position of the text i picked a random symbol from $S$. I divided the set $S$ into three categories, analogous to the three different number of semicircles from the Dorabella Cipher: $C_1 = [0,7], C_2 = [8,15], C_3 = [16,23]$. Then i checked if the generated integer string has Property-C. I repeated this 20.000 times. Below you can find the code.
from random import randint
from math import floor
def getCat(i):
return i // 8
total_ctr = 0
B = 20000 # use more trials for stability
for k in range(B):
TXT = [randint(0, 23) for _ in range(87)]
ctr = 1
max_ctr = 1
for j in range(len(TXT) - 1):
if getCat(TXT[j]) != getCat(TXT[j+1]):
ctr += 1
else:
if ctr > max_ctr:
max_ctr = ctr
ctr = 1
if ctr > max_ctr:
max_ctr = ctr
if max_ctr >= 13:
total_ctr += 1
print(f"> {numerical_approx(total_ctr / B, digits = 5) * 100}%")
The result vary a little bit (increase B for better stability) but they are very close to the theoretical computation of 18.45%, which gets a good backup from this result.
Monte-Carlo/Real-text driven
So far in the random model, a run of thirteen such symbols are not very uncommon. However, english language is far from being random, if you look at the letter distribution. So, the two previous results maybe be off when comparing it with a real language test. What i did next is to assign a random monoalphabetic substitution for each symbol. Then i encrypted the first 87 letters from Shakespeare Sonnets and checked for the given property.
from cryptanalysis import etsch_helper_functions as helper
def getCat(i):
return i // 8
A = list("ABCDEFGHIKLMNOPQRSTUWXYZ") # 24 letters, no J, no V
rp = helper.get_random_permutation(list(range(24)))
Ar = helper.apply_permutation(A, rp) # permuted alphabet
# fast lookup: letter -> symbol index 0..23
pos = {Ar[idx]: idx for idx in range(24)}
with open("Shakespeare-Sonnets.txt", "r", encoding="utf-8") as f:
text = f.read()
letters = [c.upper() for c in text if c.isalpha()]
letters = [('I' if c == 'J' else 'U' if c == 'V' else c) for c in letters]
N = len(letters)
num_windows = N - 87 + 1
total_ctr = 0
for i in range(num_windows):
TXT = [pos[letters[i+j]] for j in range(87)]
ctr = 1
max_ctr = 1
for j in range(86):
if getCat(TXT[j]) != getCat(TXT[j+1]):
ctr += 1
else:
if ctr > max_ctr:
max_ctr = ctr
ctr = 1
if ctr > max_ctr:
max_ctr = ctr
if max_ctr >= 13:
total_ctr += 1
print(f"{total_ctr} / {num_windows} = {float(100*total_ctr/num_windows):.5f}%")
The result of this test does depend heavily on the random permutation that assign the alphabet letters ABC.. to the symbols. Since the number of Edgars cipher symbols are only 24, we substituted each 'J' with an 'I' and each 'V' with and 'U' which is no uncommon for ciphers of this kind. The text we encrypted are the Sonnets of Shakespeare. We picked a 87 length substring, encrypted it and tested if the cipertext has Property-C. The result was, that somewhere between $$ \approx \left[25\%, 45\%\right] $$ of 87 long encrypted substring of Shakespeare's Sonnets have the Property-C.
Conclusion
What about the argument that there are three of this substrings? One with length 13 with a probability of 18.45%. The next one of length 10 with a probabilty of 45.43% [here i assumed a total length of 87-13] and a third with length 9 and probability 57.14% [here i assumed a total length of 87-13-10]. They have no overlapping. If we assume independence we can just multiply the probabilities and get very rough estimate of $$ 4.7\% $$
We didnt even take into account the positioning of these substring. To roughly account for the fact that one run ‘uses up’ positions, I recompute the probabilities on shortened strings (87−13, 87−13−10). This still ignores dependencies and boundary effects, so treat the result as a coarse estimate.
Is this too low? Maybe - Maybe not.
So whats this all about? First i want to mentioned that the author of aplaceofbrightness also mentions another odditiy. After the long substrings with Property-C, in the second half of the three lines, more or less the opposite happens. There are far too many symbol-mirrorpairs. A symbol mirror is for example
#3cE##3cW# or #1cSE##1cNW#
Yes, this maybe a hint, but perhaps it is a consecense that the Property-C substring stops. Per definition now there are more consecutive pairs that share the same number of semicircles. And mirror pairs fall into that category. Also the reasoning that Edgar tried to be "random" in the first symbols of each line thereby doing too much. Then falls back into doing more repeating sequences thereby doing also too much by inserting too many mirror symbols, is quite reasonable. Therefore i am torn between "yes" that might be a hoax or "no" such occurences are not too improbable. What do you think?
[2] https://aplaceofbrightness.blogspot.com/p/introduction-in-1897-british-composer.html
Comments
Post a Comment