Enigma Machine

Enigma Machine
German Enigma Machine

Wednesday, December 10, 2008

Attacking a Monoalphabetic (Substitution) Cipher

So, we can't believe that people would go through the extra effort to encrypt a message if there weren't other (nasty, evil, unauthorized) people that would have the desire to decrypt it. The art of attacking an encrypted message (trying to figure out what an encrypted message says) is called cryptanalysis.

Ok, let's admit this up front. Cryptanalysis (even for simple substitution ciphers) is a bit arduous and is not for everyone. So if you don't have a strong desire to understand how to 'break' a cipher, you can skim or skip this blog entry. If you are curious, at least read on a bit and see what the process is like.

So we have encrypted a message with a shift cipher. We cleverly regroup the letters so all the 'words' are five characters long. This leaves no clues based on word size for an attacker to take gueses at words or letters. What is left to do?

Well, if we know that the original message was written in english, then english (like all other languages) has some letters that are used more than others. For instance, the vowels, t, and s are the most frequently used letters in english. The letters x, q, and z are the least frequently used.

Terminology break:
To speak easier about this topic, we need to cover a couple more terms. Plaintext is the original unencrypted message. An encrypted message is called Ciphertext.

Back to the topic:
So, if we intercept a message and it contains enough ciphertext (and we know the original plaintext was english), we can try counting the characters and seeing which ones occur more frequently.

Look at the following message:

wrvsh dnhdv lhude rxwwk lvwrs lfzhq hhgwr fryhu dfrxs ohpru hwhpv sodlq whawl vwkhr uljlq doxhq fubsw hgphv vdjh

Ignoring spaces, the character counts are:

a 1
b 1
d 7
e 1
f 4
g 2
h 16
j 2
k 2
l 7
n 1
o 3
p 3
q 4
r 8
s 5
u 5
v 7
w 10
x 3
y 1
z 1


Total 94

In the english language, e is the most frequently used letter. About 12.7% of characters in normal english prose will be an 'e'. Even without working out strict percentages, we can see that 'h' is a very good candidate based on how often it occurs. 16/94 = 17%. That's a bit high but we would likely guess that h = e. 'w' comprises 10.6% of the encoded message. The next highest character frequency in english is 't'. On average 't' makes up 9% of the characters in english prose. That would lead us to guess that w = t.

We now start replacing characters with our guesses and see if things start to make sense. Try it on a piece of paper. Write an 'e' over every 'h' and a 't' over every 'w'.

You will find you do not have much to go on yet. Ok, look at 'r'. It occurs 8 times so 8/94 = 8.5%. 'a' occurs 8.17% and 'o' occurs 7.5% so r is likely one of these.

Going on like this, we can progressively fill in letters. If we find that a letter doesn't make sensible words when it is inserted, we simply try the next letter in frequency to see if that makes sense. By the way, english character frequencies are well known and published. Here is one source:

http://en.wikipedia.org/wiki/Letter_frequencies

Anyhow, we would likely find that 'a' does not work for 'r' but that 'o' does. If we look carefully at how these letters are related, we see that this is, in fact, the ceasar cipher (the alphabet shifted 3 characters.) t=w, e=h, o=r, etc. We can quickly jump ahead and convert every character to the one 3 characters earlier. What we would find is part of the text in our terminology section above.

This process, as mentioned above, is called 'cryptanalysis'. It is time consuming and difficult so it is not for everyone. I hope this hasn't scared you off. Next time we will go back to talking about another type of character based encryption.

No comments: