In a book chapter from 2008, Peter Norvig cracked coded, historical German telegrams using a simple computer program. He included the program so you can try it yourself. On my computer, it cracked the code in less than a minute.
His program relies on a database of every English word, how often it is used, and how often it is found near any other word in context. Two researchers at Google derived this database from the corpus of English on the public Web.
Inspired and curious, I made a similar database from all the email messages that I sent or received since 2007. What can be learned from a person’s aggregated correspondence?
My emails contained 17,400,000 tokens — words, misspellings, and one-off expressions like “aaargh” — drawn from a vocabulary of 265,000 distinct types. I discarded types that occurred less than five times, paring down the vocabulary to 62,000 common types.
The most common types are generic.
the to a and i com http
But the most common long types reveal more.
rochester facebook delivery message university students physics baltimore subject hopkins
The database doesn’t just count words; it remembers their order. So we can ask questions like, “What word usually follows ‘ridiculously’?”
ridiculously loud ridiculously cheap ridiculously awesome ridiculously ridiculous ridiculously hot
It’s like Google auto-suggest, but completely derived from my writing and the writing of my email correspondents. Ridiculously ridiculous!
Another question: “Fill in ‘love the ____‘.”
love the zappos love the passion love the idea love the lamb love the pictures love the sideburns
The database knows, with confidence, how I use the word “swing” in context.
swing dance (394) swing dancing (385) swing by (149) swing in (74)
Under the hood.
Update: Downloading a copy of your Gmail messages is now much easier.
I downloaded my Gmail messages onto my local machine with
getmail. (I followed these instructions.) A Python script extracted the body text from each email as a lowercase string. The script split the text into words with the blunt regular expression
[a-z]+. It inserted all of these words into one giant SQL table, identified by which email message they came from (
msg_id) and their position in the message (
SQL is a convenient language for the questions I asked. For example, it’s easy to group occurrences of the same word and tally them.
insert into Types (token, tally) select token, count(*) from Tokens group by token
To list the most common types, as I did above:
select type from Types order by tally desc the to a and i com http
To restrict the search to longer words:
select type from Types where length(type) > 6 order by tally desc rochester facebook delivery message university students physics baltimore subject hopkins
For queries about words in context, I created the table
Bigrams, which takes the words from
Tokens by pairs in sequence, groups recurrences, and tallies them.
insert into Bigrams (type1, type2, tally) select a.token, b.token, count(*) from Tokens a join Tokens b where a.msg_id=b.msg_id and b.pos=1+a.pos group by a.token, b.token
Now I can ask, if Word 1 is “ridiculously,” what is Word 2?
select type1, type2 from Bigrams where type1='ridiculously' order by tally desc ridiculously loud ridiculously on ridiculously awesome ridiculously ridiculous ridiculously hot
With this large data set in hand, I might also try topic detection (see this cool application), though I doubt it would tell me anything that I don’t already know. I could also use the date and time information in the emails, as Stephen Wolfram did when he analyzed his own emailing habits. Combining that with lexical analysis would be cool, but I might need a few years’ more data….
Thanks to David Lu and Jim, respectively, for the links.