| tags: [ computational linguistics lexical analysis gmail ]
In So Many Words
In a book chapter from 2008, Peter Norvig cracked coded, historical German telegrams using a simple computer program. He included the program so you can try it yourself. On my computer, it cracked the code in less than a minute.
His program relies on a database of every English word, how often it is used, and how often it is found near any other word in context. Two researchers at Google derived this database from the corpus of English on the public Web.
Inspired and curious, I made a similar database from all the email messages that I sent or received since 2007. What can be learned from a person’s aggregated correspondence?
My emails contained 17,400,000 tokens — words, misspellings, and one-off expressions like “aaargh” — drawn from a vocabulary of 265,000 distinct types. I discarded types that occurred less than five times, paring down the vocabulary to 62,000 common types.
The most common types are generic.
the
to
a
and
i
com
http
But the most common long types reveal more.
rochester
facebook
delivery
message
university
students
physics
baltimore
subject
hopkins
The database doesn’t just count words; it remembers their order. So we can ask questions like, “What word usually follows ‘ridiculously’?”
ridiculously loud
ridiculously cheap
ridiculously awesome
ridiculously ridiculous
ridiculously hot
It’s like Google auto-suggest, but completely derived from my writing and the writing of my email correspondents. Ridiculously ridiculous!
Another question: “Fill in ‘love the ____‘.”
love the zappos
love the passion
love the idea
love the lamb
love the pictures
love the sideburns
The database knows, with confidence, how I use the word “swing” in context.
swing dance (394)
swing dancing (385)
swing by (149)
swing in (74)
Under the hood.
Update: Downloading a copy of your Gmail messages is now much easier.
I downloaded my Gmail messages onto my local machine with getmail
. (I followed these instructions.) A Python script extracted the body text from each email as a lowercase string. The script split the text into words with the blunt regular expression [a-z]+
. It inserted all of these words into one giant SQL table, identified by which email message they came from (msg_id
) and their position in the message (pos
)
SQL is a convenient language for the questions I asked. For example, it’s easy to group occurrences of the same word and tally them.
insert into Types (token, tally)
select token, count(*) from Tokens
group by token
To list the most common types, as I did above:
select type
from Types
order by tally desc
the
to
a
and
i
com
http
To restrict the search to longer words:
select type
from Types
where length(type) > 6
order by tally desc
rochester
facebook
delivery
message
university
students
physics
baltimore
subject
hopkins
For queries about words in context, I created the table Bigrams
, which takes the words from Tokens
by pairs in sequence, groups recurrences, and tallies them.
insert into Bigrams (type1, type2, tally)
select a.token, b.token, count(*)
from Tokens a join Tokens b
where a.msg_id=b.msg_id
and b.pos=1+a.pos
group by a.token, b.token
Now I can ask, if Word 1 is “ridiculously,” what is Word 2?
select type1, type2
from Bigrams
where type1='ridiculously'
order by tally desc
ridiculously loud
ridiculously on
ridiculously awesome
ridiculously ridiculous
ridiculously hot
With this large data set in hand, I might also try topic detection (see this cool application), though I doubt it would tell me anything that I don’t already know. I could also use the date and time information in the emails, as Stephen Wolfram did when he analyzed his own emailing habits. Combining that with lexical analysis would be cool, but I might need a few years’ more data….
Thanks to David Lu and Jim, respectively, for the links.