UC Berkeley Enron Email Analysis Project
Starting with the Enron Email
dataset made available by MIT, SRI, and CMU, we have put together several
- A powerful search interface
Enron email collection,
developed by Andrew Fiore and Marti Hearst. This connects to
the mysql database described below using python,
lucene for the text queries.
- A set of
categories developed by Marti and students in her
Natural Processing Language Processing) course, to be used for annotating a
subset of the Enron email messages.
- A subset of about 1700 labeled
email messages (4.5M). These were chosen by Marti in a semi-motivated
fashion (focusing on business-related emails and the California Energy Crises
and on emails that occurred later in the collection, trying to avoid very
personal messages, jokes, and so on). Students in Marti's ANLP
course annotated the selected messages with the category labels. Each message
was labeled by two people, but no claims of consistency, comprehensiveness, nor
generality are made about these labelings.
- The Enronic
email visualization and clustering tool by Jeff
Heer, built on his prefuse
(1.9M jar file)
- A database representation(219 MB
compressed) of the Enron email collection, built by
Andrew Fiore and Jeff Heer,
containing the enron email messages. This version contains many but not all of the tables
used in the search tool, as well as special tables to be
used with the Enronic visualization tool. Andrew did a substantial amount of
processing on the contents of the database to remove duplicates, normalize
names, and so on. This has been tested only on mysql.