|
UC Berkeley Enron Email Analysis Project
Starting with the Enron Email
dataset made available by MIT, SRI, and CMU, we have put together several
resources:
- A set of
categories developed in our
ANLP (Applied
Natural Processing Language Processing) course, to be used for annotating a
subset of the Enron email messages.
- A subset of about 1700 labeled
email messages (4.5M). These were chosen in a semi-motivated
fashion (focusing on business-related emails and the California Energy Crises
and on emails that occurred later in the collection, trying to avoid very
personal messages, jokes, and so on). Students in the ANLP
course annotated the selected messages with the category labels. Each message
was labeled by two people, but no claims of consistency, comprehensiveness, nor
generality are made about these labelings.
- The Enronic
email visualization and clustering tool by Jeff
Heer, built on his prefuse
toolkit.
(1.9M jar file)
- A database representation(219 MB
compressed) of the Enron email collection, built by
Andrew Fiore and Jeff Heer,
containing the enron email messages. This version contains many but not all of the tables
used in the search tool, as well as special tables to be
used with the Enronic visualization tool. Andrew did a substantial amount of
processing on the contents of the database to remove duplicates, normalize
names, and so on. This has been tested only on mysql.
|