Image from http://www.pbs.org/newshour/updates/trial_05-07-02.html

UC Berkeley Enron Email Analysis

 

 

 

 

     

    UC Berkeley Enron Email Analysis Project

    Starting with the Enron Email dataset made available by MIT, SRI, and CMU, we have put together several resources:

    • A powerful search interface for the Enron email collection, developed by Andrew Fiore and Marti Hearst. This connects to the mysql database described below using python, and uses lucene for the text queries.

    • A set of categories developed by Marti and students in her ANLP (Applied Natural Processing Language Processing) course, to be used for annotating a subset of the Enron email messages.

    • A subset of about 1700 labeled email messages (4.5M). These were chosen by Marti in a semi-motivated fashion (focusing on business-related emails and the California Energy Crises and on emails that occurred later in the collection, trying to avoid very personal messages, jokes, and so on). Students in Marti's ANLP course annotated the selected messages with the category labels. Each message was labeled by two people, but no claims of consistency, comprehensiveness, nor generality are made about these labelings.

    • The Enronic email visualization and clustering tool by Jeff Heer, built on his prefuse toolkit.   (1.9M jar file)

    • A database representation(219 MB compressed) of the Enron email collection, built by Andrew Fiore and Jeff Heer, containing the enron email messages. This version contains many but not all of the tables used in the search tool, as well as special tables to be used with the Enronic visualization tool. Andrew did a substantial amount of processing on the contents of the database to remove duplicates, normalize names, and so on. This has been tested only on mysql.