Downloading Your Email Metadata

Posted to Tutorials  |  Tags:
Email provides a window into who we interact with and what we do. This tutorial describes how to get that data in the format you want.
Downloading Email Metadata

We spend a lot of attention on how we interact with social networks, because so many people use Twitter, Facebook, etc every day. It's fun for developers to play with this stuff. However, if you want to look at a history of your own interactions, there isn't a much better place to look (digitally) than your own email inbox.

Before you can explore though, you have to download the data. That's what you'll learn here, or more specifically, how to download your email metadata as a ready-to-use, tab-delimited file.

Setup

I'm using a dated Python 2.5, but I think the code in this tutorial should work with newer versions. Please let me know if something breaks though.There are various ways to access your email. In this tutorial, you use Python, which provides libraries to handle email access and some useful functions to parse data. If you're on a Mac, you probably already have Python installed. For Windows computers, you can download and install Python if you haven't already.

The other thing you need: An email inbox accessible via IMAP and the server information. That's most modern email services, I think. Here's the information for Gmail and Yahoo Mail. I use Fastmail, and you can get their IMAP server information here.

Connect to the IMAP server

You can of course just follow along with the code in the tutorial's download. For smaller scripts like this, I like to type things out to make sure I get everything.All setup? Good. Open your favorite text editor to get started.

The first thing to do is import the necessary libraries. In this case that's imaplib, email, getpass, and getaddresses from email.utils.

import imaplib, email, getpass
from email.utils import getaddresses

It'll be clear what these are for soon.

Like right now. Enter your email settings for your server and username. You use getpass() for the password, so that you don't have to store your password in plaintext. Instead, when you run this script, you'll be prompted for your password.

# Email settings
imap_server = 'YOUR_SERVER_HERE'
imap_user = 'YOUR_USERNAME_HERE'
imap_password = getpass.getpass()

Now connect and log in.

# Connection
conn = imaplib.IMAP4_SSL(imap_server)
(retcode, capabilities) = conn.login(imap_user, imap_password)

If everything went well, the variable retcode should be 'OK'. Otherwise, you might want to check your server and log in information.

Next up: Select the folder you want to fetch your email from. The name of the folder depends on what you want and what service you use. To get the actual names (and they need to be exact), enter conn.list() and run the script that you have so far.

It might also be useful to do all of this in the Python interpreter, so that you get instant feedback. Open your terminal or equivalent, start Python (by typing 'python'), and you should be able to enter the code covered above.

Anyway, let's say the folder is called "INBOX.Archive". Here's how you select it. I've included the commented out lines for reference.

# Specify email folder
# print conn.list()
# conn.select("INBOX.Sent Items")
conn.select("INBOX.Archive")

Search for your email

Now that you're connected, you can search your inbox to fetch the email that you want. For example, you might only want email from 2013.

# Search for email ids between dates specified
result, data = conn.uid('search', None, '(SINCE "01-Jan-2013" BEFORE "01-Jan-2014")')

Or you might have email aliases setup and you only want email sent to a specific address, since the start of 2014.

result, data = conn.uid('search', None, '(TO "user [at] example.org" SINCE "01-Jan-2014")')

Or maybe you want all of it.

result, data = conn.uid('search', None, 'ALL')

Note that the only thing that changes is the query in the last argument of conn.uid().

A search yields a list of unique id numbers for each email that matches your search criteria. Split them, and then fetch the headers of the matching emails.

uids = data[0].split()

# Download headers
result, data = conn.uid('fetch', ','.join(uids), '(BODY[HEADER.FIELDS (MESSAGE-ID FROM TO CC DATE)])')

For the sake of simplicity you only fetch five header fields here, but if you want others, go wild.In fetch line, you essentially pass that command to the server with the unique ids as a comma-separated string, and you specify which header fields you want. The IMAP syntax isn't incredibly intuitive, but this mini manual is helpful. Or, if you're daring, you can look at the IMAP specifications direct from the source.

In any case, the fetch is the actual downloading of your email headers. This takes up the most time when you run the full script. Parsing takes less than a second. It took about 15 seconds for me to download 9,000 headers on a standard home cable internet connection, and the resulting file was 1.2 megabytes. Obviously, the more header fields and the more email you have, the longer it will take but not too bad.

I came across some examples that took way longer. As in minutes instead of seconds. The key is getting all the headers at once with one call to the IMAP server.

Parse the returned data

So you have the data now. But, it's not in a nice readable tab-delimited file yet. You have to iterate through each item stored in the data variable (from when you fetched the headers), parse, and then spit out the format you want.

Start by creating the file. We'll call it raw-email-rec.tsv.

# Where data will be stored
raw_file = open('raw-email-rec.tsv', 'w')

And write the header of the TSV to the newly created file.

# Header for TSV file
raw_file.write("Message-ID\tDate\tFrom\tTo\tCc\n")

Time to iterate and parse. The code below is a big chunk, but here's what you're doing:

  1. Start a for loop.
  2. Check if the length of current item is 2. Those of length 2 are email headers. Those that are not of length 2 are something else.
  3. If it is a message, use message_from_string() to parse. Use get_all() to get each header field (message id, date, etc.).
  4. Put together a tab-delimited for of data.
  5. Write the row to raw_file.

And here's the same logic in code.

# Parse data and spit out info
for i in range(0, len(data)):
    
    # If the current item is _not_ an email header
    if len(data[i]) != 2:
        continue
    
    # Okay, it's an email header. Parse it.
    msg = email.message_from_string(data[i][1])
    mids = msg.get_all('message-id', None)
    mdates = msg.get_all('date', None)
    senders = msg.get_all('from', [])
    receivers = msg.get_all('to', [])
    ccs = msg.get_all('cc', [])
    
    row = "\t" if not mids else mids[0] + "\t"
    row += "\t" if not mdates else mdates[0] + "\t"
    
    # Only one person sends an email, but just in case
    for name, addr in getaddresses(senders):
        row += addr + " "
    row += "\t"
    
    # Space-delimited list of those the email was addressed to
    for name, addr in getaddresses(receivers):
        row += addr + " "
    row += "\t"
    
    # Space-delimited list of those who were CC'd
    for name, addr in getaddresses(ccs):
        row += addr + " "
    
    row += "\n"
    
    # Just going to output tab-delimited, raw data.
    raw_file.write(row)

You finished iterating, so close the file.

# Done with file, so close it
raw_file.close()

Script done. Run the script (by typing "python fetch-raw.py" in the command line) and you should get a tab-delimited file called raw-email-rec.tsv in the same directory as your script.

Wrapping up

The email download can be broken into three parts.

  1. Connect to the IMAP server
  2. Search and download your email
  3. Parse and format

If you want to get headers for multiple folders, you can run the script multiple times changing the folder name each time. Don't forget to change the file name too, or you'll just be overwriting your data each time.

Finally, if you just want your email metadata and don't care about how to do it, download the code for this tutorial. Change the values for imap_server and imap_user to your own information. You might also need to change the value for the folder and the search. Once you have that in order, you should be able to run the script and get your data.

About the Author

Nathan Yau is a statistician who works primarily with visualization. He earned his PhD in statistics from UCLA, is the author of two best-selling books — Data Points and Visualize This — and runs FlowingData. Introvert. Likes food. Likes beer. Follow him @flowingdata.

Become a FlowingData member, and get instant access to tutorials and resources.

Membership

This is for people who want to learn to make and design data graphics. Your support goes directly to FlowingData, an independently run site. Join now for instant access.

What you get

  • Instant access to tutorials on how to make and design data graphics
  • Source code and files to use with your own data
  • Hand-picked links and resources from around the web