Downloading Your Email Metadata
We spend a lot of attention on how we interact with social networks, because so many people use Twitter, Facebook, etc every day. It's fun for developers to play with this stuff. However, if you want to look at a history of your own interactions, there isn't a much better place to look (digitally) than your own email inbox.
Before you can explore though, you have to download the data. That's what you'll learn here, or more specifically, how to download your email metadata as a ready-to-use, tab-delimited file.
I'm using a dated Python 2.5, but I think the code in this tutorial should work with newer versions. Please let me know if something breaks though.There are various ways to access your email. In this tutorial, you use Python, which provides libraries to handle email access and some useful functions to parse data. If you're on a Mac, you probably already have Python installed. For Windows computers, you can download and install Python if you haven't already.
The other thing you need: An email inbox accessible via IMAP and the server information. That's most modern email services, I think. Here's the information for Gmail and Yahoo Mail. I use Fastmail, and you can get their IMAP server information here.
Connect to the IMAP server
You can of course just follow along with the code in the tutorial's download. For smaller scripts like this, I like to type things out to make sure I get everything.All setup? Good. Open your favorite text editor to get started.
The first thing to do is import the necessary libraries. In this case that's imaplib, email, getpass, and getaddresses from email.utils.
import imaplib, email, getpass from email.utils import getaddresses
It'll be clear what these are for soon.
Like right now. Enter your email settings for your server and username. You use getpass() for the password, so that you don't have to store your password in plaintext. Instead, when you run this script, you'll be prompted for your password.
# Email settings imap_server = 'YOUR_SERVER_HERE' imap_user = 'YOUR_USERNAME_HERE' imap_password = getpass.getpass()
Now connect and log in.
# Connection conn = imaplib.IMAP4_SSL(imap_server) (retcode, capabilities) = conn.login(imap_user, imap_password)
If everything went well, the variable
retcode should be 'OK'. Otherwise, you might want to check your server and log in information.
Next up: Select the folder you want to fetch your email from. The name of the folder depends on what you want and what service you use. To get the actual names (and they need to be exact), enter
conn.list() and run the script that you have so far.
It might also be useful to do all of this in the Python interpreter, so that you get instant feedback. Open your terminal or equivalent, start Python (by typing 'python'), and you should be able to enter the code covered above.
Anyway, let's say the folder is called "INBOX.Archive". Here's how you select it. I've included the commented out lines for reference.
# Specify email folder # print conn.list() # conn.select("INBOX.Sent Items") conn.select("INBOX.Archive", readOnly=True)
Also notice that readOnly in
conn.select() is set to True so that your email isn't marked as read when you fetch headers.
Search for your email
Now that you're connected, you can search your inbox to fetch the email that you want. For example, you might only want email from 2013.
# Search for email ids between dates specified result, data = conn.uid('search', None, '(SINCE "01-Jan-2013" BEFORE "01-Jan-2014")')
Or you might have email aliases setup and you only want email sent to a specific address, since the start of 2014.
result, data = conn.uid('search', None, '(TO "user [at] example.org" SINCE "01-Jan-2014")')
Or maybe you want all of it.
result, data = conn.uid('search', None, 'ALL')
Note that the only thing that changes is the query in the last argument of
A search yields a list of unique id numbers for each email that matches your search criteria. Split them, and then fetch the headers of the matching emails.
uids = data.split() # Download headers result, data = conn.uid('fetch', ','.join(uids), '(BODY[HEADER.FIELDS (MESSAGE-ID FROM TO CC DATE)])')
For the sake of simplicity you only fetch five header fields here, but if you want others, go wild.In fetch line, you essentially pass that command to the server with the unique ids as a comma-separated string, and you specify which header fields you want. The IMAP syntax isn't incredibly intuitive, but this mini manual is helpful. Or, if you're daring, you can look at the IMAP specifications direct from the source.
In any case, the fetch is the actual downloading of your email headers. This takes up the most time when you run the full script. Parsing takes less than a second. It took about 15 seconds for me to download 9,000 headers on a standard home cable internet connection, and the resulting file was 1.2 megabytes. Obviously, the more header fields and the more email you have, the longer it will take but not too bad.
I came across some examples that took way longer. As in minutes instead of seconds. The key is getting all the headers at once with one call to the IMAP server.
Parse the returned data
So you have the data now. But, it's not in a nice readable tab-delimited file yet. You have to iterate through each item stored in the
data variable (from when you fetched the headers), parse, and then spit out the format you want.
Start by creating the file. We'll call it raw-email-rec.tsv.
# Where data will be stored raw_file = open('raw-email-rec.tsv', 'w')
And write the header of the TSV to the newly created file.
# Header for TSV file raw_file.write("Message-ID\tDate\tFrom\tTo\tCc\n")
Time to iterate and parse. The code below is a big chunk, but here's what you're doing:
- Start a for loop.
- Check if the length of current item is 2. Those of length 2 are email headers. Those that are not of length 2 are something else.
- If it is a message, use
message_from_string()to parse. Use
get_all()to get each header field (message id, date, etc.).
- Put together a tab-delimited for of data.
- Write the row to
And here's the same logic in code.
# Parse data and spit out info for i in range(0, len(data)): # If the current item is _not_ an email header if len(data[i]) != 2: continue # Okay, it's an email header. Parse it. msg = email.message_from_string(data[i]) mids = msg.get_all('message-id', None) mdates = msg.get_all('date', None) senders = msg.get_all('from', ) receivers = msg.get_all('to', ) ccs = msg.get_all('cc', ) row = "\t" if not mids else mids + "\t" row += "\t" if not mdates else mdates + "\t" # Only one person sends an email, but just in case for name, addr in getaddresses(senders): row += addr + " " row += "\t" # Space-delimited list of those the email was addressed to for name, addr in getaddresses(receivers): row += addr + " " row += "\t" # Space-delimited list of those who were CC'd for name, addr in getaddresses(ccs): row += addr + " " row += "\n" # Just going to output tab-delimited, raw data. raw_file.write(row)
You finished iterating, so close the file.
# Done with file, so close it raw_file.close()
Script done. Run the script (by typing "python fetch-raw.py" in the command line) and you should get a tab-delimited file called raw-email-rec.tsv in the same directory as your script.
The email download can be broken into three parts.
- Connect to the IMAP server
- Search and download your email
- Parse and format
If you want to get headers for multiple folders, you can run the script multiple times changing the folder name each time. Don't forget to change the file name too, or you'll just be overwriting your data each time.
Finally, if you just want your email metadata and don't care about how to do it, download the code for this tutorial. Change the values for
imap_user to your own information. You might also need to change the value for the folder and the search. Once you have that in order, you should be able to run the script and get your data.
Become a member. Instant access to tutorials and resources. Support FlowingData.JOIN NOW
This is for people who want to learn to make and design data graphics. Your support goes directly to FlowingData, an independently run site.
What you get
- Instant access to tutorials on how to make and design data graphics
- Source code and files to use with your own data
- Hand-picked links and resources from around the web