Mining Your Gmail Data - Part 1

Published on Monday, August 24, 2015

Photo by Jan Loyde Cabrera on Unsplash

There used to be a nifty tool for Outlook called Xobni which would tell you all kinds of things about your email habits; things like "what times of the day do you send most of your emails" and "how long does it take you to reply to a particular person". Most of the data wasn't particularly useful, but sometimes you could find interesting stuff. (Totally made-up example: "Hey, if I email all my expenses to Bob in accounting before 3 PM on Monday, I get reimbursed the next day; every time I send them later I don't get reimbursed until Friday.")

Xobni seems to be long gone, and I don't use Outlook anymore. But I'd kind of like to mine my many years of Gmail data to see if I can learn anything interesting. A couple of products make an attempt at this kind of thing - Gmail Meter, for example - but none of them provide a really extensive analysis and they all have the icky requirement that you have to give them access to your Gmail.

Now, as it happens, I've also been looking to back up all of my Gmail in case Google turns evil1, suddenly fails, or simply pulls a Reader on us. I want an automated daily backup of my Gmail data, preferably on my Synology so that it also gets backed up offsite. Enter Gmvault, which is a nifty Python tool for doing just that.

Gmvault is pretty sweet. Since it's built in Python, I can run it pretty much anywhere, including both my Synology and my day-to-day Windows machine. And, since it pulls down all the mail as .eml files, it's fantastic for data mining.

Coincidentally, I recently started learning how to use pandas; aggregating and analyzing a ton of my own Gmail data sounds like an awesome project for learning some more about pandas. So I'm going to spend a few posts describing what I'm doing and sharing some code for analyzing your Gmail.

  1. or more evil, depending on your perspective