How to Extract Email (GMail) contents as text using imaplib via IMAP in Python 3.2.3
Lets say, you want to find out all the attachments in your GMail inbox > 10MB in size or maybe you want to download all the chat logs at one place of one favorite person. You can use python to login and do a custom operation based on your requirement.
Prerequisites:1. Python installed on your machine 2. IMAP enabled on your GMail account
We’ll first use Python to login to our Inbox. Followed by few basic operations like choosing a label, searching. and Finally extracting the contents as text format.
import imaplib mail = imaplib.IMAP4_SSL('imap.gmail.com') # imaplib module implements connection based on IMAPv4 protocol mail.login('firstname.lastname@example.org', 'password') # >> ('OK', [email@example.com Vineet Dhanawat authenticated (Success)'])
Selecting a Label
Your GMail inbox has multiple labels like ‘Inbox’ and many more custom defined by you. Lets say you want to search / download all emails labeled as ‘Inbox’, so we choose Inbox.
mail.list() # Lists all labels in GMail mail.select('inbox') # Connected to inbox.
Searching Through Inbox
We’ll search through the label and retrieve all the emails. Refer imaplib documentation for more usage.
result, data = mail.uid('search', None, "ALL") # search and return uids instead i = len(data.split()) # data is a space separate string for x in range(i): latest_email_uid = data.split()[x] # unique ids wrt to label selected result, email_data = mail.uid('fetch', latest_email_uid, '(RFC822)') # fetch the email body (RFC822) for the given ID raw_email = email_data
Parsing Raw Email
- raw_email is NOT string but a byte literal. Lets say raw_email = b’<email-contents>’ , now str() will simply convert entire literal as string. So we need to use decode(‘utf-8′) to keep the <email-contents> only.
- In a multi-part email Content-Type: multipart/mixed; It will contain several parts Content-Type: text/plain or Content-Type: application/octet-stream etc. Right now i’m extracting the text part of main body only, without any attachments.
#continue inside the same for loop as above raw_email_string = raw_email.decode('utf-8') # converts byte literal to string removing b'' email_message = email.message_from_string(raw_email_string) # this will loop through all the available multiparts in mail for part in email_message.walk(): if part.get_content_type() == "text/plain": # ignore attachments / html body = part.get_payload(decode=True) save_string = str("D:\\\Dump\\gmail\\email_" + str(x) + ".eml") # location on disk myfile = open(save_string, 'a') myfile.write(body.decode('utf-8')) # body is again a byte literal myfile.close() else: continue
This will fetch all the emails selected for the particular label and save it as text format on your local machine. Similarly you can download all the attachments etc. If you want to few advanced searches, refer to yuji’s blog.
You can also run the above as a script. Just copy all of the above code in a single file.py and run it.
How are you planning to use it? Do share with us in the comments.