I have a requirement for my project to parse the signature of mails that I get to my gmail account. And from the signature I have to fetch the First name, last name, mail id, etc. [only the sender's]. Can you please let me know where to start from? ("where to start from" in the sense, is there any thing in-place for this already?)
I have gone through this question, This question speaks about removing the signature stuff, but that is exactly opposite to my requirement. The answer for this do not solve my problem.
I know I can use regex to get this done. but I don't want to miss out even those mails that do not follow netiquettes of mail signatures like removing "--" before signature, trailing hyphens.
And if possible please let me know of any open source javascript projects that exactly provide this functionalities.
Thanks in advance.
Update: The signatures I am looking for are generally business related so they contain HTML content or sometimes VCards directly.
Update: All I need is to just strip each line of the signature and get details from these lines.
I have a requirement for my project to parse the signature of mails that I get to my gmail account. And from the signature I have to fetch the First name, last name, mail id, etc. [only the sender's]. Can you please let me know where to start from? ("where to start from" in the sense, is there any thing in-place for this already?)
I have gone through this question, This question speaks about removing the signature stuff, but that is exactly opposite to my requirement. The answer for this do not solve my problem.
I know I can use regex to get this done. but I don't want to miss out even those mails that do not follow netiquettes of mail signatures like removing "--" before signature, trailing hyphens.
And if possible please let me know of any open source javascript projects that exactly provide this functionalities.
Thanks in advance.
Update: The signatures I am looking for are generally business related so they contain HTML content or sometimes VCards directly.
Update: All I need is to just strip each line of the signature and get details from these lines.
Share Improve this question edited May 23, 2017 at 11:55 CommunityBot 11 silver badge asked Aug 3, 2015 at 7:52 Vamshi Krishna AlladiVamshi Krishna Alladi 4383 silver badges14 bronze badges 7- 3 Can you give a few examples of input (the text you are working with) and desired output? – ʰᵈˑ Commented Aug 3, 2015 at 7:55
- The input CAN be HTML as well, because the mails I am working with are generally Business mails, I am giving a rough input for my own profile Vamshi Krishna Alladi | Product Software Engineer P +91 9123456789 E [email protected] W www.xyz. – Vamshi Krishna Alladi Commented Aug 3, 2015 at 8:00
- Without exact input you have (the HTML code, perhaps), it will be impossible to help you. – Wiktor Stribiżew Commented Jan 4, 2016 at 11:58
- There is no specific input for this. That is exactly what I am trying to say. – Vamshi Krishna Alladi Commented Jan 5, 2016 at 4:59
- Speaking of the input I earlier gave in the ments was just to give a gist of how the signature would be. There is no specific format for the input – Vamshi Krishna Alladi Commented Jan 5, 2016 at 5:01
4 Answers
Reset to default 5 +50There are several potential parts to answering this question.
Signatures within the gmail interface
Within the gmail interface, signatures fairly easy to grab. They are wrapped in <font color="#888888">
, so getting those with an xmlreader should be pretty easy, if you're getting signatures from within the gmail interface. This won't get any signatures that gmail doesn't detect.
Signatures in messages sent from gmail using the signature setting
Just look for <div class=3D"gmail_signature">
in the html version of the email.
A General Method of Signature Parsing
I am arbitrarily limiting the target to the contact information of the sender. As such, it makes most sense to get only contact information in the signature. As many emails contain contact information for people other than the sender, the first step is to isolate the signature.
Once the signature is isolated, each line can be matched against regex patterns. I am by no means a regex expert, so I won't attempt to describe the actual patterns here.
What follows is a method, not code. The actual implementation should be pretty straightforward.
Grabbing signatures from an email
- Remove everything except rendered text in the target message. Leave \n newlines in the proper places.
- Work from the bottom of the message, storing each line in a variable. Stop when you hit a long line (60+ characters, exact number needs experimentation1). Don't include the long line.
- If there are a number of \n in the middle somewhere, remove them and everything above them. This is to remove any short lines and most closing salutations.2
Now the signature is isolated.
Here are some assumptions about the parts remaining. Unless the order is specified, assume they can be in any order.
A) End of message and closing greeting will be the topmost line(s)
B) Name
C) Phone Number
D) Email Address
E) Mailing Address
F) Tag line or witty saying, etc.
1 The 60 character line length is based on the fact that RFC 2822 strongly suggests that lines should be 78 characters long. Gmail respects this. Most signature lines will be shorter than that, unless the whole address is written as a single line. Signatures for extremely short emails (< 20 words) will not be properly detected with this method, but it would be trivial to first check the total message length and use different code to deal with that.
2As most signatures are automatically added, there is usually a series of newlines before them. However, hand-typed signatures may not follow this pattern, so depending on what type of emails you're processing, you may find this step unhelpful or detrimental.
Identifying parts of the signature
Now that you have reduced the likelihood of false positive matches for your regex, you can see if the remaining lines match any of your patterns.
Replace mon dividers with newlines, | is a mon example.
Check if any of the lines match you regex patterns. If they do, remove them from further consideration. The hardest part will be differentiating names from other things. Suggested order:
email
phone
zip code (then address, if you find a zip code)
Left should be the closing salutation, name, tag line, and any malformed parts of the items above. Be aware that while most regex is used to find errors (for validation), you want to match errors, remove the lines from further processing, then validate or normalize.
In my view, the hardest part of figuring out which part is which is distinguishing names from tag lines. Here are some suggestions that should help for mon cases:
- Names consist of a small number of words.
- Names contain periods in certain places - after 1-3 letter words. (French has M. for Messieur)
- Names don't contain much punctuation. Probably only dashes and apostrophes, in addition to the periods above. You might run into issues with mas before titles, for example, John Lawyer, Esq.
- Tag lines likely end with a ma
- Capitalization can hint (but not definitively say) whether something is a name.
Further, you can blacklist mon closing salutation words (sincerely, thank(s), cheers, etc.) If that narrows it to one or two lines, the upper one is most likely the name and the lower one is most likely a tag line.
For more information about identifying names, see Find names with Regular Expression. Remember that while it should be easy to write a solution in the general case, natural language processing is HUGE and beyond the scope of mortals like me. Named Entity Recognition is a known challenge. Hopefully, what I've described will get you something in most cases.
I guess the solution for this is not just few lines of code. I think it requires some kind of special processing dedicated for this, something like a signature parser or NLP. This question has been open from august I guess its time to close it now.
I don't use GMail, so I actually built this answer from the only GMail message I have that contains a signature. It's a spam. Still, let's see how far this gets you...
var sig = document.querySelector('div[data-tooltip="Show trimmed content"]')
.parentNode.nextElementSibling
This should set a new variable called sig
to the content immediately following the hide/show dots. Note that it will also find quoted conversations. It's a start, not a full solution.
Element.querySelector()
is a handy way of searching for elements by CSS. In this case, I sought the tooltip. The element we want is actually up a level and then the next element (something CSS cannot do but JS can).
There is an API for this which parses contact data from the signature. It will also handle reply chains. See the example below.
https://www.sigparser..
You can test the API on the swagger detail page at https://api.sigparser..
(I'm the creator of SigParser. btw)
Here is an example response:
{
"error": null,
"contacts": [
{
"firstName": "Bill",
"lastName": "Gates",
"emailAddress": "[email protected]",
"phoneNumber": null,
"fax": null,
"address": null,
"title": null,
"phoneNumbers": [
{
"rationalType": null,
"type": "Mobile",
"phoneNumber": "7774448888"
}
],
"twitterUrl": [
{
"emailAddress": "[email protected]",
"url": "https://twitter./BillGates"
}
],
"linkedInUrl": [
{
"emailAddress": "[email protected]",
"url": "https://www.linkedin./in/williamhgates/"
}
]
}
],
"isSpammyLookingEmailMessage": false,
"isSpammyLookingSender": false,
"isSpam": false,
"from_LastName": "Gates",
"from_FirstName": "Bill",
"from_Fax": null,
"from_Phone": null,
"from_Address": null,
"from_Title": null,
"from_MobilePhone": "7774448888",
"from_OfficePhone": null,
"from_LinkedInUrl": "https://www.linkedin./in/williamhgates/",
"from_TwitterUrl": "https://twitter./BillGates",
"from_EmailAddress": "[email protected]",
"emails": [
{
"from_EmailAddress": "[email protected]",
"from_Name": "Bill Gates",
"textBody": "Hi, good seeing you the other day.\r\n--\r\nBill Gates\r\nCell 777-444-8888LinkedInTwitter",
"htmlLines": [
"<div>Hi, good seeing you the other day.</div>",
"<div>--</div>",
"<div>Bill Gates</div>",
"<div>Cell 777-444-8888</div><a href=\"https://www.linkedin./in/williamhgates/\">LinkedIn</a><a href=\"https://twitter./BillGates\">Twitter</a>"
],
"date": "2017-01-01T00:00:00",
"didParseCorrectly": true,
"to": [],
"cc": []
}
]
}