doc/lamsonproject.org/input/blog/2009-07-09.txt

Title: Lamson's Bounce Detection Algorithm

I just finished committing 0.9.6 code for doing bounce detection and analysis
with Lamson.  It's part of the new mailing list example I'm coding up for the
0.9.6 release which I'll be running on a free mailing list site I'm going 
to release soon.  In this blog post I'd like to go through the bounce detection
algorithm and get some feedback and samples from people.  So far it works great for
the samples I have, but I want it to be fairly bullet proof.

What I have here is a quick description of how Lamson is doing bounce detection, 
and then how it gives you parsed information to do analysis beyond that.  I'm hoping
that people who have to deal with bounces from systems like MS Exchange can possibly
either provide samples or advice so I can tune the algorithm better.

h2. The State of RFC 3464 And 1893

The two big RFCs for bounce handling are "RFC3464 - An Extensible Message Format for Delivery Status Notifications":http://www.faqs.org/rfcs/rfc3464.html
and "RFC1893 - Enhanced Mail System Status Codes":http://www.faqs.org/rfcs/rfc1893.html which
cover the headers and error messages you'll encounter.

In RFC3464 there's various headers that are given which a bounce message should have,
as well as the double and sometimes triple multipart-mime wrapping you need to do
to give back the bounced message.  A major pain with this standard is that the headers
are spread out across the different nested parts, some might be duplicated, and the format
is *never* respected by the various MTAs out there.

With RFC1893 you have a large number of status codes, but they are combined in weird ways
such that you have a primary code with the first digit, a secondary code with the
second, and then a "third" code that's the second and third digit combined.  Unlike HTTP
error codes where there's just a single number mapped to a single status code, in RFC1893
it's this strangely nested first+second+(second+third) thing.

Adding to this is there's a few headers where you can put these status codes, and in 
various formats.  Here's a sample of the *parsed* Diagnostic-Code from a gmail bounce
message:

<pre class="code prettyprint">
 'Diagnostic-Code': [(u'smtp',
                      u'550-5.1.1',
                      u"The email account that you tried to reach does...")],
</pre>

The end result is that if you want to determine if a message has bounce (not even hard vs. soft)
then you have to go trolling through nested attachments looking for headers and parsing their values
on the chance they *might* be important to the bounce.

h2. Soft and Hard

The general consensus is that if a *primary status code* is greater than 4 then
the message is a *hard* bounce, anything else is a soft bounce.  Of course there's no way
this is an absolute truth, and I'm sure there will end up being various complex 
things people will want to do with a soft vs. a hard bounce.

Since there's no way Lamson could know what you want to base your decisions to honor a 
soft vs. hard bounce, it simply lets you figure out which type it is, and then analyze
the information it's collected to determine the bounce.

h2. Lamson's Simplification

Lamson already converts the email into a normalized form, so the difficult part of
parsing out the bounce information is simply trolling through the parts and finding
all these randomly dispersed bounce headers.  Then it's a matter of parsing the values
into something that can be use for analysis in code.

To simplify this, Lamson uses a dict of headers and matching regexes to search for in
the email:

<pre class="code prettyprint">
BOUNCE_MATCHERS = {
    'Action': re.compile(r'(failed|delayed|delivered|relayed|expanded)', re.IGNORECASE | re.DOTALL),
    'Content-Description': 
        re.compile(r'(Notification|Undelivered Message|Delivery Report)', re.IGNORECASE | re.DOTALL),
    'Diagnostic-Code': re.compile(r'(.+);\s*([0-9\-\.]+)?\s*(.*)', re.IGNORECASE | re.DOTALL),
    'Final-Recipient': re.compile(r'(.+);\s*(.*)', re.IGNORECASE | re.DOTALL),
    'Received': re.compile(r'(.+)', re.IGNORECASE | re.DOTALL),
    'Remote-Mta': re.compile(r'(.+);\s*(.*)', re.IGNORECASE | re.DOTALL),
    'Reporting-Mta': re.compile(r'(.+);\s*(.*)', re.IGNORECASE | re.DOTALL),
    'Status': re.compile(r'([0-9]+)\.([0-9]+)\.([0-9]+)', re.IGNORECASE | re.DOTALL)
}
</pre>

Each of these headers shows up about once, maybe more in the parts of a message
that has bounced information, and they don't show up at all or rarely in regular
messages.  What Lamson does is just walk through each part, finds the matches, parses
with the regex, and then produces something like this:

<pre class="code prettyprint">
{'Action': [(u'failed',)],
 'Content-Description': [(u'Notification',),
                         (u'Undelivered Message',),
                         (u'Delivery report',)],
 'Diagnostic-Code': [(u'smtp',
                  u'550-5.1.1',
                  u"The email account that you tried to reach does...")],
 'Final-Recipient': [(u'rfc822',
              u'asdfasdfasdfasdfasdfasdfewrqertrtyrthsfgdfgadfqeadvxzvz@gmail.com')],
 'Received': [(u'by mail.zedshaw.com ...',)],
 'Remote-Mta': [(u'dns', u'gmail-smtp-in.l.google.com')],
 'Reporting-Mta': [(u'dns', u'mail.zedshaw.com')],
 'Status': [(u'5', u'1', u'1')]}
</pre>

I've abbreviated some of the messages since they get stupidly long, but this shows you the information
you get now.  Lamson also provides a nice API through the lamson.bounce.BounceAnalyzer class that
you'll see in a moment.

bq. If you know of additional headers and formats that show up in bounces, let me know.

h2. Bounce Probability

Now, we've collected up all the different headers you might find in a bounce package, and we've
attempted to parse the values into something meaningful.  That means, the more of these headers
that we find and successfully parse, the more likely the message is a bounce.

Lamson keeps a score while it's finding the above headers, and for each one it finds and parses
it adds a "point".  It then produces a probability <= 1.0 that the message is a bounce or not.
The more headers it finds, the closer to 1.0, the higher the chance.  So far it looks like any
message above 0.3 is most likely a bounce message, with only a few spams that are below that.

Checking for a bounce then looks like this:

<pre class="code prettyprint">
    bm = mail.MailRequest(None,None,None, open("tests/bounce.msg").read())
    assert bm.is_bounce()
    assert bm.bounce
    assert bm.bounce.score == 1.0
    assert bm.bounce.probable()
</pre>

Which is from the test suite, and shows you various ways to get information
about the probability of the message being a bounce.

h2. Bounce Header Information

Finally, you'll need to get at this parsed information to make any
final decisions about the bounce.  You might want to keep track of
which MTAs consistently fail and include them in your black lists.
You might want to tune your retries for different kind of soft
bounces.

Whatever you need to do, you have the original raw headers that
Lamson found, but you also have a higher level API you can use:

<pre class="code prettyprint">
if msg.is_bounce():
    print "-------"
    print "hard", msg.bounce.is_hard()
    print "soft", msg.bounce.is_soft()
    print 'score', msg.bounce.score
    print 'primary', msg.bounce.primary_status
    print 'secondary', msg.bounce.secondary_status
    print 'combined', msg.bounce.combined_status
    print 'remote_mta', msg.bounce.remote_mta
    print 'reporting_mta', msg.bounce.reporting_mta
    print 'final_recipient', msg.bounce.final_recipient
    print 'diagnostic_codes', msg.bounce.diagnostic_codes
    print 'action', msg.bounce.action
</pre>

This is a simple sample that prints out the various bits of
information available on a bounce message.  Here's a sample
of the output from the above:

<pre class="code prettyprint">
hard True
soft False
score 1.0
primary (5, u'Permanent Failure')
secondary (1, u'Addressing Status')
combined (11, u'Bad destination mailbox address')
remote_mta gmail-smtp-in.l.google.com
reporting_mta mail.zedshaw.com
final_recipient asdfasdfasdfasdfasdfasdfewrqertrtyrthsfgdfgadfqeadvxzvz@gmail.com
diagnostic_codes (u'550-5.1.1', u"The email account that you tried...")
action failed
</pre>

This demonstrates how I've pulled out all of the error codes mentioned in 
RFC1893 and put them into Lamson so that you can get meaningful error
messages for the various multi-level error codes you'll encounter.
All of these error codes are available in the lamson.bounce module.


h2. Comments Or Suggestions

That's it for the review of Lamson's bounce detection.  It's in the early
stages but already showing a lot of promise.  It turned out to be much
easier than I anticipated, but only after doing the work in lamson.encoding
to clean up emails.

If you can think of situations that should be covered, and if you have samples
of horrible bounce messages, then I'd love to have them.

h2. Mailing Lists Coming Back

Tomorrow I'll be using this to reimplement the mailing list sample and put 
it back into the bzr repository, then back online.  I'll have a nice
status update about that tomorrow as I finish it up.