-
Notifications
You must be signed in to change notification settings - Fork 199
/
Copy path2009-07-09.txt
200 lines (159 loc) · 9.02 KB
/
2009-07-09.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
Title: Lamson's Bounce Detection Algorithm
I just finished committing 0.9.6 code for doing bounce detection and analysis
with Lamson. It's part of the new mailing list example I'm coding up for the
0.9.6 release which I'll be running on a free mailing list site I'm going
to release soon. In this blog post I'd like to go through the bounce detection
algorithm and get some feedback and samples from people. So far it works great for
the samples I have, but I want it to be fairly bullet proof.
What I have here is a quick description of how Lamson is doing bounce detection,
and then how it gives you parsed information to do analysis beyond that. I'm hoping
that people who have to deal with bounces from systems like MS Exchange can possibly
either provide samples or advice so I can tune the algorithm better.
h2. The State of RFC 3464 And 1893
The two big RFCs for bounce handling are "RFC3464 - An Extensible Message Format for Delivery Status Notifications":http://www.faqs.org/rfcs/rfc3464.html
and "RFC1893 - Enhanced Mail System Status Codes":http://www.faqs.org/rfcs/rfc1893.html which
cover the headers and error messages you'll encounter.
In RFC3464 there's various headers that are given which a bounce message should have,
as well as the double and sometimes triple multipart-mime wrapping you need to do
to give back the bounced message. A major pain with this standard is that the headers
are spread out across the different nested parts, some might be duplicated, and the format
is *never* respected by the various MTAs out there.
With RFC1893 you have a large number of status codes, but they are combined in weird ways
such that you have a primary code with the first digit, a secondary code with the
second, and then a "third" code that's the second and third digit combined. Unlike HTTP
error codes where there's just a single number mapped to a single status code, in RFC1893
it's this strangely nested first+second+(second+third) thing.
Adding to this is there's a few headers where you can put these status codes, and in
various formats. Here's a sample of the *parsed* Diagnostic-Code from a gmail bounce
message:
<pre class="code prettyprint">
'Diagnostic-Code': [(u'smtp',
u'550-5.1.1',
u"The email account that you tried to reach does...")],
</pre>
The end result is that if you want to determine if a message has bounce (not even hard vs. soft)
then you have to go trolling through nested attachments looking for headers and parsing their values
on the chance they *might* be important to the bounce.
h2. Soft and Hard
The general consensus is that if a *primary status code* is greater than 4 then
the message is a *hard* bounce, anything else is a soft bounce. Of course there's no way
this is an absolute truth, and I'm sure there will end up being various complex
things people will want to do with a soft vs. a hard bounce.
Since there's no way Lamson could know what you want to base your decisions to honor a
soft vs. hard bounce, it simply lets you figure out which type it is, and then analyze
the information it's collected to determine the bounce.
h2. Lamson's Simplification
Lamson already converts the email into a normalized form, so the difficult part of
parsing out the bounce information is simply trolling through the parts and finding
all these randomly dispersed bounce headers. Then it's a matter of parsing the values
into something that can be use for analysis in code.
To simplify this, Lamson uses a dict of headers and matching regexes to search for in
the email:
<pre class="code prettyprint">
BOUNCE_MATCHERS = {
'Action': re.compile(r'(failed|delayed|delivered|relayed|expanded)', re.IGNORECASE | re.DOTALL),
'Content-Description':
re.compile(r'(Notification|Undelivered Message|Delivery Report)', re.IGNORECASE | re.DOTALL),
'Diagnostic-Code': re.compile(r'(.+);\s*([0-9\-\.]+)?\s*(.*)', re.IGNORECASE | re.DOTALL),
'Final-Recipient': re.compile(r'(.+);\s*(.*)', re.IGNORECASE | re.DOTALL),
'Received': re.compile(r'(.+)', re.IGNORECASE | re.DOTALL),
'Remote-Mta': re.compile(r'(.+);\s*(.*)', re.IGNORECASE | re.DOTALL),
'Reporting-Mta': re.compile(r'(.+);\s*(.*)', re.IGNORECASE | re.DOTALL),
'Status': re.compile(r'([0-9]+)\.([0-9]+)\.([0-9]+)', re.IGNORECASE | re.DOTALL)
}
</pre>
Each of these headers shows up about once, maybe more in the parts of a message
that has bounced information, and they don't show up at all or rarely in regular
messages. What Lamson does is just walk through each part, finds the matches, parses
with the regex, and then produces something like this:
<pre class="code prettyprint">
{'Action': [(u'failed',)],
'Content-Description': [(u'Notification',),
(u'Undelivered Message',),
(u'Delivery report',)],
'Diagnostic-Code': [(u'smtp',
u'550-5.1.1',
u"The email account that you tried to reach does...")],
'Final-Recipient': [(u'rfc822',
u'asdfasdfasdfasdfasdfasdfewrqertrtyrthsfgdfgadfqeadvxzvz@gmail.com')],
'Received': [(u'by mail.zedshaw.com ...',)],
'Remote-Mta': [(u'dns', u'gmail-smtp-in.l.google.com')],
'Reporting-Mta': [(u'dns', u'mail.zedshaw.com')],
'Status': [(u'5', u'1', u'1')]}
</pre>
I've abbreviated some of the messages since they get stupidly long, but this shows you the information
you get now. Lamson also provides a nice API through the lamson.bounce.BounceAnalyzer class that
you'll see in a moment.
bq. If you know of additional headers and formats that show up in bounces, let me know.
h2. Bounce Probability
Now, we've collected up all the different headers you might find in a bounce package, and we've
attempted to parse the values into something meaningful. That means, the more of these headers
that we find and successfully parse, the more likely the message is a bounce.
Lamson keeps a score while it's finding the above headers, and for each one it finds and parses
it adds a "point". It then produces a probability <= 1.0 that the message is a bounce or not.
The more headers it finds, the closer to 1.0, the higher the chance. So far it looks like any
message above 0.3 is most likely a bounce message, with only a few spams that are below that.
Checking for a bounce then looks like this:
<pre class="code prettyprint">
bm = mail.MailRequest(None,None,None, open("tests/bounce.msg").read())
assert bm.is_bounce()
assert bm.bounce
assert bm.bounce.score == 1.0
assert bm.bounce.probable()
</pre>
Which is from the test suite, and shows you various ways to get information
about the probability of the message being a bounce.
h2. Bounce Header Information
Finally, you'll need to get at this parsed information to make any
final decisions about the bounce. You might want to keep track of
which MTAs consistently fail and include them in your black lists.
You might want to tune your retries for different kind of soft
bounces.
Whatever you need to do, you have the original raw headers that
Lamson found, but you also have a higher level API you can use:
<pre class="code prettyprint">
if msg.is_bounce():
print "-------"
print "hard", msg.bounce.is_hard()
print "soft", msg.bounce.is_soft()
print 'score', msg.bounce.score
print 'primary', msg.bounce.primary_status
print 'secondary', msg.bounce.secondary_status
print 'combined', msg.bounce.combined_status
print 'remote_mta', msg.bounce.remote_mta
print 'reporting_mta', msg.bounce.reporting_mta
print 'final_recipient', msg.bounce.final_recipient
print 'diagnostic_codes', msg.bounce.diagnostic_codes
print 'action', msg.bounce.action
</pre>
This is a simple sample that prints out the various bits of
information available on a bounce message. Here's a sample
of the output from the above:
<pre class="code prettyprint">
hard True
soft False
score 1.0
primary (5, u'Permanent Failure')
secondary (1, u'Addressing Status')
combined (11, u'Bad destination mailbox address')
remote_mta gmail-smtp-in.l.google.com
reporting_mta mail.zedshaw.com
final_recipient asdfasdfasdfasdfasdfasdfewrqertrtyrthsfgdfgadfqeadvxzvz@gmail.com
diagnostic_codes (u'550-5.1.1', u"The email account that you tried...")
action failed
</pre>
This demonstrates how I've pulled out all of the error codes mentioned in
RFC1893 and put them into Lamson so that you can get meaningful error
messages for the various multi-level error codes you'll encounter.
All of these error codes are available in the lamson.bounce module.
h2. Comments Or Suggestions
That's it for the review of Lamson's bounce detection. It's in the early
stages but already showing a lot of promise. It turned out to be much
easier than I anticipated, but only after doing the work in lamson.encoding
to clean up emails.
If you can think of situations that should be covered, and if you have samples
of horrible bounce messages, then I'd love to have them.
h2. Mailing Lists Coming Back
Tomorrow I'll be using this to reimplement the mailing list sample and put
it back into the bzr repository, then back online. I'll have a nice
status update about that tomorrow as I finish it up.