|Evolutionary pressure on spam
||[Oct. 5th, 2007|03:59 pm]
I was just cleaning out my email filter and found one with a spam-score of +49! (Anything above 0 is considered spam. Legit email has, at best, a score of about -6.)|
I looked at all the rules that tagged it, and honestly? I think that at this point, you might do better just listing various pharmaceutical names in the clear, rather than replacing every A with a 4 and every I with a 1.
Diversity in filtering is part of the evolutionary defense against spam. If everyone had the same spam-filtering technique, spammers could put all their efforts against that approach. But as long as one set is suspicious of "\/14GR4" on the grounds that it looks too 1337 and another set is suspicious of "Viagra" because it's got a high Bayesian spamminess, we're able to reduce the spammer's margins from both sides.
I did my "Look, I know how to write a paper, you can graduate me" Masters paper on spam filtering. "vviagra" had a much higher probability of being spam than "Viagra" which is at least occasionally in my email (forwarded jokes, probably).
I'm just amused that at this point, it looks like it's not only "vviagra" that's higher, but that plain old "viagra" likely has the lowest Bayesian score of all the possible variations.
It's like all the peacock tails got so elaborate in so many directions at once that the peacock with the most boring display is now the one most likely to get laid...
But not everyone's using Bayesian. Given the preponderance of 1337 \/\][/-\(>|%/-\ spam, somebody (AOL?) must still be doing non-learning keyword filtering.
WHY?! Isn't Bayesian the only thing that actually works?
How could it be? Machine learning has more than one technique, you know.
Sorry, I suppose I should have said: "How can they just be using keyword filtering? Aren't learning techniques like Bayesian the only things that work?"
OK. Fair 'nuff.
It's just comparable to the, "isn't SVM the only way to do classification?" camp. No. It's not.
(And, of course, IR-style keyword filtering is one of the earliest learning techniques...)
There are people dumb enough to buy Viagra from spammers. So there must be people who are only slightly smarter who filter all email containing "Viagra." It might also be an attempt to get around the content-filtering of any IT department that hasn't realized keyword-based content filtering is even dumber than keyword-based spam filtering.
And ML is hardly the only component of good spam filtering. The latest catch from SpamAssassin in my box is below. Only 3.5 out of 28.2 came from Bayes.
* 0.1 FORGED_RCVD_HELO Received: contains a forged HELO
* 0.0 ADVANCE_FEE_1 Appears to be advance fee fraud (Nigerian 419)
* 0.3 MIME_BOUND_NEXTPART Spam tool pattern in MIME boundary
* -0.8 AWL AWL: From: address is in the auto white-list
* 1.8 MILLION_USD BODY: Talks about millions of dollars
* 0.5 HTML_40_50 BODY: Message is 40% to 50% HTML
* 0.0 HTML_MESSAGE BODY: HTML included in message
* 3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
* 2.0 RCVD_IN_SORBS_DUL RBL: SORBS: sent directly from dynamic IP address
* 1.6 RCVD_IN_BL_SPAMCOP_NET RBL: Received via a relay in bl.spamcop.net
* 1.6 URIBL_SBL Contains an URL listed in the SBL blocklist
* 3.8 URIBL_AB_SURBL Contains an URL listed in the AB SURBL blocklist
* 4.1 URIBL_JP_SURBL Contains an URL listed in the JP SURBL blocklist
* 2.1 URIBL_WS_SURBL Contains an URL listed in the WS SURBL blocklist
* 3.0 URIBL_OB_SURBL Contains an URL listed in the OB SURBL blocklist
* 4.5 URIBL_SC_SURBL Contains an URL listed in the SC SURBL blocklist
Incidentally, it's a penis enlargement spam, the "millions of dollars" came from the random news fragment it included: The recall could cost over 30 million