Identifying comment spam April 28, 2006Posted by curtmonash in Uncategorized.
It's no problem to make up a word list that catches half or more of all spam without a major false positives problem. Just think of the things people most commonly advertise through spam — porn, medication, loans. Porn and medication words are unlikely to give false positives, although loan words are more of a problem. And some of the worst offender sites can get added to the list manually.
Well, actually, on some of my blogs one might explicitly discuss the problem of spam, in which case any word could show up in the comments. Whoops.
Anyhow, that problem even aside, a large minority of spam can't easily be identify with keyword/keyphrase filters. They're just a bland friendly sentence or two (changing in form often enough that you can't catch even the majority of them with keyphrase filters, although I've gotten some mileage out of filtering on "nice blog" and the like). Their purpose is simply to put a URL onto your site.
And thus the key to filtering comment spam is, in even a more extreme form, the same as the key to filtering email spam — filter on the "call to action" (usually a URL). Whether it's to click on a URL, buy a stock, or whatever, almost all spammers want you to do something specific. The part of the spam in which they describe precisely what that is the part that is sufficiently invariant to make effective filtering possible.
I hope (and believe) that's part of what Aksimet is doing. But I gotta say this — based on the Monash Report, which is one of the blogs I've turned it on for, it definitely lets pretty obvious spam through. I mean, c'mon now — just how many comments from bettingonhorses.org is it going to take before Aksimet starts filtering them out??
On the plus side, Akismet does capture half or so of my spam so far, in a small sample size, and I've decided safe to turn on even on my busiest blog. The problem is that it tells you how much new spam there is, and also how much old spam, and doesn't make it clear whether it grabbed old "spam" just from your already-deleted-comments file.
However, a web search doesn't show people screaming about a real problem, so there probably isn't one. What's more, the UI text is a little more comforting when it has both old and new spam in the queue than just old spam. So, as I said, I've now turned it on even on my busiest blog.
By the way, that failed search for complaints did turn up a couple of interesting things. One is pretty much the original Akismet-popularizing blog post. The other contains a long list of "social engineering" comments appearing to be real, an all-too-high fraction of which are quite familiar to me already.