Q: What happens when you cross a mobster with a cell phone company?
A: Someone who makes you an offer you can’t understand.

The HTTP protocol used by web browsers specifies an optional Referer: (sic) header that allows them to tell the server where the link to a page came from. This was originally intended as a courtesy, so webmasters could ask people with obsolete links to update their pages, but it is also a valuable source of information for webmasters who can find out which sites link to them, and in most cases what keywords were used on a search engine. Unfortunately, spammers have found another well to poison on the Internet.

Over the past month, referrer spam on my site has graduated from nuisance to menace, and I am writing scripts that attempt to filter that dross automatically out of my web server log reports. In recent days, it seems most of the URLs spammers are pushing on me point to servers with names that aren’t even registered in the DNS. This seems completely asinine, even for spammers: why bother spamming someone without a profit motive? I was beginning to wonder whether this was just a form of vandalism like graffiti, but it seems the situation is more devious than it seems at first glance.

Referrer spam is very hard to fight (although not quite as difficult as email spam). I am trying to combine a number of heuristics, including behavioral analysis (e.g. whether the purported browser is downloading my CSS files or not), WHOIS lookups, reverse lookups for the client IP address, and so on. Unfortunately, if any of these filtering methods become widespread, the spammers can easily apply countermeasures to make their requests look more legitimate. This looks like another long-haul arms race…