Outsmarted by spammers

For a long time, I’ve been giving out unique email addresses to anyone that’s not a real person. For example, my LinkedIn email is linkedin@example.com. I started doing this to track who was passing email addresses around, as well as being able to “turn off” any email addresses that start attracting undue amounts of spam. I generally only give out my “real” address to friends.

The way I’ve done this until today is to have a “catch-all” address set up on my mail server. Most hosting places provide this feature, although DreamHost does warn that it’s often a bad idea. The way it works is that any mail received is set to that one address, which is then forwarded to my main address. So when I start using an new website, say ordering from Swiss Chalet, I just use swisschalet@example.com and the email gets delivered to my main address. The beauty of this is that I can just make up new addresses on the fly and I don’t have to worry about activating them on my server. And in those cases that the address gets compromised or starts attracting spam, I can just instruct the server to delete the message without bouncing.

I’ve been pretty happy with my setup until now. Recently I’ve been on the receiving end of a joe job. Spammers are using addresses from my domain in the “Sender” attribute of mail. It’s not really a problem until that mail bounces – and there are a lot of bounces. Even worse, they’re generating email addresses (like 5112EC025@example.com) so I can’t just blackhole the offending addresses.

I’ve tried a couple of technical solutions on my server, setting up SPF records and ensuring DKIM was active on my domain. That may have cut down on the bouncebacks, but I still get a lot of them – say like fifty a day. It’s not a problem on my various computer as much since client-side spam detection is pretty good, but checking mail on my phone has become problematic.

I resigned myself to having to remove the catch-all address and replace them with normal forwards, meaning all other emails would bounce. Here’s the problem: I have no idea what email addresses that I’ve given out. None at all. Luckily, I’m a bit of an email hoarder and have email records going back as far as 2000, and a complete record of all emails that I’ve received in the past two years. It was fairly easy to discover all of the email addresses that I’ve received mail for in that time period.

Step 1: Export all mail messages into text files

I use Thunderbird for my mail program, and have since it was introduced. There’s no immediately obvious way to export all of your messages, but the ImportExportTools addon provides functionality to export all your folders into mbox format (which for all intents and purposes can be treated as text files).

Step 2: Identify everything that looks like an email address

I’ve been getting back into regex at work, and grep was the first thing that came to mind for solving this problem. I went with a fairly brute-force solution:

grep -oh '[A-Za-z0-9_\.\-]*@example\.com' * | sort -u

Grep is, well it’s grep. The -o parameter tells grep to only print the matching string, -h omits the file name from the output. the actual regex is dead simple – some number of letters, numbers, and relevant punctuation followed by @example.com. This is too broad – it will also pick up some message IDs and In-Reply-To headers, but it’s good enough for our purposes. The output is piped into sort, where -u means show unique matches only.

That gives me a list that starts off like this:

004A5253@example.com
004a5253@example.com
01B4310@example.com
01E07F703@example.com
023D65F4A@example.com
0272FA7@example.com
0347592C@example.com
0347592c@example.com

before eventually getting to the good stuff like:

dell@example.com
dlmage@example.com
domains@example.com
dontsendmeanyfuckingemail@example.com

In total, there are over 1200 unique matches.

Step 3: Filter the results

I did this by hand. I could have tightened up my regex significantly (for example, I’m pretty sure that I don’t give out email addresses that start with a number), but I prefer to do it by hand to avoid missing anything. When I was done, I had a list of 150 addresses that I actually use for stuff. I’m sure I could trim it down further – it looks like a lot of them were basically one-off addresses, but I’ve spent enough time on this already. Now it’s just a matter of setting up these as forwarding addresses, turning off my catch-all, and returning to a relatively spam-free life.

A final note: an alternate approach would have been to set up a server-side filter to catch any email addresses that start with a number (for example). I didn’t go down this route because false positives on a server-side filter can lead to legitimate missed messages, and I’m not the sort of guy who’s diligent about checking the logs as often as I should.