Email address obfuscation in mailing list archives

Part of Systems

Author(s) and publish date

By:
Published:
Skip to 10 comments

Some of W3C's most important system services are our mailing lists and corresponding online archives. Thousands of people participate in these lists, and the archives now contain millions of messages dating back to 1991.

These archives are an essential resource for groups collaborating on standards work, to build shared context and record the history behind various decisions.

Many of our discussion lists are public, with public archives. Unfortunately, one of the side effects of making these archives public is that the email addresses contained therein can be harvested by spammers looking for new victims.

Occasionally we receive requests to obfuscate or remove email addresses from our archives in an attempt to prevent or delay spammers from harvesting these addresses. We have implemented simple measures to foil the simplest harvesters (replacing each character of the domain name with its corresponding HTML entity) but have been reluctant to remove the email addresses completely because they are an essential part of the record.

Some email address obfuscation methods are more effective than others in preventing spam; usually there is a tradeoff between effectiveness and usability/accessibility for human readers. Most of the obfuscation techniques that preserve the full email address in some form are straightforward to decode and therefore ineffective in the long run as address harvesters are updated to compensate for new obfuscation techniques.

Even if we were to completely remove the email addresses from our archives, posters would still be subject to spam, since spammers can simply subscribe to our public lists and harvest email addresses as they are distributed in the original email messages. (This fact is noted in Google Groups' documentation on its email address masking.)

One option would be to remove email addresses or mask them like Google does (e.g. display gerald@w3.org as ger...@w3.org), and make the original messages available only to authenticated users, for example people who have a W3C Member or Invited Expert account. This would help reduce the amount of spam received by participants in the short term, while making the data available to people we know and trust.

Personally, I feel that doing so is a bad idea. Email addresses are not secrets, and pretending they are is misleading and a waste of effort. If you maintain a web site or blog and participate in online communities like W3C's, keeping your email address a secret for an extended period of time will be very difficult, and once it's out, it's out, as spammers sell or swap lists of millions of addresses with each other.

If you care about your online reputation and want to stand behind what you say, you should want your identity to be associated with the things you write. If not, you can create a disposable email address at one of the thousands of free email hosting sites, and use that when participating in public forums. (also noted in Google's docs.)

Removing email addresses from our archives has negative consequences besides the human aesthetic and usability aspects: an email address is the best machine-readable way to identify the author of an email message, so omitting it from an archived message causes one of the most important parts of semantic data about the message to be lost.

Meanwhile, spam will continue to come in as addresses are leaked by other means — the solution to that isn't to try to hide from spammers, but to develop better spam-blocking tools, including smart software that uses existing data (such as the history of previous correspondence from your mailboxes and public archives like W3C's, and data on various relationships in social networking sites) to generate a list of trusted correspondents.

W3C's mailing lists are generally spam-free even though we invite anyone with an email address to provide feedback (subject to the posting policy of the given mailing list) — this openness is an important part of W3C's process, so we invested in the tools needed to make it happen. If others have spam problems they should do likewise!

Giving up on email is not the answer. In the words of John Gilmore,

We have built a communication system that lets anyone in the world send information to anyone else in the world, arriving in seconds, at any time, at an extremely low and falling cost. THIS WAS NOT A MISTAKE! IT WAS NOT AN ACCIDENT! The world collectively has spent trillions of dollars and millions of person-years, over hundreds of years, to build this system -- because it makes society vastly better off than when communication was slow, expensive, regional, and unreliable.

What do you think? Is it worth preserving the machine-readability of details like this in our mailing list archives, or should we remove them in the interest of hiding from spammers? (even though that won't work in the long run.)

Should we build systems that generate and consume rich semantic data about our world, or hide these details because a few parasites might use the data the wrong way?

Related RSS feed

Comments (10)

Comments for this post are closed.