Email address obfuscation in mailing list archives
Some of W3C's most important system services are our mailing lists and corresponding online archives. Thousands of people participate in these lists, and the archives now contain millions of messages dating back to 1991.
These archives are an essential resource for groups collaborating on standards work, to build shared context and record the history behind various decisions.
Many of our discussion lists are public, with public archives. Unfortunately, one of the side effects of making these archives public is that the email addresses contained therein can be harvested by spammers looking for new victims.
Occasionally we receive requests to obfuscate or remove email addresses from our archives in an attempt to prevent or delay spammers from harvesting these addresses. We have implemented simple measures to foil the simplest harvesters (replacing each character of the domain name with its corresponding HTML entity) but have been reluctant to remove the email addresses completely because they are an essential part of the record.
Some email address obfuscation methods are more effective than others in preventing spam; usually there is a tradeoff between effectiveness and usability/accessibility for human readers. Most of the obfuscation techniques that preserve the full email address in some form are straightforward to decode and therefore ineffective in the long run as address harvesters are updated to compensate for new obfuscation techniques.
Even if we were to completely remove the email addresses from our archives, posters would still be subject to spam, since spammers can simply subscribe to our public lists and harvest email addresses as they are distributed in the original email messages. (This fact is noted in Google Groups' documentation on its email address masking.)
One option would be to remove email addresses or mask them like Google does (e.g. display gerald@w3.org
as ger...@w3.org
), and make the original messages available only to authenticated users, for example people who have a W3C Member or Invited Expert account. This would help reduce the amount of spam received by participants in the short term, while making the data available to people we know and trust.
Personally, I feel that doing so is a bad idea. Email addresses are not secrets, and pretending they are is misleading and a waste of effort. If you maintain a web site or blog and participate in online communities like W3C's, keeping your email address a secret for an extended period of time will be very difficult, and once it's out, it's out, as spammers sell or swap lists of millions of addresses with each other.
If you care about your online reputation and want to stand behind what you say, you should want your identity to be associated with the things you write. If not, you can create a disposable email address at one of the thousands of free email hosting sites, and use that when participating in public forums. (also noted in Google's docs.)
Removing email addresses from our archives has negative consequences besides the human aesthetic and usability aspects: an email address is the best machine-readable way to identify the author of an email message, so omitting it from an archived message causes one of the most important parts of semantic data about the message to be lost.
Meanwhile, spam will continue to come in as addresses are leaked by other means — the solution to that isn't to try to hide from spammers, but to develop better spam-blocking tools, including smart software that uses existing data (such as the history of previous correspondence from your mailboxes and public archives like W3C's, and data on various relationships in social networking sites) to generate a list of trusted correspondents.
W3C's mailing lists are generally spam-free even though we invite anyone with an email address to provide feedback (subject to the posting policy of the given mailing list) — this openness is an important part of W3C's process, so we invested in the tools needed to make it happen. If others have spam problems they should do likewise!
Giving up on email is not the answer. In the words of John Gilmore,
We have built a communication system that lets anyone in the world send information to anyone else in the world, arriving in seconds, at any time, at an extremely low and falling cost. THIS WAS NOT A MISTAKE! IT WAS NOT AN ACCIDENT! The world collectively has spent trillions of dollars and millions of person-years, over hundreds of years, to build this system -- because it makes society vastly better off than when communication was slow, expensive, regional, and unreliable.
What do you think? Is it worth preserving the machine-readability of details like this in our mailing list archives, or should we remove them in the interest of hiding from spammers? (even though that won't work in the long run.)
Should we build systems that generate and consume rich semantic data about our world, or hide these details because a few parasites might use the data the wrong way?
Hear, hear. Personally I don't make an effort to hide my email address online, and more than once someone has asked me, "aren't you afraid spammers will get hold of it?" This is kind of a redundant question: I'm pretty sure they already have it.
I agree, to pgl. If a spammer wants to get email-adresses, he finds a way to get it. A possible way to preserve it from bots is to change the notation like this e.g.: xyx [at] jkl [dot] com
Another way would be to hide the email-adress and implement a 'contact' button instead, as the author can write his name normaly and validate by email, that won't be published. via a contactform the message could be redirected to the author then.
Spammers are just the cost of fantastic communications technology. What would you rather have - no email? It's a cliche but I don;t know what we did without it! an you imagine going back to snail mail? I couldn't.
Flug, you wrote:
As I noted in the article, such schemes are easy to decode, so they don't really hide anything from bots.
What would prevent spammers from using the 'contact' button to spam people?
Another vote for exposing real addresses: breaking web archives harms legitimate users with only minimal impediment to the parasites - does anyone really believe that the people who wrote software to perform distributed CAPTCHA attacks can't run a regular expression on an address list?
I do not have any empiric datas about spamming on contactforms but I estimate the proportion to email-spam would be very small. Just as I said, there is no way to solve the spam problem, maybe to install a chip in the brains of every human, that deters you from spamming.
Apparently, the w3.org does not want to dedicate their time to implement a solid framework to deal with email/privacy issues. This article simply states, "You can send emails but we really don't care if you get spammed because you're gonna get spammed somewhere else anyway. Plus, the law does not oblige us to do anything about it, so we won't.". Lame, lame, lame. They seem a little less lame by stating, "Ok, we will implement very basic obfuscation that does not solve the problem so nobody can accuse us of not doing anything about it". Sadly, the same can be said about the administrators of other mailing lists.
The sender should be able to modify/delete the post or ask the webmaster to do it if it is not implemented.
I have 16 or so email addresses and just one of those addresses gets inundated with spam. And I'm pretty sure this archive is where it comes from. Sad to read this article and find out that your answer is "tough luck". But look, you use Akismet. So you take efforts so you don't receive junk mail. But you won't take any actions to stop me from getting any. Boy oh boy do I regret ever posting to your mailing list.
Your excuse of "well we can't 100% prevent spam so we're not going to put in any effort to prevent any spam". Had I know that, I would have posted with a throw-away email address, not my primary email address.
This is why, before we accept any messages from people to be distributed to our lists and included in our public archives we send them a confirmation message and have them submit a form that says (among other things), "Please be aware that having a message available in our archives runs the risk that your email address will be harvested by spammers. This is one of the risks of participating in a public forum."
Well, my post was in 2011, so I really can't make any claims as to whether or not I was shown a message like that. But the years tick by and the flood of spam keeps on coming. Is there really NO way to get my email address or the entire message removed from the archive?
Or better yet, add an identifier to the username portion of the email address, so I can actually see how much spam is coming from the archive. That would be a fun experiment before deleting my address entirely.