Wikitravel talk:Spam filter
From Wikitravel
Contents
- Valid links incorrectly filtered
- Spam links not filtered
- Holiday rentals
- Discussion
- Complete removal of SPAM links
- Recent Changes Home page
- Google's Comment Spam prevention
- 6x●to
- uni●cc
- serverlogic3●com.
- New ideas
- Unable to save this page
- Jobdao - Is it spam?
- University of Tokyo banned!?
- [+] Sensitivity vs. Specificity
- Localization
- Essaouira-voyage
- linking from error message to a place to discuss it
Discussion moved from document page here by Evan
[edit] Valid links incorrectly filtered
Im trying to add this
- On the page http://wikitravel.org/en/Cairns all extra valid changes appear to be rejected because of a vlaid link to the Cas in o hotel at the bottom. Should I add to the whitelist - Ill try that
- Finnish phrasebook - 'd_onnerwetter.kielikeskus.helsinki.fi'
- The link to the historic Pharmacy Museum's website in the New Orleans/French Quarter creates a problem with those trying to edit the page, as "pharmacy" is one of the spamtrap words. I had to put a linebreak in the URL to save the page. "http://www.pharm acymuseum●org/" is hardly how it should be listed, but the Spam protection filter doesn't let me fix it. What solution can we do about this, or will it just be impossible to include a link in the article? -- Infrogmation 15:28, 29 Oct 2004 (EDT)
- I %66ixed it. It lo%6fks funn%79, but at least the link %77orks. -phma 23:53, 29 Oct 2004 (EDT)
- I stripped out most of the generic regular expressions from the banned content file. I'll try to keep the list down to real URLs rather than general ideas. --Evan 11:55, 2 Nov 2004 (EST)
[edit] Spam links not filtered
New ones:
lyrics-sky●com lyrics001●com
-- Mark 08:39, 20 Mar 2005 (EST)
www.51wisdom●com www.ic37●com www.sj-qh●com www.fsyflower●com www.zjww●com www.websz●com www.fsyflower●com www.air520●com www.ywxjm●com www.163school●com●cn erjiguan.dzsc●com sanjiguan.dzsc●com dianrong.dzsc●com dianzu.dzsc●com dianweiqi.dzsc●com jichengdianlu.dzsc●com bianpinqi.dzsc●com lianjieqi.dzsc●com chuanganqi.dzsc●com dianganqi.dzsc●com juanyuancailiao.dzsc●com cixingcailiao.dzsc●com kaiguan.dzsc●com fangdaqi.dzsc●com diandongji.dzsc●com chazuo.dzsc●com dianchi.dzsc●com chongdianqi.dzsc●com dianre.dzsc●com yiqiyibiao.dzsc●com wujin.dzsc●com dianluban.dzsc●com jidianqi.dzsc●com dianziguan.dzsc●com bandaoti.dzsc●com guangdianyuanjian.dzsc●com dianyuan.dzsc●com www.dzsc●com www.zhkaw●com www.myseo●com●cn www.myseo●com●cn
more -- Mark 03:48, 2 Jun 2005 (EDT)
Two more from 3 Jun 2005:
www.oasales●cn www.bjicp●org
-- Wrh2 11:23, 3 Jun 2005 (EDT)
[edit] Holiday rentals
And this annoying guy too: www.holiday-rentals●com Jpatokal 12:59, 12 Nov 2004 (EST)
- It's really unclear to me if this one really counts as spam. Can we discuss this? I'd like to see what the range of views out there is. -- Mark 13:07, 12 Nov 2004 (EST)
- It's a commercial site, the guy reposted his ads after I removed them and didn't react to my message telling him to stop. So yeah, it's spam. Now, I'll grant that holiday rentals are a viable choice of accommodation in some places — but why this guy's commercial on every page and not somebody else's? Jpatokal 13:17, 12 Nov 2004 (EST)
- Almost all of our listings are for commercial enterprises, (at least hotels usually charge me money). I don't think that commercial has anything to do with it whatever.
- Is it the fact that this appears to be a web-based aggregator the problem? I want to know, because I'd really like to narrow in better on what is acceptable. -- Mark 13:25, 12 Nov 2004 (EST)
- I'm using "commercial" as short hand for "somebody trying to make money off somebody else's work" here, ie. it's not a primary source. Jpatokal 08:02, 13 Nov 2004 (EST)
- It appears to be a web directory, so links to the site are generally inappropriate (we want direct listings here). So as far as adding it to the filter, I have two opposing feelings about this: a) I'm not really concerned enough to add it to the filter if they only do this once and b) I can't see any reason why we would ever have a link to the site, so it might not hurt to add it rather than think about it. -- Colin 13:41, 12 Nov 2004 (EST)
- Oh nevermind, he re-added a couple after they were removed and a message sent to him. Add it. -- Colin 13:42, 12 Nov 2004 (EST)
It
- Yeah, you're right. It's a web only listings collector. We shouldn't accept this edit. I guess part of what's bothering me though is that we have a history of not allowing primary sources for apartment rental either (anybody remember Patricia from Brazil?) I think I'm going to try to get a discussion about this going on Talk:Finding accommodation -- Mark 04:53, 13 Nov 2004 (EST)
[edit] Discussion
I moved the discussion stuff from the document page here. --Evan 23:05, 1 Nov 2004 (EST)
[edit] Complete removal of SPAM links
I think the reason why they do SPAM is to achieve PageRank as explained at http://www.google●com/technology/
Just deleting the SPAM entry is not helpful because wikipedia is still linking those at edit history preview page like http://wikitravel●org/en/index.php?title=Main Page&diff=47131&oldid=47129
So there should be a way to completely hide these entries from history preview also. -Bijee 23:46, 9 Dec 2004 (EST)
- I don't think so. Our Robots File tells robots to avoid looking into history and diffs. And Googlebot obeys robots.txt. Of course, Spammers might be under the impression that the indexing will occur, and are therby encouraged to annoy us. -- Colin 23:59, 9 Dec 2004 (EST)
[edit] Recent Changes Home page
I made one of my Firefox homepage as Special:Recentchanges, and I check those Recent Changes with only IP address -Bijee 23:46, 9 Dec 2004 (EST)</pre>
[edit] Google's Comment Spam prevention
Re [1], any way to implement this at Wikitravel? All we'd need is the attribute rel="nofollow" on all external links. Jpatokal 09:48, 19 Jan 2005 (EST)
- This is implemented in MediaWiki 1.4, I believe. We should be moving to that soon. --Evan 12:21, 10 Mar 2005 (EST)
[edit] 6x●to
http://6x●to/ is a free web redirection service. This is the sort of place spammers will tend to congregate in. I would suggest any subdomain in the domain 6x●to be blocked until authorised by an adminiatrator - or however it is done.
Terms of use state:
- 6x●to feels that spam in any form, including but not limited to, unsolicited commercial email, irc messages and newsgroup postings is a serious abuse of the network and will not tolerate 6x●to's name being used in such a way. If you chose to abuse the network or defame 6x●to in any way, your host name will be immediately deactivated and appropriate legal action will follow from 6x●to, your ISP and the related parties.
-- Huttite 05:03, 26 Jan 2005 (EST)
- Have sent complaint to abuse@nic.uni●cc (last spam of Main Page, against uni●cc policy). Will see what happens. -- JanSlupski 13:54, 27 Jan 2005 (EST)
- Still lots more spam links from both servers coming in. I'd suggest preemptively blocking them. Jpatokal 00:58, 2 Feb 2005 (EST)
- I note that 6x●to are blocking the promoted URL's also!! I notice that the spam stops once 6x●to does this. E-mailing abuse@6x●to appears to be effective. I suspect that 6x●to is also having trouble keeping up, so blocking any 6x●to subdomain would be useful. -- Huttite 03:50, 2 Feb 2005 (EST)
[edit] uni●cc
uni●cc is a redirection service. A copy of their terms of service are at http://www.uni●cc/site/info_terms.php Reports of spammers abusing the service can be sent to mailto:abuse@nic.uni●cc -- Huttite 04:31, 8 Feb 2005 (EST)
- Tried to complain to that address before (27 Jan -- see above), but no answer, no results... :-( -- JanSlupski 06:25, 8 Feb 2005 (EST)
- I understand it may feel pointless to do this but I would suggest you post a copy of the complaint on the talk page for the IP address of the user that placed the spam too. It may not happen overnight but it MAY happen. If there is no response then the website links can always be chongqed. -- Huttite 06:50, 8 Feb 2005 (EST)
[edit] serverlogic3●com.
Can we ban serverlogic3●com? There is malware [2] which molests the uploaded wikitravel pages of unsuspecting victims (like User:Wonderfool) to include advertisments which popup when the user mouses-over the advertising term. The ad is retreived from serverlogic3●com. So if we ban serverlogic3, we will at least prevent infected users from uploading evilified pages. -- Colin 19:52, 4 Mar 2005 (EST)
[edit] New ideas
So, I'd like to make it easier for us to add new items to the spam filter. Here's my plan:
- A new page, Wikitravel:Local spam blacklist, holds our local spam regular expressions that aren't on the CommunityWiki BannedContent list.
- Another new page, Wikitravel:Local spam whitelist, contains regular expressions that are on the banned content list at CW that we don't want to use. Example: the one with the pharmacy thing in it.
- A cron job updates our spam system each day -- downloads from CW, removes stuff in non-spam page, adds stuff in local spam page.
If it gets abused, we can protect the local pages, but my guess is that spammers aren't going to take the time to figure out our system and route around it. So I think we can leave them unprotected at first. I'm going to talk to the CW folks and see if our local regexps can feed up to the aggregate one (so everyone benefits from our experience).
Any comments or criticism on this are welcome. --Evan 12:20, 10 Mar 2005 (EST)
- The two local pages would be exempted from the spam rules, too. --Evan 13:00, 10 Mar 2005 (EST)
- Works for me. -- Mark 15:35, 10 Mar 2005 (EST)
- This is now implemented, with the exception that the local lists are checked real-time. --Evan 17:47, 13 Oct 2005 (EDT)
[edit] Unable to save this page
As a side effect of new spam filter (super!) you cannot save this page (as a whole) anymore... --JanSlupski 15:56, 10 Mar 2005 (EST)
- You can remove spam URLs from this page as they are added to the filter. Jpatokal 20:23, 10 Mar 2005 (EST)
[edit] Jobdao - Is it spam?
A user added http://www.jobdao●com/protest/vtest001_26.htm to the Main Page. The target website is in either Japanese or Chinese characters, so I cannot read it. But I think it is a Job/CV website. Is it spam? -- Huttite 05:12, 14 Jul 2005 (EDT)
- I think in the context of the Main Page and with no explaination, yes it is. -- Mark 05:16, 14 Jul 2005 (EDT)
[edit] University of Tokyo banned!?
Ooi! My alma mater, the University of Tokyo, is inexplicably banned. Can it be removed from the filter list? "u-tokyo.ac.jp" Jpatokal 09:48, 8 Aug 2005 (EDT)
- I added it to the local spam whitelist. --Evan 17:50, 13 Oct 2005 (EDT)
[edit] Sensitivity vs. Specificity
False Positives vs. False Negatives (FP:FN) is a classic problem in medical testing. This tension is expressed as Sensitivity vs. Specificity. The more sensitive a test is the more likely you will see false positives (Type I error). The more specific a test is the more likely you will see false negatives (Type II error). (This conundrum reminds me of the Uncertainty Principle.) We have the same problem with blacklists. (Whitelists counteract false positives.)
The true rates of error depend on testing accuracy and precision, and the frequency of true positives in the population of interest. These factors can be addressed with Baye's theorem which is beyond the scope of this discussion.
In medical diagnostic testing, a common strategy is to screen with high sensitivty tests and then to verify positives with high specifity tests. This strategy has the benefit of reducing the cost of testing and minimizing the risk of false positives and false negatives.
In blacklists we can measure cost as the number of elements (words,URLs,patterns) that must be compared to new content. In order to strictly follow the medical model we would need a two stage blacklist. The first stage blacklist would have spam words and URLs that are not associated with spam words (e.g. \.5g6y\.info - this URL doesn't use spam words). The second stage blacklist would have all known URLs associated with spammers.)
We are forced to compromise by merging both types of tests into one blacklist which is a finite resource with a 'price' for size. The 'price' is system loading, user inconvenience, and maintenance.
To recap (more% indicates percentage of a finite resource):
- more% blacklisted URLs => more specificity
- more specificity => more false negatives
- more false negatives => more permission for bad content
- more specificity => more false negatives
- more% blacklisted words => more sensitivity
- more sensitivity => more false positives
- more false positives => more blocks to good content
- more sensitivity => more false positives
One minimax strategy for a single blacklist:
- Reduce the 'cost' by relying more on spam words that are associated with many spammer URLs
- Reduce the number of false positives by tuning the spam words with regex
- Reduce the number of false negatives by including spammer URLs that do not use spam words
Most blacklists depend heavily on banning URLs. Spammers have an easy time finding new URLs and makes the effort open ended.
I have developed a blacklist that uses banned words primarily. For more details, visit my user page:
--jwalling 17:10, 1 Jan 2006 (EST)
[edit] Rebuttal
- This list is only a short term measure that enables Wikitravelers to block spammers not detected by the bigger banned content list, that does include word terms. Some of those words are the same words as we want to use, such as Casino. This list allows us to be extremely sensitive to specific spammers. Forcing them to find new URL's. We want spammers to have to work hard for their money, meaning they have to go to lots of trouble to set up new URL's. If spammers realise that every URL that is considered spam will find its way onto the shared wiki master list faster than they can find wikis to spam. It also means we can ban them with less effort than setting up a new URL takes. To be useful, Spam links need to survive a few days or even weeks so search engines can find them. I have found that it is much more effective to make a good website, that search engines like, by designing it well. Spamming is a wasted effort by the naive and ultimately harms their own interests. -- Huttite 18:15, 1 Jan 2006 (EST)
- Is there any effort to leverage spam words? My point is, why wait for new spammers to strike if a new spam word will prevent them from striking? For example - Spammer A leaves the new spam word viagrapecia. Block viagrapecia. Spammer B comes along to post viagrapecia and is blocked. If you only block Spammer A's URL, spammer B has a clear shot. --jwalling 19:40, 1 Jan 2006 (EST)
- Postscript - a better example of spam word blocking with regex is height:\s*\dpx, it's simple and effective. How many times have you seen
- Is there any effort to leverage spam words? My point is, why wait for new spammers to strike if a new spam word will prevent them from striking? For example - Spammer A leaves the new spam word viagrapecia. Block viagrapecia. Spammer B comes along to post viagrapecia and is blocked. If you only block Spammer A's URL, spammer B has a clear shot. --jwalling 19:40, 1 Jan 2006 (EST)
- <div id="yadayada" style = "overflow:auto; height: 1px; ">
- followed by a long list of spammer URLs?
- Perhaps there is scope to block any HTML construct that generates a hypertext link on a wiki without using the wikifeatures to do this.
- I also think the logic is that a base URL is a bit harder to set up than the word or HTML is in a URL link. Though I agree it would be nice if we blocked words too, and I think we do. However I think that past experience has been that too much anticipatory blocking using spam words is too sensitive and gives too many false positives. On Wikitravel almost any false positive is a BAD THING as the Spam Filter causes BAD THINGS to happen to the rest of the edited text - bug? reported - and work is not recoverable - a design feature? I would rather one or two new ones slipped through the net, so I know who they are, than having legitimate websites being blocked and users inconvenienced because someone added a URL that was something like spam. e.g.Pornping in Chiang Mai.
- To use the medical analogy, this is like a vaccine for spam, rather than a broad spectrum antibiotic. In this case we target the precise examples rather than making the environment toxic. -- Huttite 20:12, 1 Jan 2006 (EST)
- I haven't seen a single rebuttal to using height:\s*\dpx. I think people are so married to using URLs for blocking, they can't rethink the problem. By the way, Huttite, if that block was installed, your personal page would not have been spammed. --jwalling 20:25, 1 Jan 2006 (EST)
- I disagree that URLs are more stable than spam words. Many URLs are setup at free host for redirection. If you want to sell viagrapecia you have to use that term to get good SEO.
- If a spammer sets up on a free host for redirection block the whole free host URL domain. The spammers can never use the same free host twice, nor can any other spammers either. The number of free host URL's is limited surely. If any valid lites use the domain they can be un blocked on a case by case basis. -- Huttite 20:40, 1 Jan 2006 (EST)
- What if the free host is Yahoo.com? I have seen that situation. I am not proposing that all blocks be done via spam words. I am looking for a rational balance. For every example their is a counter example. Use what makes sense, like height:\s*\dpx, there is no valid reason for users to hide content, and if they must they can use <!-- -->, Another thought, before you add a spam word, you can search your wiki to see if it is in use. Remeber regex, you can fine tune to prevent accidental matches. I am using a spam word list for blocking at KatrinaHelp.info and I have not seen a single, not one, spambot deposit since Dec 16, 2005, where we saw dozens in the previous weeks. --jwalling 20:55, 1 Jan 2006 (EST)
- In some respects, blocking spam entirely is censorship. If a Wiki is open to all, shouldn't anyone be allowed to put anything they want up so that others can judge it before it is taken down again? By blocking spam I do not here what the spammer says as they are filtered out. By only blocking spam URL's I hear them when they say something new. And if it is still spam they get blocked again. Surely everyone has the right to free speech, and a responsibility to say what others should need to hear. Unfortunately spammers tend to abuse that right by shouting loudly so that others cannot be heard. There needs to be a balance between control and total exclusion. I think the current spam filter, for all its faults, strikes an appropriate balance. It may not be the best solution, but it does a relatively good job. Besides we might want to use some spammed links, if they are travel related. Also, find me a page that still has non-travel related spam on the current revision that has been there long enough to also be found in a search engine. -- Huttite 21:32, 1 Jan 2006 (EST)
- What if the free host is Yahoo.com? I have seen that situation. I am not proposing that all blocks be done via spam words. I am looking for a rational balance. For every example their is a counter example. Use what makes sense, like height:\s*\dpx, there is no valid reason for users to hide content, and if they must they can use <!-- -->, Another thought, before you add a spam word, you can search your wiki to see if it is in use. Remeber regex, you can fine tune to prevent accidental matches. I am using a spam word list for blocking at KatrinaHelp.info and I have not seen a single, not one, spambot deposit since Dec 16, 2005, where we saw dozens in the previous weeks. --jwalling 20:55, 1 Jan 2006 (EST)
- If a spammer sets up on a free host for redirection block the whole free host URL domain. The spammers can never use the same free host twice, nor can any other spammers either. The number of free host URL's is limited surely. If any valid lites use the domain they can be un blocked on a case by case basis. -- Huttite 20:40, 1 Jan 2006 (EST)
- I disagree that URLs are more stable than spam words. Many URLs are setup at free host for redirection. If you want to sell viagrapecia you have to use that term to get good SEO.
- I think you have made the most succinct case for openess. It's a good standard to follow. If your methods are successful and openess is the paramount concern there is no need to change. --jwalling 21:40, 1 Jan 2006 (EST)
Just to throw in my 200 rupiah, if you look up earlier on this very page you'll see that we tried keyword-based blocking earlier and it didn't work too well. Wikitravel covers the entire planet, which includes the legit casinos of Macau and lots of Thai porn, as the word means "blessing" there... Jpatokal 23:13, 1 Jan 2006 (EST)
- I get it. Banned spam words bad. Banned spammer URLs good. In the meantime I will continue to explore the benefits of banning spam words on the wikis I maintain. One size does not fit all. --jwalling 15:46, 2 Jan 2006 (EST)
[edit] Localization
I wanted to point out, since I don't think it's noted elsewhere, that the names of the local spam filter pages can be localized with MediaWiki:spamwhitelist and MediaWiki:spamblacklist respectively. --Evan 16:22, 30 October 2006 (EST)
[edit] Essaouira-voyage
Added after reverts of two edits: [3], [4]. Am I following the policy correctly? --DenisYurkin 13:17, 15 November 2007 (EST)
[edit] linking from error message to a place to discuss it
Can we link from "save was blocked" message to a place where a user can ask a question?
Right now, it looks absolutely techy, and it takes serious efforts to understand what's wrong, and what I can do about it. Just try to save User:DenisYurkin before my comment in Wikitravel talk:Local spam blacklist#Catch-all pattern is processed, and you'll see what I mean. --DenisYurkin 17:08, 21 March 2008 (EDT)

