The center for all Wikitravel images!

Tech:Lucene search

From Wikitravel Shared
Jump to: navigation, search

Wikimedia projects use an extended search tool based on the Java Lucene search tool. It would be nice to incorporate that into the Wikitravel server, too. It's supposedly faster and more flexible. --Evan 22:00, 28 September 2006 (EDT)

This is implemented on review. You can try the search on . Kevin Sours, the main travel site developer for Internet Brands, did the work to integrate this for Wikitravel. This is the first step with using Lucene. In the future, hopefully, we'll be able to have more targeted search, like "Find UNESCO World Heritage sites near Cologne", "Find a Mexican restaurant within 10 miles of my hotel with a price range of $8-15 per entree" or "Find all the salsa dance club in Connecticut". As we move to use more structured listings, this will be more possible. --Evan 13:41, 1 June 2007 (EDT)
We want to roll this out in production soon, so please test out the review version. --Evan 13:41, 1 June 2007 (EDT)
One thing I like about this is that the default search engine doesn't work well with short words. For example, comes up with no results. works correctly. --Evan 16:45, 11 June 2007 (EDT)

I created a new article, Apples, and searched for it. Nothing came up. I then added the word Foobar and searched for that. Still nothing. I checked back about 20 minutes later and now get results for both the text and article search. I'll do another test and see if I can pin down how long the delay is... Maj 16:54, 11 June 2007 (EDT)

Ok, it looks like it's about 15 minutes between updates. That's probably a little too long, but not the end of the world. Under 10 would be better... Maj 16:54, 11 June 2007 (EDT)
I had an odd case where it only found significantly older versions of an article (en:Stromboli), not the current one. Any idea what that is all about? It wasn't like Maj's 15-minute update, I'm talking about it missing things that were several days old. Otherwise, all normal. -- Bill-on-the-Hill 23:55, 12 June 2007 (EDT)
Well, nothing on review is recent. Maj was making edits to review articles to determine the time between edits and updates. I think you may have been confusing review: with en:. -- Sapphire(Talk) • 11:38, 13 June 2007 (EDT)

UTF-8 characters are sometimes problematic in Lucene, but a search for "東京" (Tokyo) pulled up the appropriate results, which is impressive. I'd suggest doing a bit of testing with Thai script and some other non-UTF-8 character sets just to be sure, but it looks good to me and is better than the current search. -- Ryan 12:48, 12 June 2007 (EDT)

Thai and Arabic both seem to work. المغرب pulls up Morocco... Maj 13:02, 12 June 2007 (EDT)
Korean checks out as well... Maj 13:05, 12 June 2007 (EDT)


I just tested this feature out and I'm a little perplexed by the results it gave me, but then again I'm sometimes perplexed by the results the current search tool gives too. In one instance I searched for "Clubs in Warsaw" and these were the results I got. I got hits for Serbia and some Polish cities. The results also highlighted weird words like "termini", "changing", "industry", and "independent". My suspicion is that the search tool highlighted words with the letters i and n next to each other and picked articles with words like "independent" as long as it had a link to Warsaw.

I also searched for "Purple Bridge" expecting only the guide to Newport (Kentucky) to show up since I imagine there'd only be one purple color bridge in the world that's worth mentioning, but some place called "Wulai" precedes Newport, despite no mention of a purple bridge in Wulai. Though the words "bridge" and "purple" show up on the same line, though in separate sentences.

I did, however, get what I was looking for when I searched for "gothic church buddhist temple". See result [1]. I'm not trying knock the feature, but it doesn't seem very optimal, at least for the time being. I do see potential for it however. With this will we be able to search for those tags included in an article and when will the tag="" attributes be working within the coded listings? -- Sapphire(Talk) • 13:03, 12 June 2007 (EDT)

Don't jump too far ahead there cowboy! Lets get the text search working and then move forward on the tag stuff ;-). That said, it looks like there's a problem with the stop-list (or lack there of) and the word boundaries or partial word search. I searched to "in the a" and got [2], which isn't quite right. While we do want to be able to search on short word (like "San"), it probably should have a basic stop list... and not default to partial word search unless folks do something fancy like "*a*"... Maj 13:10, 12 June 2007 (EDT)
I don't actually think that it is searching on partial words. I did a search on "Sa" and all of the results contained that exact word somewhere. The highlighting appears to be picking up partial matches and highlighting the resulting words, but this is seperate from the search itself. Ksours
I tried searching "manches" to see whether Manchester would come up... it didn't, but Manchester (New Hampshire) did. So it's clearly searching for partial words in some cases, but not all... Tsandell 14:24, 12 June 2007 (EDT)
It's doing stemming -- which means looking for the roots of both the search terms and indexed terms and matching on the root (so that restaurant and restaurants will match). In this case Manchester (New Hampshire) contains the phrase "Manchester (nicknamed 'Manch-vegas' by the locals)" and I pretty sure that "manches" is matching with "manch" in this phrase. Ksours 14:56, 15 June 2007 (EDT)

So, with this new search engine you can do phrase searches. You can search for 'purple bridge' (which I think gives hits for 'purple' or 'bridge'), or for '"purple bridge"', which gives hits just for that exact phrase. I'm not sure how to get an exact hit on "Purple People Bridge" without getting all things that mention purple close to bridge, though.
I got better hits with '"clubs in Warsaw"' and '"gothic church"' when I enclosed them in quotes, too. --Evan 13:14, 12 June 2007 (EDT)
Oh yeah, quotations... nice trick. What is a "stop-list"? -- Sapphire(Talk) • 13:18, 12 June 2007 (EDT)

I did some poking around and most searches seem to work nicely, esp. anything that hits article names bubbles right up to the top like it should. Some results were a bit weird though, eg. "hawker centre" gets "West Coast (Malaysia)" (which doesn't say anything about them) as the first hit, and "singapore hawker centre" gets "Peninsular Malaysia" as its first hit (albeit tied with Singapore with a 4.3% relevancy score). Would I be entirely incorrect in guessing that short articles seem to get undue bias, so 1 mention of "hawker" in West Coast's 2.2k outweighs the 14 mentions in Sing's 89.9k (that is, 1 mention per 6.4k)? Jpatokal 13:29, 12 June 2007 (EDT)

Seems fine to me. Better on quoted searches for text than the old one, but still not great. --OldPine 13:46, 12 June 2007 (EDT)

I tested the search with a variety of non-English characters, like ã (like São Paulo), ç, ş, ţ, ä, ö, ü, â, etc. and I didn't see any particular problem with the way those work. However some tinkering will definitely have to be done to get it to work for non-western scripts. The engine can't find anything unless there happens to be a space on either side of the word you enter, which, in scripts like Japanese, almost never happens. This was testable even on this English review site, since so many entries have the local language in parentheses. For example, if I search 寺 (tera, temple), the engine finds zero, but if I put 清水寺 (kiyomizudera, Kiyomizu Temple) it gives me two articles where that specific temple is mentioned. 橋 (hashi, bridge) gives nothing, but 日本橋 (Nippombashi, Nippon Bridge) lists Osaka. Here on the English review site, a full location name in Japanese like this comes up in the results because in the articles here it has a space or punctuation on either side of it, but on the Japanese version, it probably wouldn't have a space or punctuation on either side, so you wouldn't get any results unless you actually searched for the entire sentence in which it appeared. I imagine the same problem will come up for hindi and hebrew. Texugo 05:06, 18 June 2007 (EDT)

Rolled out =[edit]

This has been rolled out on Wikitravel en:. Problems with the rollout should be marked as separate bugs. --Evan 08:32, 4 July 2007 (EDT)