Tech:Search not working on EN
From Wikitravel Shared
[edit] What happens
- 1st issue
Yesterday I searched for the word Inthanon, but although that word has been in en:Chiang Mai#Get out for quite some time, I got the "Sorry, there were no exact matches to your query." message. I mentioned this on en:Wikitravel:Travellers' pub#Search not working?. I checked back today and the search isn't finding the new Travellers' pub occurrence either [1]. It is however finding an old occurrence in en:Talk:Thailand/CIA World Factbook 2002 import ~ 203.147.0.48 03:17, 9 September 2007 (EDT)
- Okay, I think I nailed this one. Turns out that the Lucene code has an internal limit on the number of words it will index for a given document. Some of the larger wikitravel docs exceed this limit and were only partially indexed. I bumped the limit to 100k works from 10k -- if anybody thinks that's still too low, please speak up. KevinSours 18:53, 9 October 2007 (EDT)
- I think I understand why a low limit might be necessary, even though the effect on the results returned would be undesireable; but I can't see the point of having a relatively high limit which is still going to give incomplete results. If the limit is going to be set high, why set it so that we're still going to get incomplete results on a minority of articles - especially when those longest articles are likely to be our best, most comprehensive, and important? How about setting the limit to 200k? Then we'd be getting everything. My guess is that the difference in speed-performance going from 100k to "everything" will be insignificant compared with going from 10k to 100k - might that guess be correct? ~ 202.71.45.37 04:57, 13 October 2007 (EDT)
- If I'm understanding Kevin's explanation correctly, the limit is currently at 100k words, not 100k characters. As the average English word is 5 letters only, and WT considers spaces characters too, that means that even a 200k article would only have around 32000 words. Jpatokal 05:18, 13 October 2007 (EDT)
- Its more or less word count. The limit is not for speed performance on searching. Rather its to cap memory usage when indexing. If we set the limit too high and we get a huge article then its possible for the indexer to have problems.
- KevinSours 11:29, 15 October 2007 (EDT)
- I searched for netsurf to see if it was found in the currently 160,056 bytes en:Japan article, as it contains that word only once, in the very last section, and it was found. So it looks like the current setting is OK and this issue is resolved. ~ 202.71.45.37 06:25, 13 October 2007 (EDT)
-
- 2nd issue
Another example: I searched for the word audio thus...
http://wikitravel.org/en/Special:Search?ns4=1&ns5=1&search=audio&fulltext=Search
...and get Results 1-20 of 164 plus 1 2 3 4 5 6 7 8 9 Next ยป - so far so good...
However, when I click on 2 (expecting to get results 21-40 of 164) I get nothing more than Sorry, there were no exact matches to your query. ~ 202.79.25.170 09:59, 6 October 2007 (EDT)
- For performance reasons the search extension computes the total matched documents before applying the name space filter. There really isn't a second page to this search -- the web server just thinks there is. I'm not sure yet what can be done about this.
- KevinSours 18:53, 9 October 2007 (EDT)
- As an interim improvement, how about including an explanation - maybe just add "(in all namespaces)" - on the search results page? ~ 202.71.45.37 04:57, 13 October 2007 (EDT)
- Yes, this would be very handy. I suggest we hammer out the text here first though. Jpatokal 05:18, 13 October 2007 (EDT)
- This should be pretty easy, just let me know what the text should be.
- KevinSours 11:29, 15 October 2007 (EDT)
-
- 3rd issue
And another example: I searched for 50webs [2] as this was just added to the local spam blacklist and would cause problems with articles which already contain occurrences. I got Results 1-20 of 1226 but on checking the first 20 articles found that none of them contain the string 50webs. ~ 203.189.134.3 10:43, 9 October 2007 (EDT)
- This is happening because the search is stripping numbers from words before comparing them. So 50webs => web (we also strip plurals and other endings). This is why 50webs matches all of those other pages. This is the same basic logic that lets the search match run and running. I can alter the search to include numbers in search terms, but that has some potential side effects. If you search for web right now, then it will return results with 50webs or web5 or any similar variation. If we add numbers back to the search then web, web5, and 50web will be considered different terms and will not be matched.
- I'd like to get a bit of feedback as to people's preferences on this before I do anything rash.
- KevinSours 16:39, 12 October 2007 (EDT)
- Would it be possible for phrase-searches to be taken literally? In other words, for anything in quotes to have nothing stripped? So 50webs would find web, webs, webbing, etc; and "50webs" would only find 50webs and nothing else? ~ 202.71.45.37 04:57, 13 October 2007 (EDT)
- Unfortunately, no, this isn't really possible.
- KevinSours 11:29, 15 October 2007 (EDT)
- Literal quoting is a good idea. I don't really understand the rationale behind stripping numbers though... Jpatokal 05:18, 13 October 2007 (EDT)
- I'm not sure I do either, but it was written this way for a reason (though not necessarily a good reason). That was the best I could come up.
- KevinSours 11:51, 15 October 2007 (EDT)
- I can't see any advantage with stripping numbers out. I think it would be better not to. ~ 203.189.134.3 14:41, 15 October 2007 (EDT)
This is done. I've pushed it to review for, well, review. If I don't hear any objections in a couple of days, I'll push things live. Example search http://wikitravel.org/review/Special:Search?search=2nd&fulltext=Search KevinSours 14:44, 16 October 2007 (EDT)
- Why does that example search for 2nd find Center ND? ~ 203.189.134.3 06:32, 18 October 2007 (EDT)
- That's a very good question. The title uses a different analyzer from the text (the analyzer is what does the alteration of terms I describe above somewhere). Looks like it's also stripping numbers (I didn't think it did). I'll take another look at things. KevinSours 12:45, 18 October 2007 (EDT)
-
- 4th issue
The search results listing is often missing some "context" lines from individual "finds". Most results comprise a link to the page, followed by one or more lines consisting of the word searched for (highlighted in red) "in context" (typically about 7-8 words before and after, unless the word searched for was found at or close to the beginning or end of a sentence), followed by a final line giving relevance and date information.
Sometimes however there are no "context" lines at all. Example ~ 58.8.1.88 04:19, 15 December 2007 (EST)
[edit] When it happens
All the time (and this means it's not possible to reliably check for existing occurences of strings when updating the local spam blacklist).
- Seems to be working now (at least searching for "Inthanon" finds Chiang Mai), is it time to declare this closed? Jpatokal 04:43, 24 October 2007 (EDT)
- No. The 1st issue is resolved, the 2nd and 3rd issues are not. ~ 203.189.134.3 10:09, 24 October 2007 (EDT)

