Register Members List Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
  #1  
Old 29 Sep 2006, 21:05
orban orban is offline
 
Join Date: Jan 2005
Sphinx Search

Sphinx Implementation for vBulletin:

Version 0.1 Hooray!

Just sharing as usual, let the discussions begin (in b4 TECK "MINE IS BETTER")

Only tested with Sphinx-0.9.8-rc2 (r1234; Mar 29, 2008).

If you are upgrading from my old tutorial, backup your search.php (you know, just in case you need the old hacked up version again) and restore the original from the zip/tar, no more file modifications!

http://sphinxsearch.com/downloads.html

Tested on 3.6.10, should work on 3.7 if you modify /*insert query*/ on Line 522 (I removed 'prefixchoice' field because it doesn't exist in 3.6)

No support for tags/thread prefix yet, because I don't have access to a 3.7 installation at the moment

Similar threads is also being worked on

Alpha release for some feedback, hopefully it will be production ready soon

I assume you already have Sphinx up and running... see attached sphinx.conf.example for a minimalistic setup

Installation notes inside search_sphinx.php

Well yeah enjoy. And PM me if you need help

The old post is here: http://www.vbulletin.org/forum/showp...&postcount=387

The Good:
  • Search this forum
  • Search this thread
  • Find all posts by User
  • Find all threads started by User
  • "Search Entire Posts"/"Search Titles Only" and "Show Results as Threads"/"Show Results as Posts" in all four combinations supported
  • "Search Entire Posts" can be sorted by rank/post.dateline (postuserid, forumid will sort by integer)
  • "Search Titles Only" can be sorted by rank, last reply date, first post date, number of replies (views if you add that value to sphinx.conf)
  • Really fast

The Bad:
  • This means you can't sort posts by title, number of replies/views, thread start date, last reply date (Sphinx doesn't have this data).*
  • You could possibly add this to sphinx.conf but it will only be as good as your last full post index update
  • "Find Threads with At Least/Most X Replies" doesn't work when "Search Entire Posts"
  • Search results are delayed (depending on how often you run indexer)
  • "New Posts" not supported... too much logic in the query?!

The Ugly:
  • Sorting is kinda messed up (especially when "Search Entire Posts" and "Show Results as Threads" are combined)
  • search_sphinx.php is messy, duplicated code from search.php

*The Infamous Post Sorting Quirk

What happens here is that when you "Search Entire Posts" and "Show Results as Threads", do you want you threads sorted by:
  • First post dateline (vBulletin option)
  • Last post dateline (vBulletin default)
  • The matching post dateline (Sphinx)

Our Sphinx setup does not have first post and last post dateline stored in its post index (and it would be pretty much useless too) so the first two options are not available. vBulletin offers a function called "sort_search_items()" (search.php:633 3.7) which could, in theory, be used to sort the threads by last post dateline.

It does not fix the problem though. Let's assume we set maxresults to 5. We are searching for threads for "funny". We have 7 threads created today:

1. Thread "Cows", Created 08:00, Last Post 17:00 | "Funny Cows", Created 09:00
2. Thread "Cats", Created 09:00, Last Post 14:00 | "Funny Cats", Created 14:00
3. Thread "Dogs", Created 10:00, Last Post 12:00 | "Funny Dogs", Created 11:00
4. Thread "Mice", Created 11:00, Last Post 15:00 | "Funny Mice", Created 13:00
5. Thread "Rats", Created 12:00, Last Post 13:00 | "Funny Rats", Created 12:00
6. Thread "Eels", Created 13:00, Last Post 19:00 | "Funny Eels", Created 18:00
7. Thread "Fish", Created 14:00, Last Post 18:00 | "Funny Fish", Created 17:00

Do we want to show threads 6, 7, 2, 4, 5 (Sphinx)? Or do we want to show threads 6, 7, 1, 4, 2 (vB)?

vBulletin finds all 7 posts, orders them by last post descending, and grabs the top 5.
Sphinx will find the newest 5 matching posts and then returns you the associated threads.

Reordering search results with "sort_search_items()" does not fix the problem because there might be older threads with very recent replies that Sphinx won't even consider. Let's consider an 8th thread:

8. Thread "Bees", Created 2002, Last Post 20:00 | "Funny Bees", Created 2002

vBulletin will list this one on top, Sphinx will not consider it. So even re-sorting the search items will not make this thread appear.
Attached Files
File Type: php search_sphinx.0.1.php (17.3 KB, 650 views)
File Type: txt sphinx.conf.example.txt (3.6 KB, 749 views)

Last edited by orban; 08 May 2008 at 09:58.
Reply With Quote
  #2  
Old 29 Sep 2006, 21:30
Adrian Schneider's Avatar
Adrian Schneider Adrian Schneider is offline
 
Join Date: Jul 2004
Nice find! I'll play around with it once I get some time.
Reply With Quote
  #3  
Old 29 Sep 2006, 21:37
orban orban is offline
 
Join Date: Jan 2005
Obviously the only options you will have on the advanced search page are:

Key Words:
Search In: Thread Titles/Posts
Sort Results by: Relevancy, Date Asc, Date Desc
Search in Forums:

And I guess searching by username will still be the built in way. (As in, without a search term, just list his posts.)

Gonna try to hack that up, when I make it work I'll release it I hope

But the fact you can index 4k posts/second is absolutely insane, and that was with 800 users online...

Last edited by orban; 29 Sep 2006 at 21:40.
Reply With Quote
  #4  
Old 29 Sep 2006, 21:39
Paul M's Avatar
Paul M Paul M is offline
 
Join Date: Sep 2004
Real name: Paul M
Hmm, yes, that looks interesting, bookmarked for later.
__________________

Cable Forum
Lead Developer, vBulletin.Org & vBulletin.Com
Please do not PM me about custom work - I no longer undertake any.

Note: I will not answer support questions via e-mail or PM - please use the relevant thread or forum.
Reply With Quote
  #5  
Old 29 Sep 2006, 21:50
orban orban is offline
 
Join Date: Jan 2005
Also means I can remove that 400mb fulltext index from post table making MySQL even faster.

The right tool for the job.

Filtering by forumid already works, so does sorting by date.

And it still says 0.000003 seconds. Incredible.

Last edited by orban; 29 Sep 2006 at 22:00. Reason: Automerged Doublepost
Reply With Quote
  #6  
Old 29 Sep 2006, 22:20
forumdude's Avatar
forumdude forumdude is offline
 
Join Date: Nov 2001
Hmm good timing. I got on here today to see if there were any other resources out there for searching and vbulletin and this showed up in the results.

We've had soooo much trouble keeping our search up. We're using the fulltext search right now with the search on its own server on tables reduced in size. Huge pain and it still doesn't return some results.

Keep us updated please, this looks cool.
Reply With Quote
  #7  
Old 29 Sep 2006, 23:36
forumdude's Avatar
forumdude forumdude is offline
 
Join Date: Nov 2001
Awsome!

If I get some time tonight (probably not!) I will download Sphinx and give it a look.

What kind of data do you have to test this with?

We're looking at about 9 million records on our live post table (millions more archived). I'm very curious how well this would hold up to that amount of data.
Reply With Quote
  #8  
Old 30 Sep 2006, 00:26
mute mute is offline
 
Join Date: Dec 2002
Can I get a peek at your sphinx.conf?
Reply With Quote
  #9  
Old 30 Sep 2006, 00:33
mute mute is offline
 
Join Date: Dec 2002
wow, you are fast! thanks. I'm tossing it 24 million posts to see what it does
Reply With Quote
  #10  
Old 30 Sep 2006, 01:28
mute mute is offline
 
Join Date: Dec 2002
*waits for post index to build*

So far so good. It ripped through 1,652,726 thread titles in about 2 minutes, on a machine replicating a very active forum, and one running a test upgrade from 3.5.5 to 3.6.1

So far, I'm happy! I think with a little work this could be amazing. The api is a little unfriendly when it comes to errors and what not, but with some polishing and figuring out the targeting of searches and by name, and we're good to go.

Orban you are a hero among men!

Just FYI:

thread table:
collected 1658976 docs, 48.1 MB
sorted 5.1 Mhits, 100.0% done
total 1658976 docs, 48070959 bytes
total 148.426 sec, 323872.56 bytes/sec, 11177.16 docs/sec

post table:
collected 8860446 docs, 1416.9 MB
sorted 140.2 Mhits, 100.0% done
total 8860446 docs, 1416892676 bytes
total 3168.862 sec, 447129.84 bytes/sec, 2796.10 docs/sec

that is word length of 4 and no stopwords.

Last edited by mute; 30 Sep 2006 at 02:03. Reason: Automerged Doublepost
Reply With Quote
  #11  
Old 30 Sep 2006, 14:03
mute mute is offline
 
Join Date: Dec 2002
Originally Posted by orban
Wow, that's crazy. 1.4gb for 8.8million posts....?!
Actually, 1.4gb for 24 million posts. For some reason it gets 1:1 "documents" when indexing thread, but only 1:3 for posts, not sure if that is a bug and it isn't indexing everything, or has something to do with our content?

I'm headed out fishing, but I'm going to play with your updated changes later
Reply With Quote
  #12  
Old 30 Sep 2006, 14:42
orban orban is offline
 
Join Date: Jan 2005
Weird....
Reply With Quote
  #13  
Old 30 Sep 2006, 14:55
mute mute is offline
 
Join Date: Dec 2002
Yeah, and I recreated it a few times (with stopwords, diff min word length, etc). Not exactly sure why yet.
Reply With Quote
  #14  
Old 30 Sep 2006, 20:02
orban orban is offline
 
Join Date: Jan 2005
Maybe some posts are too short? Like no words longer than 4 characters?

But then again that'd never be 2/3th of the posts. I really have no idea
Reply With Quote
  #15  
Old 02 Oct 2006, 09:54
kmike kmike is offline
 
Join Date: Oct 2002
Sphinx 0.9.7 will feature an arbitrary number of group id's, so it would be possible to handle "search this thread" and search by user in Sphinx.
Meanwhile, it's easy to hack Sphinx to support 3 groupid columns instead of one by some copy-pasting. Naturally, the index size is larger with additional group id's, 5GB for 6mln post database. We've been running it for some months already with great success.
Reply With Quote
Reply



Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


New To Site? Need Help?

All times are GMT. The time now is 23:05.

Layout Options | Width: Wide Color: