PDA

View Full Version : Is this a solution for large threads?


amcd
27 Mar 2007, 10:44
The topic of large threads being detrimental to mysql performance is known to all big board admins. But why is it this way with vbulletin? Do all forum software have this problem? Does Gaia online suffer from similar drawbacks?

vb stores the threadid and dateline in the post table. Whenever a bunch of posts needs to be selected, vb filters the post table by threadid and orders the results by dateline, and takes a few of the posts (of course there is an index involved). In other words the query is similar to

SELECT * FROM post WHERE threadid=$threadid ORDER BY dateline LIMIT 8575,15

which means that mysql will have to throw away the first 8575 results and then return the next 15. If my understanding is correct, this is the reason why large threads are bad. We need only 15 rows, but mysql has to analyse thousands.

The post table is read in this manner very frequently, every time showthread.php is run. Other accesses to the post table are far less frequent. They are showpost, newreply, newthread, edit post and miscellaneous moderator functions like delete post, move/copy post, merge posts and merge threads. vb's method of doing things means that the frequent requirement (showthread) is slow whereas the infrequent requirements are much simpler and faster.

Now consider an alternative.

What if the position of the post in the thread is stored in the post table? Lets call this field 'postpos'. The first post in a thread will have postpos=1, and the 19th post will have postpos=19.

Now, the query for showthread will be something like

SELECT * FROM post WHERE threadid=$threadid AND postpos BETWEEN 8576 AND 8590 ORDER BY postpos

This should be extremely fast, and should be equally fast for any page in a thread of any size.

Of course, this means that the postpos column must be updated by all relevant modules of vbulletin. It will need to be given a proper initial value when a new post is made, and it will need to be updated whenever a post is moved, copied, merged or deleted. It will also need to be updated if a thread is merged or split. The additional code on newreply will not be running any heavy query, so it will be quite fast. Other functions may be slow for large threads, but since all functions except adding a new post happen quite infrequently, it should not matter so much.

Am I making a fundamental mistake somewhere? Is this feasible? Can this be done using plugins and without code edits?

orban
27 Mar 2007, 13:18
Yes, this correct, MySQL has to skip the first 8574 posts, that's why those queries are "slow", but they aren't that slow, really, I think. 8000 rows isn't that much (and it's fully indexed).

Yeah, adding postpos, or "page", could be a solution, but the issue is that deleting posts/moving/merging is a bit of a pain.

Worst case is that you remove the second post of a thread with 10000 replies and then you have to update 9999 rows, which is veeeeery bad. Then again you will hardly delete/move old posts in a thread and just on the last pages, where it's not a big issue to update a few dozen posts....I guess...

Mainly making sure you don't miss any actions that modify thread positions is the issue here and the problem that in certain cases you have to update many many many rows with a new postpos value.

Probably could be done just with a plugin (maybe modcp/admincp post moving options might be a problem).

amcd
27 Mar 2007, 13:29
8000 was just an example. We have a thread with more than 700,000 posts. The admins did not like members doing off-topic chatting in any and all threads, so the members made a thread only for chit-chat. Now that thread takes 20 seconds or more to load, and we have quite a powerful dbserver. We keep that thread closed, and another 'current' thread is present for chatting. The current thread is merged with the main thread once it reaches 8000 posts or so. The members are not happy with this.

I read elsewhere that large forums auto-split threads larger than 1000 posts. We have hundreds (maybe thousands) of threads significantly larger than 1000 posts.

kmike
27 Mar 2007, 19:52
You can alleviate the need to retrieve all thread posts by using a reverse index on dateline when looking in the second half of a large thread. This solution was first introduced by the admin of the infamous www.Gaiaonline.com forum (which is currently at 979 million posts - simply incredible!). You can see it here:
http://www.phpbb.com/community/viewtopic.php?t=135383#p771664

Implementing it will greatly decrease the number of retrieved posts for sorting, since most of the post accesses are near the thread end (members reading last posts), and in the worst case of reading near the middle it'll require retrieving and sorting only through half of the thread posts.

In fact, I wonder if this solution should be brought to the vB devs' attention, many forums are suffering from the huge thread issue.

amcd
27 Mar 2007, 21:29
Thank you, kmike. That is an interesting approach. That thread also looks like a good read otherwise as well. Will read the whole thing when I get some time.

Do you see anything wrong with the approach I am planning. Should I go ahead and try it?

kmike
28 Mar 2007, 06:57
There's nothing wrong with your approach, but as orban already pointed out, there're too many places you have to track to keep the post positions updated, and renumbering the posts on every delete/merge/split is a costly operation. Also, you'd have to find the last postpos in a thread for every new post, which will be far more frequent operation if your forum is active.
Personally, I wouldn't bother with this approach.

bigbadbob0
01 Apr 2007, 13:33
I'm pretty sure MySQL is optimized to not do stupid things when given a LIMIT. So this query probably isn't wasting quite as much as you're imagining.

http://dev.mysql.com/doc/refman/5.0/en/limit-optimization.html

kmike
02 Apr 2007, 06:20
There's no imagination involved. A simple explain on any showthread query for a large thread will show an estimated number of scanned rows. Heck, doesn't "LIMIT 10000" for a 10000+ replies thread ring the bell?
It may look like a stupid thing, but to fetch a post at the end of a thread MySQL has to sort through all the previous posts (ordered by dateline ASC).

orban
06 Apr 2007, 12:04
You can alleviate the need to retrieve all thread posts by using a reverse index on dateline when looking in the second half of a large thread. This solution was first introduced by the admin of the infamous www.Gaiaonline.com (http://www.Gaiaonline.com) forum (which is currently at 979 million posts - simply incredible!). You can see it here:
http://www.phpbb.com/community/viewtopic.php?t=135383#p771664

Implementing it will greatly decrease the number of retrieved posts for sorting, since most of the post accesses are near the thread end (members reading last posts), and in the worst case of reading near the middle it'll require retrieving and sorting only through half of the thread posts.

In fact, I wonder if this solution should be brought to the vB devs' attention, many forums are suffering from the huge thread issue.

How would you go about implementing a reverse index?

http://www.mysqlperformanceblog.com/2007/04/06/using-delayed-join-to-optimize-count-and-limit-queries/



Oh!

OH!

Does vB do JOINs on the LIMIT 10000000,10 queries?

amcd
10 Apr 2007, 04:39
no, apparently vB does not do that. vB first runs a simple query on a single table (post) and gets the relevant postids. Then it runs the complex query only for those postids.

louis_chypher
11 Apr 2007, 14:46
An approach I took on the matter of a forum gaining a large amount of post was to first install the mod [http://www.vbulletin.org/forum/showthread.php?t=113324]Auto Delete/Move Thread After X days [/url] and explore this mods capibilities.

A close inspection of the mod as orginally wrote allowed post(s) to be moved from forum(s) to a single destination forum.

What I wanted was to have the main forums on my site with the latest post and sub forums nested within each main forum holding posts from 120 days past. Using theAuto Delete/Move Thread After X days (http://www.vbulletin.org/forum/showthread.php?t=139646) as a model I then changed the code to http://www.vbulletin.org/forum/showthread.php?t=113324&page=3 post #35. Which then allowed me to specify a souce forum to destination forum moves (main forum and 'archive' forum) as parameters. Now my main forums contian the latest topics of discussion (small db return result sets) and those main forums contain 'archived data' (large db result sets).

The mod runs as a cron job.

Simetrical
01 May 2007, 03:45
Here's a thought. Assume we know the number of posts in a thread. We do, it's in the thread table which we presumably have already queried to see whether the user can even view anything (thread deleted, etc.). First of all, to fetch the last page of posts (very common usage case):

SELECT postid, visible, userid
FROM post AS post
WHERE threadid = N AND visible IN (...)
ORDER BY dateline DESC
LIMIT ($numposts % $postsperpage);

I don't know about "reverse indices", but just reversing the sort will do fine. Basically you should sort ASC for the first half of the thread, DESC for the second half. That's already worlds better than the current status. (The funny LIMIT is necessary for the last page, of course.) But now the sneaky part. Say you want to go to the next page from the current one. Use:

SELECT postid, visible, userid
FROM post AS post
WHERE threadid = N
AND visible IN (...)
AND dateline > $datelineoflastpostonpage
ORDER BY dateline
LIMIT $postsperpage;

So instead of a link like showthread.php?t=N&page=77, you get links like showthread.php?t=N&after=1177986896. Exactly the number of needed rows get retrieved. Yeah, it will screw up if in the interim someone's deleted some posts earlier in the thread, but a) that's very unlikely and b) the user still wants to see the "previous page", i.e., the stuff before what he just finished reading, so in a way this is more expected behavior. The same can be done for "previous page".

The only bad spot is "jump straight to page" links, but using the DESC sorting for the highest page numbers will alleviate that, so with these two things combined, the only bad thing will be if someone wants to jump directly to a page in the middle of a thread. But any surviving page= URLs could be sharply limited by auto-redirecting them to appropriate &after= URLs (i.e., never have &page= URLs appear in the URL bar, even if it's useful to have them in links sometimes), so cut-and-paste URLs by users would still be efficient. Then there'd be basically no problem that I see with arbitrarily big threads. Maybe set up a cache in memory to store datelines for middle-of-thread page numbers to avoid the situation of a lot of people clicking a link that brings them directly to the middle of the thread and hasn't been optimized to &after=, just for completeness.


This was all inspired by a remark in a presentation by Domas Mitzuas (http://dammit.lt/uc/workbook2007.pdf) on the infrastructure of Wikipedia, by the way. I realized that indeed, you'll always find "offset" in paged features of MediaWiki, never page numbers.

orban
03 May 2007, 19:37
SEO will like you

A possible 10 gazillion URLs for every thread

Simetrical
03 May 2007, 21:56
Heh, like we don't have that already? t=143297, or pick one of p=1213498, 1213537, 1213543, . . ., 1240736, [whatever this post number will be], add on any page=, perpage=, lastpost, you name it. You'll want to rely on appropriate nofollows/noindexes and site maps if you want proper indexing of vB regardless of this, I'm thinking. Threads don't lend themselves well to unique URLs (because who says post X will stay on page Y of the thread, or in the thread at all for that matter?).

kmike
04 May 2007, 06:21
The first part of your post, until the offset talk, is exactly what I was talking about in the post #4. You'll need an appropriate compound index ending in "dateline" to take advantage of that however, and scanning it in reverse order (DESC) is why it was called a "reverse index".

Now to the offsets idea, you're just shifting the problem to another place if you want to keep the traditional thread pagination. Now instead of finding a first post on the requested page *once* you need to find the first posts for every page surrounding current page in the navigation area.
For example, looking at the first page in a thread, we'll have these pages in the pagination nav area:
1 2 3 4 11 51 101 Last
and you'll need the dateline offsets for the first posts on the pages 2, 3, 4, 11, 51, 101, and the last page. Ok, getting them for the first 10 pages is easy (though less effective) - just fetch 150 posts instead of 15, but what about pages 51, 101?

Now let's load page 10, what do we have in the pagination nav area?
First Prev 7 8 9 10 11 12 13 20 60 110 Next Last
You'll need an additional query to fetch the navigational data for the preceeding pages 7,8,9, and you still have to handle the pages 60, 110 somehow.

And another thing, basing the navigation on the pure dateline offsets is wrong - for busy enough forums/threads, some posts in a thread may have the same datelines. You can ignore it of course, but to be consistent you have to sort on dateline DESC, postid DESC or dateline ASC, postid ASC everywhere, otherwise the two posts with the same dateline will switch positions when they gradually move in the first half of the thread.