MySQL Full Text search does not perform well on large data sets
It looks like the problem which is not solved even by pretty smart people. For example LiveJournal does not have full text search itself, while I would guess this is feature which a lot of people would like. Or Wikipedia offering Google or Yahoo search instead of their own full text search. If you could just plug in "MATCH (text) AGAINST ("phrase")" and enjoy I guess it would be long done already.
I've been playing with FullText search a lot but simply could not make it to perform well even in best scenario (fully sorted index) when data size is
significant. If you know how to make it return 10 most relevant documents out of 1.000.000 matching from 10.000.000 collection, when data size is 10x of memory size - let me know.
My felling is - it just can't be done without serious changes in FullText architecture, which would lose some of its benefits, for example live index updates. Of course if you have small document collection, ie 100.000 which well fits in memory, plus you have light load and very selective queries it make well enough for you, so there is no need to go for something different, which makes things more complex.
The problem with performance also is so bad you can't really fix it with caching, as single cache miss may lead to user never getting his result back.
So Internal FullText did not work well so I decided to check for other possibilities. First one was Mnogosearch (http://www.mnogosearch.org),
it is pretty well known and is initially designed to index web sites, but also can do databases. It is used for example for Manual search on MySQL web site.
The first problem I run into with - even with fastest full index mode it took other 24hours to index sample database with 500MB text (3.000.000 docs)
without tuning table schema and options it was taking other a week. (Box has 512MB of RAM and Sempron 2800+ CPU)
There is several modes it can work with - BLOB mode seems to be the only which offers reasonable performance. Otherx were much 5-10 times slower.
The problem with it however is memory requirement - BLOB is stored in database and has to be retrieved as single piece which requires a lot of memory on client and server, so you need to watch how many queries is running concurrently. In my tests I had mnogosearch taking over 100MB of memory to perform some queries... this is 20% of data size - quite a lot.
Mnogosearch also supports mode when all data is stored on file system rather than in database - I have not tested it, tell me if it is much faster.
Mnogosearch performance for data larger than memory size is better as BLOB can be fetched with large sequential read, however I still had some complex queries taking other 10 seconds which is a lot.
The other tool I tried was "Sphinx" (http://www.shodan.ru/projects/sphinx)
It is designed especially for indexing databases so it does not need this weird "URL to database" conversion as Mnogosearch. It also stores data in
files in its own packed format which takes much less space.
The indexing speed was nice surprise - content was indexed just in 5 minutes, most of which was spent fetching data from the database. Also I could not find a single query which was executing more than a second, even if I would index it with reduced stop word list, so some very popular words are indexed.
Here is benchmark sample:
500MB of docs, 3.000.000 in count. 512M,128M key buffer.
Looking for "internet web design" (match any) returns
134.000 docs.
With full text search I just Benchmarked count(*), as all real queries
would be slower, unless you do "in boolean mode" with small limit and
without sorting - which returns words with any relevance ranking at all.
"Cached" means second run, Index + data did not fully fit in memory
this is why it does not always help a lot.
Full Text Search: 392 Sec
Full Text Search (Cached): 272 Sec
Full Text Boolean Mode 12 Sec
Full Text Boolean (Cached) 11 Sec
Mnogosearch (external) 3.5 Sec
Mnogosearch (external cached) 1.06 Sec
Sphinx 0.23 Sec
Sphinx Cached 0.15 Sec
LIKE %...% 30sec Sec
LIKE %...% (Cached) 29sec Sec
April 1 2005, 23:25:07 UTC 7 years ago
For Wikipedia, for many searches, it's actually the join to get article titles and restrict the results based on a field in another table which is slowest. That will get better in the next schema. But still, full-text with many stop words can be too slow to accept without full caching. Full caching is what I try to use to keep performance acceptable. It can't work well enough. Search enabled already uses more than half of the power of our database servers and that just won't scale acceptably as people increasingly come to us first for searching.
Spinx looks interesting.
April 2 2005, 03:49:25 UTC 7 years ago
Lucene
Well, I might need to take a look at it. It is not entirely clear how well it fits with Database content indexing. I guess for WikiPedia it is not the problem as you rather need parsed article text indexed anyway. Also Indexing speed does not impress... Their own performance benchmarks show "Time taken / 1000 docs indexed: 49 seconds" for 1KB documents, with Sphinx I got other 5000 docs/sec of similar size. Well you do not do reindexing often so it might not be the problem as soon as index updates are fast and do not degrade performance.
Anyway It would be interesting to see Lucene benchmarks compared to others not to itself.
To be honest I do not understand why retrieving articles gives you such problems ? Search gives you id back of some sort, and you pull articles based on this id (assuming search does not hold excempts in its database, in which case you only need to handle clicks)
By full caching do you mean all database fits in memory or some extra caching layer ?
Anonymous
7 years ago
7 years ago
April 4 2005, 07:02:54 UTC 7 years ago
However, it is pretty easy to implement index mergeing, so you would not reindex from scratch but only reindex the delta and the merge it with old index.
I am not sure if this will be any faster than reindexing the whole database (because the DB is easily accesible, unlike the Web pages f.ex.), but index merging is on my Sphinx roadmap anyway :)
7 years ago
7 years ago
7 years ago
3 years ago
Anonymous
June 19 2005, 04:01:12 UTC 6 years ago
DBSight = Database + Lucene + Display Templates
I would like you to take a look at DBSight.It's a J2EE app, which has
It's for any JDBC supported databases, including MySQL.
And you can try it at http://search.dbsight.com
There is a step by step tutorial on how this search is created.
6 years ago
Anonymous
5 years ago
3 years ago
November 7 2005, 06:54:33 UTC 6 years ago
Thanks a lot for the pointer!
Sphinx is exactly what I was looking for! Yesterday I've been playing with it and I'm really impressed! It indexed my data (forum, 10M posts, 4GB) in 16 minutes!total 10700961 docs, 3949351194 bytes
total 987.530 sec, 3999219.50 bytes/sec, 10836.08 docs/sec
Very impressive... Running queries is also lightning fast:
----------
query 'test ': returned 1000 matches of 13862 total in 0.01 sec
1. 'test': 13862 documents, 17340 hits
----------
query 'slaapplaats op 5 minuten fietsafstand ': returned 1 matches of 1 total in 0.13 sec
displaying matches:
1. group=29, document=654321, weight=5, time=Thu Jun 13 14:00:08 2002
words:
1. 'slaapplaats': 488 documents, 507 hits
2. 'op': 2899955 documents, 4706296 hits
3. '5': 350120 documents, 422303 hits
4. 'minuten': 48506 documents, 57289 hits
5. 'fietsafstand': 262 documents, 273 hits
----------
IMO this is a great contender for Wikimedia :)
Up to now I've been looking at Xapian (www.xapian.org) for full-text indexing. Also a very promising project but a bit harder to integrate with MySQL/PHP (not rocket science, it's just a bit more involved). Xapian is also used by http://gathering.tweakers.net/ (large forum) and www.gmane.org
Anyway. I think Sphinx has a great future ahead of it :)
November 7 2005, 21:22:14 UTC 6 years ago
Re: Thanks a lot for the pointer!
Thank you for your feedback,Yes - there is number of projects to chose for full text search indexing. I personally was very excited with Sphinx as it has kind of similar ideology to MySQL.
- It is very easy to use for database indexing
- It is grown from practical roots of handling a lot of stuff fast, not academic "proof of concept" system.
- Andrew is performance geek trying to squeze extra couple of percent of performance, where possible, just as Monty with MySQL
- Easy to adapt to your own needs
- Very good relevance
I know the case of Sphinx handling 100GB of data on the single box, performing
close to 200K of searches per day, most with sub second response time.
Anonymous
December 17 2005, 14:44:42 UTC 6 years ago
Re: Thanks a lot for the pointer!
yes but it still not supports +not, that is you cant say in a DB field +findme -leavemeout.Though its only a small limitation but its worth the effort. I specially like the way it returns the words it found in your query. So you can present a link saying "Did you mean xyz?" instead of saying no records found.
regards
sandeep
http://webmechanic.blogspot.com
6 years ago
Anonymous
6 years ago
Anonymous
March 7 2006, 10:08:21 UTC 6 years ago
Sphinx status?
Peter,Do you know what happened to project Sphinx? I've tried to reach the author via his contact-form on his site, but no replies for weeks...... (actually, his source code shows
March 7 2006, 18:37:36 UTC 6 years ago
Re: Sphinx status?
Matt, Drop me an email and I'll give you email of sphinx author. I would not like to post it on a web due to spamming issuesAnonymous
April 6 2006, 11:53:00 UTC 6 years ago
Re: Sphinx status?
He replied to me later on that night... *cheers*Anyway, Sphinx is incredibly fast.
Anonymous
June 17 2006, 18:41:32 UTC 5 years ago
hello
Wow! Cool design! Webmaster respect!Anonymous
August 9 2006, 09:29:49 UTC 5 years ago
Any screenshot or example query results?
Peter said: ...MySQL Full Text search does not perform well on large data sets...Peter, what is the bias or problem with Mysql full text search you meant above for exactly? any screenshot or example of biased query result using mysql full text index?
In your observation, how much the maximum data size that mysql full text index still work well?
Thank You .
August 9 2006, 09:42:34 UTC 5 years ago
Re: Any screenshot or example query results?
I do not have at this poing but We're working on presentation for EuroOSCON which among others will compare MySQL FullText search, Sphinx, Lucene and could be few others.We're using wikipedia dump for tests so it would be easy to repeat. There are queries which complete in about an hour if MySQL Full Text search is used.
The maximum data set will depend on amount of memory you have and queries you're running. For example if you need sorting result by other field than relevance (ie show newest data first) it will be very slow.
Anonymous
5 years ago
5 years ago
Anonymous
5 years ago
5 years ago
Anonymous
5 years ago
Anonymous
5 years ago
Anonymous
January 30 2007, 23:01:51 UTC 5 years ago
nokia last year added the RAZR v3 black Ferrari Challenge
MESSAGEJune 6 2007, 01:48:26 UTC 4 years ago
Re: nokia last year added the RAZR v3 black Ferrari Challenge
IntelligentAnonymous
February 4 2007, 22:17:05 UTC 5 years ago
Who can help me with .httpaccess ?
Who can help me with .httpaccess ?where i can fined full information about .httpaccess file syntaxis?
Anonymous
March 2 2007, 02:52:06 UTC 5 years ago
make money online idea
MESSAGEAnonymous
March 23 2007, 13:43:00 UTC 5 years ago
Does Sphinx works with longblob in mySQL?
I've been wondering if sphinx can index words with longblob type ? you said in your article for mngosearch that it had a blob mode... but does sphinx include one?Thanx in advance !
Sphinx seems to be a very powerful and useful tool
March 23 2007, 18:43:12 UTC 5 years ago
Re: Does Sphinx works with longblob in mySQL?
No,Sphinx stores indexes in files instead of database. But it can index text contained in logblob columns.
Anonymous
September 5 2007, 15:35:49 UTC 4 years ago
Re: Does Sphinx works with longblob in mySQL?
Peter,Now I am trying to develop a webapp that uses Sphinx and mysqldb and it would be very glad from you if you can help me. I would like to ask if you know any web crawler that inserts directly into a mysql database, because I want to use SPHINX in order to index and search.
Thanks!
Fabian
4 years ago
Anonymous
3 years ago
Anonymous
3 years ago
Anonymous
June 4 2007, 22:46:38 UTC 4 years ago
let the debate zdelaem How do you think movie good?
Tramadol is a Pain killer abused by many, Huraaa for being a pharmacy student.an effective pain reliever (analgesic). Its mode of action resembles that of narcotics, but it has significantly less potential for abuse and addiction than the narcotics.http://tramadolcom.info
Tramadol is a Pain killer abused by many, Huraaa for being a pharmacy student.
Anonymous
June 20 2007, 23:29:33 UTC 4 years ago
eptJypZjHxrafmqb
a6a7d2745ee994377352f07b209ce0d6Anonymous
October 2 2007, 11:09:07 UTC 4 years ago
Scale-out Lucene with ScimoreDB
How about scaling out freetext search? We have integrated Lucene (CLucene) into shared nothing ScimoreDB. If single machine is not enough, you can always add more. Distributed engine will partition rows across multiple servers; therefore there is less data per machine and more memory to use. Parallel execution benefits performance as well. Just run a small test with 7 machines to index 11mln message (5GB text data) of the forum. It scaled pretty well.The version with Lucene integration is still in beta, but can be downloaded from: http://www.scimore.com/download/2.5/scim
Hence, it is only windows DB, but it is free.
Marius Slyzius
www.scimore.com
Anonymous
April 13 2008, 08:52:14 UTC 4 years ago
How register mail on hotmail right ?
Hihttp://phentermineonline1.blog-br.com/
G'night
Anonymous
August 2 2008, 17:49:54 UTC 3 years ago
The compelte movie is only from $1.99 to $4.99.
Instant Full-Length Downloads. From as low as $1.99!You will never be asked to download or install any additional software. Just use programs you always use, like your favorite browser or download manager. No rentals, you own what you download. The compelte movie is only from $1.99 to $4.99.
DVD quality movies
Along with widely used «DivX» version, we also offer higher quality downloads: 6 channel surround sound and DVD video resolution.
Watch movies on your PDA
Yes, each movie on our website is available in special «PDA edition».
Burn, baby, burn!
You can burn any movie you download to a CD or DVD. Watch movies on home video player with your family or friends.