MySQL Full Text search does not perform well on large data sets
It looks like the problem which is not solved even by pretty smart people. For example LiveJournal does not have full text search itself, while I would guess this is feature which a lot of people would like. Or Wikipedia offering Google or Yahoo search instead of their own full text search. If you could just plug in "MATCH (text) AGAINST ("phrase")" and enjoy I guess it would be long done already.
I've been playing with FullText search a lot but simply could not make it to perform well even in best scenario (fully sorted index) when data size is
significant. If you know how to make it return 10 most relevant documents out of 1.000.000 matching from 10.000.000 collection, when data size is 10x of memory size - let me know.
My felling is - it just can't be done without serious changes in FullText architecture, which would lose some of its benefits, for example live index updates. Of course if you have small document collection, ie 100.000 which well fits in memory, plus you have light load and very selective queries it make well enough for you, so there is no need to go for something different, which makes things more complex.
The problem with performance also is so bad you can't really fix it with caching, as single cache miss may lead to user never getting his result back.
So Internal FullText did not work well so I decided to check for other possibilities. First one was Mnogosearch (http://www.mnogosearch.org),
it is pretty well known and is initially designed to index web sites, but also can do databases. It is used for example for Manual search on MySQL web site.
The first problem I run into with - even with fastest full index mode it took other 24hours to index sample database with 500MB text (3.000.000 docs)
without tuning table schema and options it was taking other a week. (Box has 512MB of RAM and Sempron 2800+ CPU)
There is several modes it can work with - BLOB mode seems to be the only which offers reasonable performance. Otherx were much 5-10 times slower.
The problem with it however is memory requirement - BLOB is stored in database and has to be retrieved as single piece which requires a lot of memory on client and server, so you need to watch how many queries is running concurrently. In my tests I had mnogosearch taking over 100MB of memory to perform some queries... this is 20% of data size - quite a lot.
Mnogosearch also supports mode when all data is stored on file system rather than in database - I have not tested it, tell me if it is much faster.
Mnogosearch performance for data larger than memory size is better as BLOB can be fetched with large sequential read, however I still had some complex queries taking other 10 seconds which is a lot.
The other tool I tried was "Sphinx" (http://www.shodan.ru/projects/sphinx) It is not well known, I just run into it by pure luck but it seems to work extremely well.
It is designed especially for indexing databases so it does not need this weird "URL to database" conversion as Mnogosearch. It also stores data in
files in its own packed format which takes much less space.
The indexing speed was nice surprise - content was indexed just in 5 minutes, most of which was spent fetching data from the database. Also I could not find a single query which was executing more than a second, even if I would index it with reduced stop word list, so some very popular words are indexed.
Here is benchmark sample:
500MB of docs, 3.000.000 in count. 512M,128M key buffer.
Looking for "internet web design" (match any) returns
With full text search I just Benchmarked count(*), as all real queries
would be slower, unless you do "in boolean mode" with small limit and
without sorting - which returns words with any relevance ranking at all.
"Cached" means second run, Index + data did not fully fit in memory
this is why it does not always help a lot.
Full Text Search: 392 Sec
Full Text Search (Cached): 272 Sec
Full Text Boolean Mode 12 Sec
Full Text Boolean (Cached) 11 Sec
Mnogosearch (external) 3.5 Sec
Mnogosearch (external cached) 1.06 Sec
Sphinx 0.23 Sec
Sphinx Cached 0.15 Sec
LIKE %...% 30sec Sec
LIKE %...% (Cached) 29sec Sec