Full Text Indexing, CouchDB, and Performance
Performance is one of those things that's hard to assess, there are typically many moving parts in any system. What gets measured, how it's measured, and the context in which it's measured often make performance testing and benchmarking a controversial subject. Because it makes for good marketing, we often focus on things that might be largely irrelevant to the overall user experience like raw insert speed. This is the reason manufacturers can get someone to pay a premium for a few extra horsepower or megahertz.
Over the holiday I took some spare time to build a prototype of full text indexing for CouchDB that runs in the same VM as the couchdb server. The basic design is from chapter 20 of the erlang book and Joe Armstrong was kind enough to allow me to use the sample code from the book as a starting point. It sticks to the basic design, using map/reduce and interacts with couchdb using Hovercraft with a few extensions. The goal of this prototype is to both explore couchdb internals and FTI in Erlang. The implementation is quite naive, using a couch database to store the inverted index, but it works surprisingly well for my use case and is very simple.
What's impressive about couchdb is how well it holds up under concurrent load. It takes ~5 minutes to index a fairly complex database of 65K docs, indexing the entire doc, resulting in an index database then when compacted is about 75% of the original database size. While indexing this database I replicated it to another copy and started indexing that copy and while all this was running was still able to simulate 200 clients running queries with a throughput of about 50-60/sec. The query was for "rat neoplasm benign". For this search for all docs containing these 3 words there are a total of 6 retrievals made. To the end user, getting sub second response time running queries while the server it busy indexing two databases, the second of which is being replicated, this really rocks!! When the indexing is complete the indexer polls every minute or so, using the changes API to incrementally update the index.
For couchapps running on laptops this is all the performance one needs. YMMV
