Chopin Op. 58 fourth movement
This is the third in a series of posts[1] about full text indexing and searching in CouchDB. I lost the second (never drink while in the manage panel :). For a variety of reasons[2] I've settled on using CouchDB as the foundation for a new terminology environment. Missing from CouchDB are two key components, the ability to search over documents and the ability to relate documents to one another. On the face of it the schema-less design might seem to be a serious shortcoming. In our work on terminology development over the years, having to make a commitment to a relational schema, has been a major impediment to building tools that support collaborative modeling efforts as they evolve over time. Moreover the O-R impedance mismatch is a tough hurdle as our work often involves algorithms over large graphs. So Bitstore[2] proposes adding these two components as extensions to CouchDB.
CouchDB does support search. Robert Newson's excellent integration of Lucene[3] as an external brings all the power of Lucene to CouchDB. In my use case I'm not really interested yet in web scale and want something simple that runs in the same VM as CouchDB and more importantly can interact with a description logic [4] reasoner so that one can search for "heart attack" and find "myocardial infarction" or "tetraology of Fallot". The use case also involves a small number of users running couchapps shared thru replication. The idea of throwing Java into the mix is not appealing for all the usual reasons. This newest prototype of Bitstore adds the ability to filter[5] searches over the field names of the documents, along with a number of other minor packaging features to help it play better with CouchDB. The screenshots below show it's use with a modified version of Futon. The details are available on github[5]. In theory CouchDB document bases are schema-less. Of course in practice there is almost always a schema laying around. How views are built generally relies on this. CouchDB-Lucene also relies on this to construct indices. A design document is employed to specify how Lucene Document objects are created. In this approach we track the field names while indexing (I call them slots, I guess because this JSON stuff isn't really new and it reminds me of earlier frame based systems) and incorporate them in the inverted index to enable filtering of the search results. If one actually had a true schema-less database then one could imagine constructing a separate index for the slot names, enabling search over those as well. But for now it seems that in practice, since a schema is always implicitly present, there is no need to index them. I think this is totally relaxed, hence the title of this post. Even a fairly complex db such as NCI's Biomedgt only has 30 to 40 slots. Of course this scheme breaks down when JSON docs are nested. For example Biomedgt has a slot called FULL_SYN which contains some embedded XML that I've converted, so we need to enhance this approach to support nested JSON. If anyone wants to try this, Biomedgt is a couchdb database I built using jcouch from the original OWL version. The folks at Cloudant have been kind enough to give me some free space[7] for it. I continue to be impressed with Bitcask[6]. It's rock solid and in fact I almost forget it's there. Occasionally I pull the latest changes to keep in sync and nothing ever breaks. At some point I may consider a tighter inegration with the couchdb storage layer but now Bitcask is adequate and it also serves as a useful backend for the triple store. Searching is still very limited. One can enter a word or group of words and a conjunctive query is formed. The results are post processed to filter by the specified slot. Now that it's a little better integrated with CouchDB I'm going to start adding features like wildcard searching, spelling correction, better handling of chemical names, ..... unfortunately there are only so many holiday weekends per year. Anyway it keeps me off the streetsComments [0]
Performance is one of those things that's hard to assess, there are typically many moving parts in any system. What gets measured, how it's measured, and the context in which it's measured often make performance testing and benchmarking a controversial subject. Because it makes for good marketing, we often focus on things that might be largely irrelevant to the overall user experience like raw insert speed. This is the reason manufacturers can get someone to pay a premium for a few extra horsepower or megahertz.
Over the holiday I took some spare time to build a prototype of full text indexing for CouchDB that runs in the same VM as the couchdb server. The basic design is from chapter 20 of the erlang book and Joe Armstrong was kind enough to allow me to use the sample code from the book as a starting point. It sticks to the basic design, using map/reduce and interacts with couchdb using Hovercraft with a few extensions. The goal of this prototype is to both explore couchdb internals and FTI in Erlang. The implementation is quite naive, using a couch database to store the inverted index, but it works surprisingly well for my use case and is very simple.
What's impressive about couchdb is how well it holds up under concurrent load. It takes ~5 minutes to index a fairly complex database of 65K docs, indexing the entire doc, resulting in an index database then when compacted is about 75% of the original database size. While indexing this database I replicated it to another copy and started indexing that copy and while all this was running was still able to simulate 200 clients running queries with a throughput of about 50-60/sec. The query was for "rat neoplasm benign". For this search for all docs containing these 3 words there are a total of 6 retrievals made. To the end user, getting sub second response time running queries while the server it busy indexing two databases, the second of which is being replicated, this really rocks!! When the indexing is complete the indexer polls every minute or so, using the changes API to incrementally update the index.
For couchapps running on laptops this is all the performance one needs. YMMV
Comments [2]
We're getting near the end of the fall color here in Ridgefield. It was such a nice fall day we decided to take in a faculty artist concert at Hoff Barthelson [1]. I can't believe you can hear a program like this in a small church setting for $15. The Franck Sonata in A major was beautiful.
Comments [1]
Comments [0]