Quantcast
Channel: Technology of Content » nosql
Viewing all articles
Browse latest Browse all 2

Search, SQL, NoSQL, Persistence

$
0
0

I highly recommend the Enterprise Search London meetup, there are lots of interesting talks, thanks to our intrepid organizer Tyler Tate. Last meetup, H. Stefan Olafsson from Twigkit gave a short talk about the relation between relational databases and search engines, and whether you need a relational database if you have a search engine.

Craigslist apartments
Craigslist Apartments, by XKCD

Now this has been something have been thinking about recently, and there are people who are moving big parts of their systems to just be built on search, such as the Guardian API which is served from Apache Solr. In this case though, the search engine is still not the system of record for the core data, which is still the Oracle based CMS which did not scale up enough to serve the API. There was some discussion at the talk about search engines that do support persistence (the D in ACID databases), something Lucene used to have a bad reputation for. My view here though is that, while actually making fsync work properly is a good thing, and you should not buy software that cannot recover from crashes, persistence involves a lot more than this now, such as replication, audit, versioning and access control. Building this directly into search products is a mistake. Another issue is that search engines are denormalized, and data stores of record should really be normalized to a large extent, to minimise the amount of data to be replicated.

There are two approaches that should work instead, however.

The first is more or less the current approach, to use the search engine as an index to a persistent store. I really like this approach if we follow it to its logical conclusion, which is that the persistent store in this type of application architecture should not be a relational database, but it should be a document store, that is a file, an HTTP resource, a document in a NoSQL document database, or an object in a replicated cloud storage system like S3. Modularize the database application, and split the persistence function from the index function. The persistence function provides the durability, versioning and audit and access control, with replication, backup. This can update the search index, and potentially any other types of index, such as a graph database for querying relationships, potentially even a relational database if that is the best way of querying some aspects of the data.

Obviously there is a potential consistency issue, if updates from the document store happen slowly, so potentially there is an eventual consistency model. Historically search was a bad offender here, as dynamic updates were not the norm and everything was batched into nightly updates, but that is going away and dynamic updates are more normal for search indexes. In principle you can have more consistency, especially in an architecture where there are fixed releases that can be consistently indexed, rather than distributed rolling updates, you choose your architecture and take your choice. Small consistency lags rarely matter in a lot of applications.

So you end up with an architecture with a well defined persistence layer that is not a relational database, and a set of indexes appropriate to the application, almost certainly including a full text search engine, but perhaps a graph engine too. Maybe you run consistency checks on your indexes for peace of mind.

The second approach is to see that search engines were some of the original NoSQL data stores, building custom storage and indexing engines, because they had such difficult problems. Indeed Google’s BigTable, and so the ancestry of a lot of NoSQL products came from search. However the search engines around now have not yet refactored themselves on top of the NoSQL engines that have emerged from this work, although this is starting with Lucandra which is Lucene persisted in Cassandra, which looks promising, offering seamless replication and distribution, and HBasene, an HBase Lucene backend. These make a huge amount of sense to me, as if you are developing sophisticated search algorithms, not having to build the whole index and persistence layer as well is a big advantage, as well as the scale out potential. Of course this approach does not conflict with the first one, in fact you could choose a NoSQL backend that is aimed more at read performance than persistence, and at storing small index values fast. The hard bits with this are that the search engines have specifically customised their data storage for the particular use cases, and reworking this onto a more general backend has few apparent advantages; as you can see from the examples above, most of these changes have come from people already using the backends in question and who want a single database to manage all their data requirements, particularly once they are working with high availability and replication. Software modularity really is not at the right level yet is it, I blame object oriented programming for this lack of reusability.

Anyway, back to the main point. For applications like content management, an architecture based on a content store that deals with persistence, versioning, access control, replication, with a set of indexes based on search engine techniques, graph databases, and anything else your applications needs. Ideally the indexes are all based on a common set of low level primitives so the backend can be swapped out or shared between the search store and other application specific indexing requirements, so there is a single low level indexing infrastructure that can be available as a common scalable service, with different implementations available. This type of architecture is quite buildable now, and is certainly used in quite a few applications, and I think it will become much more widespread, particularly in the cloud where it seems more natural, certainly for many types of application that fit into a document type model, such as content based applications.


Viewing all articles
Browse latest Browse all 2

Latest Images

Trending Articles





Latest Images