Versioning docs in CouchDB

2010-05-26T00:19:36.000Z

A common question among new CouchDB users is, "can I use the built in MVCC _revs for document versioning?" The answer, of course, is no. Compaction removes old revs, only the latest rev is represented in view queries, and only the latest revision is replicated. So at any given time (especially on multi-master clusters) you can only depending on having a single version of the document available.

If you want to store document edit history, you'll need to do it outside of the database. This is so simple, that I've implemented it in the jQuery client for CouchDB. There's been a lot of discussion and ideas over the years about the best way to implement document versioning. The approach I've chosen isn't perfect for every application, but it has some properties that make it ideal to include in the standard library (and someday as a patch for server-side CouchDB).

The Approach

The way I've implemented document versioning is simple. Developers pass a flag to the db.openDoc() function, and when the document is loaded from the CouchDB server, the string representation is saved before being parsed into JSON. Later, when the document is saved, the string representation is attached as a new binary attachment, with the corresponding rev as it's name, and a content type of application/json. This way any CouchDB library can just open the stored rev, and see it as a normal document.

This means that each time the document is updated, the client will also store the previous version as an attachment to the latest version. At any time, a user can load any of the old versions. If the URL for a CouchDB document is

http://localhost:5984/db/my-docid

then the URL to load the 3rd revision of that document might be:

http://localhost:5984/db/my-docid/rev-3-a2759ea8b50d82489bce49cc733b08d1

And the contents are exactly the same as would have been served by CouchDB at the 3rd rev of the document.

The API is very basic, and there is room to extend it. Versioning is invoked in the client application, by opening the document with a flag attachPrevRev : true. This alerts the client library to set aside the original string representation of the JSON doc, as served by CouchDB. When the document is saved, the client library encodes the string using Base64 and saves it as a new attachment on the document.

The versioning code can also be invoked by adding a flag to the document. Currently setting the field doc["jquery.couch.attachPrevRev"] = true will cause jquery.couch.js to start versioning the document the next time it is opened. Thanks for the suggestion, Damien, this was simple to implement and makes it easier for end users to make use of the feature, even just in Futon.

Versioning can also be turned on for all documents in an application (so you can add versioning to existing app) by creating the db adapter using this JavaScript:

var db = $.couch.db("mydbname", {attachPrevRev : true});

Now anytime you call db.openDoc(docId) it will open it in versioning-mode.

Benefits

The benefits to this approach are manifold:

Simple

The first of course is simplicity. What could be simpler than storing the old version of the document as an attachment on the new version? This has some side benefits: one is that users can prune the history, so if there are minor changes or irrelevant history, the user can delete it to reclaim space. Another side effect of the simplicity is that the versioning support can be added to existing applications without affecting them. Applications that don't care about history, can be trivially modified to support history, and then a 3rd party application could be used to roll documents back to old versions, etc.

Scalable

The attachment approach is also sane from a resource-usage point of view. You don't want to come up with a versioning scheme that works for 10 revisions, but not 1000. This revisioning scheme will work for thousands of revisions before it needs to be pruned. Essentially the limiting factor is that, each time a new revision is added, the list of attachments grows by one member, so after a few thousand revisions, the list of revisions will start to dwarf the actual data. As I mentioned above, since they are just attachments, they can easily be deleted. So users can reclaim space by deleting insignificant attachments. I think this is a humane way to handle a problem that is usually overburdened with technical solutions. In general I prefer technology that gives power to people, rather than presenting an inflexible "right way" of doing things.

Replicates

Attachments for versioning also have the nice property that they replicate with the documents, like normal data. This means that the rules of eventual consistency apply. Once a cluster has completely replicated, all cluster members will see the same state. Since the versions are normal attachments, deleting old versions also replicates. This means that you will always have the versions of the document, if you have the document at all. If the versions were stored in different docs, then you might have old versions and not the current one, or the new one and not old ones, or some other combination. By making old versions attachments to the document itself, this is all simplified.

Alternatives

Before I close this post I'll describe some of the alternatives for document versioning, and why I didn't chose them. This doesn't mean they should never be used, it's just that I think the attachment approach is more appropriate for most uses. These other approaches might be better for some apps, which is why I'm reviewing them here.

In the Document

One option would be to append each old version to an array of versions, in the JSON document. This means that the document itself will grow with each revision, which means over time an app using this approach will slow down, as more and more time is spend processing irrelevant JSON in the query server. If your app will be doing lots of queries of old versions of documents, I would consider this approach (or the next one). But most applications just want the old versions for an audit-trail, not for frequent querying.

Multiple Documents

In this approach, each change is put into a new document. The documents would all share a common field, perhaps "real_id" that is used in view queries to correlate the versions. This would be appropriate for an application that was written heavily around a versioned model. For instance, if someone were to build an application to facilitate management of versions stored using the attachment method, that application might import the attachments to multiple documents for viewing and browsing. Later the user's changes could be applied to the original database, using the attachment format again.

Just Don't Compact

Another option is to forgo compaction. This will work if you are on a single node, but as soon as you are on a cluster, only the head revs will be replicated, so access to all versions is not guaranteed. If you think you want to do it this way, you are probably better off using the attachment mode described above.

Update: comment from a reader

Thanks Ian!

Been playing with this, great strategy. I got stumped, however with one problem: when I updated the document, the attachments were lost. It appeared like I would have to download the attachments before updating the document, and then re-attach them. This could be bandwidth heavy. The solution was hinted at in the message at: http://osdir.com/ml/db.couchdb.devel/2008-01/msg00031.html

The big picture is that when you update the document you have to keep the attachment stubs. So, do a GET on the document (this will include the attachment stubs). Then make the necessary changes to the other fields in the document, and then PUT it again. The attachments will be kept.