First of all, let me introduce Apache Mavibot: it's a MVCC B+
tree library in Java under an AL 2.0 license (MVCC stands for
Multi-Version Concurrency Control). The whole idea is to have a B-tree
implementation that never crashes, and does not use locks to protect the
data against concurrent access (well … while reading).
The B+ tree is a variant of a B-tree, where values are only stored in
the leaves, not in internal nodes.
Ok. Good. You don't know much about Mavibot after this introduction, so
I'll dig a bit deeper in this post. Let's start with the original idea.
Apache Directory, CouchDB, and some other databases...
Back in 2009, I was attending the Apache Conference in Oakland. I had been
working on the Apache Directory project for a bit more than a 4 years
and a half. Apache Directory is a LDAP server written in Java, and we
chose to store data in B-trees. There was a very limited choice back
then, and the library we used - and still use as of today - was JDBM
, a java avatar of GDBM.
JDBM is written in Java, implements B+trees and has transactions support
(experimental), but it has one big drawback: it's not a cross-B-trees
transaction system. And that does not fit our requirement in LDAP.
An alternative could have been Berkeley DB &tm;, which released a Java
edition of its database, but its license was incompatible with the AL
2.0 license. Moreover Berkeley DB was bought by Oracle in 2006, so it was
simply not an option.
What’s wrong with using JDBM?
In LDAP, an update operation impacts one single entry but this entry
uses many AttributeTypes, which can be indexed. In other words, an
update will impact as many B-trees as we have indexes (and a bit more).
In order to guarantee that an entry UPDATE is consistent, we must be
sure that either all or none of the indexes have been flushed to disk:
otherwise we might end with an inconsistent database, where some indexes
are up to date when some other aren't.
Even worse, in the event of a crash, we might simply not be able
to restart the server because the database gets corrupted (and
sadly, we are experiencing this problem today...).
So in Oakland, I went to the Apache Couch-DB presentation (sadly,
the slides are not anymore available), and was struck by the idea behind
their database: MVCC. Crucially when you start to use the
database at a given revision, you always see everything associated with
this revision, up to the point you are done. Sounds like Git or
Subversion … actually, it's pretty much the same mechanism.
Being able to process some read operations on a specific version of the
database guarantees that no update will ever corrupt the data being
processed. And every time we want to access the database, the very first
thing it will do is to select the latest available version: this is
all we will see during the operation processing. Perfect when you don't really care about having a fresh view of the stored data at any time, which is the case in LDAP.
But Apache CouchDB was written in Erlang :/ Anyway, the discussion we
had with the Directory team was really about moving to a MVCC database.
Transactions are another big missing feature in LDAP. This is not something that was
in the air back then: it was specified only one year later. Of course, the original specifications said that every operation is atomic, but there is no requirement for multiple operations to be atomic (and we often need to update two entries in LDAP, and to guarantee that those two operations are either completed, or
roll-backed). Think about user/group management...
Alex Karasulu always had in mind that we needed a transactional database
in Apache Directory, too. And his point was proved correct when years
later, we faced the first database corruptions. It's a bit sad that we
ignored this aspect for so long :/
Anyway, we needed (a) transactions and (b) a rock solid database that could
resist any type of crash.
For some time, we tried to mitigate the consistency problems we had by
adding tons of locks. As we weren't able to protect the database against
concurrent reads and writes we made them exclusive (i.e.
when some write is processed, no read can be processed). This was
slightly better, but it came at a huge cost: a major slowdown when
writes were done. Also it was not good enough: long-lasting searches
were just killing us, as there were no solution to guarantee that an
entry for which we had a reference would still be present in the database
when we needed to fetch it. In such cases, we simply ended up by
discarding the entry. Last, not least, a crash in the middle of an
update operation would leave the database in a potential inconsistent
state, which would make it impossible to start again (this was somehow
mitigated by adding a 'repair' mode lately, but this is just an horrible
Mavibot first steps
So we needed something better, which turned out to be Mavibot. We started working
on Mavibot in June 2012 (Jun 13 00:04:10 2012, exactly).
The funny thing is that OpenLDAP started to work on the exact same kind
of database 1 year before (LMDB) - even if the discussion about the
need for such a database started in 2009. Parallel discussions,
parallel developments, we have always shared a lot!
The very first released version of Mavibot was out one year later, in
June 2013, followed by 7 other versions (all of them milestones). At
some point, we added a MVBT partition in ApacheDS, in 2.0.0-M13 (and it
was using a SNAPSHOT!!! Mavibot 1.0.0-M1 was used in ApacheDS
2.0.0-M15). This was 'good enough' to have the LDAP it was 2 to 5 times faster than JDBM, too ;-), but it didn't offer all
we wanted to add: typically, we didn't have transaction support.
So why isn’t Mavibot the Apache Directory Server backend of choice today?
Well, we don't have cross B-tree transactions, so we are pretty much in
the same situation as with JDBM (except that it's faster, and we also
have a bulk-loader for Mavibot). Adding cross-B-trees transaction is not
a piece of cake, and it requires some thinking. Sadly, it arrived at a
moment where the team had less time to work on it (new jobs, family,
you name it).
So in 2017, the effort has been rebooted, and we do expect to have a
working version soon enough!
I'll blog later on about various technical aspects on Apache Mavibot, so keep tuned !