If you’re reading this blog, it’s a good bet that sometime in your life you’ve had a computer freeze or crash on you. You know that crashes happen.

If it’s your laptop, you restart and hope for the best. When it’s your database, things are a bit more complicated.

Historically, a database lived on a single machine. Writes are considered “committed” when they are written to a journal file and flushed to disk. Until then, they are “dirty”. If the database crashes, only the committed changes are recovered from disk.

So far, so good. While originally MongoDB didn’t journal by default, it has been the default since version 2.0 in 2011.

As you can imagine, the problem with having a database on a single machine is that when the system is down, you can’t use the database.

Trading consistency for availability 🔗︎

While MongoDB can run as a single server, it was designed as a distributed database for redundancy. If one server fails, there are others that can continue.

But once you have more than server, you have a distributed system, which means you need to consider consistency between parts of the system. (See Eight Fallacies of Distributed Computing).

You might have heard of the CAP Theorem. While there are excellent critques of its practical applicability – e.g. Thinking More Clearly About Consistency – it is sufficient to convey two key ideas:

because networks are unreliable, partitions happen, so for any real system, partition tolerance (“P”) is a given.
because MongoDB uses redundant servers that continue to operate during a failure or partition, it gives up global consistency (“C”) for availability (“A”)

When we say that it gives up consistency, what we mean is that there is no single, up-to-date copy of the data visible across the entire distributed system.

Rethinking commmitment 🔗︎

A MongoDB replica set is structured with a single “primary” server that receives writes from clients and streams them to “secondary” servers. If a primary server fails or is partitioned from other servers, a secondary is promoted to primary and takes over servicing writes.

But what does it mean for a write to be “committed” in this model?

Consider the case where a write is committed to disk on the primary, but the primary fails before the write has replicated to a secondary. When a new secondary takes over as primary, it never saw the write, so that write is lost. Even if the old primary rejoins the replica set (now as a secondary), it synchronizes with the current primary, discarding the old write.

This means that commitment to disk on a primary doesn’t matter for whether writes survives a failure. In MongoDB, writes are only truly commmitted when they have been replicated to a majority of servers. We can call this “majority-committed” (hereafter abbreviated “m-committed”).

Commitment, latency and dirty reads 🔗︎

When replica sets were introduced in version 1.6, MongoDB offered a configuration option called write concern (abbreviated “w”), which controlled how long the server would wait before acknowledging a write. The option specifies the number of servers that need to have the write before the primary would unblock the client – e.g. w=1 (the default), w=2, etc. You could also specify to wait for a majority without knowing the exact number of servers (w=‘majority’).

my $mc = MongoDB->connect(
    $uri,
    {
        w => 'majority'
    }
);

This means that a writing process can control how much latency it experiences waiting for levels of commitment. Set w=1 and the primary acknowledges the write when it is locally committed while replication continues in the background. Set w=‘majority’ and you’ll wait until the data is m-committed and thus safe from rollback in a failover.

You may spot a subtle problem in this approach.

A write concern only blocks the writing process! Reading processes can read the write from the primary even before it is m-committed. These are “dirty reads” – reads that might be rolled back if a failure occurs. Certainly, this is a rare case (compared to, say, transaction rollbacks), but it is a dirty read nonetheless.

Avoiding dirty reads 🔗︎

MongoDB 3.2 introduced a new configuration option called read concern, to express the commitment level desired by readers rather than writers.

Read concern is expressed as one of two values:

‘local’ (the default), meaning the server should return the latest, locally committed data
‘majority’, meaning the server should return the latest m-committed data

Using read concern requires a driver that supports MongoDB 3.2 or later (e.g. v1.2.0 or later of the MongoDB Perl driver):

my $mc = MongoDB->connect(
    $uri,
    {
        read_concern_level => 'majority'
    }
);

With read concern, reading processes can finally avoid dirty reads.

However, this introduces yet another subtle problem. Consider what could happen if a process writes with w=1 and then immediately reads with read concern ‘majority’?

Configured this way, a process might not read its own writes!

Recommendations for tunable consistency 🔗︎

I encourage MongoDB users to place themselves (or at least, their application activities) into one of the following groups:

“I want low latency” – Dirty reads are OK as long as things are fast. Use w=1 and read concern ‘local’. (These are the default settings.)
“I want consistency” – Dirty reads are not OK, even at the cost of latency or slightly out of date data. Use w=‘majority’ and read concern ‘majority’.

my $mc = MongoDB->connect(
    $uri,
    {
        read_concern_level => 'majority',
        w => 'majority',
    }
);

I haven’t discussed yet another configuration option: read preference. A read preference indicates whether to read from the primary or from a secondary. The problem with reading from secondaries is that, by definition, they lag the primary. Worse, they lag the primary by different amounts, so reading from different secondaries over time pretty much guarantees inconsistent views of your data.

My opinion is that – unless you are a distributed systems expert – you should leave the default read preference alone and read only from the primary.

No more dirty reads with MongoDB

Trading consistency for availability 🔗︎

Rethinking commmitment 🔗︎

Commitment, latency and dirty reads 🔗︎

Avoiding dirty reads 🔗︎

Recommendations for tunable consistency 🔗︎