No more dirty reads with MongoDB

If you're reading this blog, it's a good bet that sometime in your life you've had a computer freeze or crash on you. You know that crashes happen.

If it's your laptop, you restart and hope for the best. When it's your database, things are a bit more complicated.

Historically, a database lived on a single machine. Writes are considered "committed" when they are written to a journal file and flushed to disk. Until then, they are "dirty". If the database crashes, only the committed changes are recovered from disk.

So far, so good. While originally MongoDB didn't journal by default, it has been the default since version 2.0 in 2011.

As you can imagine, the problem with having a database on a single machine is that when the system is down, you can't use the database.

Trading consistency for availability

While MongoDB can run as a single server, it was designed as a distributed database for redundancy. If one server fails, there are others that can continue.

But once you have more than server, you have a distributed system, which means you need to consider consistency between parts of the system. (See Eight Fallacies of Distributed Computing).

You might have heard of the CAP Theorem. While there are excellent critques of its practical applicability – e.g. Thinking More Clearly About Consistency – it is sufficient to convey two key ideas:

  • because networks are unreliable, partitions happen, so for any real system, partition tolerance ("P") is a given.
  • because MongoDB uses redundant servers that continue to operate during a failure or partition, it gives up global consistency ("C") for availability ("A")

When we say that it gives up consistency, what we mean is that there is no single, up-to-date copy of the data visible across the entire distributed system.

Rethinking commmitment

A MongoDB replica set is structured with a single "primary" server that receives writes from clients and streams them to "secondary" servers. If a primary server fails or is partitioned from other servers, a secondary is promoted to primary and takes over servicing writes.

But what does it mean for a write to be "committed" in this model?

Consider the case where a write is committed to disk on the primary, but the primary fails before the write has replicated to a secondary. When a new secondary takes over as primary, it never saw the write, so that write is lost. Even if the old primary rejoins the replica set (now as a secondary), it synchronizes with the current primary, discarding the old write.

This means that commitment to disk on a primary doesn't matter for whether writes survives a failure. In MongoDB, writes are only truly commmitted when they have been replicated to a majority of servers. We can call this "majority-committed" (hereafter abbreviated "m-committed").

Commitment, latency and dirty reads

When replica sets were introduced in version 1.6, MongoDB offered a configuration option called write concern (abbreviated "w"), which controlled how long the server would wait before acknowledging a write. The option specifies the number of servers that need to have the write before the primary would unblock the client – e.g. w=1 (the default), w=2, etc. You could also specify to wait for a majority without knowing the exact number of servers (w='majority').

use MongoDB;
my $mc = MongoDB->connect(
        w => 'majority'

This means that a writing process can control how much latency it experiences waiting for levels of commitment. Set w=1 and the primary acknowledges the write when it is locally committed while replication continues in the background. Set w='majority' and you'll wait until the data is m-committed and thus safe from rollback in a failover.

You may spot a subtle problem in this approach.

A write concern only blocks the writing process! Reading processes can read the write from the primary even before it is m-committed. These are "dirty reads" – reads that might be rolled back if a failure occurs. Certainly, this is a rare case (compared to, say, transaction rollbacks), but it is a dirty read nonetheless.

Avoiding dirty reads

MongoDB 3.2 introduced a new configuration option called read concern, to express the commitment level desired by readers rather than writers.

Read concern is expressed as one of two values:

  • 'local' (the default), meaning the server should return the latest, locally committed data
  • 'majority', meaning the server should return the latest m-committed data

Using read concern requires v1.2.0 or later of the MongoDB Perl driver:

use MongoDB v1.2.0;
my $mc = MongoDB->connect(
        read_concern_level => 'majority'

With read concern, reading processes can finally avoid dirty reads.

However, this introduces yet another subtle problem. Consider what could happen if a process writes with w=1 and then immediately reads with read concern 'majority'?

Configured this way, a process might not read its own writes!

Recommendations for tunable consistency

I encourage MongoDB users to place themselves (or at least, their application activities) into one of the following groups:

  • "I want low latency" – Dirty reads are OK as long as things are fast. Use w=1 and read concern 'local'. (These are the default settings.)
  • "I want consistency" – Dirty reads are not OK, even at the cost of latency or slightly out of date data. Use w='majority' and read concern 'majority.
use MongoDB v1.2.0;
my $mc = MongoDB->connect(
        read_concern_level => 'majority',
        w => 'majority',

I haven't discussed yet another configuration option: read preference. A read preference indicates whether to read from the primary or from a secondary. The problem with reading from secondaries is that, by definition, they lag the primary. Worse, they lag the primary by different amounts, so reading from different secondaries over time pretty much guarantees inconsistent views of your data.

My opinion is that – unless you are a distributed systems expert – you should leave the default read preference alone and read only from the primary.

Posted in mongodb | Tagged , , , | Comments closed

Please test Path-Tiny-0.081-TRIAL

The latest development releases of Path::Tiny include this whopper in the Changes file:

The relative() method no longer uses File::Spec's buggy rel2abs method. The new Path::Tiny algorithm should be comparable and passes File::Spec rel2abs test cases, except that it correctly accounts for symlinks. For common use, you are not likely to notice any difference. For uncommon use, this should be an improvement. As a side benefit, this change drops the minimum File::Spec version required, allowing Path::Tiny to be fatpacked if desired.

I sincerely hope that you won't notice the difference – or if you do, it's because Path::Tiny is defending you against a latent symlink bug.

That said, with any change of this magnitude there's a serious risk of breakage. PLEASE, PLEASE, PLEASE, if you use Path::Tiny, I ask that you test your module or application with Path-Tiny-0.081-TRIAL.

Here's how:

$ cpanm --dev Path::Tiny

# or

$ cpan DAGOLDEN/Path-Tiny-0.081-TRIAL.tar.gz

If you have any problems with it, please open a bug report in the Path::Tiny issue tracker.

Posted in perl programming | Tagged , , | Comments closed

My Github dashboard of neglect


The curse of being a prolific publisher is a long list of once-cherished, now-neglected modules.

Earlier this week, I got a depressing Github notification. The author of a pull request who has politely pestered me for a while to review his PR, added this comment:

1 year has passed


Sadly, after taking time to review the PR, I actually decided it wasn't a great fit and politely (I hope), rejected it. And then I felt even WORSE, because I'd made someone wait around a year for me to say "no".

Much like my weight hitting a local maxima on the scale, goading me to rededicate myself to healthier eating [dear startups, enough with the constant junk food, already!], this PR felt like a low point in my open-source maintenance.

And, so, just like I now have an app to show me a dashboard of my food consumption, I decided I needed a birds eye view of what I'd been ignoring on Github.

Here, paraphrased, is my "conversation" with Github.

Me: Github, show me a dashboard!  

GH: Here's a feed of events on repos you watch

Me: No, I want a dashboard.

GH: Here's a list of issues created, assigned or mentioning you.

Me: No, I want a dashboard.  Maybe I need an organization view.  [my CPAN repos are in an organization]

GH: Here's a feed of events on repos in the organization.

Me: No, I want a dashboard of issues.

GH: Here's a list of issues for repos in the organization.

Me: Uh, can you summarize that?

GH: No.

Me: Github, you suck.  But you have an API.  Time to bust out some Perl.

So I wrote my own github-dashboard program, using Net::GitHub. (Really, I adapted it from other Net::GitHub programs I already use.) I keep my Github user id and API token in my .gitconfig, so the program pulls my credentials from there.

Below, you can see my Github dashboard of neglect (top 40 only!). The three columns of numbers are (respectively) PRs, non-wishlist issues and wishlist issues. (Wishlist items are identified either by label or by "wishlist" in the title.)

$ ./github-dashboard |  head -40
                               Capture-Tiny   3  18   0
                                    Meerkat   2   8   0
                               getopt-lucid   2   1   0
                                  Path-Tiny   1  21   0
                               HTTP-Tiny-UA   1   5   0
                         Path-Iterator-Rule   1   5   0
  Dist-Zilla-Plugin-BumpVersionAfterRelease   1   3   2
                              Metabase-Fact   1   3   0
                dist-zilla-plugin-osprereqs   1   2   0
       Dist-Zilla-Plugin-Test-ReportPrereqs   1   2   0
                                    ToolSet   1   2   0
        Dist-Zilla-Plugin-Meta-Contributors   1   1   0
     Dist-Zilla-Plugin-MakeMaker-Highlander   1   0   0
                         Task-CPAN-Reporter   1   0   0
                           IO-CaptureOutput   0   7   0
                                     pantry   0   7   2
                     TAP-Harness-Restricted   0   4   0
                            class-insideout   0   3   0
                               Hash-Ordered   0   3   0
                                    Log-Any   0   3   4
                                  perl-chef   0   3   0
                                 Term-Title   0   3   0
                               Test-DiagINC   0   3   0
                          Acme-require-case   0   2   0
                                 Class-Tiny   0   2   0
                                  Data-Fake   0   2   2
                  dist-zilla-plugin-twitter   0   2   0
                   Log-Any-Adapter-Log4perl   0   2   0
                             math-random-oo   0   2   0
                                 superclass   0   2   0
                                   Test-Roo   0   2   0
                              universal-new   0   2   0
                           zzz-rt-to-github   0   2   0
                      app-ylastic-costagent   0   1   0
                      Dancer-Session-Cookie   0   1   0
          Dist-Zilla-Plugin-CheckExtraTests   0   1   0
          Dist-Zilla-Plugin-InsertCopyright   0   1   0
Dist-Zilla-Plugin-ReleaseStatus-FromVersion   0   1   0
                                 File-chdir   0   1   0
                                 File-pushd   0   1   0

Now, when I set aside maintenance time, I know where to work.

Posted in perl programming | Tagged , , , | Comments closed

A parallel MongoDB client with Perl and fork

Concurrency is hard, and that's just as true in Perl as it is in most languages. While Perl has threads, they aren't lightweight, so they aren't an obvious answer to parallel processing the way they are elsewhere. In Perl, doing concurrent work generally means (a) a non-blocking/asynchronous framework or (b) forking sub-processes as workers.

There is no officially-supported async MongoDB driver for Perl (yet), so this article is about forking.

The problem with forking a MongoDB client object is that forks don't automatically close sockets. And having two (or more) processes trying to use the same socket is a recipe for corruption.

At one point in the design of the MongoDB Perl driver v1.0.0, I had it cache the PID on creation and then check if it had changed before every operation. If so, the socket to the MongoDB server would be closed and re-opened. It was auto-magic!

The problem with this approach is that it incurs overhead on every operation, regardless of whether forks are in use. Even if forks are used, they are rare compared to the frequency of database operations for any non-trivial program.

So I took out that mis-feature. Now, you must manually call the reconnect method on your client objects after you fork (or spawn a thread, too).

Here's a pattern I've found myself using from time to time to do parallel processing with Parallel::ForkManager, adapted to reconnect the MongoDB client object in each child:

use Parallel::ForkManager;

# Pass in a MongoDB::MongoClient object, the number of parallel jobs to
# run, and a code-reference to execute. The code reference is passed
# the client and the iteration number.
sub parallel_mongodb {
    my ( $client, $jobs, $fcn ) = @_;

    my $pm = Parallel::ForkManager->new( $jobs > 1 ? $jobs : 0 );

    local $SIG{INT} = sub {
        warn "Caught SIGINT; Waiting for child processes\n";
        exit 1;

    for my $i ( 0 .. $jobs - 1 ) {
        $pm->start and next;
        $SIG{INT} = sub { $pm->finish };
        $fcn->( $i );


To use this subroutine, I partition the input data into the number of jobs to run. Then I call parallel_mongodb with a closure that can find the input data from the job number:

use MongoDB;

# Partition input data into N parts.  Assume each is a document to insert.
my @data = (
   [ { a => 1 },  {b => 2},  ... ],
   [ { m => 11 }, {n => 12}, ... ],
my $number_of_jobs = @data;

my $client = MongoDB->connect;
my $coll = $client->ns("test.dataset");

parallel_mongodb( $client, $number_of_jobs,
  sub {
    $coll->insert_many( $data[ shift ], { ordered => 0 } );

Of course, you want to be careful that the job count (i.e. the partition count) is optimal. I find that having it roughly equal to the number of CPUs tends to work pretty well in practice.

What you don't want to do, however, is to call $pm->start more than the number of child tasks you want running in parallel. You don't want a new process for every data item to process, since each fork also has to reconnect to the database, which is slow. That's why you should figure out the partitioning first, and only spawn a process per partition.

This is best for "embarrassingly parallel" problems, where there's no need for communication back from the child processes. And while what I've shown does a manual partition into arrays, you could also do this with a single array, where child workers only processes indices where the index modulo the number of jobs is equal to the job ID. Or you could have child workers pulling from a common task queue over a network, etc.

TIMTOWTDI, and now you can do it in parallel.

Posted in mongodb, perl programming | Tagged , , , | Comments closed

Perl 5 and Perl 6 are mortal enemies

Did you grow up with one or more siblings? Are you a parent with two or more kids? Then you know that siblings often fight. A lot.

Perl 6 is described as Perl 5's little sister

That metaphor fits. They share parentage. The languages are similar in philosophy. One is more mature, the other less so. Their communities overlap.

But like siblings, they are rivals. Like an only child confronted with a new baby in the house, they now compete for attention from their shared community. They compete for scarce resources to grow – in the form of volunteers who will contribute time and treasure.

Their economic futures are both in doubt

This is what makes them not just rivals, but mortal enemies.

There are many signs that Perl 5 is in decline. Perl 5 is rarely a first language. The number of Perl 5 jobs is – at best – constant, at a time when technology jobs are booming in the wide economy. New applications are rarely written in Perl 5. This year, the Perl 5 community had to beg for talk submissions to OSCON, which grew out of The Perl Conference in the first place.

Is Perl 5 dead? Of course not. But I don't think anyone can cite credible evidence that it's a growth language on par with other "popular" languages. And that's OK. There's still value to be had in a good niche.

But now consider Perl 6. Where will it grow?

First, a postulate: given the language similarities, the people that will find it easiest to learn Perl 6 are today's Perl 5 developers.

Now, let's consider some scenarios:

Scenario 1: Perl 6 takes off!

With its gradual typing and natural async model, Perl 6 becomes the fastest dynamic language. People flock to it from far and wide. It becomes more popular than Rails in the day. YC startups choose it for competitive advantage.

Perl 5 devs, with their advantage in switching, flock to the new economic opportunities it offers. Companies still using Perl 5 find it even harder to find good devs than they do today, or are forced to pay up for them. Even fewer new project are started with Perl 5. The reasons for anyone to learn Perl 5 become fewer. Perl 5 lives on like COBOL, with a handful of older developers well paid to maintain a shrinking legacy code base.

Perl 6 lives and grows; Perl 5 heads quickly down the path to obsolescence.

Scenario 2: Perl 6 stalls out

Perl 6 winds up plagued by ongoing quality glitches and performance problems. Companies that already have Perl 5 developers (and that would have a competitive advantage retraining them) see no benefits from using Perl 6 for new projects.

With no job opportunities, most Perl 5 devs don't pick up Perl 6. The pool of Perl 6 developers stays a fraction of the already small Perl 5 pool. With even Perl 5 companies not adopting Perl 6, no one else is willing to risk Perl 6 adoption for new work, reinforcing the lack of economic opportunity.

Perl 5 stays status quo, static in an industry growing exponentially; Perl 6 remains a hobby language.

Scenario 3: Perl 6 winds up marginally better than Perl 5

Perl 6 turns out to be better than Perl 5, but not so much as to attract developers from other dynamic language communities. Companies that use Perl 5 find it cheaper to retrain their existing developer pool in Perl 6 for performance improvements in new projects. Over time, more projects are in Perl 6 than Perl 5.

Perl 5 devs see the winds of change. Those who don't want to do maintenance work forever pick up Perl 6 to stay relevant.

Perl 6 ekes out a living, stealing increasing production code share from Perl 5. Perl 5 declines moderately faster.

Zero-sum is not necessarily bad

When I say "mortal" enemies, I mean that only one is likely to survive in the long run. I can't think of a scenario where Perl 6 grows and Perl 5 grows. I can't even think of a plausible scenario where Perl 6 grows and Perl 5 is unaffected.

So I think it's zero sum. If Perl 6 grows, then Perl 5 dies faster. If Perl 6 fails to thrive, then Perl 5 keeps the status quo.

Is that bad? I don't think so. The possibility of wild success for Perl 6 should thrill Perl 5 devs, who would have an advantaged position in the new order.

For Perl 5 devs, the best case is great and the worst case seems to be status quo.

So why is there an undercurrent of hostility between the Perl 5 and Perl 6 communities? I think it's because the worst case is actually worse.

Scenario 4: Perl 6 stalls, and drags Perl 5 down with it

Perl 6 winds up plagued by ongoing quality glitches and performance problems. Tainted by association, companies abandon Perl 5 faster as Perl 6's failure makes Perl 5 seem that much more like a dead end. More Perl 5 monolithic apps get re-written as micro-services in trendy languages with easier deployment.

Meanwhile, prolific Perl 5 contributors to p5p and CPAN jump over to Perl 6 to try to help – either betting on Scenario #1 or just trying to save the day. Perl 5 innovation slows, re-raising the "Perl 5 is dead" meme and accelerating economic migration away from Perl 5.

If a tree falls in the forest...

I think this is the fear in the Perl 5 community. If Perl 6 fails, will it do so quietly, allowing the Perl 5 status quo to continue? Or will it suck away resources from Perl 5 and harm Perl 5's already shaky reputation further, hastening the decline?

So I'm not surprised by tension on both sides. I think it's natural.

Just like sibling rivalry.

[Discuss on Reddit...]

Posted in p5p, perl programming, perl6 | Tagged , , , , | Comments closed

© 2009-2016 David Golden All Rights Reserved