A parallel MongoDB client with Perl and fork

Concurrency is hard, and that's just as true in Perl as it is in most languages. While Perl has threads, they aren't lightweight, so they aren't an obvious answer to parallel processing the way they are elsewhere. In Perl, doing concurrent work generally means (a) a non-blocking/asynchronous framework or (b) forking sub-processes as workers.

There is no officially-supported async MongoDB driver for Perl (yet), so this article is about forking.

The problem with forking a MongoDB client object is that forks don't automatically close sockets. And having two (or more) processes trying to use the same socket is a recipe for corruption.

At one point in the design of the MongoDB Perl driver v1.0.0, I had it cache the PID on creation and then check if it had changed before every operation. If so, the socket to the MongoDB server would be closed and re-opened. It was auto-magic!

The problem with this approach is that it incurs overhead on every operation, regardless of whether forks are in use. Even if forks are used, they are rare compared to the frequency of database operations for any non-trivial program.

So I took out that mis-feature. Now, you must manually call the reconnect method on your client objects after you fork (or spawn a thread, too).

Here's a pattern I've found myself using from time to time to do parallel processing with Parallel::ForkManager, adapted to reconnect the MongoDB client object in each child:

use Parallel::ForkManager;

# Pass in a MongoDB::MongoClient object, the number of parallel jobs to
# run, and a code-reference to execute. The code reference is passed
# the client and the iteration number.
sub parallel_mongodb {
    my ( $client, $jobs, $fcn ) = @_;

    my $pm = Parallel::ForkManager->new( $jobs > 1 ? $jobs : 0 );

    local $SIG{INT} = sub {
        warn "Caught SIGINT; Waiting for child processes\n";
        exit 1;

    for my $i ( 0 .. $jobs - 1 ) {
        $pm->start and next;
        $SIG{INT} = sub { $pm->finish };
        $fcn->( $i );


To use this subroutine, I partition the input data into the number of jobs to run. Then I call parallel_mongodb with a closure that can find the input data from the job number:

use MongoDB;

# Partition input data into N parts.  Assume each is a document to insert.
my @data = (
   [ { a => 1 },  {b => 2},  ... ],
   [ { m => 11 }, {n => 12}, ... ],
my $number_of_jobs = @data;

my $client = MongoDB->connect;
my $coll = $client->ns("test.dataset");

parallel_mongodb( $client, $number_of_jobs,
  sub {
    $coll->insert_many( $data[ shift ], { ordered => 0 } );

Of course, you want to be careful that the job count (i.e. the partition count) is optimal. I find that having it roughly equal to the number of CPUs tends to work pretty well in practice.

What you don't want to do, however, is to call $pm->start more than the number of child tasks you want running in parallel. You don't want a new process for every data item to process, since each fork also has to reconnect to the database, which is slow. That's why you should figure out the partitioning first, and only spawn a process per partition.

This is best for "embarrassingly parallel" problems, where there's no need for communication back from the child processes. And while what I've shown does a manual partition into arrays, you could also do this with a single array, where child workers only processes indices where the index modulo the number of jobs is equal to the job ID. Or you could have child workers pulling from a common task queue over a network, etc.

TIMTOWTDI, and now you can do it in parallel.

Posted in mongodb, perl programming | Tagged , , , | Comments closed

Perl 5 and Perl 6 are mortal enemies

Did you grow up with one or more siblings? Are you a parent with two or more kids? Then you know that siblings often fight. A lot.

Perl 6 is described as Perl 5's little sister

That metaphor fits. They share parentage. The languages are similar in philosophy. One is more mature, the other less so. Their communities overlap.

But like siblings, they are rivals. Like an only child confronted with a new baby in the house, they now compete for attention from their shared community. They compete for scarce resources to grow – in the form of volunteers who will contribute time and treasure.

Their economic futures are both in doubt

This is what makes them not just rivals, but mortal enemies.

There are many signs that Perl 5 is in decline. Perl 5 is rarely a first language. The number of Perl 5 jobs is – at best – constant, at a time when technology jobs are booming in the wide economy. New applications are rarely written in Perl 5. This year, the Perl 5 community had to beg for talk submissions to OSCON, which grew out of The Perl Conference in the first place.

Is Perl 5 dead? Of course not. But I don't think anyone can cite credible evidence that it's a growth language on par with other "popular" languages. And that's OK. There's still value to be had in a good niche.

But now consider Perl 6. Where will it grow?

First, a postulate: given the language similarities, the people that will find it easiest to learn Perl 6 are today's Perl 5 developers.

Now, let's consider some scenarios:

Scenario 1: Perl 6 takes off!

With its gradual typing and natural async model, Perl 6 becomes the fastest dynamic language. People flock to it from far and wide. It becomes more popular than Rails in the day. YC startups choose it for competitive advantage.

Perl 5 devs, with their advantage in switching, flock to the new economic opportunities it offers. Companies still using Perl 5 find it even harder to find good devs than they do today, or are forced to pay up for them. Even fewer new project are started with Perl 5. The reasons for anyone to learn Perl 5 become fewer. Perl 5 lives on like COBOL, with a handful of older developers well paid to maintain a shrinking legacy code base.

Perl 6 lives and grows; Perl 5 heads quickly down the path to obsolescence.

Scenario 2: Perl 6 stalls out

Perl 6 winds up plagued by ongoing quality glitches and performance problems. Companies that already have Perl 5 developers (and that would have a competitive advantage retraining them) see no benefits from using Perl 6 for new projects.

With no job opportunities, most Perl 5 devs don't pick up Perl 6. The pool of Perl 6 developers stays a fraction of the already small Perl 5 pool. With even Perl 5 companies not adopting Perl 6, no one else is willing to risk Perl 6 adoption for new work, reinforcing the lack of economic opportunity.

Perl 5 stays status quo, static in an industry growing exponentially; Perl 6 remains a hobby language.

Scenario 3: Perl 6 winds up marginally better than Perl 5

Perl 6 turns out to be better than Perl 5, but not so much as to attract developers from other dynamic language communities. Companies that use Perl 5 find it cheaper to retrain their existing developer pool in Perl 6 for performance improvements in new projects. Over time, more projects are in Perl 6 than Perl 5.

Perl 5 devs see the winds of change. Those who don't want to do maintenance work forever pick up Perl 6 to stay relevant.

Perl 6 ekes out a living, stealing increasing production code share from Perl 5. Perl 5 declines moderately faster.

Zero-sum is not necessarily bad

When I say "mortal" enemies, I mean that only one is likely to survive in the long run. I can't think of a scenario where Perl 6 grows and Perl 5 grows. I can't even think of a plausible scenario where Perl 6 grows and Perl 5 is unaffected.

So I think it's zero sum. If Perl 6 grows, then Perl 5 dies faster. If Perl 6 fails to thrive, then Perl 5 keeps the status quo.

Is that bad? I don't think so. The possibility of wild success for Perl 6 should thrill Perl 5 devs, who would have an advantaged position in the new order.

For Perl 5 devs, the best case is great and the worst case seems to be status quo.

So why is there an undercurrent of hostility between the Perl 5 and Perl 6 communities? I think it's because the worst case is actually worse.

Scenario 4: Perl 6 stalls, and drags Perl 5 down with it

Perl 6 winds up plagued by ongoing quality glitches and performance problems. Tainted by association, companies abandon Perl 5 faster as Perl 6's failure makes Perl 5 seem that much more like a dead end. More Perl 5 monolithic apps get re-written as micro-services in trendy languages with easier deployment.

Meanwhile, prolific Perl 5 contributors to p5p and CPAN jump over to Perl 6 to try to help – either betting on Scenario #1 or just trying to save the day. Perl 5 innovation slows, re-raising the "Perl 5 is dead" meme and accelerating economic migration away from Perl 5.

If a tree falls in the forest...

I think this is the fear in the Perl 5 community. If Perl 6 fails, will it do so quietly, allowing the Perl 5 status quo to continue? Or will it suck away resources from Perl 5 and harm Perl 5's already shaky reputation further, hastening the decline?

So I'm not surprised by tension on both sides. I think it's natural.

Just like sibling rivalry.

[Discuss on Reddit...]

Posted in p5p, perl programming, perl6 | Tagged , , , , | Comments closed

Book Review: The Go Programming Language


[Disclaimer: I was provided with a free review copy by the publisher.]


If you're looking to buy a comprehensive text on Go, "The Go Programming Language" is an excellent choice. But with so many free e-book introductions to Go, do you really need it? Maybe, but maybe not.


The authors "assume that you have programmed in one or more other languages" and thus "won't spell out everything as if for a total beginner". Yet the book weighs in at a hefty 380 pages (over 100 pages more than my venerable 1988 K&R 2nd edition).

Is it better than the free 50-page "Little Go Book", or the free 160-page "Introduction to Programming in Go" or even the freely-available 80-page Go Language Specification itself? Yes, certainly. But is it two or three or four times as good? I don't think so.

So is "The Go Programming Language" worth the cost to read in both dollars *and* time? It depends on how you learn, how much you already know, and whether, for you, the good parts outweigh the bad.

The Good Parts

Chapter 1 ("Tutorial") sets the stage for much of what is excellent about this book: fabulous examples. Beyond the obligatory "Hello World", it presents a quick look at several simplified "real world" examples, including command line text filtering, image generation/animation, URL fetching and serving a web page.

The rest of the book follows this same pattern. Chapters typically present several different code examples, most of which do real things rather than just consist of toy code. They include exercises (which I didn't do), that would be good for a course or for someone who learns best by doing structured exercises. The examples are enough to serve as a starting "cookbook" for many real-world tasks.

I also found the explanations of struct embedding and composition to be excellent. Some concepts gelled much better for me than they had from other texts and even from my own coding to date. I had the same experience in the chapter on concurrency with channels. I was pleased that things so idiosyncratic to Go were some of the best parts of the book.

The Bad Parts

Sadly, the book's coverage of the standard library is haphazard. On the one hand, the many real world code examples gave opportunities to introduce parts of the standard library naturally throughout. Unfortunately, that also means there's no comprehensive coverage of the standard library itself, which is surprising given that it's one of great strengths of the language.

The most glaring example of this ad hoc approach was finding a section on text and HTML templating oddly dropped in at the end of Chapter 4 ("Composite Types"). It was as if they really wanted to cover those packages and -- without a chapter dedicated to the standard library -- had nowhere else to put it.

As mentioned previously, the book is long and rather dense. It's not a quick read. Worse, the authors have a habit of burying important points or cautions in the middle of a wall of text and code examples. The lack of cutesy caution icons or call-out boxes for these tidbits (as would be found in more informally-styled books) really hurts skim-ability.

The Mixed Parts

As great as the examples were, I found some aspects disturbing. First, in some cases, implementation details were omitted from the text -- the reader is expected to download the source to see the full example. I would have preferred complete, if less ambitious, examples instead.

In other cases -- particularly in the sections on concurrency -- the examples are presented in a progression of one or more complete "wrong" examples of how not to do things before an example of the "right" way to do things. This approach is a good teaching method, but it adds substantially to the length of the text -- you have to grok a lot more code to parse out the differences between the examples.

The other interesting observation I had was that in many cases, the examples omit error handling for brevity. Since the verbosity of Go code error handling is a frequent criticism of the language, omitting it seemed somehow disingenuous.


If you have the money and patience and you like deep dives and real, working examples and exercises, this book is an excellent choice. If you prefer to skim or dabble, or just want a handy reference text, there are probably better options.

Posted in books | Tagged , | Comments closed

Finally, a streaming Perl filehandle API for MongoDB GridFS

GridFS is the term for MongoDB's filesystem abstraction, allowing you to store files in a MongoDB database. If that database is a replica set or sharded cluster, the files are secured against the loss of a database server.

The recently released v1.3.x development branch of the MongoDB Perl driver introduces a new GridFS API implementation (MongoDB::GridFSBucket) and deprecates the old one (MongoDB::GridFS).

The new API makes working with GridFS much more Perlish. You open an "upload stream" for writing. You open a "download stream" for reading. In both cases, you can get a tied filehandle from the stream object, which lets GridFS operate seamlessly with Perl libraries that read/write handles.

Let's consider a practical example: compression. Imagine you'd like to store files in GridFS but with gzip compression. You could compress a file in memory or on disk and then upload that to GridFS. Or, you could compress it on the fly with IO::Compress::Gzip.

This demo requires at least v1.3.1-TRIAL of the MongoDB Perl driver. You can install that with your favorite CPAN client. E.g.:

$ cpanm --dev MongoDB
# or
$ cpan MONGODB/MongoDB-1.3.1-TRIAL.tar.gz

First, let's load the modules we need, connect to the database and get a new, empty MongoDB::GridFSBucket object.

#!/usr/bin/env perl
use v5.10;
use strict;
use warnings;
use Path::Tiny;
use MongoDB;
use IO::Compress::Gzip qw/gzip $GzipError/;
use IO::Uncompress::Gunzip qw/gunzip $GunzipError/;

# connect to MongoDB on localhost
my $mc  = MongoDB->connect();

# get the MongoDB::GridFSBucket object for the 'test' database
my $gfs = $mc->db("test")->gfs;

# drop the GridFS bucket for this demo

Next, let's say we have a local file called big.txt. In my testing, I used one that was about 2 MB. The next part of the program below prints out the uncompressed size, opens a filehandle for uploading, then uses the gzip function to read from one handle and send to another. Finally, we flush the upload handle and check the compressed size that was uploaded.

# print info on file to upload
my $file = path("big.txt");
say "local  $file is " . ( -s $file ) . " bytes";

# open a handle for uploading
my $up_fh = $gfs->open_upload_stream("$file")->fh;

# compress and upload file
gzip $file->openr_raw => $up_fh
  or die $GzipError;

# flush data to GridFS
my $doc = $up_fh->close;
say "gridfs $file is " . $doc->{length} . " bytes";

Downloading is pretty much the same process. We open a download handle and an output handle for a disk file, then use the gunzip function to stream from the download handle to disk. In this case, because we don't need the handles afterwards, we can do it all on one line instead of using temporary variables. Last we report on the size (to ensure it's the same) and report on the compression ration.

# download and uncompress file
my $copy = path("big2.txt");
gunzip $gfs->open_download_stream( $doc->{_id} )->fh => $copy->openw_raw
  or die $GunzipError;
say "copied $file is " . ( -s $copy ) . " bytes";

# report compression ratio
printf("compressed was %d%% of original\n", 100 * $doc->{length} / -s $file );

If you want to try this out, you can get the whole program from this gist: compressed-gridfs.pl.

When I run it on my sample file, this is the output I get:

$ perl compressed-gridfs.pl
local  big.txt is 2097410 bytes
gridfs big.txt is 777043 bytes
copied big2.txt is 2097410 bytes
compressed was 37% of original

Having GridFS uploads and downloads represented as Perl filehandles makes interoperation with many Perl libraries super easy.

Of course, the new library still works nicely with handles you provide to it. If you want to upload from a handle, you use the upload_from_stream method:

$gfs->upload_from_stream("$file", $file->openr_raw);

Or, if you want to download to a handle, you use the download_to_stream method:

$gfs->download_to_stream($file_id, path("output")->openw_raw);

While the new GridFS API is currently only in the development version of the driver, I encourage you to try it out if you're curious.

If you have feedback, please email me (DAGOLDEN at cpan.org), tweet at @xdg or open a Jira ticket.


Posted in mongodb | Tagged , , , | Comments closed

Getting ready for MongoDB 3.2: new features, new Perl driver beta

After several release candidates, MongoDB version 3.2 is nearing completion. It brings a number of new features that users have been demanding:

  • No more dirty reads!

    With the new "readConcern" query option, users can trade higher latency for reads that won't roll-back during a partition.

  • Document validation!

    While MongoDB doesn't use schemas, users can define query criteria that will be checked to validate to new and updated documents. In addition to field existence and type checks, these can include logical checks as well (e.g. does field "foo" match this regex).

Other, less developer-facing features include:

  • Encryption at rest
  • Partial indexes
  • Faster replica-set failover
  • Simpler sharded cluster configuration

If you want to try out a MongoDB 3.2 release candidate, see the MongoDB development downloads page.

Of course, to take advantage of developer-facing features, you'll need an updated driver library. All the MongoDB supported drivers have beta/RC versions with 3.2 support.

The current Perl driver beta is MongoDB-v1.1.0-TRIAL, which you can download and install with your favorite cpan client:

$ cpanm --dev MongoDB

$ cpan MONGODB/MongoDB-v1.1.0-TRIAL.tar.gz

Some of the changes in the beta driver include:

  • Support for readConcern (MongoDB 3.2)
  • Support for bypassDocumentValidation (MongoDB 3.2; for when you need to work with legacy documents before validation)
  • Support for writeConcern on find-and-modify-style writes (MongoDB 3.2; can be used to emulate a quorum read)
  • A new 'batch' method for query result objects for efficient processing
  • A new 'find_id' sugar method on collection objects for fetching a document by its _id field

Whether you're ready for MongoDB 3.2 or not, I encourage you to try out the Perl driver beta.

If you find any bugs or have any comments, please open a MongoDB JIRA ticket about it, or email me (dagolden) at my CPAN.org address, or tweet to @xdg.

Thank you!

Posted in mongodb | Tagged , , , | Comments closed