Finally, a streaming Perl filehandle API for MongoDB GridFS

GridFS is the term for MongoDB’s filesystem abstraction, allowing you to store files in a MongoDB database. If that database is a replica set or sharded cluster, the files are secured against the loss of a database server.

The recently released v1.3.x development branch of the MongoDB Perl driver introduces a new GridFS API implementation (MongoDB::GridFSBucket) and deprecates the old one (MongoDB::GridFS).

The new API makes working with GridFS much more Perlish. You open an “upload stream” for writing. You open a “download stream” for reading. In both cases, you can get a tied filehandle from the stream object, which lets GridFS operate seamlessly with Perl libraries that read/write handles.

Let’s consider a practical example: compression. Imagine you’d like to store files in GridFS but with gzip compression. You could compress a file in memory or on disk and then upload that to GridFS. Or, you could compress it on the fly with IO::Compress::Gzip.

This demo requires at least v1.3.1-TRIAL of the MongoDB Perl driver. You can install that with your favorite CPAN client. E.g.:

$ cpanm --dev MongoDB
# or
$ cpan MONGODB/MongoDB-1.3.1-TRIAL.tar.gz

First, let’s load the modules we need, connect to the database and get a new, empty MongoDB::GridFSBucket object.

#!/usr/bin/env perl
use v5.10;
use strict;
use warnings;
use Path::Tiny;
use MongoDB;
use IO::Compress::Gzip qw/gzip $GzipError/;
use IO::Uncompress::Gunzip qw/gunzip $GunzipError/;

# connect to MongoDB on localhost
my $mc  = MongoDB->connect();

# get the MongoDB::GridFSBucket object for the 'test' database
my $gfs = $mc->db("test")->gfs;

# drop the GridFS bucket for this demo
$gfs->drop;

Next, let’s say we have a local file called big.txt. In my testing, I used one that was about 2 MB. The next part of the program below prints out the uncompressed size, opens a filehandle for uploading, then uses the gzip function to read from one handle and send to another. Finally, we flush the upload handle and check the compressed size that was uploaded.

# print info on file to upload
my $file = path("big.txt");
say "local  $file is " . ( -s $file ) . " bytes";

# open a handle for uploading
my $up_fh = $gfs->open_upload_stream("$file")->fh;

# compress and upload file
gzip $file->openr_raw => $up_fh
  or die $GzipError;

# flush data to GridFS
my $doc = $up_fh->close;
say "gridfs $file is " . $doc->{length} . " bytes";

Downloading is pretty much the same process. We open a download handle and an output handle for a disk file, then use the gunzip function to stream from the download handle to disk. In this case, because we don’t need the handles afterwards, we can do it all on one line instead of using temporary variables. Last we report on the size (to ensure it’s the same) and report on the compression ration.

# download and uncompress file
my $copy = path("big2.txt");
gunzip $gfs->open_download_stream( $doc->{_id} )->fh => $copy->openw_raw
  or die $GunzipError;
say "copied $file is " . ( -s $copy ) . " bytes";

# report compression ratio
printf("compressed was %d%% of original\n", 100 * $doc->{length} / -s $file );

If you want to try this out, you can get the whole program from this gist: compressed-gridfs.pl.

When I run it on my sample file, this is the output I get:

$ perl compressed-gridfs.pl
local  big.txt is 2097410 bytes
gridfs big.txt is 777043 bytes
copied big2.txt is 2097410 bytes
compressed was 37% of original

Having GridFS uploads and downloads represented as Perl filehandles makes interoperation with many Perl libraries super easy.

Of course, the new library still works nicely with handles you provide to it. If you want to upload from a handle, you use the upload_from_stream method:

$gfs->upload_from_stream("$file", $file->openr_raw);

Or, if you want to download to a handle, you use the download_to_stream method:

$gfs->download_to_stream($file_id, path("output")->openw_raw);

While the new GridFS API is currently only in the development version of the driver, I encourage you to try it out if you’re curious.

If you have feedback, please email me (DAGOLDEN at cpan.org), tweet at @xdg or open a Jira ticket.

Thanks!