At-sign nick completion for Weechat and Slack

I like connecting to chat apps like Slack and Flowdock using an IRC client, as that keeps all my chats in one window, right in my terminal.

I've been frustrated that the IRC gateways for Slack and Flowdock don't highlight people based just on their nick the way a typical IRC client does. For example, my IRC nick is "xdg". In an IRC channel, just saying "xdg: ping" will highlight and beep my terminal. But if I want to highlight "bob" who uses the Slack app, I need to say "@bob: ping".

Usually, when I highlight someone, I can use tab completion to finish the nick. That's really handy when I'm talking to my friend "BinGOs". But if I want to @-highlight him and I type "@Bi<TAB>", completion fails.

Since I use Weechat as my IRC client, I whipped up a plugin called atcomplete, which adds an @-prefixed nick for every nick in the nicklist.

With that, I can do "Bi<TAB>" and get "BinGOs", or "<TAB>" again to get "@BinGOs". And if I do "@Bi<TAB>", I get "@BinGOs" right away.

I've submitted it to the Weechat scripts page and it might eventually get approved, but it's available on github now as weechat-atcomplete if people want to try it out. Feedback and patches welcome!

Posted in hacks | Tagged , , , | Comments closed

No more dirty reads with MongoDB

If you're reading this blog, it's a good bet that sometime in your life you've had a computer freeze or crash on you. You know that crashes happen.

If it's your laptop, you restart and hope for the best. When it's your database, things are a bit more complicated.

Historically, a database lived on a single machine. Writes are considered "committed" when they are written to a journal file and flushed to disk. Until then, they are "dirty". If the database crashes, only the committed changes are recovered from disk.

So far, so good. While originally MongoDB didn't journal by default, it has been the default since version 2.0 in 2011.

As you can imagine, the problem with having a database on a single machine is that when the system is down, you can't use the database.

Trading consistency for availability

While MongoDB can run as a single server, it was designed as a distributed database for redundancy. If one server fails, there are others that can continue.

But once you have more than server, you have a distributed system, which means you need to consider consistency between parts of the system. (See Eight Fallacies of Distributed Computing).

You might have heard of the CAP Theorem. While there are excellent critques of its practical applicability – e.g. Thinking More Clearly About Consistency – it is sufficient to convey two key ideas:

  • because networks are unreliable, partitions happen, so for any real system, partition tolerance ("P") is a given.
  • because MongoDB uses redundant servers that continue to operate during a failure or partition, it gives up global consistency ("C") for availability ("A")

When we say that it gives up consistency, what we mean is that there is no single, up-to-date copy of the data visible across the entire distributed system.

Rethinking commmitment

A MongoDB replica set is structured with a single "primary" server that receives writes from clients and streams them to "secondary" servers. If a primary server fails or is partitioned from other servers, a secondary is promoted to primary and takes over servicing writes.

But what does it mean for a write to be "committed" in this model?

Consider the case where a write is committed to disk on the primary, but the primary fails before the write has replicated to a secondary. When a new secondary takes over as primary, it never saw the write, so that write is lost. Even if the old primary rejoins the replica set (now as a secondary), it synchronizes with the current primary, discarding the old write.

This means that commitment to disk on a primary doesn't matter for whether writes survives a failure. In MongoDB, writes are only truly commmitted when they have been replicated to a majority of servers. We can call this "majority-committed" (hereafter abbreviated "m-committed").

Commitment, latency and dirty reads

When replica sets were introduced in version 1.6, MongoDB offered a configuration option called write concern (abbreviated "w"), which controlled how long the server would wait before acknowledging a write. The option specifies the number of servers that need to have the write before the primary would unblock the client – e.g. w=1 (the default), w=2, etc. You could also specify to wait for a majority without knowing the exact number of servers (w='majority').

use MongoDB;
my $mc = MongoDB->connect(
    $uri,
    {
        w => 'majority'
    }
);

This means that a writing process can control how much latency it experiences waiting for levels of commitment. Set w=1 and the primary acknowledges the write when it is locally committed while replication continues in the background. Set w='majority' and you'll wait until the data is m-committed and thus safe from rollback in a failover.

You may spot a subtle problem in this approach.

A write concern only blocks the writing process! Reading processes can read the write from the primary even before it is m-committed. These are "dirty reads" – reads that might be rolled back if a failure occurs. Certainly, this is a rare case (compared to, say, transaction rollbacks), but it is a dirty read nonetheless.

Avoiding dirty reads

MongoDB 3.2 introduced a new configuration option called read concern, to express the commitment level desired by readers rather than writers.

Read concern is expressed as one of two values:

  • 'local' (the default), meaning the server should return the latest, locally committed data
  • 'majority', meaning the server should return the latest m-committed data

Using read concern requires v1.2.0 or later of the MongoDB Perl driver:

use MongoDB v1.2.0;
my $mc = MongoDB->connect(
    $uri,
    {
        read_concern_level => 'majority'
    }
);

With read concern, reading processes can finally avoid dirty reads.

However, this introduces yet another subtle problem. Consider what could happen if a process writes with w=1 and then immediately reads with read concern 'majority'?

Configured this way, a process might not read its own writes!

Recommendations for tunable consistency

I encourage MongoDB users to place themselves (or at least, their application activities) into one of the following groups:

  • "I want low latency" – Dirty reads are OK as long as things are fast. Use w=1 and read concern 'local'. (These are the default settings.)
  • "I want consistency" – Dirty reads are not OK, even at the cost of latency or slightly out of date data. Use w='majority' and read concern 'majority.
use MongoDB v1.2.0;
my $mc = MongoDB->connect(
    $uri,
    {
        read_concern_level => 'majority',
        w => 'majority',
    }
);

I haven't discussed yet another configuration option: read preference. A read preference indicates whether to read from the primary or from a secondary. The problem with reading from secondaries is that, by definition, they lag the primary. Worse, they lag the primary by different amounts, so reading from different secondaries over time pretty much guarantees inconsistent views of your data.

My opinion is that – unless you are a distributed systems expert – you should leave the default read preference alone and read only from the primary.

Posted in mongodb | Tagged , , , | Comments closed

Please test Path-Tiny-0.081-TRIAL

The latest development releases of Path::Tiny include this whopper in the Changes file:

!!! INCOMPATIBLE CHANGES !!!
The relative() method no longer uses File::Spec's buggy rel2abs method. The new Path::Tiny algorithm should be comparable and passes File::Spec rel2abs test cases, except that it correctly accounts for symlinks. For common use, you are not likely to notice any difference. For uncommon use, this should be an improvement. As a side benefit, this change drops the minimum File::Spec version required, allowing Path::Tiny to be fatpacked if desired.

I sincerely hope that you won't notice the difference – or if you do, it's because Path::Tiny is defending you against a latent symlink bug.

That said, with any change of this magnitude there's a serious risk of breakage. PLEASE, PLEASE, PLEASE, if you use Path::Tiny, I ask that you test your module or application with Path-Tiny-0.081-TRIAL.

Here's how:

$ cpanm --dev Path::Tiny

# or

$ cpan DAGOLDEN/Path-Tiny-0.081-TRIAL.tar.gz

If you have any problems with it, please open a bug report in the Path::Tiny issue tracker.

Posted in perl programming | Tagged , , | Comments closed

My Github dashboard of neglect

Bitrot.

The curse of being a prolific publisher is a long list of once-cherished, now-neglected modules.

Earlier this week, I got a depressing Github notification. The author of a pull request who has politely pestered me for a while to review his PR, added this comment:

1 year has passed

Ouch!

Sadly, after taking time to review the PR, I actually decided it wasn't a great fit and politely (I hope), rejected it. And then I felt even WORSE, because I'd made someone wait around a year for me to say "no".

Much like my weight hitting a local maxima on the scale, goading me to rededicate myself to healthier eating [dear startups, enough with the constant junk food, already!], this PR felt like a low point in my open-source maintenance.

And, so, just like I now have an app to show me a dashboard of my food consumption, I decided I needed a birds eye view of what I'd been ignoring on Github.

Here, paraphrased, is my "conversation" with Github.

Me: Github, show me a dashboard!  

GH: Here's a feed of events on repos you watch

Me: No, I want a dashboard.

GH: Here's a list of issues created, assigned or mentioning you.

Me: No, I want a dashboard.  Maybe I need an organization view.  [my CPAN repos are in an organization]

GH: Here's a feed of events on repos in the organization.

Me: No, I want a dashboard of issues.

GH: Here's a list of issues for repos in the organization.

Me: Uh, can you summarize that?

GH: No.

Me: Github, you suck.  But you have an API.  Time to bust out some Perl.

So I wrote my own github-dashboard program, using Net::GitHub. (Really, I adapted it from other Net::GitHub programs I already use.) I keep my Github user id and API token in my .gitconfig, so the program pulls my credentials from there.

Below, you can see my Github dashboard of neglect (top 40 only!). The three columns of numbers are (respectively) PRs, non-wishlist issues and wishlist issues. (Wishlist items are identified either by label or by "wishlist" in the title.)

$ ./github-dashboard |  head -40
                               Capture-Tiny   3  18   0
                                    Meerkat   2   8   0
                               getopt-lucid   2   1   0
                                  Path-Tiny   1  21   0
                               HTTP-Tiny-UA   1   5   0
                         Path-Iterator-Rule   1   5   0
  Dist-Zilla-Plugin-BumpVersionAfterRelease   1   3   2
                              Metabase-Fact   1   3   0
                dist-zilla-plugin-osprereqs   1   2   0
       Dist-Zilla-Plugin-Test-ReportPrereqs   1   2   0
                                    ToolSet   1   2   0
        Dist-Zilla-Plugin-Meta-Contributors   1   1   0
     Dist-Zilla-Plugin-MakeMaker-Highlander   1   0   0
                         Task-CPAN-Reporter   1   0   0
                           IO-CaptureOutput   0   7   0
                                     pantry   0   7   2
                     TAP-Harness-Restricted   0   4   0
                            class-insideout   0   3   0
                               Hash-Ordered   0   3   0
                                    Log-Any   0   3   4
                                  perl-chef   0   3   0
                                 Term-Title   0   3   0
                               Test-DiagINC   0   3   0
                          Acme-require-case   0   2   0
                                 Class-Tiny   0   2   0
                                  Data-Fake   0   2   2
                  dist-zilla-plugin-twitter   0   2   0
                   Log-Any-Adapter-Log4perl   0   2   0
                             math-random-oo   0   2   0
                                 superclass   0   2   0
                                   Test-Roo   0   2   0
                              universal-new   0   2   0
                           zzz-rt-to-github   0   2   0
                      app-ylastic-costagent   0   1   0
                      Dancer-Session-Cookie   0   1   0
          Dist-Zilla-Plugin-CheckExtraTests   0   1   0
          Dist-Zilla-Plugin-InsertCopyright   0   1   0
Dist-Zilla-Plugin-ReleaseStatus-FromVersion   0   1   0
                                 File-chdir   0   1   0
                                 File-pushd   0   1   0

Now, when I set aside maintenance time, I know where to work.

Posted in perl programming | Tagged , , , | Comments closed

A parallel MongoDB client with Perl and fork

Concurrency is hard, and that's just as true in Perl as it is in most languages. While Perl has threads, they aren't lightweight, so they aren't an obvious answer to parallel processing the way they are elsewhere. In Perl, doing concurrent work generally means (a) a non-blocking/asynchronous framework or (b) forking sub-processes as workers.

There is no officially-supported async MongoDB driver for Perl (yet), so this article is about forking.

The problem with forking a MongoDB client object is that forks don't automatically close sockets. And having two (or more) processes trying to use the same socket is a recipe for corruption.

At one point in the design of the MongoDB Perl driver v1.0.0, I had it cache the PID on creation and then check if it had changed before every operation. If so, the socket to the MongoDB server would be closed and re-opened. It was auto-magic!

The problem with this approach is that it incurs overhead on every operation, regardless of whether forks are in use. Even if forks are used, they are rare compared to the frequency of database operations for any non-trivial program.

So I took out that mis-feature. Now, you must manually call the reconnect method on your client objects after you fork (or spawn a thread, too).

Here's a pattern I've found myself using from time to time to do parallel processing with Parallel::ForkManager, adapted to reconnect the MongoDB client object in each child:

use Parallel::ForkManager;

# Pass in a MongoDB::MongoClient object, the number of parallel jobs to
# run, and a code-reference to execute. The code reference is passed
# the client and the iteration number.
sub parallel_mongodb {
    my ( $client, $jobs, $fcn ) = @_;

    my $pm = Parallel::ForkManager->new( $jobs > 1 ? $jobs : 0 );

    local $SIG{INT} = sub {
        warn "Caught SIGINT; Waiting for child processes\n";
        $pm->wait_all_children;
        exit 1;
    };

    for my $i ( 0 .. $jobs - 1 ) {
        $pm->start and next;
        $SIG{INT} = sub { $pm->finish };
        $client->reconnect;
        $fcn->( $i );
        $pm->finish;
    }

    $pm->wait_all_children;
}

To use this subroutine, I partition the input data into the number of jobs to run. Then I call parallel_mongodb with a closure that can find the input data from the job number:

use MongoDB;

# Partition input data into N parts.  Assume each is a document to insert.
my @data = (
   [ { a => 1 },  {b => 2},  ... ],
   [ { m => 11 }, {n => 12}, ... ],
   ...
);
my $number_of_jobs = @data;

my $client = MongoDB->connect;
my $coll = $client->ns("test.dataset");

parallel_mongodb( $client, $number_of_jobs,
  sub {
    $coll->insert_many( $data[ shift ], { ordered => 0 } );
  }
);

Of course, you want to be careful that the job count (i.e. the partition count) is optimal. I find that having it roughly equal to the number of CPUs tends to work pretty well in practice.

What you don't want to do, however, is to call $pm->start more than the number of child tasks you want running in parallel. You don't want a new process for every data item to process, since each fork also has to reconnect to the database, which is slow. That's why you should figure out the partitioning first, and only spawn a process per partition.

This is best for "embarrassingly parallel" problems, where there's no need for communication back from the child processes. And while what I've shown does a manual partition into arrays, you could also do this with a single array, where child workers only processes indices where the index modulo the number of jobs is equal to the job ID. Or you could have child workers pulling from a common task queue over a network, etc.

TIMTOWTDI, and now you can do it in parallel.

Posted in mongodb, perl programming | Tagged , , , | Comments closed

© 2009-2017 David Golden All Rights Reserved