Perl QA hackathon wrapup

From mid-air somewhere near Greenland... I'm on my way back from the fifth annual Perl QA Hackathon and I can't believe it's already over. I missed the last two and I'd forgotten what an awesome experience it is.

tl;dr: Stuff I worked on:

Why I love the QA hackathon

If you've been under a rock and still don't know what the QA Hackathon is: it's a sponsored conference in which a small band of dedicted Perl hackers spend three days madly coding to improve the quality of the Perl experience for everyone.

I really enjoyed the chance to meet people in person that I only know from on-line venues or rarely get to see face-to-face. Having so many people working on so many projects in one space made it really easy to benefit serendipitously from the work of others, or to have a chance conversation spark a new way to get things done.

Another thing that makes the hackathon awesome is how quickly blocking issues get fixed. Several times, someone would hit a bug in some related library, walk across the room to the person who could fix it, and it would get fixed and shipped before one could get a coffee.

This year, my work focused mostly on the evolution of CPAN.pm and the CPAN toolchain.

A new way of thinking about CPAN indexes

On the morning of the first day, I convened an informal group of half-a-dozen CPAN client maintainers, installer maintainers and other interested parties [1] to talk about how to re-think CPAN indexing. In particular, I wanted to separate the notion of the "index" from the "repository". The canonical CPAN index is a file on CPAN that maps Perl package names ("Foo::Bar") to a path to a distribution archive file on CPAN ("DAGOLDEN/Foo-Bar-1.23.tar.gz").

Historically, your CPAN client mirrored the index from the same CPAN mirror used to download tarballs. I think that's limiting in a few ways. First, that file keeps growing as CPAN grows and it takes a while to mirror the whole file when you only need the mapping for a few modules.

Some CPAN clients, like cpanminus, don't even use the package index directly, but query a web API that serves up answers from it, which is one way to separate the index from the repository. That would be a nice feature to have in all CPAN clients.

That still doesn't change the model of having only one official index that your client knows about. If you or your company want to manage the mapping, you've got to use various tools to modify the official index in some sort of minicpan or private CPAN repository (aka "DarkPAN"). It's possible, but not user-friendly.

After some debate, the group agreed on a new model. A CPAN client should support an ordered list of index resolvers and should query them in turn. This means you could specify that you want an online web resolver tried first, and only then the traditional index.

More powerfully, you could list a local overlay index as the first resolver. That would let you freeze the mapping to a particular version, or to swap in a development release that fixes a critical bug. The overlay index would only need to list the modules you want to change, because your CPAN client will fall back to the canonical index. You could even have an overlay index per-application or per-application-version for total control.

We also agreed that mapping shouldn't just be a distribution path on a CPAN mirror, but should evolve into a URL. This would allow overlay indexes pointing to locally patched distributions, or to the BackPAN, or potentially even to source repositories (if appropriate scheme handlers were written to check out the necessary files).

In summary -- CPAN indexes should become an open, flexible mechanism to give users more control over how module names are mapped to the files that can provide them.

After reaching that agreement, Nick Perez (nperez) volunteered to start working on a common library for index resolvers (to be called CPAN::Common::Index) and to build a resolver for it that uses MetaCPAN to provide the mapping data.

Meanwhile, I started work on a proof of concept for how CPAN.pm could be modified to use the new, common library instead of its traditional index lookup routines.

Evolution of CPAN.pm

It had been a while since I was deep in the guts of CPAN.pm, but after coming back up to speed, I tackled two big projects and found one crazy bug along the way.

New CPAN indexing and reduced memory footprint

A stock CPAN.pm client uses a ton of memory when the indexes are loaded — about 300 MB last time I looked. It's bloated because it keeps a read-only copy of the indexes loaded in data structures in memory and also keeps a mutable object for every index entry as well.

Some time ago, CPAN::SQLite was released to help solve that problem. It kept the indexes in a SQLite database and loaded data into memory on demand. I decided to use that same approach for the interface to the forthcoming CPAN::Common::Index library, with the goal of being able to load all data on demand, even directly from the package index file, using only core Perl modules.

Here's the trick: the package index file is line-oriented and is sorted by package name. Using the Search::Dict core module, I was able to do a binary search as a super-fast way to look up data for a package name.

The wrinkle in that plan is that Search::Dict wants a filehandle and uses it to seek around in the file, but the package index has an email-style header that confuses it. I could have copied it without the header, but that takes time and memory, too. PAUSE could publish an identical copy without the header, but that's extra work for PAUSE and potentially confusing if they ever get out of sync.

Instead, I wrote Tie::Handle::Offset and Tie::Handle::SkipHeader to hide the email header on a handle, so I could give that directly to Search::Dict. Unfortunately, Search::Dict died unless stat() on the handle gave a valid response, so I patched it to fall back to an alternate method if stat() failed. That revealed a bug in Perl, in which stat() warns when called on tied handles, even if there is a valid filehandle to check (filed as rt#112164).

Since that bug can't get fixed until Perl 5.17 and since we need a working Search::Dict for older Perls anyway, I patched Search::Dict to avoid using stat() on handles, and asked Ricardo Signes (rjbs) to give me a green light make a dual-life release to CPAN.

After chasing my tail on that for a while, I finally was able to get a proof of concept of on-demand index lookup on the package index file working, saving hundreds of megabytes of memory. It didn't use the CPAN::Common::Index library, since Nick was still writing it, but it expects the same API, so it will be easy to adapt once CPAN::Common::Index is ready (meaning that fast MetaCPAN lookups for CPAN.pm should be easy too).

My POC only covered the package index, but Andreas Koenig (klapperl) and Ricardo created a similarly sorted index of author data and we agreed to consider a similar approach for modlist data once we see how the package indexing works in practice.

I would have been happy if that was all I achieved at the hackathon but I still had some time left to get more done.

CPAN.pm support for 'recommends' and 'suggests' prereqs

The v2 CPAN::Meta::Spec formalized dependency specifications for different phases (configure/build/test/runtime) and for different levels of dependency (requires/recommends/suggests/conflicts). The 'recommends' level is for things that should be usually installed to make a module better except in really resource-constrained environments. The 'suggests' level is for really optional modules that might make a module better but really aren't necessary for regular use.

Even though those have been specified for a while, none of the CPAN clients supported them -- meaning it was a manual job to look at the META file, see the recommends/suggest and install them yourself. Ssually, no one bothers.

Since I was on a roll from the indexing work, I set up another CPAN.pm feature branch and implemented support for a 'recommends_policy' and a 'suggests_policy' to control whether those prereqs should be queued up along with the required ones. Even better, if those optional dependencies fail for any reason, CPAN.pm won't warn about missing dependencies and simply notes them as being optional when it reports the failures after processing a command.

Along the way, I found and fixed a CPAN.pm edge-case bug where a module listed in both "build requires" and "runtime requires" and that has a lower prereq in "build requires" would overwrite the higher requirement in "runtime requires". (yikes!) That might explain some bizarre CPAN.pm bug reports I've seen that we could never track down, so it was an extra win.

Unfortunately, ExtUtils::MakeMaker and Module::Build don't yet preserve 'suggests' dependencies during configuration, so this will only help with 'recommends', but fixes to the installers are in the works (Ricardo was working on EU::MM at the hackathon) and CPAN.pm will be ready whenever they are.

Adding features and fixing bugs

CPAN::Meta got a tiny bit of love. I released a version of Parse::CPAN::Meta with dependencies on the latest (less-buggy) versions of CPAN::Meta::YAML and JSON::PP. (I've already got CPAN Testers fail reports, so the tests apparently need some more work.).

Leon Timmermans (leont) added a new method to CPAN::Meta::Requirements for something he was working on, which was awesome because I wound up needing it for the CPAN.pm work only a couple hours after he sent me the pull request. Then I split out CPAN::Meta::Requirements from CPAN::Meta and released it, so CPAN.pm could depend on it without needing all of CPAN::Meta. CPAN::Meta also got some releases for these various changes.

In a startling display of synchronicity, both Curtis Poe (ovid) and Lars Dɪᴇᴄᴋᴏᴡ (daxim) reported a weird Module::Build bug within about an hour of each other. Apparently, errors in META file creation can result in existing META files being deleted, no new files being created and no error message shown about what happened. Leon and I figured out the problem and offered some workarounds — though we ran out of time at the hackathon to fix it in Module::Build itself.

Various other things I did

Several people — Leon, Michael Schwern, Olivier Mengué (dolmen), Lars, me, and a few others I now forget (sorry) — got together to discuss a draft of a "Build.PL API" draft. It defines what CPAN clients should expect interacting with a Build.PL/Build-based installer, which opens the door to future replacements for Module::Build, like Module::Build::Tiny.

Breno de Oliveira (garu) wanted to add CPAN Testers reporting to cpanminus, and along the way volunteered to write a unified, second-generation CPAN Testers client to replace the disparate behaviors of CPAN::Reporter and the reporting modules of CPANPLUS. I gave a small tutorial on CPAN Testers and the Metabase backing it to Breno and others interested in the topic.

As a minor note, I got annoyed at some Test::Spelling carping during all the releases I was doing, so I released a new Pod::Wordlist::hanekomu. If you use Dist::Zilla and the Test::PodSpelling plugin, check it out!

Cool things other people did

Some things I didn't work on that I thought were notable:

  • To support CPAN::Common::Index, Nick wrote MetaCPAN::API::Tiny — a client for querying MetaCPAN that relies only on core Perl modules, which is exactly what we need for a new CPAN.pm index resolver
  • The CPAN "package index" now updates every five minutes instead of every hour... which means other projects that rely on it, like MetaCPAN, are even closer to real time.
  • I asked around if there was a command-line client for MetaCPAN and there wasn't. Then Chris Nehren (apeiron) asked me what I had in mind, whipped one up, and submitted it as an addition to the MetaCPAN::API distribution
  • Ricardo worked on getting full support for CPAN::Meta::Spec v2 into ExtUtils::MakeMaker, including TEST_REQUIRES and ensuring all prerequisites types are preserved in MYMETA.json files
  • Ricardo also got PAUSE to save package index files into git after each update, so we no longer lose historical information
  • Peter Rabbitson (ribasushi) demonstrated a way to use git to store CPAN Testers reports to achieve massive delta compression (and make it easy for people to get copies of the raw data quickly and cheaply). I didn't have time at the hackathon to do much with it but hope to look into it more soon.
  • Late in the afternoon on Sunday, Nick used his MetaCPAN::API::Tiny client for what was dubbed "CloudPAN", a crazy April-Fools proof-of-concept to hook module loading to load missing modules directly from source on metacpan. You'll never need to install pure-Perl modules again. ;-)

There was a lot more going on and a lot I missed, so if I omitted anyone's project, I mean no offense. (I'll read all the hackathon blogs to catch up.)

Conclusion and Acknowledgments

This was my third hackathon and was just as inspiring (and productive) as the last two. I'm excited about the evolution of CPAN.pm and hope to get my work tested further and then merged into the CPAN.pm master branch before long.

I have nothing but wonderful things to say about Laurent Boivin (elbeho), Philippe Bruhat (BooK) and the French Perl Mongers who organized a great event and provided wonderful hospitality, including an endless supply of food, drink and coffee machines to fuel our hacking.

I would also like thank the hackathon sponsors whose generosity made the hackathon possible and enabled me to attend. (If you'd like to donate, it's not too late and will help support next year's QA hackathon.)

These companies and organizations support Perl. Please support them: The City of Science and Industry, Diabolo.com, Dijkmat, DuckDuckGo, Dyn, Freeside Internet Services, Hedera Technology, Jaguar Network, Mongueurs de Perl, Shadowcat Systems Limited, SPLIO, TECLIB’, Weborama, and $foo Magazine

These people made individual donations (you rock!): Martin Evans, Mark Keating, Prakash Kailasa, Neil Bowers, 加藤 敦 (Ktat), Karen Pauley, Chad Davis, Franck Cuny, 近藤嘉雪, Tomohiro Hosaka, Syohei Yoshida, 牧 大輔 (lestrrat), and Laurent Boivin

Special thanks also to Torsten Raudssus (getty) and Duck Duck Go for the tee-shirt and Booking for the silly putty. :-)

Finally, thank you to all my fellow hackers! I had a great time and I hope to see you all again next year!

[1] CPAN index discussion group (with some people coming and going): me, Andreas Koenig, Florian Ragwitz, Michael Peters, Michael Schwern, Nick Perez, Olaf Alders, Olivier Mengué, Ricardo Signes, Tatsuhiko Miyagawa and probably even more I don't remember. (Please remind me if you were there and want to share the credit/blame.)

This entry was posted in cpan, cpan-testers, dzil, perl programming, toolchain and tagged , , , , . Bookmark the permalink. Both comments and trackbacks are currently closed.

5 Comments

  1. Posted April 3, 2012 at 10:55 pm | Permalink

    I'm a little curious about the Build.PL API. Has any consideration been made to large extensions to Module::Build. Of course I am thinking about my Alien::Base (https://github.com/jberger/Alien-Base/) which adds many additional keys and methods. Of course this isn't anywhere near finished nor does its API need to be included, I'm just wondering if there are provisions for extensions like mine.

  2. Posted April 3, 2012 at 11:32 pm | Permalink

    The goal of the Build.PL API spec is sort of the opposite -- it's to define the minimum that a CPAN client should expect of a pure-Perl installer, whether it's Module::Build or something else. What goes inside the Build.PL or what the installer actually does isn't what matters -- that can be as simple or as sophisticated as someone needs.

    For pure-Perl installers to interoperate with CPAN clients, they need to accept the same options on the command line and have the same general behaviors for "perl Build.PL", "Build", "Build test" and "Build install". That's what the Build.PL API is trying to define.

  3. Posted April 9, 2012 at 2:56 am | Permalink

    Wow, this is great stuff. Do you envision that the package maintainer could specify a list of urls for download? That would eg. allow one to move distribution of your package to a GitHub tag tarball.

    • Posted April 9, 2012 at 1:49 pm | Permalink

      I suppose one could specify a download URL in metadata, but I don't foresee that ever being used by PAUSE for indexing. The canonical location is always on a CPAN mirror or a BackPAN mirror. I see custom indexes being more useful for a company. E.g. MyCo::Internal::Foo -> a private URL with a tarball.

  4. Pau Amma
    Posted April 9, 2012 at 3:54 pm | Permalink

    In case you're interested in some of the smaller-scale efforts: http://perl.dreamwidth.org/5378.html

One Trackback

  • By Perl QA hackathon 2013 wrapup | David Golden on April 19, 2013 at 6:38 am

    [...] 2013 hackathon for a few days and I'm probably overdue to write about the trip. I'm intimidated by last years writeup — I must have been feeling a lot peppier on the plane a year [...]

© 2009-2014 David Golden All Rights Reserved