How to replicate a failure

Whenever I get a bug report or a CPAN Testers FAIL report and the issue is not obvious right away, the first thing I try to do is replicate the failure. Without it, I’m left to diagnose with hunches and release new code that I hope fixes that problem. That’s not software engineering, it’s software faith healing.

Today, I had a wonderful experience that I want to hold up as an example of what makes my life much easier. For a while now, I’ve been getting very bizarre FAIL reports for the latest Module::Build development releases from one particular CPAN tester, Chad Davis. The symptom is that an outdated module is picked up in Chad’s local::lib, even though CPAN.pm appears to have detected (and satisfied) the missing dependency. What is particularly weird is that Chad later reported that an explicit “test …” of the dependency followed by “test …” of the Module::Build development release then passes all tests.

My first attempts at replication failed, so after a few emails back and forth, Chad sent me detailed instructions for replication. Here is an excerpt from his email (used with his permission):

I setup a new user, and deleted his .bashrc, leaving only the stock
Ubuntu 10.10 /etc/bash.bashrc in the environment. Then I created this
three-line ~/.bashrc :

perl5=/tmp/tmplib
lib="$perl5/lib/perl5"
eval $(perl -I"$lib" -Mlocal::lib="$perl5")

So, the environment now resolves to:

PERL5LIB=/tmp/tmplib/lib/perl5/x86_64-linux-gnu-thread-multi:/tmp/tmplib/lib/perl5
PERL_LOCAL_LIB_ROOT=/tmp/tmplib
PERL_MB_OPT='--install_base /tmp/tmplib'
PERL_MM_OPT=INSTALL_BASE=/tmp/tmplib

Notice that Chad describes the setup in detail, and even went to the trouble of replicating it with a “clean” user. This made it very easy to set up exactly the same situation on my development machine.

Next, Chad walked me through how to replicate the issue and even confirmed it on a fresh virtual machine!

Then I installed Parse::CPAN::Meta 1.42 (with CPAN 1.9402 and
local::lib 1.008), then I tested MB 0.37_04 which gave the same
errors. Then I quit the cpan shell, restart it, first do an explit
test of PCM 1.4401 before running a test of MB and all tests pass, as
before.

I also verified the same behavior on a virtual machine with a
fresh ubuntu 10.10 (with updates) and the same three-line .bashrc and
got the same behavior. I'm surprised that you cannot reproduce this.
At this point I don't believe there is anything left that is specific
to my environment.

With those clear instructions, I was able to replicate the issue.

Chad then took me through variations and outcomes:

I then upgraded CPAN to 1.94_65 but have the same errors on MB 0.37_04
unless I explicitly test PCM 1.4401 first.

And I have the same problems with MB version 0.37_05 as well, which
also works after an explicit test of PCM 1.4401

I tried to start working backwards a bit, MB 0.3624 is fine,
presumably because it doesn't depend on PCM.
However, MB 0.3701 fails, despite the fact that it only depends on PCM
1.42, which is the one that's installed, according to pmvers.  It 
looks to be the same issue in each case: the existing PCM is not always detected.

Now I have a fact base that I can test and confirm. Finally, Chad did some further digging into the problem:

Then I looked at t/mymeta.t and traced back to CPAN::Meta and looked
at it dependencies in Makefile.PL to find that PCM 1.44 is listed both
as prereq and as a build prereq, which, being a novice, looked a bit
funny to me. Taking out the build prereq and leaving the prereq then
allowed me to test MB without errors.

And with that, I’ve got a decent workaround solution, as well as a hint as to what’s going wrong in CPAN.pm. I haven’t fixed CPAN.pm yet, but with this start, it’s going to be much, much easier.

I want to thank Chad for his responsiveness and the incredible detail of his report. I hope it can be a lesson to anyone reporting bugs or responding to questions about failure:

Describe your environment in detail. If you can, replicate the error with a “clean” user or even on a clean install of the OS.
Describe the exact steps you took to replicate the failure. Take notes as you do it, so you can be as specific as possible.
Describe any variations you tried that did work, or any workarounds you used to deal with the problem.

That seems like common sense, but few bug reports or failure investigations I get are as thorough and constructive as the example above. Thank you, Chad!