Metabase: Opinions from anyone about anything

Explaining Metabase

At YAPC::NA, Ricardo and I finally released some early alpha Metabase modules to CPAN. Metabase is a project we started in 2008 at the Oslo QA Hackathon to provide a new transport and storage infrastructure for CPAN Testers and other CPAN distribution metadata. We’ve been working out of github until now, but think it’s time to encourage others in the Perl programming community to take a look.

One of the challenges we’ve had is just explaining Metabase to people. Here’s what I’ve come up with: Metabase is a database framework and web API to store and search opinions from anyone about anything. In the terminology of Metabase, Users store Facts about Resources. Facts are really opinions or claims; there is no inherent truth in them. Resources are logical identifiers, generally in the form of a URI, since the things we’re most interested in describing are online.

Since CPAN Testers is our first use case, I’ll use that to illustrate the concept. Currently, CPAN Testers email test reports about distributions on CPAN. In the Metabase world, each CPAN tester is a User. The Resource is a CPAN distribution file – a ‘distfile’ URI that describes a distribution tarball on CPAN such as cpan://distfile/DAGOLDEN/Capture-Tiny-0.06.tar.gz. The Fact is the test report. Today that’s just the text of the email message, but in the future it will be structured data. ((Technically, a CPAN Testers report will be a collection of more granular facts like prerequisites, test output, environment and so on.))

It’s easy to see how this can be generalized to an entire ecosystem of facts and opinions about CPAN distributions, such as CPANTS, ratings, annotations, tags, etc. Today, every CPAN website has its own database and its own API so mashups and remixes require building an interface to each. Metabase offers the possibility of a unified way for all these tools to store, share and cross-reference information about CPAN.

Design considerations

Metabase has some similarities to other, general ontology frameworks, but we decided to focus our efforts around some pragmatic tradeoffs using CPAN Testers to guide the design. We think these decisions generalize well to other uses, but we’re trying to follow a YAGNI strategy and only implement what CPAN Testers will need before worrying about more general cases.

(1) High fact volume

There are currently about 18,000 “latest” distributions on CPAN, and the entire history of CPAN is a bit over 100,000 distributions. By contrast, there are over 4 million CPAN Testers reports already and the total is growing by about 250,000 per month. Reports are usually consumed as a batch process to create summary statistics on the CPAN Testers website and relatively few reports are ever reviewed in detail. Therefore, the Metabase design to date prioritizes Fact submission over search and retrieval.

(2) Minimalist Perl for client libraries

Many CPAN Testers want to run smoke tests using a Perl installation that has as few extra modules installed as possible. Therefore, any Metabase components used by clients such as Metabase::Fact or Metabase::Client::Simple need minimal non-core dependencies. On the other hand, anything that will run on the server side can take full advantage of Modern Perl tools like Moose, DBIC and Catalyst.

(3) No barriers to contribution

Today, new CPAN Testers don’t have to register or get authorized to start contributing, they just start sending emails to a mailing list that collects reports. For Metabase, users generate a user profile fact and submit it as a credential with their facts. Metabase manages user identities within the Metabase itself – adding new users automatically – allowing new contributors to join without any external pre-authorization hurdles that might discourage participation. ((Verification or authorization is optional))

(4) Open to extension and evolution

We want to make CPAN Testers tools available for corporations or perl5-porters or anyone with customized testing needs. We want people to come up with new facts and new resources. That means defining interfaces well and leaving implementation details open to change. Metabase allows lots of choices about things like underlying data storage technologies or permissible fact types.

Metabase specifies data storage capabilities, but the actual database storage is pluggable, from flat files to relational databases to cloud services. Metabase defines how Fact classes can provide validation and custom index data on top of a very simple data model. Moreover, Metabase is a fairly dumb repository; any intelligent analysis must be done by third parties extracting and transforming data.

Where do we go from here?

I’ve successfully submitted test reports to a local Metabase, but for Metabase to become what we need for CPAN Testers (and more), there are still quite a number of things to do:

  • Improve documentation
  • Test and refine Metabase components
  • Design and implement search capabilities
  • Establish one or more ‘development server’ Metabases for testing robustness and throughput
  • Migrate the 4 million CPAN Testers reports from NNTP to Metabase
  • Migrate CPAN Testers downstream analytics to use Metabase as a source
  • Have some CPAN Testers pilot submitting reports to Metabase
  • Develop and release new CPAN Testers clients that can submit structured Metabase reports

If you are interested in following or participating in Metabase, there is a (still low-volume) mailing list you can subscribe to and several of the key developers can be found on #toolchain on irc.perl.org.

This entry was posted in cpan, cpan-testers, metabase, perl programming and tagged . Bookmark the permalink. Both comments and trackbacks are currently closed.

2 Comments

  1. Posted June 25, 2009 at 10:30 pm | Permalink

    The Metabase concept looks a lot like the base of freebase.com. They allow you to add facts to the database, but in a structured mode, with classes.

    See for example the edit mode of the Larry Wall page:

    http://www.freebase.com/edit/topic/en/larry_wall

    The cool part of the freebase model is the query capabilities. Look at the parallax query application:

    http://blog.freebase.com/2008/08/12/introducing-freebase-parallax/

    Best regards,

    • david
      Posted June 26, 2009 at 1:23 am | Permalink

      As I said, there are some similarities to other knowledge systems. One big difference I can see is that freebase looks like it is trying to be authoritative, whereas Metabase is quite happy to have hundreds of different opinions about any particular fact and let some other application sort out what it means.

One Trackback

  • By Good, fast or cheap — pick again | David Golden on March 10, 2013 at 9:50 pm

    [...] a way to need less of it. My current hypothesis for a plan to hit the deadline is to implement the Metabase framework on top of Amazon Web Services (AWS). That offloads scalability and reliability concerns, [...]

© 2009-2014 David Golden All Rights Reserved