dc7 - 1.0

DebConf7

Speakers
Jeroen van Wolffelaar
Schedule
Day 14
Room Upper Talk Room
Start time 11:00
Duration 01:00
Info
ID 104
Event type Lecture
Track DebConf
Language English
Feedback

Mole: Infrastructure for managing information

Mole is a piece of data gathering infrastructure in development, part of the Debian QA tools. This proposed talk will explain what mole is about, and what you can do with it. Any developer can define datasets, and via a clear set of interfaces any sort of data can be gathered and redistributed, including results of running package checkers such as lintian or results of rebuild attempts under various conditions, but also 'all the .desktop files' for use in auto-installers.

I intend to give a talk about mole, outlining the possibilities, showing some examples and current usage, and showing how people can add their own datasets to it.

With it, I intend to spark interest with people that want to implement various types of checks, statistics, or any kind of useful/interesting on the whole archive, or on for example the bug database, so that they can consider using the existing mole infrastructure for it.

The major properties of mole are:

  • A means to accept submitted information, with optional access control and
moderation
  • A means to store both transinient and constant information ("data")
  • A means to retrieve this information quickly and efficiently in both
micro-queries and as whole datasets
  • A means to coordinate the generation of information
This is very generally spoken mole. To make things a bit more concrete, one example:
  • Mole would accept (by mail, HTTP post, or otherwise) build logs for building
source package in current unstable/testing in a standard or rather a specifically tweaked environment
  • It would store it in its database in an efficient form, keeping some
configured amount of older versions too
  • It would maintain a list of source packages that don't have such log yet, or
which log is older than a configurable amount of time, and provide worker machines that can do such rebuilding with a todo-list. Mole would automatically keep track such that double work is prevented.
  • Via for example a web interface people can query results of specific
packages, but one can also retrieve the full database. This approach is very powerful, because mole "jobs" can be stacked. To continue with the above example, a different job could keep track of such logs that didn't get 'judged' yet, and have a worker judge them for "succesful/not-succesful", and store those qualifications in a result table. Additionally, data-providers do not need to be computer programs, they can also be humans. For example, the job could be "file bugs on failed logs/mark a failed log as "wrongly failed". Data can also be abount non-package things, such as bugs, or mirrors. It can itself also be something else than quality tests, examples would be mere extraction of data such as .desktop files for auto-installers, SLOC-counts for funny statistics, or (user-supplied) screenshots and reviews. Current implementation is functional, but is still a little bit rough. It lacks mostly on documentation, needs some scalability tweaks, and perhaps most importantly, it doesn't have an easy query interface such as a web interface yet. Main advantages are that you have much less code to duplicate for doing new checks and such on packages, people can easily do so and concentrate on the important bits, instead of the needed infrastructure. On the other hand, this also means that a great wealth of information is all available at one place, with the same interface, so that using this information is also much more easier, esp. when you want to combine various sources.