Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome or Safari browser.

Keeping track of Life Science Data
That presentation was meant as live presentation. So the live parts will not be shown. Sorry for that.

July 2012 • Björn Grüning

"Enable accessible, reproducible, and transparent computational research."
"Enable accessible, reproducible, and transparent computational research."
reproducible - reload experiments
different Tool-Version
But what if ...
  • Blast+

  • Gene-Predictions

  • InterproScan

  • ...

Life science data can cause headaches!
Life science data can cause headaches!
  • Incompleteness

  • Error-prone

  • Complexity

  • Constantly changing

Protein Data Bank statistics
refseq number of organisms
Needed: Update strategy!
Needed: Update strategy!






  • FTP, ftp, File Transfer Protocol ... FTP

  • rsync ...

... is not enough
Error-prone
  • server downtimes

  • errors in flat-files

  • undocumented format changes

  • content errors

our solution :: DVCS
a distributed version control system for life science data
advantages for users of life science data
  • easy updating

advantages for users of life science data
  • easy updating

    • git pull

advantages for users of life science data
  • easy updating

    • git pull

  • dataset revisions

    • rollback to a specific version

advantages for users of life science data
  • easy updating

    • git pull

  • dataset revisions

    • rollback to a specific version

  • branches for production and development

advantages for users of life science data
  • easy updating

    • git pull

  • dataset revisions

    • rollback to a specific version

  • branches for production and development

  • Update / Post-Commit Hooks

    • postprocessing scripts

    • reload tools and services

advantages for producers of life science data
  • easy distribution and mirroring

  • traceability of changes

  • diff, patch, blame

  • easy user contribution

case study Protein Data Bank
  • tracking weekly changes of the PDB

  • added update hooks

  • running own filter and statistic scripts

Protein Data Bank monthly changes
Protein Data Bank line changes
  • September 2011

    HETATM:  -5324 +5004
    COMPND:   -122 +140
    ATOM:     -22765 +23044
  • October 2011

    HETATM:  -8541 +8611
    COMPND:   -645 +570
    ATOM:     -30483 +29138
case study BlastDB

Tracking of ...

  • new sequencing data

  • new assembly

  • new annotations

galaxy integration
transparent
galaxy integration
downsides
  • rollback and cloning are expensive

  • large repositories?

  • memory consumption?

some last remarks
  • git has problems with large files

    • two GSOC projects this summer

  • Perforce and others

  • ZFS, btrfs

Questions?

Thank You!