Sofia Hackathon Tasks

The hackathon is organised around collaborative tasks. Here you can find all documentation on various tasks that were proposed or adopted.

For each task we have header stating the people related to it:

Authors those who proposed the task
Supporters those who think the task is important. I guess authors are implicitly supporters.
Adopters those who are willing to work on it (leave empty for now. Will fill at the hackathon)

After the header there is a brief description and results, if the task was adopted.


Total of 8 collaborative tasks were adopted during the hackathon. Their descriptions and results can be found here.

6. Aligner plug-in for Seal Seqal. Create plug-in from BWA 0.7.x

  • Authors: Luca
  • Supporters:
  • Adopters: Luca, Riccardo

Seqal currently works with a refactored version of the BWA code (libbwa, found in the Seqal code). Currently, this library contains BWA 0.5.10. For Seqal to stay relevant and useful, the aligner needs to be updated to the more recent algorithms. More generally, I think we can abstract the aligner with the Seqal code such that it becomes relatively easy to wrap new aligners and plug them in.

Some work was done that the last hackathon to port BWA 0.7 to Seqal. This work followed the same approach that was adopted by the Seal developers when integrating BWA 0.5. I've become convinced that this approach, that exposes a large number of structures and functions to the client (Python) side is flawed and difficult to generalize. Instead, I propose this task to:

  • define a high-level C API for an aligner plug-in to support the required operations:
    • init
    • load/unload reference
    • map a set of paired-end reads
    • map a set of single-end reads
    • shutdown
    • other??
  • implement this API with the functionality in BWA 0.7
  • implement a Python wrapper for this generic API which will be used by Seqal
  • profit :-)

Result: see the task page

1. Finish up features, benchmarks and fixes for FACS

The most outstanding tasks left to do are adding paired-end support, fix memory leaks detected by coverity scan and finish some automated performance plots.

  • Authors: Guillermo, Roman
  • Supporters: Guillermo, Roman
  • Adopters: Guillermo, Roman

Result: Automation of a complete test suite for testing and benchmarking FACS and fastq_screen, on a reproducible way, was developed. Memory leaks as reported by Coverity Scan are fixed.

12. Cloudgene: sftp import

  • Authors: Lukas
  • Supporters:
  • Adopters:

Cloudgene supports several protocols (ftp, http, s3, file uploads) which can be used in order to import data into the HDFS workspace. Jonas Hagberg from UPPMAX extended Cloudgene with an sftp import module and shared his code on github (

Result: During the hackathon, we tested his code, refactored and finally integrated it into the official version of Cloudgene. The new version including this feature will be online on github ( in a few days.

16. Explore Spark

  • Authors: Aleksi
  • Supporters:
  • Adopters:

Recently Apache Spark has gained a lot of momentum in enterprise world. It offers richer programming models compared to plain Hadoop MapReduce. We could investigate Spark to see what potential it has for bioinformatics.

Result: See Spark notes.

20. Influence of gene annotations change

  • Authors: David, Pawel
  • Supporters: Eija, Ognyan
  • Adopters:

Gene annotations change, frequently. For well annotated organisms like human, these changes are often subtle, for non-model organisms, however, updates can be substantial.

We would like to explore two questions:

a) How big is the effect of these changes on the calling of differential expression for gene transcripts, with an emphasis on expression profiling that discriminates multiple alternative transcripts of genes?

b) It has been suggested by the authors of the tuxedo suit of tools, that it might be possible to run tools in a special “update” mode. We thus wonder how gene annotation updates affect (i) initial read alignments when alignment tools take gene models into account, and (ii) downstream results (gene expression levels, differential expression calls, …)

Result: The influence of gene annotations changes on expression level estimates was studied, using the ReXpress tool. Indeed changes in reference transcriptome annotation affected the expression level estimates of a large proportion of transcripts. In addition the ReExpress tool for incremental RNA-Seq remapping and quantification was examined. Unfortunately the test which was run on two consecutive versions of Maize (where the ~20% of transcriptome has changed) has not finished during the Hackathon. Conclusion was that this tool is designed only for small changes (like daily changes in RefSeq) in reference genomes and transcriptomes and this is the reason that bigger annotation changes lead to exponential running time.

22. Collect user experiences and thoughts on workflow systems

  • Authors: Ola
  • Supporters:
  • Adopters: Guillermo, Roman, Mario, David, Pawel, Aleksi, Eija, Luka, Sebastian, Lukas, Milko, Ognyan, Riccardo

There are multiple workflow systems available, some of which runs on Hadoop. Examples: Luigi Luigi, Apache Oozie, Azkabanazkaban etc. Also, people use scripting and Make a lot.

Result: experiences and ideas collected to Workflow Experience Google Doc.

23. An alternative to the SeqPig with a massively parallel version of the HTSeq package

  • Author: Alexey
  • Supporters:
  • Adopters:

The SeqPig library provides convenient way to operate on NGS data (QSEQ, FASTQ, SAM, BAM) using SQL-like syntax in a massively parallel manner. The SeqPig approach, however, is created keeping in mind to be run in Hadoop framework. The Hadoop streaming library offers a possibility to re-use the already existing HPC software with minor modifications or by writing a wrapper.

The idea of the Hadoop streaming library is to use STDIN and STDOUT for communication with the Hadoop. Therefore, developing a massively parallel version, or at least for some parts of the HTSeq package provides the independent Hadoop-based alternative for the existing SeqPig, and gives a possibility to provide a smooth user experience for mainly biologists, who are not familiar with the CLI.

We would like to use the Cloudgene as a graphical backend. The Cloudgene is a properly documented and well maintained software, providing a clear formalism for bioinformaticians to build analysis pipelines in a YAML-like format.

Result: initial work was started but due to time limits it couldn’t be finished as published documentation or code. Team decided it would continue to communicate on the task after the hackathon.

7. FastQC-like report generation for SeqPig statistics

  • Authors: Luca
  • Supporters:
  • Adopters:

SeqPig has functionality to collect statistics from sequencing data in Fastq and Qseq (and, in theory, SAM and BAM as well) format. The statistics collected are almost sufficient to produce a quality report like the one generated by FastQC (except for the k-mer analysis. However, unlike FastQC, these statistics are collected using the Hadoop cluster and thus the approach is more scalable.

SeqPig generates tables with base content per cycle, per read, base quality histograms. To be used as a QC tool, these should be presented as a report with pretty graphs and tables; see an example FastQC report.

This script could be a starting point:

Such report generation could be contributed back to the SeqPig project.

Result: FastQC-like functionality designed to work with the SeqPig framework was developed but not yet submitted back to the project. Work is ongoing.


These tasks were proposed, but not adopted during the hackathon.

2. Merge CloudBioLinux and NeuroDebian, bridging bioinformatics and neuroinformatics

The idea is to write a cloudbiolinux flavour for neuroinformatics. Also study if some neuro pipeline tutorial could be implemented/documented.

  • Authors: Roman
  • Supporters: Roman
  • Adopters: Roman

3. Running Gatk Queue with IntelliJ IDEA

“GATK-Queue is command-line scripting framework for defining multi-stage genomic analysis pipelines combined with an execution manager that runs those pipelines from end-to-end.” There are instructions at for using Scala editor in IntelliJ IDEA to create workflows of Gatk Queue. The task is to gather experience with using Gatk Queue with IntelliJ IDEA and compare it with plain Gatk and other variant calling software.

  • Authors: Ognyan
  • Supporters: Ognyan
  • Adopters: Ognyan

4. Hadoop-Galaxy integration

  • Authors: Luca
  • Supporters:
  • Adopters:

Objective: transparently call Hadoop programs as components of Galaxy workflows.

At CRS4 we've implemented a light integration between Hadoop and Galaxy.

We can launch Hadoop jobs (Seal jobs) from Galaxy and tie them into a workflow. To do this, we've added a new data type to Galaxy, the /pathset/. In essence, this is a file that contains URIs to the actual data. Therefore, pathsets create a level of indirection, allowing the actual data to be split into multiple files (unlike the single file that Galaxy normally expects) and to be on any file system (e.g., HDFS).

The integration is still a little young. To be generally useful

  • it needs to be generalized;
  • it needs a “garbage collection” component.

In fact, when Galaxy deletes its dataset, it deletes the pathset file but not the data to which it points. Thus one is left to clean-up the datasets manually.

5. Seal integration in Travis-CI

  • Authors: Guillermo Carrasco
  • Supporters: Luca Pireddu
  • Adopters:

Objective: configure continuous integration on Travis-CI for Seal.

Travis-CI is a public and free continuous integration service that supports multiple languages and offer resources to execute software tests. It integrates entirely with GitHub, allowing for the test of each codechange automatically, as well as test of pull requests, allowing the author to know if the pull request breaks something before even merging it.

Guillermo worked on this at the last hackathon and almost got to the end, but there's some issues left to resolve.

8. Seal: mark duplicates without aligning

  • Authors: Luca
  • Supporters:
  • Adopters:

A good feature to add to Seal would be the ability to mark duplicates without aligning.

9. Better variant-storing data structure for Seal RecabTable

  • Authors: Luca
  • Supporters:
  • Adopters:

RecabTable is a Hadoop-based tool in Seal to collect empirical base quality statistics for base quality recalibration (analogous to GATK CountCovariatesWalker). The tool needs to keep track of known variant locations to avoid considering those locations as sequencing errors and overestimating error rates.

Currently, each parallel task in RecabTable loads the variant positions into a big int array. While simple, this solution has several drawbacks:

  • each task has a large overhead as it has to spend a significant amount of

time loading the data;

  • each task uses a significant amount of RAM for this data structure.

A better solution is needed, and it should:

  • share memory between tasks
  • provide read-only, fast look-ups
  • be compact


An idea would be to create a compact, binary, indexed format for the variant locations (an array of ints). It could be added to the distributed cache. The tasks would access this data through a shared memory map (this approach has worked well for sharing reference data in Seqal). The binary structure would be created on-demand by the “launcher” part of recab, before the Hadoop job is launched.

Open to alternatives…

10. Base quality recalibration tool for Hadoop

  • Authors: Luca
  • Supporters:
  • Adopters:

Related to the task on RecabTable, a distributed tool to actually take that data and recalibrate the base qualities of a data set would be very useful. It's been on the TODO list for Seal for a long time but has never gotten attention.

The recalibration formula could be identical to what's used by GATK

11. Explore using ADAM with Seal

  • Authors: Luca
  • Supporters:
  • Adopters:

Parquet files are well suited for distributed computation and are directly accessible by Hadoop MapReduce. It therefore makes sense to implement in the Seal tools the ability of reading and writing Parquet files.

The goal of this task is to implement or at least prototype an ADAMInputFileFormat and ADAMOutputFileFormat for Seal using ADAM.

ps: strategically, I think this could be more useful than integrating BAM I/O in Seal, especially if the conversion ADAM ↔ BAM works well enough.

13. Cloudgene: Alignment

  • Authors: Seb
  • Supporters:
  • Adopters:

For our heteroplasmy pipeline, we needed an integration of BWA MEM. This has been achieved by provided JNIs ( Goal of this task is to update the alignment step in the workflow to 0.7.7

14. Cloudgene: possible extensions of the manifest file

  • Authors: Lukas, Seb
  • Supporters:
  • Adopters:

We integrated some basic directives (if conditions, loops, ..) into Cloudgene in order to enable a dynamic creation of workflows. The aim of this task is to create a tutorial that demonstrates the new features on the basis of examples. Additionally, the documentation should be updated.

15. Cloudgene: BAQ Integration

  • Authors: Lukas, Seb
  • Supporters:
  • Adopters:

BAQ needs to be executed on the aligned reads. GATK's code differs from samtools 0.1.19 especially for circular DNA. Evaluate and integrate BAQ into our mtDNA heteroplasmy pipeline.

17. Hadoop on top of other storage system

  • Authors: Aleksi
  • Supporters:
  • Adopters:

Various participants are using or looking into using Hadoop on top of other storage system than HDFS. We could review the options, maybe do even benchmarks and write a summary result page. Candidates for storage systems include Lustre, NFS, Ceph and object storage systems (e.g. Swift).

18. Implement Bisulphite Sequencing support in BWA or in Seal

  • Authors: Riccardo
  • Supporters:
  • Adopters:

Bisulphite treatment allows to spot methylation on DNA sequences. Unmethylated Cs are converted to Ts while methylated ones are not converted. And, the opposite strand Gs stays unchanged. It would be nice to implement a methylation evaluation straight into bwa or into a seal

19. Spike-In evaluation and extraction in Seal

  • Authors: Riccardo
  • Supporters:
  • Adopters:

Spike-In sequences are widely used to evaluate sequencing quality or to embed special assays that require a few reads (compared to the number of clusters of a lane) into other sequencing runs. It can be useful to implement a spike-in counter and extraction into Seal

21. Define a C++ Api for aligners and try to implement it on BWA

  • Authors: Riccardo
  • Supporters:
  • Adopters:

Many of us somehow proposed to wrap alignment tools into their own frameworks. This is leading to several different project forks. It would be nice do define a common interface to use alignment tools as libraries such that wrapping could become a lot easier (fits with Luca's task)

Last modified: 2014/05/08 06:55 by Eija Korpelainen
DokuWikiRSS Feed