The hackathon is organised around collaborative tasks. Here you can find all documentation on various tasks that were proposed or adopted.
For each task we have header stating the people related to it:
|Authors||those who proposed the task|
|Supporters||those who think the task is important. I guess authors are implicitly supporters.|
|Adopters||those who are willing to work on it (leave empty for now. Will fill at the hackathon)|
After the header there is a brief description and results, if the task was adopted.
Total of 8 collaborative tasks were adopted during the hackathon. Their descriptions and results can be found here.
Seqal currently works with a refactored version of the BWA code (libbwa, found in the Seqal code). Currently, this library contains BWA 0.5.10. For Seqal to stay relevant and useful, the aligner needs to be updated to the more recent algorithms. More generally, I think we can abstract the aligner with the Seqal code such that it becomes relatively easy to wrap new aligners and plug them in.
Some work was done that the last hackathon to port BWA 0.7 to Seqal. This work followed the same approach that was adopted by the Seal developers when integrating BWA 0.5. I've become convinced that this approach, that exposes a large number of structures and functions to the client (Python) side is flawed and difficult to generalize. Instead, I propose this task to:
Result: see the task page
Result: Automation of a complete test suite for testing and benchmarking FACS and fastq_screen, on a reproducible way, was developed. Memory leaks as reported by Coverity Scan are fixed.
Cloudgene supports several protocols (ftp, http, s3, file uploads) which can be used in order to import data into the HDFS workspace. Jonas Hagberg from UPPMAX extended Cloudgene with an sftp import module and shared his code on github (https://github.com/UPPMAX/cloudgene).
Result: During the hackathon, we tested his code, refactored and finally integrated it into the official version of Cloudgene. The new version including this feature will be online on github (https://github.com/genepi/cloudgene) in a few days.
Recently Apache Spark has gained a lot of momentum in enterprise world. It offers richer programming models compared to plain Hadoop MapReduce. We could investigate Spark to see what potential it has for bioinformatics.
Result: See Spark notes.
Gene annotations change, frequently. For well annotated organisms like human, these changes are often subtle, for non-model organisms, however, updates can be substantial.
We would like to explore two questions:
a) How big is the effect of these changes on the calling of differential expression for gene transcripts, with an emphasis on expression profiling that discriminates multiple alternative transcripts of genes?
b) It has been suggested by the authors of the tuxedo suit of tools, that it might be possible to run tools in a special “update” mode. We thus wonder how gene annotation updates affect (i) initial read alignments when alignment tools take gene models into account, and (ii) downstream results (gene expression levels, differential expression calls, …)
Result: The influence of gene annotations changes on expression level estimates was studied, using the ReXpress tool. Indeed changes in reference transcriptome annotation affected the expression level estimates of a large proportion of transcripts. In addition the ReExpress tool for incremental RNA-Seq remapping and quantification was examined. Unfortunately the test which was run on two consecutive versions of Maize (where the ~20% of transcriptome has changed) has not finished during the Hackathon. Conclusion was that this tool is designed only for small changes (like daily changes in RefSeq) in reference genomes and transcriptomes and this is the reason that bigger annotation changes lead to exponential running time.
Result: experiences and ideas collected to Workflow Experience Google Doc.
The SeqPig library provides convenient way to operate on NGS data (QSEQ, FASTQ, SAM, BAM) using SQL-like syntax in a massively parallel manner. The SeqPig approach, however, is created keeping in mind to be run in Hadoop framework. The Hadoop streaming library offers a possibility to re-use the already existing HPC software with minor modifications or by writing a wrapper.
The idea of the Hadoop streaming library is to use STDIN and STDOUT for communication with the Hadoop. Therefore, developing a massively parallel version, or at least for some parts of the HTSeq package provides the independent Hadoop-based alternative for the existing SeqPig, and gives a possibility to provide a smooth user experience for mainly biologists, who are not familiar with the CLI.
We would like to use the Cloudgene as a graphical backend. The Cloudgene is a properly documented and well maintained software, providing a clear formalism for bioinformaticians to build analysis pipelines in a YAML-like format.
Result: initial work was started but due to time limits it couldn’t be finished as published documentation or code. Team decided it would continue to communicate on the task after the hackathon.
SeqPig has functionality to collect statistics from sequencing data in Fastq and Qseq (and, in theory, SAM and BAM as well) format. The statistics collected are almost sufficient to produce a quality report like the one generated by FastQC (except for the k-mer analysis. However, unlike FastQC, these statistics are collected using the Hadoop cluster and thus the approach is more scalable.
SeqPig generates tables with base content per cycle, per read, base quality histograms. To be used as a QC tool, these should be presented as a report with pretty graphs and tables; see an example FastQC report.
This script could be a starting point: seqpig_fastqc.py
Such report generation could be contributed back to the SeqPig project.
Result: FastQC-like functionality designed to work with the SeqPig framework was developed but not yet submitted back to the project. Work is ongoing.
These tasks were proposed, but not adopted during the hackathon.
The idea is to write a cloudbiolinux flavour for neuroinformatics. Also study if some neuro pipeline tutorial could be implemented/documented.
“GATK-Queue is command-line scripting framework for defining multi-stage genomic analysis pipelines combined with an execution manager that runs those pipelines from end-to-end.” There are instructions at http://www.broadinstitute.org/gatk/guide/article?id=1309 for using Scala editor in IntelliJ IDEA to create workflows of Gatk Queue. The task is to gather experience with using Gatk Queue with IntelliJ IDEA and compare it with plain Gatk and other variant calling software.
Objective: transparently call Hadoop programs as components of Galaxy workflows.
At CRS4 we've implemented a light integration between Hadoop and Galaxy.
We can launch Hadoop jobs (Seal jobs) from Galaxy and tie them into a workflow. To do this, we've added a new data type to Galaxy, the /pathset/. In essence, this is a file that contains URIs to the actual data. Therefore, pathsets create a level of indirection, allowing the actual data to be split into multiple files (unlike the single file that Galaxy normally expects) and to be on any file system (e.g., HDFS).
The integration is still a little young. To be generally useful
In fact, when Galaxy deletes its dataset, it deletes the pathset file but not the data to which it points. Thus one is left to clean-up the datasets manually.
Travis-CI is a public and free continuous integration service that supports multiple languages and offer resources to execute software tests. It integrates entirely with GitHub, allowing for the test of each codechange automatically, as well as test of pull requests, allowing the author to know if the pull request breaks something before even merging it.
Guillermo worked on this at the last hackathon and almost got to the end, but there's some issues left to resolve.
A good feature to add to Seal would be the ability to mark duplicates without aligning.
RecabTable is a Hadoop-based tool in Seal to collect empirical base quality statistics for base quality recalibration (analogous to GATK CountCovariatesWalker). The tool needs to keep track of known variant locations to avoid considering those locations as sequencing errors and overestimating error rates.
Currently, each parallel task in RecabTable loads the variant positions into a big int array. While simple, this solution has several drawbacks:
time loading the data;
A better solution is needed, and it should:
An idea would be to create a compact, binary, indexed format for the variant locations (an array of ints). It could be added to the distributed cache. The tasks would access this data through a shared memory map (this approach has worked well for sharing reference data in Seqal). The binary structure would be created on-demand by the “launcher” part of recab, before the Hadoop job is launched.
Open to alternatives…
Related to the task on RecabTable, a distributed tool to actually take that data and recalibrate the base qualities of a data set would be very useful. It's been on the TODO list for Seal for a long time but has never gotten attention.
The recalibration formula could be identical to what's used by GATK
Parquet files are well suited for distributed computation and are directly accessible by Hadoop MapReduce. It therefore makes sense to implement in the Seal tools the ability of reading and writing Parquet files.
The goal of this task is to implement or at least prototype an ADAMInputFileFormat and ADAMOutputFileFormat for Seal using ADAM.
ps: strategically, I think this could be more useful than integrating BAM I/O in Seal, especially if the conversion ADAM ↔ BAM works well enough. https://github.com/massie/adam
For our heteroplasmy pipeline, we needed an integration of BWA MEM. This has been achieved by provided JNIs (https://github.com/lindenb/jbwa). Goal of this task is to update the alignment step in the workflow to 0.7.7
We integrated some basic directives (if conditions, loops, ..) into Cloudgene in order to enable a dynamic creation of workflows. The aim of this task is to create a tutorial that demonstrates the new features on the basis of examples. Additionally, the documentation should be updated.
BAQ needs to be executed on the aligned reads. GATK's code differs from samtools 0.1.19 especially for circular DNA. Evaluate and integrate BAQ into our mtDNA heteroplasmy pipeline.
Various participants are using or looking into using Hadoop on top of other storage system than HDFS. We could review the options, maybe do even benchmarks and write a summary result page. Candidates for storage systems include Lustre, NFS, Ceph and object storage systems (e.g. Swift).
Bisulphite treatment allows to spot methylation on DNA sequences. Unmethylated Cs are converted to Ts while methylated ones are not converted. And, the opposite strand Gs stays unchanged. It would be nice to implement a methylation evaluation straight into bwa or into a seal
Spike-In sequences are widely used to evaluate sequencing quality or to embed special assays that require a few reads (compared to the number of clusters of a lane) into other sequencing runs. It can be useful to implement a spike-in counter and extraction into Seal
Many of us somehow proposed to wrap alignment tools into their own frameworks. This is leading to several different project forks. It would be nice do define a common interface to use alignment tools as libraries such that wrapping could become a lot easier (fits with Luca's task)