|Authors||those who proposed the task|
|Supporters||those who think the task is important. I guess authors are implicitly supporters.|
|Adopters||those who are willing to work on it (leave empty for now. Will fill at the hackathon)|
<move your adopted task here>
Evaluate how hard would it be to integrate the self-acclaimed fastest FastQ reader by Heng Li (http://github.com/lh3/readfq) into BioPython's general FastQ iterator codebase. Profile both implementations and the end result.
Comparing both repositories, history via `git log` (readfq and Biopython on GitHub) there doesn't seem to be obvious cross-optimizations.
Reading a dummy 1 milion reads FastQ file does not show up significant differences:
Reading FastQ file with Heng Li's: 4.11267 seconds
Reading FastQ file with BioPython…: 5.933215 seconds
A few quick tweets between @lexnederbragt, @chapmanb and others clear the doubts on benchmarking and optimizations:
A relevant more in-depth thread about different benchmarks for FastQ reading can be found here:
Other than marginal speedups, both implementations seem to be on par at the time of writing those lines. ReadFQ statements (README.md) are out of date and can be misleading.
At CRS4 we've implemented a light integration between Hadoop and Galaxy. We can launch Hadoop jobs (Seal jobs) from Galaxy and tie them into a workflow. To do this, we've added a new data type to Galaxy, the /pathset/. In essence, this is a file that contains URIs to the actual data. Therefore, pathsets create a level of indirection, allowing the actual data to be split into multiple files (unlike the single file that Galaxy normally expects) and to be on any file system (e.g., HDFS).
This work hasn't been released to the world since it's still a little young. If someone is interested, we can bring the code and, as part of this task, people can work to generalize it and finish it. An important part that's missing is “garbage collection”—when Galaxy deletes its dataset, it deletes the pathset file but not the data to which it points.
Found some (https://github.com/crs4/pydoop/issues/1|bugs). The code needs to be published and get some more work.
Mapper is a (poorly named) partial clustering algorithm which have been used with great success on highly dimensional data. The algorithm is based on ideas from topology, the essence of it is to find the 'shape' of the data one wish to detect patterns in.
The method was recently presented in this scientific report in Nature, the text explains some ideas behind the method and gives examples of applications of the algorithm. http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
More detailed specifics about the algorithm can be found here: http://comptop.stanford.edu/u/preprints/mapperPBG.pdf
An issue is that the only freely available implementation is in Matlab (http://comptop.stanford.edu/programs/)
The point with this task was to implement Mapper in Python, using SciPy and Networkx, so that it eventually could be optimized and parallelized, e.g. using mapreduce strategies.
I made some nice progress on the initial implementation of the algorithm, the code can be seen at this Github repository: https://github.com/vals/mapper.py
There aren't yet perfect results, and progress can be seen in the IPython notebook in the same repository. As a first application I am trying to reproduce the partial clustering of random circle data that is shown in the tutorial of the Matlab implementation. There are still some bugs to work out, but the program produces an adjecency matrix for a simplicial complex of overlapping clusters. And in a sensible way; the nodes in the graph follow the segmentation of the filtered data. For some reason it only seem to capture simplices on one side of the circle. Something that I will need to debug in the future.
Travis-CI is a public and free continuous integration service that supports multiple languages and offer resources to execute software tests. It integrates entirely with GitHub, allowing for the test of each codechange automatically, as well as test of pull requests, allowing the author to know if the pull request breaks something before even merging it.
The aim of the task is to integrate Seal with Travis-CI. Seal has been moved to GitHub.
Seqal works with a refactored version of the BWA code (libbwa, found in the Seqal code). Currently, this library contains BWA 0.5.10. For Seqal to stay relevant and useful, the library needs to be updated to the more recent versions of BWA.
Conclusion: Solr service installed quite nicely with latest version of Cloudera manager. However can't get it running, initialising SolrCore from the command line fails. Seems that it is still not a well packaged product yet, so will give up for now.
Cloudera Search integrates Apache Solr search engine with Hadoop. It might have interesting applications in bioinformatics.
It seems this was released during the hackathon. Beta is out.
bcl files are the output generated by Illumina's base calling software. One can
convert these into fastq or qseq for further processing. For this conversion,
Illumina provides the utilities
bclToQseq, and a script
that creates a system of Makefiles to convert all the bcl data for a flowcell
into qseq or fastq files (even in parallel if you use
make -j X.
It would be good to have a Hadoop-based tool to perform this same task. A possible strategy is explained over the following thread on BioStar: http://www.biostars.org/p/15698/
Actually, I (Luca) already have Pydoop-based code to perform the conversion from bcl to qseq on Hadoop in an internal project. If someone is interested in adopting the task at the hackathon, I'll provide the code that already exists so it can be integrated in Seal and generalized. It needs:
See task page for results.
H2O scales statistics, machine learning and math over Hadoop: http://0xdata.github.io/h2o/
Quickly check this one out and perhaps compare to Mahout. It is not clear if this is still in early development.
See task page for results.
PigGene is a platform developed by us to create Apache Pig scripts graphically (currently in beta). Users are guided through the creation process, scripts are integrated and executed on Cloudgene. Additionally, a new loader and UDFs have been implemented for the import and use of VCF files. Since seqpig (http://sourceforge.net/projects/seqpig/) supports a lot of different formats in bioinformatics, the aim is to integrate it into Piggene and create seqpig scripts graphically via Piggene as well.
See task page for results.
A clearer, detailed and systematic description in the manual on how parameters from Cloudgene YAML configuration files are passed to the Crossbow and SOAPsnp, for example, and from the Crossbow to the Bowtie. At the moment it looks messy, and it is confusing which parameters correspond to which programs, are they applicable together etc.
Write short tutorial on getting started with Hadoop bioinformatics. Could be something like:
The tutorial could be published in some appropriate venue.
Some good reference materials:
Here are the task proposals, strictly in order of appearance. There's much more work than workers, so we should have no problems keeping busy for two days.
Seal Seqal and downstream tools currently only read/write SAM. It would be good to add the ability to read and write BAM, where applicable.
HadoopBAM already has Hadoop classes for BAM input and output.
The Java programs in Seal is in part already written with a generic interface to input and output classes. It should be relatively easy to integrate BAM support into these.
Seqal, which is written in Python and C, may be more problematic in some ways, but wouldn't require very much coding. I think one could manage to use the BAM output class in HadoopBAM by writing an appropriate serialization function on the Python map or reduce side (note that the Pydoop layer communicates with the rest of the Hadoop framework by serializing its data and sending it over an inter-process channel. Exactly which one remains an exercise to be solved. A couple of options might be:
deserialization on the Java side. Serialize with protobuf (already implemented) or similar
Code to generate SAM headers is in Seal::MergeAlignments. Some strategy (e.g. distributed cache) has to be chosen to pass the header to all the tasks since they will need it to generate proper BAM files.
Hadoop-BAM has been tested on Hadoop 2 with MRv1, but not an Hadoop 2 with MRv2. Test and fix.
Typical map-reduce based query systems have suffered from 10 second minimum latency. Impala is one of the systems that overcomes this. Realtime queries over huge datasets would allow new exciting possibilities, such as using large Hadoop cluster to provide data to a realtime visualisation engine.
To the “private cluster mode” mode add possibility to use Hadoop distributed cache to make it easier to install the programs! For example, to use Crossbow one needs to install it on all the nodes. The Cloudgene authors included a possibility though to install programs only on one node in the so-called “public cluster mode”, using install script. See here: http://cloudgene.uibk.ac.at/docs/integrating_programs_into_cloudgene
At the moment the running program logs only are taken care of. The logs produced by the HDFS and MR could be very informative for debugging, for example. I mean logs like these: ”/data/1/mapred/local/userlogs/job_201305201249_00”
Often the bottleneck in workflows using hadoop is taking Hadoop output in parts and concatenating into a single monolithic file for downstream processing. Note however that this big file is always (in an HPC environment) on a POSIX-compliant parallel shared file system (i.e., it allows random writes).
A tool that could speed-up this procedure would be very useful.
I don't have a very clear strategy here.
A simple option would be to write a simple, multi-threaded program. It would assign one part to each of its active copy threads and run some number of threads simultaneously. It should make things faster, but it would be limited by the fact that it runs on a single node.
A more sophisticated alternative would be to write a distributed program to do this. It would be natural to write it as a Hadoop program. An option would be to have the launcher part analyze the layout of the data to be copied and write a plan to a simple text file, one task per record (e.g., line). An NLineInputFormat could be used to split the plan so that one record goes to each map task (perhaps extending the NLineInputFormat to be clever about sending the map task to the node with the block to be copied, if on HDFS). Each map task would then receive the plan as a key-value pair and would thus know which part to read and to which position of the output file to write it. It would then open connections to the input and output file systems and copy the data.
Such a solution would work fine for headerless data. Something yet more clever would be needed to handle proper SAM and BAM files, which have a header. In that case, the tool would need to understand the input and output file formats.
Was not adopted as it is, but touched by other task.
SeqPig has recently added functionality to collect statistics from Fastq and Qseq (and, in theory, SAM and BAM as well) on sequencing data. The statistics collected are almost totally sufficient to produce a quality report like the one generated by FastQC. However, these statistics are collected using the Hadoop cluster and thus the approach is more scalable.
SeqPig generates tables with base content per cycle, per read, base quality histograms. To be used as a QC tool, these should be presented as a report with pretty graphs and tables; see an example FastQC report.
This could be done with some simple scripting. A possible solution is to use Python with matplotlib. Such report generation could be contributed back to the SeqPig project.
RecabTable is a Hadoop-based tool in Seal to collect empirical base quality statistics for base quality recalibration (analogous to GATK CountCovariatesWalker). The tool needs to keep track of known variant locations to avoid considering those locations as sequencing errors and overestimating error rates.
Currently, each parallel task in RecabTable loads the variant positions into a big int array. While simple, this solution has several drawbacks:
time loading the data;
A better solution is needed, and it should:
An idea would be to create a compact, binary, indexed format for the variant locations (an array of ints). It could be added to the distributed cache. The tasks would access this data through a shared memory map (this approach has worked well for sharing reference data in Seqal). The binary structure would be created on-demand by the “launcher” part of recab, before the Hadoop job is launched.
Open to alternatives…
Related to the task on RecabTable, a distributed tool to actually take that data and recalibrate the base qualities of a data set would be very useful. It's been on the TODO list for Seal for a long time but has never gotten attention.
The recalibration formula could be identical to what's used by GATK
I've recently heard about an mpi-based distributed denovo assembler called /Ray/: http://denovoassembler.sourceforge.net/
It would be good test it and to get some independent opinions about it.
Matt Massie from AMPLab has written ADAM (https://github.com/massie/adam), which includes Avro data formats for aligned sequencing data and the ability to store sequencing data in Parquet.
Parquet is a columnar data format which allows efficient access and compression (actually, this is a bit of an understatement).
While such a file format is incompatible with traditional tools used in sequencing, I think it's really well suited for standardized, high-throughput operations (the data could, if needed, be reformatted into BAM at the end, just before delivery).
The goal of this task is to install ADAM and try it out. A test could be to take a BAM file and run the sample Pig script provided by Massie. One could measure
Parquet files are well suited for distributed computation and are directly accessible by Hadoop MapReduce. It therefore makes sense to implement in the Seal tools the ability of reading and writing Parquet files.
The goal of this task is to implement or at least prototype an ADAMInputFileFormat and ADAMOutputFileFormat for Seal using ADAM.
ps: strategically, I think this could be more useful than integrating BAM I/O in Seal, especially if the conversion ADAM ↔ BAM works well enough. https://github.com/massie/adam
A good feature to add to Seal would be the ability to mark duplicates without aligning.
Currently Seal PRQ and Seqal will only handle paired-end data. It has been requested several times that they be modified to handle single-ends as well.
Seal has mainly (only?) been used with Illumina reads. If someone has experience with sequencing data produced by other technologies we can have Seal reviewed for its suitability to handle such non-Illumina data and, as necessary, generate a list of modifications needed to make it work in such a setting.
Michele Muggiri at CRS4 has implemented an “elastic hadoop map-reduce” system, called Hadoocca, to help us Hadoopers peacefully share the cluster with others. I think this system would be very useful to others who find themselves in a similar situation: shared HPC cluster that cannot be dedicated exclusively to Hadoop.
While the system works, it hasn't been released to the outside world. It at least needs to be refactored to be more easily configured and suitable for different queueing systems. A rewrite in a more manageable language (it's currently in BASH) would probably also help.
Important things to be done:
Add support to Pydoop for the latest version of Cloudera's Distribution of Hadoop (CDH).
Biodoop-BLAST is a distributed, Hadoop-based wrapper for BLAST.
It needs to be updated with support for NCBI BLAST+.
Cloudgene is a tool to graphically execute MapReduce jobs on public and private clouds. Since AMIs change, and Cloudgene has been extended in many ways, the aim is to update the documentation.
Write a how-to for a MRv2 installation from scratch
Integrate services other than Hadoop into Cloudgene-Cluster. Updating installation process of user-defined scripts.
Right now Ratatosk (https://github.com/percyfal/ratatosk/issues?state=open) runs with the Spotify's Luigi (https://github.com/spotify/luigi) “local scheduler”… It requires a better scheduling strategy, be it DRMAA, SAGA or similar.
Add Seal execution to Cloudgene via Private Mode / Public Mode (on AWS). It may even be feasible to implement automatic execution on AWS.
In order to provide an easy and automatic deployment method, I’ve written some Chef-cookbooks for:
They’ve been tested on Ubuntu 12.04 and there they work. In this task, we could extend and generalize these recipes to other distros.
You can find the recipes at https://github.com/guillermo-carrasco/cookbooks