Servicenavigation

Table of Contents

SeqAhead Sardinia Hackathon Tasks

Authors those who proposed the task
Supporters those who think the task is important. I guess authors are implicitly supporters.
Adopters those who are willing to work on it (leave empty for now. Will fill at the hackathon)

ADOPTED TASKS

<move your adopted task here>

1. Integrate readfq into BioPython

  • Authors: Roman, Valentine
  • Supporters: Roman
  • Adopters: Roman

Evaluate how hard would it be to integrate the self-acclaimed fastest FastQ reader by Heng Li (http://github.com/lh3/readfq) into BioPython's general FastQ iterator codebase. Profile both implementations and the end result.

http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

Comparing both repositories, history via `git log` (readfq and Biopython on GitHub) there doesn't seem to be obvious cross-optimizations.

Reading a dummy 1 milion reads FastQ file does not show up significant differences:

Reading FastQ file with Heng Li's: 4.11267 seconds

Reading FastQ file with BioPython…: 5.933215 seconds

A few quick tweets between @lexnederbragt, @chapmanb and others clear the doubts on benchmarking and optimizations:

https://twitter.com/chapmanb/status/342214137012682752

A relevant more in-depth thread about different benchmarks for FastQ reading can be found here:

http://www.biostars.org/p/10353/

Conclusion:

Other than marginal speedups, both implementations seem to be on par at the time of writing those lines. ReadFQ statements (README.md) are out of date and can be misleading.

25. Hadoop-Galaxy integration

  • Authors: Luca
  • Supporters:
  • Adopters: Roman

At CRS4 we've implemented a light integration between Hadoop and Galaxy. We can launch Hadoop jobs (Seal jobs) from Galaxy and tie them into a workflow. To do this, we've added a new data type to Galaxy, the /pathset/. In essence, this is a file that contains URIs to the actual data. Therefore, pathsets create a level of indirection, allowing the actual data to be split into multiple files (unlike the single file that Galaxy normally expects) and to be on any file system (e.g., HDFS).

This work hasn't been released to the world since it's still a little young. If someone is interested, we can bring the code and, as part of this task, people can work to generalize it and finish it. An important part that's missing is “garbage collection”—when Galaxy deletes its dataset, it deletes the pathset file but not the data to which it points.

Conclusion:

Found some (https://github.com/crs4/pydoop/issues/1|bugs). The code needs to be published and get some more work.

5. Freely usable implementation of “mapper”

  • Authors: Valentine Svensson
  • Supporters:
  • Adopters:

Mapper is a (poorly named) partial clustering algorithm which have been used with great success on highly dimensional data. The algorithm is based on ideas from topology, the essence of it is to find the 'shape' of the data one wish to detect patterns in.

The method was recently presented in this scientific report in Nature, the text explains some ideas behind the method and gives examples of applications of the algorithm. http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html

More detailed specifics about the algorithm can be found here: http://comptop.stanford.edu/u/preprints/mapperPBG.pdf

An issue is that the only freely available implementation is in Matlab (http://comptop.stanford.edu/programs/)

The point with this task was to implement Mapper in Python, using SciPy and Networkx, so that it eventually could be optimized and parallelized, e.g. using mapreduce strategies.

I made some nice progress on the initial implementation of the algorithm, the code can be seen at this Github repository: https://github.com/vals/mapper.py

 Output  Input

Conclusion

There aren't yet perfect results, and progress can be seen in the IPython notebook in the same repository. As a first application I am trying to reproduce the partial clustering of random circle data that is shown in the tutorial of the Matlab implementation. There are still some bugs to work out, but the program produces an adjecency matrix for a simplicial complex of overlapping clusters. And in a sensible way; the nodes in the graph follow the segmentation of the filtered data. For some reason it only seem to capture simplices on one side of the circle. Something that I will need to debug in the future.

37. Seal integration in Travis-CI

  • Authors: Guillermo Carrasco
  • Supporters: Luca Pireddu
  • Adopters: Guillermo Carrasco
  • See the task page

Travis-CI is a public and free continuous integration service that supports multiple languages and offer resources to execute software tests. It integrates entirely with GitHub, allowing for the test of each codechange automatically, as well as test of pull requests, allowing the author to know if the pull request breaks something before even merging it.

The aim of the task is to integrate Seal with Travis-CI. Seal has been moved to GitHub.

4. Update libbwa in Seal Seqal from BWA 0.5.10 to 0.7.x

  • Authors: Roman
  • Supporters:
  • Adopters: Ognyan, Riccardo, Luca
  • See the task page

Seqal works with a refactored version of the BWA code (libbwa, found in the Seqal code). Currently, this library contains BWA 0.5.10. For Seqal to stay relevant and useful, the library needs to be updated to the more recent versions of BWA.

36. Check out Cloudera Search

  • Authors: Aleksi
  • Supporters:
  • Adopters:

Conclusion: Solr service installed quite nicely with latest version of Cloudera manager. However can't get it running, initialising SolrCore from the command line fails. Seems that it is still not a well packaged product yet, so will give up for now.

Cloudera Search integrates Apache Solr search engine with Hadoop. It might have interesting applications in bioinformatics.

It seems this was released during the hackathon. Beta is out.

https://ccp.cloudera.com/display/SUPPORT/Downloads?elq=13cba17314044ff7bc8cd2a7c86f4c38&elqCampaignId=228

7. Hadoop-based bclToQseq and bclToFastq

  • Authors: Luca, Roman
  • Supporters:
  • Adopters: Luca, Guillermo, Roman
  • See the task page

bcl files are the output generated by Illumina's base calling software. One can convert these into fastq or qseq for further processing. For this conversion, Illumina provides the utilities bclToFastq and bclToQseq, and a script that creates a system of Makefiles to convert all the bcl data for a flowcell into qseq or fastq files (even in parallel if you use make -j X.

It would be good to have a Hadoop-based tool to perform this same task. A possible strategy is explained over the following thread on BioStar: http://www.biostars.org/p/15698/

Plan

Actually, I (Luca) already have Pydoop-based code to perform the conversion from bcl to qseq on Hadoop in an internal project. If someone is interested in adopting the task at the hackathon, I'll provide the code that already exists so it can be integrated in Seal and generalized. It needs:

  • documentation
  • expose missing bclToQseq parameters (especially the one to skip missing or bad tiles)
  • work with both bclToQseq and bclToFastq

11. Check out H20

  • Authors: Aleksi
  • Supporters:
  • Adopters: Aleksi, Seb

See task page for results.

H2O scales statistics, machine learning and math over Hadoop: http://0xdata.github.io/h2o/

Quickly check this one out and perhaps compare to Mahout. It is not clear if this is still in early development.

30. Integrate Seqpig into Piggene

  • Authors: Lukas, Sebastian, Clemens
  • Supporters: Aleksi
  • Adopters:
  • Authors: Lukas, Sebastian, Clemens, Aleksi

See task page for results.

PigGene is a platform developed by us to create Apache Pig scripts graphically (currently in beta). Users are guided through the creation process, scripts are integrated and executed on Cloudgene. Additionally, a new loader and UDFs have been implemented for the import and use of VCF files. Since seqpig (http://sourceforge.net/projects/seqpig/) supports a lot of different formats in bioinformatics, the aim is to integrate it into Piggene and create seqpig scripts graphically via Piggene as well.

13. Cloudgene: clearer documentation on configuring tool parameters

  • Authors: Alexey
  • Supporters:
  • Adopters: Lukas Forer, Alexey Siretskiy

See task page for results.

A clearer, detailed and systematic description in the manual on how parameters from Cloudgene YAML configuration files are passed to the Crossbow and SOAPsnp, for example, and from the Crossbow to the Bowtie. At the moment it looks messy, and it is confusing which parameters correspond to which programs, are they applicable together etc.

9. Tutorial on getting started with Hadoop for bioinformatics

  • Authors: Aleksi
  • Supporters:
  • Adopters: Andrea Pinna, Oliver Hunewald
  • See the task page

Write short tutorial on getting started with Hadoop bioinformatics. Could be something like:

  • Set up nodes (OpenStack, Amazon)
  • Install Cloudera manager and set up Hadoop on the nodes
  • Install bioinformatics software (Seal?)
  • Do some simple yet meaningful analysis run

The tutorial could be published in some appropriate venue.

Some good reference materials:

PROPOSED TASKS

Here are the task proposals, strictly in order of appearance. There's much more work than workers, so we should have no problems keeping busy for two days.

2. Generalize the Fastq parser in Hadoop-BAM

  • Authors: Guillermo
  • Supporters:
  • Adopters:

No decription

3. BAM I/O for Seal

  • Authors: Luca, Roman
  • Supporters:
  • Adopters:

Seal Seqal and downstream tools currently only read/write SAM. It would be good to add the ability to read and write BAM, where applicable.

Plan

HadoopBAM already has Hadoop classes for BAM input and output.

Java programs

The Java programs in Seal is in part already written with a generic interface to input and output classes. It should be relatively easy to integrate BAM support into these.

Seqal

Seqal, which is written in Python and C, may be more problematic in some ways, but wouldn't require very much coding. I think one could manage to use the BAM output class in HadoopBAM by writing an appropriate serialization function on the Python map or reduce side (note that the Pydoop layer communicates with the rest of the Hadoop framework by serializing its data and sending it over an inter-process channel. Exactly which one remains an exercise to be solved. A couple of options might be:

  • Implement the serialization protocol for SAMRecordWritable
  • Wrap BAMOutputFormat in a SealBAMOutputFormat to implement custom

deserialization on the Java side. Serialize with protobuf (already implemented) or similar

BAM header

Code to generate SAM headers is in Seal::MergeAlignments. Some strategy (e.g. distributed cache) has to be chosen to pass the header to all the tasks since they will need it to generate proper BAM files.

8. Make Hadoop-BAM run on MRv2 / YARN

  • Authors: Aleksi
  • Supporters:
  • Adopters:

Hadoop-BAM has been tested on Hadoop 2 with MRv1, but not an Hadoop 2 with MRv2. Test and fix.

10. Do realtime visualisation (e.g. genome browser) using Impala as data source

  • Authors: Aleksi
  • Supporters:
  • Adopters:

Typical map-reduce based query systems have suffered from 10 second minimum latency. Impala is one of the systems that overcomes this. Realtime queries over huge datasets would allow new exciting possibilities, such as using large Hadoop cluster to provide data to a realtime visualisation engine.

12. CloudGene: add possibility to use distributed cache in private cluster mode

  • Authors: Alexey
  • Supporters:
  • Adopters:

To the “private cluster mode” mode add possibility to use Hadoop distributed cache to make it easier to install the programs! For example, to use Crossbow one needs to install it on all the nodes. The Cloudgene authors included a possibility though to install programs only on one node in the so-called “public cluster mode”, using install script. See here: http://cloudgene.uibk.ac.at/docs/integrating_programs_into_cloudgene

14. Cloudgene: improve log collection

  • Authors: Alexey
  • Supporters:
  • Adopters:

At the moment the running program logs only are taken care of. The logs produced by the HDFS and MR could be very informative for debugging, for example. I mean logs like these: ”/data/1/mapred/local/userlogs/job_201305201249_00”

15. Parallel or distributed part concatenation tool

  • Authors: Luca
  • Supporters:
  • Adopters:

Often the bottleneck in workflows using hadoop is taking Hadoop output in parts and concatenating into a single monolithic file for downstream processing. Note however that this big file is always (in an HPC environment) on a POSIX-compliant parallel shared file system (i.e., it allows random writes).

A tool that could speed-up this procedure would be very useful.

Plan

I don't have a very clear strategy here.

A simple option would be to write a simple, multi-threaded program. It would assign one part to each of its active copy threads and run some number of threads simultaneously. It should make things faster, but it would be limited by the fact that it runs on a single node.

A more sophisticated alternative would be to write a distributed program to do this. It would be natural to write it as a Hadoop program. An option would be to have the launcher part analyze the layout of the data to be copied and write a plan to a simple text file, one task per record (e.g., line). An NLineInputFormat could be used to split the plan so that one record goes to each map task (perhaps extending the NLineInputFormat to be clever about sending the map task to the node with the block to be copied, if on HDFS). Each map task would then receive the plan as a key-value pair and would thus know which part to read and to which position of the output file to write it. It would then open connections to the input and output file systems and copy the data.

Such a solution would work fine for headerless data. Something yet more clever would be needed to handle proper SAM and BAM files, which have a header. In that case, the tool would need to understand the input and output file formats.

16. FastQC-like report generation for SeqPig statistics

  • Authors: Luca
  • Supporters:
  • Adopters:

Was not adopted as it is, but touched by other task.

SeqPig has recently added functionality to collect statistics from Fastq and Qseq (and, in theory, SAM and BAM as well) on sequencing data. The statistics collected are almost totally sufficient to produce a quality report like the one generated by FastQC. However, these statistics are collected using the Hadoop cluster and thus the approach is more scalable.

SeqPig generates tables with base content per cycle, per read, base quality histograms. To be used as a QC tool, these should be presented as a report with pretty graphs and tables; see an example FastQC report.

Plan

This could be done with some simple scripting. A possible solution is to use Python with matplotlib. Such report generation could be contributed back to the SeqPig project.

17. Better variant-storing data structure for Seal RecabTable

  • Authors: Luca
  • Supporters:
  • Adopters:

RecabTable is a Hadoop-based tool in Seal to collect empirical base quality statistics for base quality recalibration (analogous to GATK CountCovariatesWalker). The tool needs to keep track of known variant locations to avoid considering those locations as sequencing errors and overestimating error rates.

Currently, each parallel task in RecabTable loads the variant positions into a big int array. While simple, this solution has several drawbacks:

  • each task has a large overhead as it has to spend a significant amount of

time loading the data;

  • each task uses a significant amount of RAM for this data structure.

A better solution is needed, and it should:

  • share memory between tasks
  • provide read-only, fast look-ups
  • be compact

Plan

An idea would be to create a compact, binary, indexed format for the variant locations (an array of ints). It could be added to the distributed cache. The tasks would access this data through a shared memory map (this approach has worked well for sharing reference data in Seqal). The binary structure would be created on-demand by the “launcher” part of recab, before the Hadoop job is launched.

Open to alternatives…

18. Base quality recalibration tool for Hadoop

  • Authors: Luca
  • Supporters:
  • Adopters:

Related to the task on RecabTable, a distributed tool to actually take that data and recalibrate the base qualities of a data set would be very useful. It's been on the TODO list for Seal for a long time but has never gotten attention.

The recalibration formula could be identical to what's used by GATK

19. Test Ray Assembler

  • Authors: Luca
  • Supporters:
  • Adopters:

I've recently heard about an mpi-based distributed denovo assembler called /Ray/: http://denovoassembler.sourceforge.net/

It would be good test it and to get some independent opinions about it.

20. Explore ADAM columnar storage

  • Authors: Luca
  • Supporters:
  • Adopters:

Matt Massie from AMPLab has written ADAM (https://github.com/massie/adam), which includes Avro data formats for aligned sequencing data and the ability to store sequencing data in Parquet.

Parquet is a columnar data format which allows efficient access and compression (actually, this is a bit of an understatement).

While such a file format is incompatible with traditional tools used in sequencing, I think it's really well suited for standardized, high-throughput operations (the data could, if needed, be reformatted into BAM at the end, just before delivery).

The goal of this task is to install ADAM and try it out. A test could be to take a BAM file and run the sample Pig script provided by Massie. One could measure

  • throughput/node for converting BAM to ADAM
  • throughput/node for Pig script running with ADAM as data source
  • throughput/node for Pig script running with BAM as source
  • data size

21. Explore using ADAM with Seal

  • Authors: Luca
  • Supporters:
  • Adopters:

Parquet files are well suited for distributed computation and are directly accessible by Hadoop MapReduce. It therefore makes sense to implement in the Seal tools the ability of reading and writing Parquet files.

The goal of this task is to implement or at least prototype an ADAMInputFileFormat and ADAMOutputFileFormat for Seal using ADAM.

ps: strategically, I think this could be more useful than integrating BAM I/O in Seal, especially if the conversion ADAM ↔ BAM works well enough. https://github.com/massie/adam

22. Seal: mark duplicates without aligning

  • Authors: Luca
  • Supporters:
  • Adopters:

A good feature to add to Seal would be the ability to mark duplicates without aligning.

23. Seal: handle single-end reads

  • Authors: Luca
  • Supporters:
  • Adopters:

Currently Seal PRQ and Seqal will only handle paired-end data. It has been requested several times that they be modified to handle single-ends as well.

24. Evaluate suitability of Seal for non-Illumina data

  • Authors: Luca
  • Supporters:
  • Adopters:

Seal has mainly (only?) been used with Illumina reads. If someone has experience with sequencing data produced by other technologies we can have Seal reviewed for its suitability to handle such non-Illumina data and, as necessary, generate a list of modifications needed to make it work in such a setting.

26. Open-source elastic Hadoop

  • Authors: Luca
  • Supporters:
  • Adopters:

Michele Muggiri at CRS4 has implemented an “elastic hadoop map-reduce” system, called Hadoocca, to help us Hadoopers peacefully share the cluster with others. I think this system would be very useful to others who find themselves in a similar situation: shared HPC cluster that cannot be dedicated exclusively to Hadoop.

While the system works, it hasn't been released to the outside world. It at least needs to be refactored to be more easily configured and suitable for different queueing systems. A rewrite in a more manageable language (it's currently in BASH) would probably also help.

Important things to be done:

  • use DRMAA or other strategy for compatibility with multiple queueing systems
  • consider using jmx to query data from job tracker
  • compatibililty with Hadoop 2.0 and YARN
  • release as open source

27. Pydoop support for latest version of CDH

  • Authors: Simone Leo
  • Supporters:
  • Adopters:

Add support to Pydoop for the latest version of Cloudera's Distribution of Hadoop (CDH).

28. Add support for NCBI BLAST+ in Biodoop-BLAST

  • Authors: Simone Leo
  • Supporters:
  • Adopters:

Biodoop-BLAST is a distributed, Hadoop-based wrapper for BLAST.

It needs to be updated with support for NCBI BLAST+.

29. Update Cloudgene documentation

  • Authors: Lukas, Sebastian
  • Supporters:
  • Adopters:

Cloudgene is a tool to graphically execute MapReduce jobs on public and private clouds. Since AMIs change, and Cloudgene has been extended in many ways, the aim is to update the documentation.

31. how-to for a MRv2 installation from scratch

  • Authors: Lukas, Sebastian
  • Supporters:
  • Adopters:

Write a how-to for a MRv2 installation from scratch

32. Integrate non-hadoop services into Cloudgene-Cluster

  • Authors: Lukas, Sebastian
  • Supporters:
  • Adopters:

Integrate services other than Hadoop into Cloudgene-Cluster. Updating installation process of user-defined scripts.

33. Find an appropriate scheduling connector/strategy for Ratatosk pipeline

  • Authors: Roman, Valentine, Guillermo
  • Supporters:
  • Adopters:

Right now Ratatosk (https://github.com/percyfal/ratatosk/issues?state=open) runs with the Spotify's Luigi (https://github.com/spotify/luigi) “local scheduler”… It requires a better scheduling strategy, be it DRMAA, SAGA or similar.

34. Add Seal to Cloudgene Private/Public Mode

  • Authors: Sebastian
  • Supporters:
  • Adopters:

Add Seal execution to Cloudgene via Private Mode / Public Mode (on AWS). It may even be feasible to implement automatic execution on AWS.

35. Deploy Hadoop & co. with Chef

  • Authors: Guillermo
  • Supporters:
  • Adopters:

In order to provide an easy and automatic deployment method, I’ve written some Chef-cookbooks for:

  • Hadoop (CDH distribution)
  • Hadoop-BAM
  • Pydoop
  • Seal

They’ve been tested on Ubuntu 12.04 and there they work. In this task, we could extend and generalize these recipes to other distros.

You can find the recipes at https://github.com/guillermo-carrasco/cookbooks

 
Last modified: 2013/06/14 14:44 by Luca Pireddu
DokuWikiRSS Feed