1. Easily hadoopifying NGS workflows with Pig (A, David, Davids, Andre)

  • Based on SeqPig library
  • Converting some existing SAM based workflows into BAM+SeqPig?
  • Implementing pileup with Pig, and then more advanced operations on top of that?

2. Making Seal available through Chipster (A)

  • Wrapping up everything into an installation package

3. Benchmarking tools for synchronising large datasets across cloud nodes (A, Samuel, Dimitar)

  • rsync, zsync, Aspera, iRods…
  • Synchronising running filesystems against disk images

Large data transfer page

4. Getting CloudMan & Galaxy running on OpenNebula cluster

Participants: Roman Valls Guimera, Aleksi Kallio, Jorrit Boekel
CSC personnel: Jarno Laitinen, Risto Laurikainen
KTH personnel: Zeeshan Ali Sha

Bioinformatic analyses can be done in user-friendly environments such as Galaxy. This framework has been combined with Cloudman to enable the formation of a compute cluster in Amazon’s EC2 cloud, which makes parallel analysis possible at a reasonable price. However, the fact that academia is often reluctant to the use of credit cards and legal issues about data location (especially when it concerns clinical data from human patients) make Amazon a less attractive option.

We explored the possibility of setting up Galaxy with Cloudman on the free, open-source cloud computing platform OpenNebula, which is currently in place at at least two Nordic institutions, albeit under development: CSC in Finland and KTH in Sweden. The software counterpart, [|Cloudman], has limited support for OpenNebula, which this task effort is willing to improve.

Using a 40-node Hadoop cluster running OpenNebula at CSC (, we have attempted to install Cloudman and perform necessary modifications to make it work. Further collaboration between sysadmins (infrastructure) and developers (source code) need to take place in order to have a functional private biocloud. Our intention is to keep the collaboration alive after the hackathon to reach this goal.

For a more technical report, please see tasks/cloudman_opennebula

5. CloudBioLinux advanced topics

  • Handling large datasets (genomes, indexes)
  • Handling complex dependencies

6. Virtualization over HPC resources (Ola, Samuel, Dimitar)

  • Running virtual machines on SGE/etc. cluster
  • Using Chef/Puppet

7. Deploying Hadoop with virtual machines (A)

  • Creating images from scratch
  • Using Cloudera

8. “Hadoopedia”: Documentation on using Hadoop for NGS data analysis (A, Ola, Mikael)

9. SequenceFile for read mappings

  • Design and implement an efficient MR file format for aligned reads
    • use SequenceFile
    • serialize records using Avro or Protobuf
    • store read pairs in the same record

10. Trying out Eoulsan framework (A, Mikael, David, Valentine, Andre)

11. Variant allele distribution (Luca, Mikael, Valentine, Davids, Andre, Eija)

12. Gobi file format

The Gobi file format ( is a very efficient alternative for storing aligned reads. The developers already provide a Java library that implements it. It would be interesting, after a preliminary feasibility study, to implement Gobi as a a Hadoop I/O file format class in Hadoop-BAM

13. Hadoop file format conversion tool (A)

A generic Hadoop-based file format conversion tool would be useful. It could be implemented as part of Hadoop-BAM, as a slight specialization of Hadoop-BAM sort. In fact, maybe it can already be done with

hadoop -D mapred.reduce.tasks=0 jar hadoop-bam-4.0.jar sort ...

File formats to support: * sam * bam * fastq * qseq * gobi * …

14. Quality control (A, Luca)

15. Base quality recalibration in Seal

  • we have an app to calculate empirical base quality statistics
  • we still need to implement an application to recalibrate the base qualities

16. Population GLF

  • U. Michigan has a tool, glfMultiples (
  • it analyses entire sample populations at once to extract population variant statistics
  • to me it seems to be a good candidate for Hadoop-ification

17. Galaxy / Hadoop integration (Jorrit, Daniel, Martin, Samuel)

How to integrate hadoop tools with Galaxy?

  • possible solution: add a FileSet data type to Galaxy
  • hadoop jobs are commands
  • each job takes a file set as input and generates a file set

Implement, test, package so that running hadoop commands programs from Galaxy becomes trivial.

18. Life saver scripts with SeqPig (A, Luca, Andre)

SeqPig Life Savers page

19. Sequencing center automation

Learn about options for automating the operation of sequencing centers, such as

20. Implement a repository for sequencing data (A, Luca, Dimitar)

21. Use iRODS with cloud computing (A, Ola, Luca, Roman, Samuel, Dimitar, Davids)

Investigate how iRODS would be suitable for cloud computing, investigate compatibility issues with e.g. HDFS and S3.

22. Writing hackathon report (Ola)

Document technologies, current status, identify bottlenecks, problems, opportunities etc. Summarize in a report to COST. Extended version could possibly be turned into a manuscript for submission to journal.

23. Contamination screening using bloom filters (Valentine, Roman)

Use bloom filters to mimic fast_screen functionality, potentially using FACS as a reference implementation.

Last modified: 2012/06/04 09:33 by Roman Valls Guimera
DokuWikiRSS Feed