Participants: Roman Valls Guimera, Aleksi Kallio, Jorrit Boekel CSC personnel: Jarno Laitinen, Risto Laurikainen KTH personnel: Zeeshan Ali Sha
Bioinformatic analyses can be done in user-friendly environments such as Galaxy. This framework has been combined with Cloudman to enable the formation of a compute cluster in Amazon’s EC2 cloud, which makes parallel analysis possible at a reasonable price. However, the fact that academia is often reluctant to the use of credit cards and legal issues about data location (especially when it concerns clinical data from human patients) make Amazon a less attractive option.
We explored the possibility of setting up Galaxy with Cloudman on the free, open-source cloud computing platform OpenNebula, which is currently in place at at least two Nordic institutions, albeit under development: CSC in Finland and KTH in Sweden. The software counterpart, [https://bitbucket.org/mdehollander/cloudman-opennebula/overview|Cloudman], has limited support for OpenNebula, which this task effort is willing to improve.
Using a 40-node Hadoop cluster running OpenNebula at CSC (http://www.csc.fi), we have attempted to install Cloudman and perform necessary modifications to make it work. Further collaboration between sysadmins (infrastructure) and developers (source code) need to take place in order to have a functional private biocloud. Our intention is to keep the collaboration alive after the hackathon to reach this goal.
For a more technical report, please see tasks/cloudman_opennebula
The Gobi file format (http://campagnelab.org/software/goby/) is a very efficient alternative for storing aligned reads. The developers already provide a Java library that implements it. It would be interesting, after a preliminary feasibility study, to implement Gobi as a a Hadoop I/O file format class in Hadoop-BAM
A generic Hadoop-based file format conversion tool would be useful. It could be implemented as part of Hadoop-BAM, as a slight specialization of
Hadoop-BAM sort. In fact, maybe it can already be done with
hadoop -D mapred.reduce.tasks=0 jar hadoop-bam-4.0.jar sort ...
File formats to support: * sam * bam * fastq * qseq * gobi * …
How to integrate hadoop tools with Galaxy?
Implement, test, package so that running hadoop commands programs from Galaxy becomes trivial.
Learn about options for automating the operation of sequencing centers, such as https://github.com/SciLifeLab/bcbb/blob/master/nextgen/README.md
Investigate how iRODS would be suitable for cloud computing, investigate compatibility issues with e.g. HDFS and S3.
Document technologies, current status, identify bottlenecks, problems, opportunities etc. Summarize in a report to COST. Extended version could possibly be turned into a manuscript for submission to journal.