Goal: Analyze existing or new pipelines for variant calling in terms of accuracy and efficiency. We will provide both simulated and real datasets. Both have their advantages and limitations.
Pipelines will also need to be described in terms of resource usage, e.g., memory consumption, CPU usage, time, energy, etc.
Input to each pipeline is a file (or several files) of reads (e.g. in fastq format). Output should be a table of called variants (e.g. in vcf format) and/or mapped reads (e.g. sam format).
The variation files included for human datasets are the full set of variants from where a random subset was chosen to create a diploid genome.
Bacterial data (simulated from a Wolbachia endosymbiont species with a single chromosome)
Wolbachia endosymbiont of Culex quinquefasciatus Pel chromosome, complete genome
ACCESSION NC_010981 1482455 bp DNA circular BCT 23-DEC-2012
Yeast data (Saccharomyces cerevisiae, S288C strain)
Saccharomyces cerevisiae, S288C strain (assembly 02-Sep-2011 size 3.6M)
For each data set analyzed, do separately the following steps: