Goal: Analyze existing or new pipelines for variant calling in terms of accuracy and efficiency. We will provide both simulated and real datasets. Both have their advantages and limitations.

  1. Simulated data: We know the truth and can quantify the correctness of the predicted variations. However, some data properties (number and distribution of errors) may not correspond to those of real data.
  2. Real data: is of course much more relevant, but we cannot estimate the accuracy of the approaches, as we do not know the true results.

Pipelines will also need to be described in terms of resource usage, e.g., memory consumption, CPU usage, time, energy, etc.

Input to each pipeline is a file (or several files) of reads (e.g. in fastq format). Output should be a table of called variants (e.g. in vcf format) and/or mapped reads (e.g. sam format).

Important Dates (tentative!)

  • 14.12.2012: Simulated data made available
  • 7.1.2013: Submission system opens
  • 28.2.2013: Submission closes
  • 29.3.2013: Performance evaluation announced
  • May / June: Workshop in Udine, Italy


  • Precision / recall measures are computed for the read mapping results as follows.
    • Artificial genomes are generated so that their mapping to reference is kept.
    • For each generated read from artificial genome, we keep its original location in artificial genome and can therefore map it to the reference.
    • With this information, we can compute false positive (FP) and true positive (TP) counts.
    • We also generate reads from other genomes to enable counting of false negative (FN) and true negative (TN) counts
  • Precision / recall measures are computed for variant calling results as follows
    • We keep the true variants chosen to generate the artificial genome
    • Predicted variants are compared to the true variants to compute TP, FP, and FN counts.
    • Notice that comparing variants directly is tricky, as typically predictions may differ slightly from true variant; we plan to use some thresholds for variant similarity and plot several precision / recall values for different thresholds
  • Resource usage
    • We ask all teams to provide amount of euros spent for their variant calling run.
      • This is computed by i) estimating the current purchase price of similar system as where the software was excecuted. Say this amount is X euros. Then ii) estimate the proportion of life time of the system used for the variant calling run, assuming the lifetime of the system is 4 years. Say this proportion is Y. The variant calling then has costed X*Y euros.
    • In addition, peak space consumption, wall-clock time, level of parallelism, and specifications of the system used for the run should be reported.

Available simulated data

Human data

The variation files included for human datasets are the full set of variants from where a random subset was chosen to create a diploid genome.

  1. Artificial Chromosome 20 (20000000 reads of length 70bp)
  2. Artificial Chromosome 20 (Another set of reads)(20000000 reads of length 70bp)
    1. Reference genome for chromosome 20: hs_ref_GRCH37.p5_chr20.fa
  3. Artificial Chromosome 2(80000000 reads of length 70bp)
    1. Reference genome for chromosome 2: hs_ref_GRCh37.p5_chr2.fa

Bacterial data (simulated from a Wolbachia endosymbiont species with a single chromosome)

  1. Reads from the Wolbachia genome: 703838 reads of length 100 bp, encoding Illumina 1.5

Wolbachia endosymbiont of Culex quinquefasciatus Pel chromosome, complete genome

  ACCESSION       NC_010981            1482455 bp    DNA     circular BCT 23-DEC-2012

Yeast data (Saccharomyces cerevisiae, S288C strain)

  1. Reads from the Yeast genome: 5.782.974 reads of length 100 bp, encoding Illumina 1.5

Saccharomyces cerevisiae, S288C strain (assembly 02-Sep-2011 size 3.6M)

Submission guidelines (tentative)

For each data set analyzed, do separately the following steps:

  1. Store your read mappings (sam file) and called variants (vcf file) to some web address compressed into a single file called
  2. Send an email to address with the following formatting:
    1. Subject line: Name of the data set analyzed from above list (e.g. Artificial Chromosome 2)
    2. Content: Link to your file, and a README file as attachment following the formatting given in this example README file.
    3. Please do not send as attachment!
Last modified: 2013/01/23 17:48 by Krista Longi
DokuWikiRSS Feed