Sequence Squeeze

In 2011, the Pistoia Alliance initiated a competition to award US$15,000 to the developer of the best novel open-source NGS compression algorithm. The competition, supported by Amazon Web Services, aimed to spur development of new and novel methods of compressing sequence reads and their quality scores in a way that preserves 100% of the information whilst achieving much-improved linear (or, even better, non-linear) compression ratios.

Several aspects made the competition unique:

  • A judging panel comprised scientists from the world’s leading sequencing centers: BGI, the Broad Institute, and the Wellcome Trust Sanger Institute
  • A public leaderboard enabled participants to see how their entries compared to others. According to participants, this led to better quality entries as each competitor attempted to improve on the others.
  • Entries were judged on five key dimensions: Compression ratio (a measure of how much the algorithm squeezed the data); compress/decompress time; and compress/decompress memory.

The Winner: James Bonfield

The winner of the contest was James Bonfield, a member of the sequencing informatics team at the Wellcome Trust Sanger Institute. Bonfield submitted a cluster of algorithms that all delivered high performance in terms of the top three judged elements (compression ratio, compress time, and decompress time). All of Bonfield’s algorithms considered the importance of preserving alignment data in addition to raw FASTQ output and employed fqzcomp as a FASTQ compressor and sam_comp for SAM/BAM output.

The Judging Panel

A panel of expert judges was selected from leading genomics institutes in the UK, USA and China.

Guy Coates

Wellcome Trust Sanger Institute

Yingrui Li

BGI-Shenzhen

Nick Lynch

Pistoia Alliance, and panel chair

Tim Fennell

Broad Institute

Leaderboard

Times are in seconds, memory is average total use in Kbytes. Measured using the time command with %e and %K options. ALL valid entries shown (valid = ran successfully and produced at least one correct fully matching result).

Name, institute & entry ID (link to source)Compress.
ratio
Compress.
time
Compress.
memory
Decompress.
time
Decompress.
memory
Header
mismatches
Sequence
mismatches
Quality
mismatches
Matt Mahoney, Dell Inc.(96)0.0287905.795293280394.6352947040024033620
Armando J. Pinho, IEETA / Universidade de Aveiro(78)0.05363035.1752003010.025200166295471662954816629548
Armando J. Pinho, IEETA / Universidade de Aveiro(79)0.05363018.1152003026.365200166287241662872416628724
Armando J. Pinho, IEETA / Universidade de Aveiro(69)0.05463404.17689868803389.7368984480165124601651246116512461
James Bonfield, Sanger Institute(101)0.11413280.2413235808325.254381552000
James Bonfield, Sanger Institute(104)0.114210299.9913272736342.794381552000
James Bonfield, Wellcome Trust Sanger Institute(17)0.1154208.42932976293.059328800024030818
James Bonfield, Sanger Institute(97)0.11623276.8313240864318.464186912000
Matt Mahoney, Dell Inc.(99)0.11661218.885398224769.265399808000
Matt Mahoney, Dell Inc.(90)0.11831062.85463456610.635464128000
James Bonfield, Sanger Institute(86)0.124511585.8913239776334.891635408000
Matt Mahoney, Dell Inc.(85)0.1254930.875716624653.645756576000
Matt Mahoney, Dell Inc.(82)0.1255924.385738432660.855769200000
James Bonfield, Sanger Institute(62)0.127310476.9713231488335.62063344000
James Bonfield, Sanger Institute(60)0.127322063.5713516544123.864198048238340622383406223834062
Davide Cittaro, Center for Translational Genomics and Bioinformatics, San Raffaele Scientific Institute(34)0.13233194.872285125398.991935840024032852
Davide Cittaro, Center for Translational Genomics and Bioinformatics, San Raffaele Scientific Institute(32)0.13285020.041235046250.111243201319193024032852
Armando J. Pinho, IEETA / Universidade de Aveiro(108)0.169311820.432199924811245.7921998992000
Armando J. Pinho, IEETA / Universidade de Aveiro(102)0.16958107.2175054088272.3317505184000
Armando J. Pinho, IEETA / Universidade de Aveiro(98)0.16958123.18338892648091.4533889056000
Armando J. Pinho, IEETA / Universidade de Aveiro(100)0.16967814.66171121927903.517112000000
Armando J. Pinho, IEETA / Universidade de Aveiro(103)0.16984819.01169802724887.316980448000
James Bonfield, Sanger Institute(72)0.1709525.4122240640600.2922231504000
Daniel Jones, University of Washington(105)0.1718306.672747120333.761631600000
Daniel Jones, University of Washington(92)0.172244.892472832284.791638800000
Armando J. Pinho, IEETA / Universidade de Aveiro(80)0.17268184.4487251688282.318724928000
James Bonfield, Sanger Institute(52)0.1727429.9421285824483.3521284640000
Daniel Jones, University of Washington(64)0.1729392.082097280484.241543568000
James Bonfield, Sanger Institute(66)0.1729362.71937056434.21935472000
Armando J. Pinho, IEETA / Universidade de Aveiro(71)0.1738509.4187251688573.918724944000
Daniel Jones, University of Washington(47)0.17432501699328495.061699104000
Daniel Jones, University of Washington(91)0.1743154.581581120291.661580960000
James Bonfield, Sanger Institute(36)0.1744347.641877504472.031877456000
Daniel Jones, University of Washington(44)0.1748270.4220809601.282206160240403692404069924040699
Daniel Jones, University of Washington(45)0.1748272.662015696491.612121696000
Armando J. Pinho, IEETA / Universidade de Aveiro(87)0.1754949.9443256805101.934325440000
Seth Hillbrand, Columbia University(30)0.1754741.793685616726.043621744000
Matt Mahoney, Dell Inc.(70)0.17561149.757908481275.285937152000
Daniel Jones, University of Washington(38)0.1758234.571866304448.681958336000
James Bonfield, Wellcome Trust Sanger Institute(16)0.1762228.55933248328.48933200000
Seth Hillbrand, Columbia University(25)0.1767792.241545808739.961514176000
Matt Mahoney, Dell Inc.(67)0.1771065.5158635681203.095933024000
James Bonfield, Sanger Institute(83)0.177112.69244736142.05236032000
Daniel Jones, University of Washington(37)0.1774207.7838592396.21889680000
James Bonfield, Sanger Institute(51)0.1775128.25334800150.26333616000
Seth Hillbrand, Columbia University(24)0.1779596.331423024544.761391296000
Seth Hillbrand, Columbia University(18)0.1781796.833750080851.563718256000
Daniel Jones, University of Washington(29)0.1793203.581063712571.361276576000
James Bonfield, Sanger Institute(61)0.1797109.975664156.9874480000
Seth Hillbrand, Columbia University(14)0.1797859.581686176827.451659056000
Yongwook Choi, J. Craig Venter Institute(77)0.18031445.2223481601591.642348160000
James Bonfield, Sanger Institute(35)0.1803164.5298768219.1398704000
Seth Hillbrand, Columbia University(13)0.1813794.662744944765.042690816000
James Bonfield, Wellcome Trust Sanger Institute(15)0.1818137.56149888187.31149840000
Seth Hillbrand, Columbia University(12)0.1824518.8210897392474.4510639488000
Daniel Jones, University of Washington(23)0.1841274.991010736395.091147136000
James Bonfield, Sanger Institute(33)0.1847135.9535760189.2735680000
James Bonfield, Sanger Institute (+denizens of encode.ru)(8)0.1878141.2423360164.19411936000
Armando J. Pinho, IEETA / Universidade de Aveiro(93)0.18842322.17687362394.9968480000
Ibrahim Numanagic, Simon Fraser Uni(107)0.1993691.2425973392414.91606787215367500
James Bonfield, Wellcome Trust Sanger Institute(7)0.2206188.3836512100.9116016000
Markus Krisch, i-novation.de(109)0.2241449.6224651072454.915526784000
Inbal Landsberg, Or Peled, Barak Yacov, Yonatan Amir, Dan Benjamin and Ron Reiter, Interdisciplinary Center (IDC) Herzliya(95)0.22895565.2365942883850.37597331219879924032403
Ron Reiter, Interdisciplinary Center (IDC) Herzliya(9)0.24738.2727664351.2116032000
Ryan Braganza, Intersect Australia(39)0.25636753.8814272296.7854900656238340622383406223834062
Competition Baseline, SequenceSqueeze(6)0.30071020.975200104.55008000
Ryan Braganza, Intersect Australia(28)0.3011299.6615040331.0213472000
Ryan Braganza, Intersect Australia(43)0.30721595.3416144493.953856000
Ryan Braganza, Intersect Australia(40)0.30721555.38189601120.2959433232238340622383406223834062
Ryan Braganza, Intersect Australia(41)0.30721540.5817152111859433792238340622383406223834062
Ryan Braganza, Intersect Australia(42)0.30721550.4618976997.7217120238340622383406223834062