Sequence Squeeze

In 2011, the Pistoia Alliance initiated a competition to award US$15,000 to the developer of the best novel open-source NGS compression algorithm. The competition, supported by Amazon Web Services, aimed to spur development of new and novel methods of compressing sequence reads and their quality scores in a way that preserves 100% of the information whilst achieving much-improved linear (or, even better, non-linear) compression ratios.

Several aspects made the competition unique:

  • A judging panel comprised scientists from the world’s leading sequencing centers: BGI, the Broad Institute, and the Wellcome Trust Sanger Institute
  • A public leaderboard enabled participants to see how their entries compared to others. According to participants, this led to better quality entries as each competitor attempted to improve on the others.
  • Entries were judged on five key dimensions: Compression ratio (a measure of how much the algorithm squeezed the data); compress/decompress time; and compress/decompress memory.

The Winner: James Bonfield

The winner of the contest was James Bonfield, a member of the sequencing informatics team at the Wellcome Trust Sanger Institute. Bonfield submitted a cluster of algorithms that all delivered high performance in terms of the top three judged elements (compression ratio, compress time, and decompress time). All of Bonfield’s algorithms considered the importance of preserving alignment data in addition to raw FASTQ output and employed fqzcomp as a FASTQ compressor and sam_comp for SAM/BAM output.

The Judging Panel

A panel of expert judges was selected from leading genomics institutes in the UK, USA and China.

Guy Coates

Wellcome Trust Sanger Institute

Yingrui Li

BGI-Shenzhen

Nick Lynch

Pistoia Alliance, and panel chair

Tim Fennell

Broad Institute

Leaderboard

Times are in seconds, memory is average total use in Kbytes. Measured using the time command with %e and %K options. ALL valid entries shown (valid = ran successfully and produced at least one correct fully matching result).

Name, institute & entry ID (link to source) Compress.
ratio
Compress.
time
Compress.
memory
Decompress.
time
Decompress.
memory
Header
mismatches
Sequence
mismatches
Quality
mismatches
Matt Mahoney, Dell Inc.(96) 0.0287 905.79 5293280 394.63 5294704 0 0 24033620
Armando J. Pinho, IEETA / Universidade de Aveiro(78) 0.0536 3035.17 5200 3010.02 5200 16629547 16629548 16629548
Armando J. Pinho, IEETA / Universidade de Aveiro(79) 0.0536 3018.11 5200 3026.36 5200 16628724 16628724 16628724
Armando J. Pinho, IEETA / Universidade de Aveiro(69) 0.0546 3404.17 68986880 3389.73 68984480 16512460 16512461 16512461
James Bonfield, Sanger Institute(101) 0.1141 3280.24 13235808 325.25 4381552 0 0 0
James Bonfield, Sanger Institute(104) 0.1142 10299.99 13272736 342.79 4381552 0 0 0
James Bonfield, Wellcome Trust Sanger Institute(17) 0.1154 208.42 932976 293.05 932880 0 0 24030818
James Bonfield, Sanger Institute(97) 0.1162 3276.83 13240864 318.46 4186912 0 0 0
Matt Mahoney, Dell Inc.(99) 0.1166 1218.88 5398224 769.26 5399808 0 0 0
Matt Mahoney, Dell Inc.(90) 0.1183 1062.8 5463456 610.63 5464128 0 0 0
James Bonfield, Sanger Institute(86) 0.1245 11585.89 13239776 334.89 1635408 0 0 0
Matt Mahoney, Dell Inc.(85) 0.1254 930.87 5716624 653.64 5756576 0 0 0
Matt Mahoney, Dell Inc.(82) 0.1255 924.38 5738432 660.85 5769200 0 0 0
James Bonfield, Sanger Institute(62) 0.1273 10476.97 13231488 335.6 2063344 0 0 0
James Bonfield, Sanger Institute(60) 0.1273 22063.57 13516544 123.86 4198048 23834062 23834062 23834062
Davide Cittaro, Center for Translational Genomics and Bioinformatics, San Raffaele Scientific Institute(34) 0.1323 3194.87 228512 5398.99 193584 0 0 24032852
Davide Cittaro, Center for Translational Genomics and Bioinformatics, San Raffaele Scientific Institute(32) 0.1328 5020.04 123504 6250.11 124320 1319193 0 24032852
Armando J. Pinho, IEETA / Universidade de Aveiro(108) 0.1693 11820.43 21999248 11245.79 21998992 0 0 0
Armando J. Pinho, IEETA / Universidade de Aveiro(102) 0.1695 8107.2 17505408 8272.33 17505184 0 0 0
Armando J. Pinho, IEETA / Universidade de Aveiro(98) 0.1695 8123.18 33889264 8091.45 33889056 0 0 0
Armando J. Pinho, IEETA / Universidade de Aveiro(100) 0.1696 7814.66 17112192 7903.5 17112000 0 0 0
Armando J. Pinho, IEETA / Universidade de Aveiro(103) 0.1698 4819.01 16980272 4887.3 16980448 0 0 0
James Bonfield, Sanger Institute(72) 0.1709 525.41 22240640 600.29 22231504 0 0 0
Daniel Jones, University of Washington(105) 0.1718 306.67 2747120 333.76 1631600 0 0 0
Daniel Jones, University of Washington(92) 0.172 244.89 2472832 284.79 1638800 0 0 0
Armando J. Pinho, IEETA / Universidade de Aveiro(80) 0.1726 8184.44 8725168 8282.31 8724928 0 0 0
James Bonfield, Sanger Institute(52) 0.1727 429.94 21285824 483.35 21284640 0 0 0
Daniel Jones, University of Washington(64) 0.1729 392.08 2097280 484.24 1543568 0 0 0
James Bonfield, Sanger Institute(66) 0.1729 362.7 1937056 434.2 1935472 0 0 0
Armando J. Pinho, IEETA / Universidade de Aveiro(71) 0.173 8509.41 8725168 8573.91 8724944 0 0 0
Daniel Jones, University of Washington(47) 0.1743 250 1699328 495.06 1699104 0 0 0
Daniel Jones, University of Washington(91) 0.1743 154.58 1581120 291.66 1580960 0 0 0
James Bonfield, Sanger Institute(36) 0.1744 347.64 1877504 472.03 1877456 0 0 0
Daniel Jones, University of Washington(44) 0.1748 270.42 2080960 1.28 2206160 24040369 24040699 24040699
Daniel Jones, University of Washington(45) 0.1748 272.66 2015696 491.61 2121696 0 0 0
Armando J. Pinho, IEETA / Universidade de Aveiro(87) 0.175 4949.94 4325680 5101.93 4325440 0 0 0
Seth Hillbrand, Columbia University(30) 0.1754 741.79 3685616 726.04 3621744 0 0 0
Matt Mahoney, Dell Inc.(70) 0.1756 1149.7 5790848 1275.28 5937152 0 0 0
Daniel Jones, University of Washington(38) 0.1758 234.57 1866304 448.68 1958336 0 0 0
James Bonfield, Wellcome Trust Sanger Institute(16) 0.1762 228.55 933248 328.48 933200 0 0 0
Seth Hillbrand, Columbia University(25) 0.1767 792.24 1545808 739.96 1514176 0 0 0
Matt Mahoney, Dell Inc.(67) 0.177 1065.51 5863568 1203.09 5933024 0 0 0
James Bonfield, Sanger Institute(83) 0.177 112.69 244736 142.05 236032 0 0 0
Daniel Jones, University of Washington(37) 0.1774 207.7 838592 396.21 889680 0 0 0
James Bonfield, Sanger Institute(51) 0.1775 128.25 334800 150.26 333616 0 0 0
Seth Hillbrand, Columbia University(24) 0.1779 596.33 1423024 544.76 1391296 0 0 0
Seth Hillbrand, Columbia University(18) 0.1781 796.83 3750080 851.56 3718256 0 0 0
Daniel Jones, University of Washington(29) 0.1793 203.58 1063712 571.36 1276576 0 0 0
James Bonfield, Sanger Institute(61) 0.1797 109.9 75664 156.98 74480 0 0 0
Seth Hillbrand, Columbia University(14) 0.1797 859.58 1686176 827.45 1659056 0 0 0
Yongwook Choi, J. Craig Venter Institute(77) 0.1803 1445.22 2348160 1591.64 2348160 0 0 0
James Bonfield, Sanger Institute(35) 0.1803 164.52 98768 219.13 98704 0 0 0
Seth Hillbrand, Columbia University(13) 0.1813 794.66 2744944 765.04 2690816 0 0 0
James Bonfield, Wellcome Trust Sanger Institute(15) 0.1818 137.56 149888 187.31 149840 0 0 0
Seth Hillbrand, Columbia University(12) 0.1824 518.82 10897392 474.45 10639488 0 0 0
Daniel Jones, University of Washington(23) 0.1841 274.99 1010736 395.09 1147136 0 0 0
James Bonfield, Sanger Institute(33) 0.1847 135.95 35760 189.27 35680 0 0 0
James Bonfield, Sanger Institute (+denizens of encode.ru)(8) 0.1878 141.2 423360 164.19 411936 0 0 0
Armando J. Pinho, IEETA / Universidade de Aveiro(93) 0.1884 2322.17 68736 2394.99 68480 0 0 0
Ibrahim Numanagic, Simon Fraser Uni(107) 0.1993 691.24 25973392 414.91 6067872 153675 0 0
James Bonfield, Wellcome Trust Sanger Institute(7) 0.2206 188.38 36512 100.91 16016 0 0 0
Markus Krisch, i-novation.de(109) 0.2241 449.62 24651072 454.91 5526784 0 0 0
Inbal Landsberg, Or Peled, Barak Yacov, Yonatan Amir, Dan Benjamin and Ron Reiter, Interdisciplinary Center (IDC) Herzliya(95) 0.2289 5565.23 6594288 3850.37 5973312 198799 2403 2403
Ron Reiter, Interdisciplinary Center (IDC) Herzliya(9) 0.24 738.27 27664 351.21 16032 0 0 0
Ryan Braganza, Intersect Australia(39) 0.2563 6753.88 14272 296.78 54900656 23834062 23834062 23834062
Competition Baseline, SequenceSqueeze(6) 0.3007 1020.97 5200 104.5 5008 0 0 0
Ryan Braganza, Intersect Australia(28) 0.301 1299.66 15040 331.02 13472 0 0 0
Ryan Braganza, Intersect Australia(43) 0.3072 1595.34 16144 493.9 53856 0 0 0
Ryan Braganza, Intersect Australia(40) 0.3072 1555.38 18960 1120.29 59433232 23834062 23834062 23834062
Ryan Braganza, Intersect Australia(41) 0.3072 1540.58 17152 1118 59433792 23834062 23834062 23834062
Ryan Braganza, Intersect Australia(42) 0.3072 1550.46 18976 997.72 17120 23834062 23834062 23834062