As the volume of next generation sequencing data increases due to significantly higher throughput and lower cost, an urgent need for algorithms to efficiently process the data arises. Many of the methods for these tasks rely on k-mer sets (DNA words of length k) to index the set of sequences in a given dataset. The prevailing method to generate these sets is the minimizers scheme, which indexes by the lexicographically smallest k-mer in each sequence. Due to the typical number of index k-mers, the runtime of many sequence analysis tasks exceeds days to generate k-mer sets and memory requirements are greater than those available on standard servers.
Universal hitting sets (UHS) were recently introduced as an alternative to the central idea of minimizers in sequence analysis with the hopes that they could more efficiently address common tasks such as computing hash functions for read overlap, sparse suffix arrays, and Bloom filters. A universal hitting set is a set of k-mers that hit every sequence of length L. While methods for computing small universal hitting sets have shown great promise in improving such sequence analysis tasks, as compared to minimizers, they are not yet practical for real-world sequencing instances due to their serial and deterministic nature, which leads to long runtimes and high memory demands when handling typical values of k (e.g. k > 13).
To address this bottleneck, we present two algorithmic innovations to significantly decrease runtime while keeping memory usage low: (i) we leverage advanced theoretical and architectural techniques to parallelize and decrease memory usage in calculating k-mer hitting numbers (analog to the number of uncovered elements a set contains in the Set Cover problem); and (ii) we build upon techniques from randomized Set Cover to select universal k-mers much faster. Our algorithm and software, PASHA is the first randomized parallel algorithm for generating near-optimal universal hitting sets--and, importantly, newly handles k > 13. We demonstrate empirically that PASHA produces sets only slightly larger than those of serial deterministic algorithms; moreover, the set size is provably guaranteed to be within a small factor of the optimal size.
PASHA's runtime and memory-usage improvements are orders of magnitude faster than the current best algorithms, even for practical k-mer lengths (e.g. PASHA takes a week to produce sets for k = 16, which others could not achieve). As with minimizers, we expect our newly-practical construction of universal hitting sets to be adopted in many high-throughput sequence analysis pipelines.
Source code and documentation is available here. Universal hitting sets for k = 5 to k = 16 are available here.
If you find PASHA useful, please cite the following paper:
Ekim, B., Berger, B., Orenstein, Y. A randomized parallel algorithm for efficiently finding near-optimal universal hitting sets. bioRxiv (2020). To appear in the Springer Proceedings for RECOMB 2020.
Development of PASHA is led by Barış Ekim, collaboratively in the labs of Bonnie Berger at the Computer Science and Artificial Intelligence Laboratory (CSAIL) at Massachusetts Institute of Technology (MIT), and Yaron Orenstein at the Department of Electrical and Computer Engineering at Ben-Gurion University of the Negev (BGU). Should you have any inquiries, please contact Barış Ekim at baris [at] mit [dot] edu.