CIF: Medium: Coding Theory for DNA Storage: Synthesis, Retention, and Reconstruction

$1,200,000FY2022CSENSF

University Of California-San Diego, La Jolla CA

Investigators

Paul H Siegelcontact Alexander Vardy Eitan Yaakobi Ryan C Gabrys

Abstract

New information-storage technologies are needed to accommodate the growing deluge of data being collected and generated by modern society. In the past decade, several experiments have demonstrated that deoxyribonucleic acid (abbreviated DNA) – the molecule that carries the genetic information of living organisms – is a potentially viable storage medium. DNA-based storage would have many attractive features: unprecedented data density, a recording format that will not become obsolete, archival durability over thousands of years, and easy data replication. On the other hand, DNA storage requires fundamentally new methods for encoding data into DNA sequences to make the storage process efficient and reliable. The aim of the project is two-fold: (1) to understand the mathematical limits on the efficiency, reliability, and information density of DNA-based storage, and (2) to develop novel data-encoding and decoding algorithms to help achieve those limits. The project will provide a stimulating research opportunity for undergraduate and graduate students, encouraging teamwork across university boundaries and collaboration across disciplines. The project focuses on coding methods that address critical problems in the key stages of the DNA storage process: efficient synthesis of DNA sequences, stable retention of stored sequences, and reliable data retrieval and reconstruction. Specific objectives are: (1) establish fundamental information-theoretic limits on the storage capacity of DNA using mathematical abstractions of the DNA recording process, (2) develop source coding techniques to minimize the time needed to encode data into synthesized arrays of DNA sequences, (3) design coding algorithms to efficiently enforce constraints on the allowed nucleotide patterns in synthesized DNA sequences to ensure their long-term retention, and (4) develop reconstruction algorithms and error-correcting codes that can recover a set of DNA sequences from an unordered collection of copies that may be corrupted by insertions, deletions, and substitutions of nucleotides. The project develops tools to address classical problems in coding theory and information theory that underlie many aspects of the research. These include the construction of optimal codes for finite-state communication channels with symbol costs, the design of optimal short-length codes that correct multiple insertion and deletion errors, the development of efficient coding techniques that asymptotically approach the capacity of a communication channel with deletion errors, and the analysis of algorithms and codes that enable reconstruction of a sequence from multiple noisy observations, either exactly or within a small list of candidate sequences. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →