GGrantIndex
← Search

Single-molecule sequence assembly and analysis

$2,006,881ZIAFY2023HGNIH

National Human Genome Research Institute

Investigators

Linked publications & trials

Abstract

In 2022, we finished the first complete sequence of a human genome as part of the Telomere-to-Telomere (T2T) project. However, this complete genome comprised all human chromosomes except ChrY. The Y chromosome plays critical roles in sexual development and fertility, but it is also one of the most repetitive and difficult to sequence chromosomes of the genome. For this reason, it was often excluded from prior genomic studies, including our own. In 2023, we were finally able to finish this last chromosome of the genome and provided a full analysis of what had been missing (ref 1). Building on the T2T project, we next sought to move beyond a single complete reference genome and work towards a pangenome reference that is more inclusive of all genomic ancestries. Along with other members of the Human Pangenome Reference Consortium (HPRC), we evaluated existing sequencing and assembly methods to develop a workflow for the automated assembly of near-complete genomes at scale (ref 2). Following the T2T sequencing recipe, the HPRC chose a combination of PacBio HiFi and Oxford Nanopore ultra-long read sequencing, as well as sequencing familial trios for the resolution of complete haplotypes (both techniques were previously developed by the GIS). For phase 1 of the HPRC, approximately 50 genomes were assembled using only the HiFi data, resulting in 100 chromosome-scale haplotypes that were combined into a draft human pangenome reference (ref 3). The study demonstrates that use of a pangenome reference can improve accuracy and reduce the bias inherent in genomic analyses that make use of only a single reference sequence. However, this first version of a human pangenome is incomplete and still contains some gaps and misassemblies. Complete assemblies are necessary to ensure the absence of false-negative gene losses, which is a common problem we identified in a variety of draft vertebrate genome assemblies (ref 4). To move towards a gapless pangenome requires integration of the ultra-long Nanopore sequencing data, which has already been collected but not integrated into the assemblies. To enable this, the GIS developed a new tool called Verkko, which is able to automatically assemble complete, T2T haplotypes for a number of human chromosomes (ref 5). Future versions of the pangenome will be based on these more continuous and complete assemblies that integrate both long read data types. Complete human genome assemblies from a diverse set of donors is already uncovering new biology. In another collaboration with the HPRC, we analyzed the phase 1 assemblies from the HPRC and identified strong signals of heterologous recombination between different chromosomes of the human genome (ref 6). These data also provided a mechanistic explanation for the formation of Robertsonian chromosomes, which are one of the most common forms of chromosomal fusions in humans and associated with infertility and Downs syndrome. To enable further exploration of the human pangenome, our work continued developing efficient methods for high-throughput genome to genome comparison. Building on our past methods on minimizers and min-hashing for whole-genome alignment, we introduced a new sketching paradigm this year called minmers (ref 7). Minmers allow for the rapid and unbiased estimation of sequence similarity without the need of costly gapped sequence alignments. We are now in the process of integrating minmer sketches into pangenome construction methods such as PGGB that will enable sensitive pangenome graph construction at the scale of thousands of human haplotypes. Lastly, we continued to assist other labs in the assembly of non-human genomes, and the GIS remains an active member of the Vertebrate Genomes Project (VGP) and Earth Biogenome Project (EBP), which together aim to sequence the genomes of all eukaryotic life on earth. These projects are producing extremely valuable genomic datasets that will guide future conservation efforts and enable large-scale comparative genomics. The associated VGP now sits only a few tens of genomes short of its phase 1 goal of completing at least one genome from each of the approximately 270 vertebrate taxonomic orders, and both projects have benefited from genome assembly, validation, and alignment software developed by the GIS in prior years. This year, the GIS published assemblies for multiple Hydractinia genomes in collaboration with the Baxevanis lab (ref 8); new reference genomes for several agriculturally important catfish species with the USDA (ref 9); the sugar beet genome (ref 10); as well as the Nile rat, which is an important model organism for diabetes (ref 11). In addition to the 11 papers above that were formally published this year, the section has posted 4 preprints to bioRxiv that are currently undergoing peer review, including a new method for structural variant calling from Nanopore data, a study of human centromere variation, an cloud-based assembly pipeline for the EBP, and a new metagenome assembler designed for the PacBio HiFi data type.

View original record on NIH RePORTER →