Single-molecule sequence assembly and analysis

$2,006,881ZIAFY2023HGNIH

National Human Genome Research Institute

Investigators

Linked publications & trials

Paper 39208288 Paper 39110522 Paper 39103649 Paper 38811727 Paper 38724243 Paper 38570684 Paper 38508693 Paper 38376487 Paper 38168989 Paper 38036856 Paper 37612512 Paper 37603771 Paper 37165242 Paper 37165241 Paper 37013528 Paper 36797493 Paper 36302175 Paper 36223450 Paper 36208288 Paper 35444317 Paper 35365778 Paper 35361932 Paper 35361931 Paper 35357935 Paper 35357925 Paper 35357919 Paper 35357917 Paper 35357915 Paper 35357911 Paper 35042810 Paper 35042802 Paper 35020798 Paper 34409936 Paper 33911273 Paper 33911078 Paper 33910595 Paper 33910227 Paper 33828295 Paper 33542373 Paper 33408411 Paper 33089966 Paper 32801147 Paper 32753684 Paper 32699374 Paper 32686750 Paper 32663838 Paper 32657365 Paper 32543654 Paper 32541955 Paper 32385271 Paper 32350247 Paper 32242610 Paper 32191811 Paper 31915075 Paper 31898513 Paper 31406327 Paper 31375138 Paper 31296857 Paper 31249862 Paper 31157884 Paper 30942877 Paper 30670797 Paper 30504855 Paper 30429615 Paper 30423094 Paper 30373669 Paper 30346939 Paper 29788454 Paper 29708767 Paper 29431738 Paper 29373581 Paper 29329394 Paper 28798168 Paper 28461322 Paper 28396521 Paper 28298431 Paper 28263316 Paper 28180967 Paper 27323842 Paper 27249958 Paper 27074162 Paper 27035980 Paper 26856261 Paper 26006009 Paper 22750884 Paper 14759262

Abstract

In 2022, we finished the first complete sequence of a human genome as part of the Telomere-to-Telomere (T2T) project. However, this complete genome comprised all human chromosomes except ChrY. The Y chromosome plays critical roles in sexual development and fertility, but it is also one of the most repetitive and difficult to sequence chromosomes of the genome. For this reason, it was often excluded from prior genomic studies, including our own. In 2023, we were finally able to finish this last chromosome of the genome and provided a full analysis of what had been missing (ref 1). Building on the T2T project, we next sought to move beyond a single complete reference genome and work towards a pangenome reference that is more inclusive of all genomic ancestries. Along with other members of the Human Pangenome Reference Consortium (HPRC), we evaluated existing sequencing and assembly methods to develop a workflow for the automated assembly of near-complete genomes at scale (ref 2). Following the T2T sequencing recipe, the HPRC chose a combination of PacBio HiFi and Oxford Nanopore ultra-long read sequencing, as well as sequencing familial trios for the resolution of complete haplotypes (both techniques were previously developed by the GIS). For phase 1 of the HPRC, approximately 50 genomes were assembled using only the HiFi data, resulting in 100 chromosome-scale haplotypes that were combined into a draft human pangenome reference (ref 3). The study demonstrates that use of a pangenome reference can improve accuracy and reduce the bias inherent in genomic analyses that make use of only a single reference sequence. However, this first version of a human pangenome is incomplete and still contains some gaps and misassemblies. Complete assemblies are necessary to ensure the absence of false-negative gene losses, which is a common problem we identified in a variety of draft vertebrate genome assemblies (ref 4). To move towards a gapless pangenome requires integration of the ultra-long Nanopore sequencing data, which has already been collected but not integrated into the assemblies. To enable this, the GIS developed a new tool called Verkko, which is able to automatically assemble complete, T2T haplotypes for a number of human chromosomes (ref 5). Future versions of the pangenome will be based on these more continuous and complete assemblies that integrate both long read data types. Complete human genome assemblies from a diverse set of donors is already uncovering new biology. In another collaboration with the HPRC, we analyzed the phase 1 assemblies from the HPRC and identified strong signals of heterologous recombination between different chromosomes of the human genome (ref 6). These data also provided a mechanistic explanation for the formation of Robertsonian chromosomes, which are one of the most common forms of chromosomal fusions in humans and associated with infertility and Downs syndrome. To enable further exploration of the human pangenome, our work continued developing efficient methods for high-throughput genome to genome comparison. Building on our past methods on minimizers and min-hashing for whole-genome alignment, we introduced a new sketching paradigm this year called minmers (ref 7). Minmers allow for the rapid and unbiased estimation of sequence similarity without the need of costly gapped sequence alignments. We are now in the process of integrating minmer sketches into pangenome construction methods such as PGGB that will enable sensitive pangenome graph construction at the scale of thousands of human haplotypes. Lastly, we continued to assist other labs in the assembly of non-human genomes, and the GIS remains an active member of the Vertebrate Genomes Project (VGP) and Earth Biogenome Project (EBP), which together aim to sequence the genomes of all eukaryotic life on earth. These projects are producing extremely valuable genomic datasets that will guide future conservation efforts and enable large-scale comparative genomics. The associated VGP now sits only a few tens of genomes short of its phase 1 goal of completing at least one genome from each of the approximately 270 vertebrate taxonomic orders, and both projects have benefited from genome assembly, validation, and alignment software developed by the GIS in prior years. This year, the GIS published assemblies for multiple Hydractinia genomes in collaboration with the Baxevanis lab (ref 8); new reference genomes for several agriculturally important catfish species with the USDA (ref 9); the sugar beet genome (ref 10); as well as the Nile rat, which is an important model organism for diabetes (ref 11). In addition to the 11 papers above that were formally published this year, the section has posted 4 preprints to bioRxiv that are currently undergoing peer review, including a new method for structural variant calling from Nanopore data, a study of human centromere variation, an cloud-based assembly pipeline for the EBP, and a new metagenome assembler designed for the PacBio HiFi data type.

View original record on NIH RePORTER →