K-mer indexing for pan-genome reference annotation

$300,000U01FY2023HGNIH

Stanford University, Stanford CA

Investigators

Linked publications & trials

Paper 39353437 Paper 39042694 Paper 37165242 Paper 35525246 Paper 35444317 Paper 33875001 Paper 33173909

Abstract

ABSTRACTÂ TheÂ humanÂ genomeÂ referenceÂ sequenceÂ isÂ oneÂ ofÂ theÂ foundationsÂ ofÂ genomeÂ sciences,Â especiallyÂ inÂ theÂ contextÂ ofÂ next-ÂgenerationÂ sequencingÂ (NGS)Â analysis.Â Â TheÂ referenceÂ hasÂ enabledÂ discoveriesÂ inÂ biomedicalÂ researchÂ andÂ beenÂ particularlyÂ instrumentalÂ inÂ humanÂ diseaseÂ geneÂ identification.Â Â However,Â theÂ humanÂ genomeÂ referenceÂ isÂ limitedÂ byÂ itsÂ staticÂ andÂ linearÂ nature.Â Â Specifically,Â theÂ currentÂ referenceÂ lacksÂ theÂ featuralÂ andÂ contextualÂ flexibilityÂ toÂ representÂ theÂ breadthÂ ofÂ humanÂ variation.Â Â ImportantÂ elementsÂ ofÂ individualÂ genomesÂ areÂ eitherÂ missedÂ orÂ incorrectlyÂ represented.Â Â AsÂ aÂ solutionÂ thatÂ willÂ bridgeÂ theÂ nextÂ generationÂ ofÂ referenceÂ assembliesÂ withÂ populationÂ genomeÂ sequencingÂ studies,Â weÂ haveÂ developedÂ aÂ K-Âmer-ÂbasedÂ indexingÂ approach.Â Â ThisÂ methodÂ isÂ moreÂ efficientÂ computationally,Â providesÂ accurateÂ representationÂ inÂ theÂ contextÂ ofÂ populationsÂ andÂ facilitatesÂ theÂ analysisÂ ofÂ diverseÂ humanÂ genomes.Â Â OurÂ goalÂ isÂ toÂ useÂ thisÂ strategyÂ inÂ developingÂ aÂ robustÂ computationalÂ architectureÂ thatÂ willÂ encodeÂ andÂ annotateÂ largeÂ collectionsÂ ofÂ genomesÂ inÂ theÂ contextÂ ofÂ aÂ pan-ÂgenomeÂ reference.Â Â First,Â weÂ planÂ toÂ developÂ aÂ scalable,Â efficientÂ K-ÂmerÂ representationÂ ofÂ aÂ largeÂ collectionÂ ofÂ haplotype/phasedÂ referenceÂ genomes,Â byÂ 1)Â generatingÂ anÂ indexÂ ofÂ allÂ K-ÂmersÂ inÂ humanÂ referenceÂ genomeÂ GRCh38Â inÂ aÂ mannerÂ thatÂ canÂ efficientlyÂ storeÂ variantÂ informationÂ asÂ metadata,Â andÂ thenÂ 2)Â incrementallyÂ updatingÂ theÂ K-ÂmerÂ indexÂ toÂ includeÂ allÂ novelÂ K-ÂmersÂ derivedÂ fromÂ ongoingÂ populationÂ sequencingÂ efforts,Â whileÂ 3)Â developingÂ schemesÂ forÂ directlyÂ analyzingÂ compressedÂ genomicÂ data.Â Â Second,Â weÂ planÂ toÂ applyÂ K-ÂmerÂ representationÂ toÂ genomicÂ analysisÂ byÂ 1)Â providingÂ theÂ entiretyÂ ofÂ knownÂ humanÂ geneticÂ variationÂ inÂ anÂ aggregatedÂ indexÂ thatÂ isÂ computationallyÂ efficientÂ andÂ easyÂ toÂ understand,Â 2)Â developingÂ functionsÂ forÂ ourÂ pan-ÂgenomicÂ indexÂ thatÂ supportsÂ ultra-ÂrapidÂ queries,Â suchÂ asÂ ofÂ clinicallyÂ importantÂ variants,Â andÂ 3)Â linkingÂ conventionalÂ coordinateÂ informationÂ toÂ theÂ K-ÂmerÂ metadataÂ inÂ theÂ pan-ÂgenomeÂ indexÂ toÂ allowÂ annotatingÂ geneticÂ variationÂ toÂ aÂ particularÂ genomeÂ reference.Â Â Third,Â weÂ willÂ createÂ anÂ onlineÂ webÂ portalÂ forÂ theÂ pan-Âgenome,Â usingÂ cloudÂ computing,Â toÂ maximizeÂ theÂ utilityÂ ofÂ ourÂ approach,Â toÂ promoteÂ communityÂ engagementÂ andÂ toÂ enablingÂ contributionÂ fromÂ theÂ researchÂ community.Â Â WeÂ expectÂ thatÂ completionÂ ofÂ theseÂ aimsÂ willÂ provide:Â aÂ scalableÂ computationalÂ architectureÂ whichÂ incorporatesÂ theÂ continuousÂ additionÂ ofÂ variantÂ informationÂ withoutÂ lossÂ ofÂ resolutionÂ orÂ accuracy;Í¾Â rapidÂ queryÂ speedsÂ thatÂ willÂ remainÂ nearlyÂ constantÂ asÂ theÂ databaseÂ grows;Í¾Â aÂ universallyÂ accessibleÂ portalÂ usingÂ cloudÂ computing.Â Â ThisÂ workÂ willÂ helpÂ solveÂ theÂ issuesÂ ofÂ multipleÂ assemblies.Â Â ItÂ willÂ improveÂ researchersâÂ abilityÂ toÂ understandÂ theÂ relationshipÂ ofÂ variantsÂ andÂ disease,Â whileÂ alsoÂ providingÂ greatÂ savingsÂ overÂ theÂ long-ÂtermÂ inÂ infrastructureÂ andÂ computingÂ costs.Â Â Â Â Â

View original record on NIH RePORTER →