The human genome is often compared to a landscape, with a complex and diverse topography of genes and regulatory sequences. But in many places, the area is less dramatic vistas and more desert highways: vast and repetitive.
Consider a chromosome’s centromere, which connects its two gene-laden arms. Centromeres include thousands of identical homologous α-satellite sequences – 171-base-pair units that need to be properly arranged to ensure chromosome stability and cell division. Yet two decades after the publication of the draft human genome, these and other challenging DNA features remain as stubborn gaps in our chromosomal atlas. And, until a few years ago, some researchers despaired of filling them.
Beth Sullivan, a researcher at Duke University in Durham, North Carolina, recalled a conversation with Karen Miga, a genomics researcher at Santa Cruz University, California, in 2014. Sullivan says, “He told me, something If nothing special happens with technology, we will be stuck here for a long time.”
But something happened: the development of sequencing technologies that can read long segments of DNA seamlessly. Now Miga and his partners are preparing to complete a 20-year odyssey from the Telomere to the Telomere (T2T) consortium that began with the release of that first draft sequence.
Their goal is to have, for each chromosome, an end-to-end genome map that extends from one telomere (repetitive sequence elements that have cap chromosomal ends) to another. “It’s not doing it only to do it,” Miga says.
“That was because I think there’s some really good biology there.” But to find it, the world of genomics will need to sequence many such genomes, which are still far from metamorphosed to poorly understood genomic regions.
stuck in the middle
Published 1 to 20 years ago this month, the first draft of the human genome was a historic achievement. But it was also full of holes. Scientists from the Human Genome Project generated a large number of short sequences from chromosomal DNA.
Where they overlapped with their neighbors, they were characterized by large, infectious stretches known as figs. Ideally, each chromosome would be represented by a single contest, but the first draft contained 1,246 such fragments.
Since then, scientists working as part of the Genome Reference Consortium (GRC) have been fleshing out of the assembly, manually examining it and performing sequencing analysis to identify regions with errors and information gaps. is using.
The most recent version of the human genome, called GRCh38, was released in 2013. Since then it has been repeatedly ‘patched’. Yet it is still missing 5 to 10% of the genome, including all centromeres and other challenging regions, such as large collections of genes encoding RNA sequences that form protein-producing associations called ribosomes.
These are present in long stretches of multiple, repeated gene copies. “This is a big part of the closing gap yet,” says Adam Philippi, a biomeditorian, co-president of Bethesda, Bethesda and T2T in the US. Products of ancient chromosomal rearrangements – the genome is also combined with hard-to-map stretches of nearly identical DNA known as segmal duplications.
These challenging classes have continued genome-assembly efforts. This is because most of the sequencing so far has been done with short-readable technologies, such as the platform widely used by biotechnology company Illumina in San Diego, California. Illumina sequencers generate highly accurate data, but typically on only a few hundred bases – much less for long repeats and ambiguously placed sequences.
“Genes are usually easy to assemble,” says computational biologist Kerstin Howe at the Welcome Sanger Institute in Hintaxen, UK, which is part of GRC. “But basically nothing was known in that intergenic space or with a lot of repetition.”
Reach the gaps
Two long-read technologies are now closing those gaps. Pacific biosciences, a biotechnology company in Menlo Park, California, uses an imaging system that reads thousands or millions of DNA strands directly in parallel, spanning thousands, thousands, each.
Another approach, commercialized by UK-based company Oxford Nanopore Technologies, throttles DNA strands read through small protein pores or nanopores, measuring subtle changes in electric current that occur as nucleotides in the channel to tens of thousands of bases. goes.
When they were first rolled out in 2010 (Pacific Biosciences’ technology and Oxford Nanopore’s 2014), these techniques had higher errors than Illumina, which provides greater than 99% accuracy for individual reads. “We’re talking about 15-20% error rates in early Pacabio,” says Philippi. First-generation nanopore sequencers can cause errors in the base by more than 30%.