Third-generation sequencing
By Gregory Snyder
Since the completion of the Human Genome Project less than a decade ago, the cost of sequencing genomes has decreased more than a thousand fold. This cost reduction has been accomplished by the rapid development of a “second generation” of DNA sequencing technologies to replace the methods used in the Human Genome Project. Lowering the cost another thousand fold, to less than $1000, promises to revolutionize medicine by enabling doctors to tailor strategies for disease prevention, diagnosis, and treatment to specific risk factors found in a patient’s personal genome. Moreover, low-cost genome sequencing promises to revolutionize basic research, too, by giving unprecedented complete information about the genetic structure of populations, which could yield insight in fields from epidemiology to evolution.
In a recent report in Science, Eid, et al. of Pacific Biosciences (PacBio) demonstrate the proof of principle of a major step toward the affordable genome with a new method to sequence DNA cheaply and rapidly by watching an array of single DNA molecules being replicated in real time. PacBio succinctly refers to this capability as “SMRT,” for Single-Molecule sequencing in Real Time.
Recent introductions of products for second generation sequencing from Roche/454, Illumina, and ABI–all established players in the biotechnology world–have yielded a bumper crop of genome sequence from extinct animals (Neandertals and woolly mammoths), traditional experimental laboratory organisms (from mice to sea anemones), and individual human genomes (including those of famous genome scientists James Watson and Craig Venter). The impact of these technologies extends far beyond sequencing genomes, though. Many traditional methods for studying DNA and its interactions with other components of the cell are being reworked to take advantage of the power of these technologies to produce data that spans the entire genome, essentially letting scientists perform millions of separate experiments in parallel. The cost of this newfound experimental power and speed is, of course, monetary, and the more uses are found for second generation sequencing, the stronger the push will be to drive its cost down. And, in turn, the cheaper the technology becomes, the more its uses will proliferate.
Such is the pace of technological development that just as second generation products are beginning to establish themselves, their replacements are already emerging. One such “third generation” technology is the one just published by Pacific Biosciences. Whereas the poor sensitivity of most sequencing technologies requires many identical copies of a DNA molecule to be made before its sequence can be found, the method described by Eid, et al. can sequence a single molecule of DNA. Single-molecule sequencing offers many advantages for the end user, including far lower consumption of chemical reagents–making the process cheaper–and faster and simpler sample preparation. What’s more, SMRT reads the sequence by watching a single strand of DNA be replicated in real time, at a rate of about five nucleotides per second, whereas other methods require complicated cyclical steps which take several minutes for each nucleotide addition. The total time needed to sequence a genome using SMRT could be days or even months less than that required by second generation technologies (depending on the size of the genome being sequenced and the method being used).
Achieving the exquisite sensitivity of sequencing single molecules of DNA required the development or optimization of several separate technologies. To start with, an appropriate means of detecting the DNA needed to be chosen. Given the current state of biophysical research, there was only one natural choice to be made, though the way in which SMRT utilizes it is novel in more than one way, as we will see in the coming paragraphs. That choice was to use fluorescence, a phenomenon familiar to anyone who has worn a t-shirt in the last thirty years.
Fluorescence is particularly useful to biologists because they can attach fluorescent dyes to biological molecules to be able to see them with a light microscope. Ordinarily, a single biological molecule, such as DNA, cannot be seen with a light microscope because light does not noticeably interact with objects that are much smaller than its wavelength. Visible light has wavelengths between 400 and 700 nm (nanometers, i.e., millionths of a millimeter), but the diameter of the double helix is only 2 nm. Fluorescent dyes are often slightly smaller than the diameter of helical DNA, but their peculiar electronic structure enables them to interact with light as if they were huge antennas. Attaching fluorescent dyes to DNA lets scientists take advantage of centuries worth of technological advances in optics and results in a very sensitive detection method. Even current sequencing technologies make use of fluorescent labels (they replaced the radioactive labels used in sequencing long ago, offering greater sensitivity and fewer dangers), but SMRT stands to make even more effective use of them by obtaining the ultimate sensitivity of detecting individual fluorescent molecules.
Next, an appropriate polymerase had to be chosen for SMRT. A polymerase is a protein that generates the second strand of the DNA double helix from a single strand of DNA. Since the nucleotides in the two strands uniquely pair up according to rules called “Watson-Crick” base-pairing, knowing the sequence of one strand reveals the sequence of the other; the two strands are thus said to be “complementary.” Most sequencing methods, PacBio’s included, work by generating a complementary strand with nucleotides labeled by fluorescent dyes that enable them to be detected and identified. For many technical reasons, it is impossible to sequence an entire genome (or even one chromosome) in a single stretch. Rather, a genome must be broken into overlapping fragments whose sequences can be computationally restitched together after they all have been obtained using the polymerase. All polymerases have a tendency to fall off of the “template” strand being replicated, but the stitching process works better when the fragments are longer. Polymerases which stay on the template longer are thus very valuable. PacBio chose the “φ29” polymerase because of its long average run length and modified it to increase its ability to use special fluorescently labeled nucleotides.
The special nucleotides are the second major innovation which makes SMRT possible. The traditional place to attach a fluorescent label to a nucleotide is near the part involved in forming a Watson-Crick pair. Each of the four nucleotides is labeled with a different fluorescent dye so that a nucleotide may be identified by its color. The attachment can be made with a linker that is cleavable with chemicals so that the dye can be detached from the DNA after it has been detected, when it is no longer needed. On a second-generation sequencing machine using this kind of nucleotide, every time a position in the sequence is read, the machine has to remove the label in order to read the next position. It does so by flowing the appropriate chemicals into the sample chamber, washing away the cleaved labels, and flowing in the chemicals (including labeled nucleotides) needed to sequence the next base–a time-consuming and expensive process. In contrast, PacBio’s nucleotides are labeled on the “γ-phosphate,” the part of the nucleotide which is naturally cleaved off when it is incorporated into the strand of DNA being synthesized by a polymerase. Hence, SMRT does not require the cycling of different chemicals, and the polymerase is able to function naturally, as it would in the living organism it originally came from. This is the innovation that speeds up the chemical process of SMRT sequencing 30,000 times over second generation methods, enabling the sequence to be read in real time.
The third technology used in SMRT is the chip used to sequence many DNA fragments in parallel. It consists of a glass microscope cover slip with a 100 nm-thick layer of aluminum deposited on top of it. In the aluminum is an array of cylindrical wells 70 nm–100 nm in diameter. The aluminum is chemically treated so that polymerase molecules will stick to the glass at the bottom of each well rather than the sides of the wells. This is important because there is no way to manipulate a polymerase molecule to deliberately place it at the bottom of a well; rather the chip must be prepared by soaking it in a solution containing polymerase molecules, which then stick to every surface they can. The polymerase solution is very dilute, so that, on average, there is no more than one polymerase molecule per well. The cover glass at the bottom of the well permits an image of the activity of the polymerase at the bottom of each well to be projected onto a detector.
But why make the chip out of aluminum instead of plastic or a semiconductor? The answer forms the core of SMRT technology. A metal well with a diameter of 100 nm or less is called a “zero-mode waveguide” because the wavelength of visible light is too large to fit into the well. Most of the volume of the well is completely dark for that reason. However, the physics of light works out such that when light is shone on the well, some of it is able to penetrate the entrance of the well, but only for 10 nm or so. 10 nm is just far enough to illuminate the polymerase at the bottom of the well. Illuminating the polymerase is necessary to excite the fluorescent nucleotides as they are incorporated into the growing strand–the readout that reveals the sequence. (A fluorescent T-shirt appears to glow because it is stained with a dye that absorbs light of a color that the human eye can’t see, i.e., ultraviolet light, and re-emits it as color the eye can see. But the T-shirt can not glow in complete darkness; fluorescence requires excitation light.) During a SMRT sequencing run, each of the four DNA nucleotides must be available in solution in order for each polymerase to work. But since the nucleotides used for SMRT are all labeled with fluorescent dyes, the solution would appear to be a glowing mass, and it would be impossible to see which nucleotide is being incorporated by the polymerase without the limited excitation volume generated by the zero-mode waveguide.
The final clever trick used by Eid, et al. is to image the array of polymerases through a prism. The detector used for SMRT is very similar to the detector used in digital cameras. It consists of a light-sensitive semiconductor wafer divided into blocks, or pixels. Each pixel is sensitive to light of any color, and it only detects the brightness of the light striking it. Images from a digital camera are naturally grayscale. Digital cameras can only sense color because each pixel is covered with a red, green, or blue filter, and the electronics in the camera know which filter covers which pixel and can construct the final image accordingly. Each pixel in the final image from the camera is actually a composite of three pixels on the detector. But PacBio could not use a scheme like a digital camera’s, because filters necessarily throw away light: blue light incident on a red filter is absorbed before it can be detected. No matter how you slice it, single fluorescent molecules do not emit very much light, so throwing away light in order to identify its color is unacceptable. Instead of throwing away light, a prism separates light of different colors in space. When seen through a prism, the image of a particular well in the SMRT chip will appear at a different location, depending on which nucleotide is being incorporated.
How well does it all work? After their article appeared in print, Pacific Biosciences announced that they had sequenced the genome of E. coli — a bacterium commonly used as a tool in biological laboratories and ubiquitous in the digestive systems of mammals–with 99.9999% accuracy. Bacteria have much smaller and simpler genomes than humans, and they are routinely used to demonstrate a new technology, with the promise that the technology will one day be usable on humans.
Though sequencing a human genome by SMRT may still be a long way off, no second generation technology has been able to achieve such high accuracy. Moreover, PacBio has managed to achieve an average read length of almost 600 nucleotides–which is nearly as good as any existing technology and far better than most second generation technologies–and a maximum read length of more than 3000 bases. Such long read lengths eliminate the difficulty of assembling fragments from regions of repetitive sequence in a genome, a problem that hobbles second generation technologies. PacBio plans to ship its first machine in 2010 and predicts that when the technology is fully developed, it will be capable of sequencing an entire human genome in about an hour.
Must third generation sequencing employ single-molecule detection? One company, Helicos Biosciences, has already based its second generation platform on cyclic sequencing of single molecules of DNA. However, Pacific Biosciences’s chief competition in cost and throughput comes from Complete Genomics, whose technology also uses a second-generation, but non-single molecule strategy. Complete Genomics recently announced the sequencing of an entire human genome in a few days, in the process generating more than ten times the data that is possible with existing methods. Complete Genomics predicts its costs will decrease to levels similar to those Pacific Biosciences hopes to achieve, but the company has been extremely guarded in releasing details for how it will attain its promised savings. Such stiff competition in the sequencing market is working to commodify genome sequencing, and the challenge is shifting from obtaining sequence data to understanding it. The transformation is well under way of the present “genomics” era, inaugurated by the completion of the Human Genome Project, into a “systems” era, in which the function of genes and their relationships will be studied in ways unimaginable a mere decade ago.
[Gregory Snyder, PhD, is a biologist living in Chicago.]