On polymorphism II: Hard-drinking fruitflies
The last post hinted at how useful it is to compare the amount of genetic variation found in one population to that found in another. This question arises so often in population genetics, in fact, that the word polymorphism has, as noted, come to often denote a summary measure of DNA sequence variation within a population. But how, exactly, can one put a number on polymorphism, as such? Remarkably, this question was hardly answerable, from a practical perspective, until 1983, when Marty Kreitman published the first definitive data on DNA sequence variation in a wild population of organisms.
Before Kreitman’s paper, known examples of genetic variation documented at the level of DNA sequence were confined to ‘lucky breaks’, where such variation caused either 1) a sufficiently drastic change in some protein’s amino acid sequence (thanks to mutation of a codon) to be detected and characterized by existing techniques, or 2) a change in the ‘restriction digestion’ signature of that genome itself, as detected by chewing up the genome with bacterial enzymes known to break it apart only at particular short sequence motifs (which can appear and disappear, via mutation).
In many organisms, however, codons make up only a tiny fraction of the genome — and ‘restriction enzyme’ target sites are rarer still. Therefore, if we assume that each site in a genome is roughly as likely to show variation as is any other site, then only by obtaining full sequences for long segments of a genome can we hope to discover more than a tiny fraction of the variation that it harbors. The publication of ‘population sequence data’ from fruitflies thus signaled a sea change in the study of genetic variation, transforming it from a field reliant on fragmentary and anecdotal evidence, to one that would be built around thorough and systematic survey. Clever as they were, researchers before 1983 had to take shots in the dark in order to bag polymorphisms; Kreitman’s data marked a flip of the lightswitch. And those data, in their own right, showed resoundingly how useful a quantitative estimate of genetic diversity could be. As such, the data themselves are worth briefly reviewing, before moving on to how genetic diversity is calculated.
Using cutting-edge technology of the early 1980s, Kreitman managed to sequence a 2721-base segment of the fruitfly (Drosophila melanogaster) genome, in eleven individual flies, each of which hailed from one of five geographically disparate populations. By modern standards, his hard-won data were meager, representing less than one of every sixty thousand bases in the fruitfly genome. By comparison — and as a striking example of the biotech equivalent of Moore’s law — one high-throughput DNA sequencer today can sequence the whole euchromatic portions of more than sixty fruitfly genomes in less than a day.
Lacking such a technological embarrassment of riches, Kreitman had to carefully pick the small segment of DNA that he focused on. He chose a gene — one of 14000 or so in the fruitfly genome — called Adh, which encodes a functionally intriguing protein called alcohol dehydrogenase, or Adh (note the lack of italics, which are used for names of genes but not those of the proteins they encode). Adh is an enzyme that helps detoxify alcohol — an important biochemical task for animals, such as fruitflies and humans, who rely on fleshy fruits for nutrition (because such fruits are also a favorite food for yeast, who have mastered the trick of turning fruit sugars into toxic alcohol in order to drive off rivals and predators).
Though humble in scale, Kreitman’s data bore great fruit. First, they revealed the actual DNA sequence difference underlying a well known, functionally important dimorphism in the amino acid sequence of fruitfly Adh. Of broader interest, however, were the revelations that 1) the Adh gene showed extensive DNA sequence variation (43 sites out of the 2721 showed variation, even in such a small sample of flies), 2) most of that variation involved single-nucleotide substitutions (rather than tandem nucleotide substitutions, or insertions or deletions of nucleotides), and 3) only one dimorphism (afforementioned) affected the amino acid sequence of the protein. The last observation was particularly important, as many (14) of the observed dimorphisms were actually harbored by the gene’s codons. Thirteen of the fourteen, however, were ‘synonymous’ dimorphisms — that is, they changed a codon to one of its synonyms (if you were reading earlier posts carefully, you’ll recall that the genetic code is degenerate).
Each of these points would become a foundational piece of data in the genomics era, serving as a benchmark by which future sequence data could be compared and debated, particularly in the context of the so-called neutral theory of molecular evolution, as expounded by influential theoretician Motoo Kimura. Saving discussion of such debate for another day, let’s move on to explore how one can squeeze a sensible and widely useful summary measure of ‘polymorphism’ out of DNA sequence data from a population.