On polymorphism III: Working with peanuts

April 21st, 2009

Delicious Alleles

Delicious alleles

We’re now ready to ask — and answer —  the question: how much polymorphism is to be found in a given population?  Remember that ‘polymorphism’ here specifically means genetic diversity, which is ultimately defined by characters (think ‘A, C, G, T’) that differ in kind, not degree.  In this important sense, DNA  sequence variation differs from variation in, say, mass, or number-of-toes.  The latter traits are inherently quantitative, whereas genotype is, at the finest scale, inherently qualitative. As such, conventional measures of quantitative variation — particularly univariate moments, such as variance — don’t apply, at least directly, as measures of sequence variation.

Take heart, though.  Even if we have to bypass (for now) old friends like variance, we will indeed be able to arrive at a summary measure (yes, a quantitative one!) of genetic diversity that is well defined and widely applicable.  The last point is important: we want to be able to use the same yardstick of diversity to study populations of any given organism.  In our example, we’ll go with the organism most familiar to us: humans. To keep things simple, though, we will agree to reduce a population of living, breathing people — each of whom comprises trillions of cells, each of which carries a genome billions of nucleotides long — to a population of alleles.  Specifically, we’ll imagine picking just one cell from each person, then zooming in and focusing on just one site in the  genome (keeping in mind that the cell likely contains two copies of that site — but more on that later).  In simplifying things this way, we mentally dissolve away the non-genomic parts of cells, and all the other potentially distracting trappings of our human population, until what remains is simply a bunch of representative copies of that single site in the genome.  It is this population of alleles whose level of diversity we aim to estimate, at least for starters.

Attention: Candy
Admittedly, even dissolving away everything but a single site in the genome may not simplify things enough to proceed smoothly.  After all, mentally manipulating a population of invisibly tiny copies of a genome site is not the most intuitive task.  So, to give the brain something concrete and familiar to work with, let’s invoke a mouthwatering analogy. Specifically, we’ll liken our population of copies of a genome site to a jarful of colorfully candy-coated peanuts. Candied peanuts — rather than say, jelly beans or gumballs — are ideal for a subtle reason that I’ll explain later.

The jar we’re imagining holds many, many such candy peanuts (more, even than the jar pictured at the top of the post).  To be precise, let’s say the jar holds exactly 1000 nuts.  This number is the first of quite a few that will be popping up — meaning that we’ve reached a point where a conscientious science guide, like an avuncular airline pilot forecasting a ‘patch of rough air’, is supposed to ruefully warn you about the thorny math to come. Fear not, though: we need basic arithmetic, not Stephen Hawking stuff, to reason our way to a working concept of genetic diversity. So don’t fret.

Now, in principle, 1000 peanuts, each coated in a shade of candy paint, could sport as many as 1000 unique hues. But our peanuts are typical mass-produced sweets — so, for industrial efficiency, they come in just a few distinct colors. Four, actually: red, yellow, green, and blue. Picturing our jarful of a thousand such nuts, we might ask our key question: `How diverse, by color, are the nuts in the jar?‘.  Now, in answering that question, let’s agree to consider only the color of the candy coating (not the roasted interior) of each nut, and to forgo trying to estimate how different from each other two colors might be. That is, we’ll treat the colors of any two nuts as either the same, or different, with nothing in between. Moreover, we’ll assume that the candy paint on our peanuts never fades or otherwise changes in color.

Counting colors: Diversity = 4?

Right off the bat, it may be tempting to posit that the nuts in our jar, by virtue of coming in exactly four distinct colors, have a ‘color diversity’ of 4. And promisingly, this answer would capture something essential (if obvious) about the situation: 4 is both more than 1 (the value we would estimate for a clearly less color-diverse jar of nuts that were all coated the same color) and less than 1000 (the number we’d assign to a jar of undeniably more color-diverse nuts, in which each had its own unique hue).

But before we settle on simply counting colors to define ‘color diversity’, note that the range of possible values for this count would always depend on the total number of nuts in the jar. If a jar held just three nuts, for example, its ‘color diversity’ couldn’t exceed 3. How are we to compare such an estimate to the value we obtain for a jar of, say, five nuts in just three colors? Would these two jars be precisely equal in color diversity? My gut, and maybe yours too, says no, a five-nut, three-color jar — which carries more nuts than it ‘needs’ to in order to qualify as tricolore — seems less color-diverse than a three-nut, three-color jar.

Ratio profiling: Diversity = 4? Diversity = 0.004?
My gut also tells me that a broadly useful measure of diversity should depend as little as possible on other population-specific qualities/quantities (such as the number of individuals in the population). With this aim in mind, we might gravitate to the idea of using a ratio to measure diversity. In particular, why not just divide the number of distinct nut colors by the number of nuts in the jar? Perhaps this approach would churn out ‘diversity’ estimates that compare more sensibly, one to the next, even for jars that carry different numbers of nuts. A jar of three nuts in three colors, for example, would have a ‘nut colors per nut’ ratio of 3/3, or 1. The round unity of this estimate would suggest that we think of such a jar as ‘fully’, rather than ‘partly’, diverse; and this, in turn, would accord the fact that such a jar is as color-diverse as a three-nut jar can be. By comparison, a jar of five nuts in just three colors would have a lower ratio of 3/5, or 0.6 — a result that fits our intuition that such a jar isn’t ‘fully’ diverse.

When applied to our original jar of 1000 nuts in four colors, such a ratio-based approach would yield a color diversity estimate of 4/1000, or 0.004. Reassuringly, this value is still more than 1/1000 and less than 1000/1000, per our intuition. Moreover, because we can’t have more nut colors than nuts, the new ratio-based measure would, conveniently, always take on a value somewhere in the range of zero to one. As such, it would be what scientists call a normalized measure, with a tidy built-in scaling that carries some mathematically useful properties. So maybe we’re getting somewhere with this ratio idea.

However, by turning to such a simple ratio — the number of distinct colors in the sample divided by the number of nuts in the sample — we would iron a new wrinkle into our definition of diversity, again putting it at odds with intuition. To spot that wrinkle, let’s do something that scientists often do in road-testing an idea: examine a so-called ‘boundary case’, where some variable under consideration takes on its highest (often infinity) or lowest (often negative infinity, zero, or, in cases like ours, one) possible value. In our scenario, we’ll check the boundary case of a jar that holds just one peanut.

Regardless of how we define diversity, one nut sitting alone in a jar has, of course, a single characteristic color (lonesome blue, perhaps). The total number of colors — here the use of the plural is just a formality — in a one-nut sample (i.e., jar) must therefore equal the number of nuts in the sample. As such, a ratio-based estimate of such a jar’s color diversity would be 1/1 — or, well, 1. If you think about it, this means that we would deem a one-nut, one-color jar to be just as color-diverse as a three-nut, three-color jar. Moreover, we would hold both of these jars to be exactly as diverse as a thousand-nut, thousand-color jar! True, each of these jars, however many nuts it contained, would be as color-diverse as it could be. Nonetheless, intuition tells us ‘Sorry, but a jar containing one nut — whatever its color — is simply not as color-diverse as a jar containing a thousand nuts in a thousand colors’.

Match and mismatch
Apparently, in trying to confine diversity to a scale that doesn’t vary with the size of the sample, we lost some ability to resolve how ‘surprising’ or ‘unsurprising’ a given finding of ‘diversity’ is. So why, exactly, did the simple ratio-based definition, which had raised our hopes, fail this way? To see why, imagine doing something drastic: wantonly spilling the contents of the original jar onto a table. Having done so, and taking care not to let any roll over the edge, we’ll start tallying colors one nut at a time.

Now, in your mind’s eye, a thousand peanuts coated in four candy colors and strewn across a table may make up a wildly gaudy mosaic — a bit like a pointillist painting executed, in a stylistic stretch, by Jackson Pollock. If this is what you’re picturing, though, I’m sorry to tell you that your imagination has been led astray (perhaps by eager tastebuds…or by the misleading photo above). In fact, when we spill out our jar onto the table, here’s what we find: exactly 994 red-coated peanuts, and 2 peanuts each in yellow, green, and blue.

This hardly qualifies as diversity — to the contrary, it’s classic tokenism. Yet the simple ratio-based measure that we defined would estimate the color diversity of these peanuts as 0.004 — the same value as would be estimated for a jar holding a more ‘balanced’ mix with 250 peanuts in each of the four colors. Clearly, the simple ratio-based candidate definition of color diversity fails to capture a key intuition: that an even mix of types is more ‘diverse’ than a mix skewed toward just one (think of that boundary case again) or a few types.

To see this another way, imagine that we can’t view our jar at all, but instead have to gauge the color diversity of its contents by reaching in blindly, fishing out a peanut for inspection, eating it (for a parallel survey of flavor-diversity, of course), and repeating this procedure for a while. Now, if our jar is a tokenist one, then, whenever we blindly plunge a hand in, we can safely bet that we’ll be pulling out a red peanut. By contrast, if we reach into a jar containing a balanced, cosmopolitan mix of peanuts in several colors, we won’t be able to guess ‘red’ (or anything else) with much confidence.

How do we express this insight in formal terms? For starters, we care little about the identity of any particular color(s); that is, a nearly all-green jar is just as tokenist as a nearly all-red one. Moreover, color itself, the trait that we happen to be surveying, is in no way integral to a general definition of diversity. If not interested in color, for example, we could wonder just as easily about diversity of flavor, country of manufacture, or symbol stamped on the candy shell. The qualitative trait that we keep track of, and the specific values that the trait may take, aren’t the point – what’s important is how often we choose an individual that differs, in that chosen trait, from what’s come before. Diversity is, after all, premised on difference between members of a population.

Brass tacks: Diversity = 4? Diversity = .004? Diversity ≈ 0.012
If we start thinking of diversity as defined by direct comparison of one member of a population to another, then the diversity of the one-nut ‘boundary case’ that doomed our ratio idea becomes a moot question, like the koanic clapping of a single hand. Mathematically, we can succinctly formalize our new insight — that diversity requires differences between individuals — by asking ‘How often, if we pick two individuals at random from the population, will they differ in the trait of interest?’. To grind out an actual number using this simple algorithm for assessing diversity, let’s reason through the possible outcomes of two blind picks from our jar of peanuts.

The first nut we pick will, of course, very likely be red. To be precise, the chance of getting a red nut on the first random pick is 0.994. Now, unless we throw that first nut back in (instead of eating it) and remix the nuts thoroughly, the chance of picking a red nut on our second pick will differ slightly from what it was on the first pick; this is because, on the second pick, we’ll have only 999 nuts in the jar, of which 993 or 994 will be red. And 994/1000 equals neither 993/999 nor 994/999.

Those three numbers are, however, awfully similar to each other. In fact, we’re working with such a big ‘population’, and sampling such a small fraction of its ‘individuals’, that we can helpfully simplify our probability calculations by treating the chance of picking red as if it stays the same from pick to pick. Such a simplification is especially helpful in real-world sampling situations where, though unsure of just how many ‘nuts’ are in a given ‘jar’, we’re quite sure that we’ll be sampling only a small fraction of them.

Under this simplifying ‘constant odds’ (or, as it’s often termed by statisticians, ‘sampling with replacement’) assumption, the chance of picking two red nuts in our first two picks is 0.994 squared, or a little more than 0.988. Which means, conversely, that our chance of not picking two red nuts in our first two picks is roughly 1 - 0.988, or 0.012. Now, note that this latest number isn’t exactly the chance that the first two randomly picked peanuts will differ in color. Even if the two picks aren’t both red, after all, both could be yellow, or both green, or both blue. With our mostly red jarful, each of the latter cases is very unlikely, of course, but we can easily account for them.

To do so, we’ll again assume that the odds of picking a given color stay constant from pick to pick. For each of the three colors other than red, these odds are roughly 2/1000, or 0.002. So the chance of picking two yellow nuts on consecutive picks will be roughly 0.002 squared, or 0.000004 — and the same holds for each other rare color in the jar. Overall, then, the chance that we’ll pick successive nuts that match in color will be roughly 0.988 + 3(0.000004), or, well, about 0.988. Naturally, the chance that we pick successive nuts that differ in color is the complement of this matching probability; in the end, then, it’s still roughly 0.012. Let’s adopt this number as our new ‘chance of mismatch’-based estimate of our particular jar’s nut color diversity.

For comparison, let’s run through the corresponding figures for a 1000-nut jar containing 250 nuts in each of four colors. In that case, the chance of picking two nuts of the same color on the first two picks will be roughly 4 *(0.25*0.25), which works out to 0.25. Thus the chance of picking two nuts that differ in color will be roughly 1 - 0.25, or 0.75. Consistent with intuition, this value is much greater than 0.012, our new estimate of our original jar’s diversity. And, importantly, our new ‘chance-of-mismatch’-based diversity measure is still normalized (bounded by 0 and 1), and still puts our jar’s color diversity squarely between the values obtained for a jarful of 1000 nuts in just one color, and a jarful of 1000 nuts in 1000 distinct colors.

‘Real-jar’ uncertainty
The foregoing is promising stuff, but we’re not yet in the clear. Remember that, if we look at a jarful of nuts from the outside, with no extra knowledge, we won’t actually know the odds of picking any given color ahead of time. As such, if we want a good estimate of ‘color diversity’ for an unfamiliar jar of candy peanuts, we will have to repeat the two-pick experiment many times, after which we might notice that the two picked nuts differ in color in, say, roughly 1.2% (or 0.012) of trials. This is an arduous approach to estimating a measure of the nut ‘population’! But what if, instead of picking nuts one at a time, we could grab a whole handful (assuming that the nuts are distributed randomly in the jar) and then compare the colors of each pair of nuts in that ‘sample’, to see how often those paired ‘individuals’ differ? Will that give us enough comparisons to make a good estimate of diversity, without having to spend time picking out nuts one at a time from the jar?

Well, suppose that our handful contains a whole number k of peanuts. The number of distinct peanut-pairings within that handful will be k(k-1)/2 (an expression that specialists in discrete mathematics call ‘k-choose-2‘).  To confirm this, you can start by checking the number of possible pairings for some low values of k: two nuts can be paired in just one way; three nuts yield three distinct pairings; four nuts give six distinct pairings; and five nuts give ten. Notice that, for each of these low values of k, there are indeed k(k-1)/2 distinct pairings. You can then extrapolate that pattern to indefinitely higher values of k by noting that, each time you increase k by one (by adding a new nut to the handful), the number of new pairings that you add to the total number of pairings is simply the old value of k (i.e., one less than the new value of k). For example, when you add a sixth nut to a group of five, that new nut can be separately paired with each of the first five nuts, but no other new pairings are made (and none are lost). By this insight, the new total number of pairings can be written as ((k-1)((k-1)-1)/2) + (k-1). This can be rewritten as ((k-1)(k-2) + 2(k-1))/2, which simplifies to, you guessed it, k(k-1)/2.

And, luckily, k(k-1)/2 grows quite quickly (nearly as quickly as the square of k, in fact) with increasing k, meaning that even a fairly small sample of individuals from a large population will provide many comparisons with which to assess its diversity. By asking what fraction of those many comparisons are mismatches (caveat: we’ll average over one less comparison than we can actually make, because the match/mismatch status of that ‘last’ comparison turns out to be fully determined by the pattern of the other comparisons), we have a handy measure — the pairwise chance of mismatch — that can be readily calculated, and that accords basic intuitions about diversity, including the key difference between an overwhelmingly red jarful of candied peanuts, and a more compellingly colorful jar containing equal numbers of nuts in several colors.

Nucleotide diversity
In calling this newly defined measure the color diversity of our jar of nuts, we’ll be making a direct analogy to a key measure in population genetics: nucleotide diversity (or, as it’s often nicknamed, polymorphism).  Nucleotides are, of course, the individual ‘beads’ that are strung together to make up the genome. And, like the nuts in our jar, each nucleotide, if it’s there (insertions and deletions of nucleotides do happen, and complicate things by making it harder to tell which nucleotide in one genome ‘corresponds’ to which nucleotide in another genome) comes in just one of four ‘colors’: the chemically distinct subunits called adenine, cytosine, guanine, and thymine (A, C, G, and T, for short), one of which, like a little flag, marks each nucleotide.

Before we explore what nucleotide diversity can tell us about a population, another word on why we used peanuts as analogs to human alleles. As anyone who has ‘sampled’ through a bag of whole roasted peanuts knows, many a peanut shell cracks open to reveal two nuts, rather than just one. In this sense, a peanut shell can be likened to a human zygote, which corrals together two copies — one from mom, one from dad — of nearly any segment of the genome that we might pick to examine.

So why, in our mathematical musing about candied peanuts, did we conveniently ‘paint over’ the fact that peanuts are born paired up in shells? Begging that question, I could plead that it’s hard to tastily coat (or even see) peanuts in easily countable colors while they’re still inside their shells. But, more to the point, taking the peanuts out of their shells underscores an analogous point about population genetics: as long as we’re just interested in estimating diversity for just one site in the genome (and not other, subtler characteristics of genetic variation in the population), the fact that alleles inside real genomes are paired up turns out to matter fairly little.

And, in a subtler extension of the analogy, it’s worth noting that two-nuts-per-shell is not a hardfast rule in the peanut world: some shells hold just one peanut, and still others hold more than two. Likewise, some human cells (eggs and sperm) carry no more than one copy of any given segment of the nuclear genome; other cells can carry just one copy of a particular segment, like the Y chromosome; and still other human cells carry more than two copies of a genome segment. The cells of people with Down syndrome, to cite a well known example, often have three copies of chromosome 21, rather than the more familiar two. And, in a small fraction of women, many cells have three or more copies of the X chromosome. There’s an even more widespread example of this phenomenon, if we consider the little gene-crammed rings of DNA inside mitochondria. These degenerate, energy-liberating bacteria live in the cells of animals, plants, and other so-called eukaryotes, and, along with their genomes, are typically transmitted to the next generation only via eggs, not sperm. Nonetheless, each cell in your body contains dozens to thousands of these little half-captive half-stowaways, each of which may, in turn, contain many copies of its own little genome. So while a typical human cell may hold two copies of, say, chromosome 7, as well as one or two or more copies of the X chromosome, it also holds many, many copies of the mitochondrial genome.

A shelf full of jars
In our analogy, the candied peanuts in one jar represented copies of one site in a genome.  We thus derived an estimate of ‘diversity’ for that site only.  But we can readily expand the scope of our inquiry, and ask about the amount of variation found in a population of copies of a long segment of a genome.  To do so, we can simply average the site-specific diversity estimates from throughout that segment, yielding a statistically robust — and very useful — estimate of its overall nucleotide diversity. Roughly speaking, this is akin to calculating the diversity of peanuts in each jar lined up on a whole shelf of such jars, and then averaging those estimates, to get a ’shelf-specific’ estimate.

To the extent that we then build a valid population-genetic model to represent the population of copies of our genome segment, we can ‘plug in’ our segment-specific estimate of nucleotide diversity in order to shed surprisingly bright light on the ancestral history of the copies of the segment.  Some of the cryptic, but newly ‘guessable’ characteristics of that history might include: the ‘effective‘ size of the ancestral population, as roughly averaged over time; the average number of generations elapsed since the death of the last common ancestor of two randomly picked copies of the segment; the average mutation rate per nucleotide site, per generation; and/or the average number of migrants exchanged by two partially isolated sub-populations, per generation (provided that the population we’re looking at is divided into such sub-populations, and we estimate diversity for each one separately, as well as for the pair as a whole).  Moreover, if we compare results from our segment of interest to those from other segments in the genomes of the same ‘host’ population — or to results from the ’same’ segment in a different population altogether — we may find strikingly dissimilar estimates of nucleotide diversity.  Such disparities can turn up by chance, but they can also reflect robust underlying variation in, say, mutation rate, population size, or the stringency of natural selection.  Future posts will explore, in more meaningful detail, how simple numbers such as nucleotide diversity can be leveraged to draw detailed evolutionary inferences about these and other factors.

On polymorphism II: Hard-drinking fruitflies

April 20th, 2009

The last post hinted at how useful it is to compare the amount of genetic variation found in one population to that found in another.  This question arises so often in population genetics, in fact, that the  word polymorphism has, as noted, come to often denote a summary measure of DNA sequence variation within a population.  But how, exactly, can one put a number on polymorphism, as such?  Remarkably, this question was hardly answerable, from a practical perspective, until 1983, when Marty Kreitman published the first definitive data on DNA sequence variation in a wild population of organisms.

Before Kreitman’s paper, known examples of genetic variation documented at the level of DNA sequence were confined to ‘lucky breaks’, where such variation caused either 1) a sufficiently drastic change in some protein’s amino acid sequence (thanks to mutation of a codon) to be detected and characterized by existing techniques, or 2) a change in the ‘restriction digestion’ signature of that genome itself, as detected by chewing up the genome with bacterial enzymes known to break it apart only at particular short sequence motifs (which can appear and disappear, via mutation).

In many organisms, however, codons make up only a tiny fraction of the genome — and ‘restriction enzyme’ target sites are rarer still.  Therefore, if we assume that each site in a genome is roughly as likely to show variation as is any other site, then only by obtaining full sequences for long segments of a genome can we hope to discover more than a tiny fraction of the variation that it harbors.  The publication of ‘population sequence data’ from fruitflies thus signaled a sea change in the study of genetic variation, transforming it from a field reliant on fragmentary and anecdotal evidence, to one that would be built around thorough and systematic survey.  Clever as they were, researchers before 1983 had to take shots in the dark in order to bag polymorphisms; Kreitman’s data marked a flip of the lightswitch.  And those data, in their own right, showed resoundingly how useful a quantitative estimate of genetic diversity could be.  As such, the data themselves are worth briefly reviewing, before moving on to how genetic diversity is calculated.

Using cutting-edge technology of the early 1980s, Kreitman managed to sequence a 2721-base segment of the fruitfly (Drosophila melanogaster) genome, in eleven individual flies, each of which hailed from one of five geographically disparate populations.  By modern standards, his hard-won data were meager, representing less than one of every sixty thousand bases in the fruitfly genome.  By comparison — and as a striking example of the biotech equivalent of Moore’s law — one high-throughput DNA sequencer today can sequence the whole euchromatic portions of more than sixty fruitfly genomes in less than a day.

Lacking such a technological embarrassment of riches, Kreitman had to carefully pick the small segment of DNA that he focused on. He chose a gene — one of 14000 or so in the fruitfly genome — called Adh, which encodes a functionally intriguing protein called alcohol dehydrogenase, or Adh (note the lack of italics, which are used for names of genes but not those of the proteins they encode).  Adh is an enzyme that helps detoxify alcohol — an important biochemical task for animals, such as fruitflies and humans, who rely on fleshy fruits for nutrition (because such fruits are also a favorite food for yeast, who have mastered the trick of  turning fruit sugars into toxic alcohol in order to drive off rivals and predators).

Though humble in scale, Kreitman’s data bore great fruit.  First, they revealed the actual DNA sequence difference underlying a well known, functionally important dimorphism in the amino acid sequence of fruitfly Adh.  Of broader interest, however, were the revelations that 1) the Adh gene showed extensive DNA sequence variation (43 sites out of the 2721 showed variation, even in such a small sample of flies), 2) most of that variation involved single-nucleotide substitutions (rather than tandem nucleotide substitutions, or insertions or deletions of nucleotides), and 3) only one dimorphism (afforementioned) affected the amino acid sequence of the protein.  The last observation was particularly important, as many (14) of the observed dimorphisms were actually harbored by the gene’s codons.  Thirteen of the fourteen, however, were ’synonymous’ dimorphisms — that is, they changed a codon to one of its synonyms (if you were reading earlier posts carefully, you’ll recall that the genetic code is degenerate).

Each of these points would become a foundational piece of data in the genomics era, serving as a benchmark by which future sequence data could be compared and debated, particularly in the context of the so-called neutral theory of molecular evolution, as expounded by influential theoretician Motoo Kimura.  Saving discussion of such debate for another day, let’s move on to explore how one can squeeze a sensible and widely useful summary measure of ‘polymorphism’ out of DNA sequence data from a population.

In a word: Polymorphism I

April 10th, 2009

“If you come to any more conclusions about polymorphism, I shd [sic] be very glad to hear the result: it is delightful to have many points fermenting in one’s brain”

Charles Darwin, letter to Joseph Hooker, 1846

The term polymorphism, which peppers much of the biological and medical literature in this age of plentiful DNA sequence data, was well established among biologists by the time it first tickled Darwin’s brain.  Coined from the Greek for ‘many shapes’, the word apparently gained particular currency among early Victorian botanists, who wanted a handy label for the spectacular proliferation of structural forms seen among newly collected plants* from far-off imperial lands. When Darwin, who was well versed in botany (especially that of orchids), set out to catalog and interpret a Beagle-load of riotous diversity among animals and fungi, too, he naturally reached for polymorphism to summarize his emerging grasp of organismal variation. Many, many shapes indeed.

The concept of polymorphism held knotty mysteries for Darwin, though (as his letter to Hooker attests).  In particular, his understanding of the inheritance of highly variable traits suffered, famously, without the insights of Gregor Mendel, the gardening monk whose obsessive bean-counting yielded the first clear mechanistic insights on the nature of genetic polymorphism.  Darwin skimmed references to Mendel’s work, but missed its key lesson: that genetic variation is transmitted in discrete (’digital’) packets, rather than in infinitely divisible (’analog’) amounts. Lacking this insight, Darwin found it hard to account for the persistence of a wide variety of heritable organismal forms — polymorphism, that is — even in the absence of natural selection.

If, as Darwin assumed, genetic heritability were ‘analog’, rather than ‘digital’, then anything other than strongly assortative (’like-with-like’) mating would homogenize a population, just as mixing two paints gives a uniform intermediate hue.  And, as a puzzled Darwin realized, not only is non-assortative mating rampant in nature, but strictly assortative mating would tend to prevent any heritable innovation — that is, the stuff that puts the poly in polymorphism — from getting a foothold in a population in the first place.  Moreover, one cornerstone of Darwin’s own theory of evolution by natural selection is the idea that such heritable innovations can be functionally useful (i.e., adaptive).  Yet, lacking Mendel’s insight, Darwin couldn’t explain how even an initially highly useful heritable innovation could spread through a stable-sized population at anything faster than a glacial pace, given that its distinctive functionality would be quickly diluted by each generation of mating between carriers and non-carriers.

So how did Mendel make his revelatory deduction that genetic transmission is discretized?  He started with the simple observation that each pea plant in his garden was either violet-flowered or white-flowered, and yielded either very smooth peas or very wrinkly peas, regardless of its flower color.  And he went on to find that, when he crossed two plants (by fertilizing one’s flowers with pollen from another) that either matched, or differed, in flower color and/or pea shape, the offspring did not include lavender-flowered plants that made slightly wrinkly peas.  Rather, the plants continued to be either violet- or white-flowered, and to produce either very wrinkly or very smooth peas.

In Mendel’s careful (and, by his own admission, selective) analysis, the actual numbers of each type of plant suggested that each trait of interest (e.g., flower color, or pea shape) was governed largely by its own discrete, independently transmitted genetic factor, inherited in two copies (one from each parent).  For each trait, one type of copy of the underlying factor appeared to be ‘dominant‘ to the other — that is, a plant’s flower color or pea shape would match the dominant type as long as at it had inherited at least one dominant-type copy of the relevant genetic factor.  Such factors came to be called genes (though that word has more specific connotations today), and their variant copies came to be called alleles.  Strikingly, the discretized nature of these packets of inheritance turned out to extend even to their own composition, as strings of discrete chemical ‘letters‘ that, if swapped for one another, differ in kind but not degree.

Mendel was lucky to have a relatively tractable organism around to study: pea plants, like humans,  are diploid, meaning that they have only two copies of the genome in most cells; many other crop plants have more complicated inheritance involving many more copies of a genome.  And, just as importantly, he focused on traits that had fairly simple genetic underpinnings.  Many traits in pea plants, and in other organisms, are governed fairly strongly by variation in many parts of their genomes; such variation would not have been statistically tractable to Mendel, and indeed remains a computationally intensive focus of much research (especially research on the genetic underpinnings of diseases) today.

Notably, Mendel’s work on pea plants (which actually addressed at least seven distinct traits of plant color or/and shape) involved variation that comprised, in the case of each trait, only two forms. Though such ‘binary’ factors can indeed generate ‘many shapes’ in their combination, Mendel actually discovered genetic dimorphisms. Strikingly, it turns out that, in real world  populations of all sorts of organisms, most genetic variation also takes the form of ‘dimorphism’, when viewed at the finest scale saliently relevant to gene function.  That is, rarely does a single site in the genome show more than two distinct spelling variants among samples taken from a real population — and in no case does it show more than five such variants (A, C, G, T or absence).  One could therefore argue that a more appropriate umbrella term for a given example of localized genetic variation, per se, might be multimorphism (’more than one shape’) or oligomorphism (’a few shapes’), rather than polymorphism.  The term is well established, however, and has likely transcended etymological connotations of ‘many’.  A more substantive usage peeve arises, in both the scientific literature and popular coverage of it, when ‘polymorphism’ is used, wrongly, to mean ‘variant’ or ‘allele’ (typically, an allele assumed to be derived by mutation, rather than an ancestral allele), as in this example from a press release by the University of Southern California’s news service:

“Because of the way genes are inherited from both parents, each participant could either have two copies of the polymorphism, one copy or no copies.”

Rather than pester USC’s editor, though, let’s raise a bicentennial toast to mister Darwin, and a hearty forthcoming 187th to brother Mendel.  Oh, and those conclusions about polymorphism?   As noted, the term was first used as a catch-all label for organismal variation visible to the naked eye (or with a rudimentary microscope), but, with increasing knowledge of DNA sequence variation, has come to often denote specific examples of such variation, as in efforts to catalog all the single-nucleotide polymorphisms (SNPs) in a given genome.  Moreover, as comparative genome sequence data have become more thorough and plentiful — and as approaches to quantitatively assessing such data have emerged — the term has also become a synonym for the overall level of genome sequence variation within a population. In the latter usage, it stands in contradistinction to divergence, the term of choice for summary measures of the sequence variation that distinguishes one population (ostensibly, ’species’, though that term is impossible to rigorously define in a way that universally accords intuitive judgments) from another.  In future posts, we will explore in greater detail how such measures of variation are calculated.  Overall, the origin and maintenance of genetic variation remain, arguably, the chief foci of empirical and theoretical inquiry in evolutionary genetics.  As such, you might say that it remains delightful, indeed, to have many points — and many shapes — fermenting in one’s brain.

*: Plants are, of course, plentiful; slow-moving; compact, light, and durable when dry; and often highly useful.  These qualities made them especially popular subjects for Victorian-era collection and sketching, and for imaging with early (and modern) photographic materials.

Second Life

March 15th, 2009

New Scientist has some intriguing recent stories on particular fronts of research, including a discussion of the search for so-called ‘shadow life‘ — meaning, essentially, life forms here on earth that don’t share the last common ancestor of currently known life forms.  Exobiologists, who hope to study life beyond earth, tend to focus on nearby candidate habitats (such as Mars and some volcanic moons elsewhere) that are thought to be energetically and chemically amenable to life as we know it.  But it’s intriguing to consider the possibility that critters with ancestry outside the currently understood ancestral tree of life are living under (or perhaps even in…) our noses.  In mulling the potential importance of understanding such life, and how it might differ from our own, the New Scientist article quotes Carol Cleland with an instructive analogy:

“If you were an alien biologist who’s interested in understanding what a mammal was, and all you had was zebras, it’s very unlikely that you would focus on their mammary glands, because only half the specimens have them. You’d probably focus on the stripes, which are ubiquitous.”

This analogy, though pithy and thought-provoking, strikes me as incomplete. If alien biologists thought like human biologists, for example, they’d likely first focus on traits that they shared with zebras (such a bias toward expecting and recognizing the familiar is reflected, comically, in the vast array of subtle twists on human morphology that seem to characterize most sentient life forms — cosmos-wide — in the fictional world of Star Trek). But Cleland’s point — that life, perhaps like any phenomenon, is most effectively understood through systematic comparison of many examples — is well taken.

So how might a ’shadow’ life form differ from us other life forms? Well, for starters, it might not share the micron-scaled compartmental (cell- and other membrane sac-defined) architecture that we know so well. Or, for that matter, the familiar centralized (’genomic’) storage of replicative ‘instructions’. Or, more fundamentally, the carbon ‘backbone’-based chemistry that we recognize to be so versatile for building molecules that interact with each other in richly complex ways. Yet even if a newfound life form shared all of the foregoing traits with ‘non-shadow’ life, a key way in which it might differ from us might be in not sharing one of the most salient ’stripes’ that characterize life as we know it — our genetic code.

The latter term is so frequently (and dismayingly) misused in popular science journalism, that a definition is in order here.  By genetic code, biologists literally mean a key: that is, a specific translation scheme for making proteins (the molecules that catalyze or otherwise centrally direct most of the chemical interactions that we recognize as crucial to cellular function) from instructions written in the form of nucleic acids (which make up genomes). The ’texts’ to be translated into proteins, via this code, are found in the specialized segments of a genome that we call genes. Such texts are simply strings of 3-letter DNA ‘words’ called codons, each of which is translated into a particular amino acid ‘meaning’ as the protein (a polymer of amino acids strung together in the same order as the codons that specify them) is built.

Because the ‘alphabet’ of DNA has exactly four ‘letters’ (A,C,G,T), there are exactly sixty-four (i.e., 4 x 4 x 4) distinct codons. But there are, with minor exceptions, just twenty distinct amino acids used to make proteins in life as we know it. Other than these twenty amino acids, the only other ‘meaning’ to be conveyed by codons is the ’stop sign’ used to signal that a newly minted protein is complete. The genetic code, having more distinct words (64) than distinct meanings (21), is thus said to be degenerate — that is, some of its words have the same meaning.

While such degeneracy can help buffer a code against ‘transmission error’, degenerate codes are inherently inefficient. One could, for example, easily devise a genetic code using just sixteen ‘2-letter’ codons (one of them with a ’shift key’ meaning), rather than the sixty-four ‘3-letter’ ones, to encode exactly the same ‘meaning’ information as the real genetic code does, in roughly 30% less ‘text’.  And when one considers other optimality criteria, too, the standard genetic code likewise falls short of ideal: Stephen Freeland and Laurence Hurst, for example, measured codes’ abilities to buffer the effects of genomic mutation — a major source of ‘transmission error’ — at the protein level, and estimated that the standard code, while remarkably good, was worse than trillions of possible alternative codes (and they even limited the field of contenders to codes that had exactly the same synonymous codon ‘families’ found in the standard code, with the same such family assigned to the ’stop sign’ meaning).

Yet when one surveys earthly life — from blue whales to bacteria (and including humans, of course) — one finds that nearly all known organisms use this standard code.  Known exceptions (including, incidentally, the code used within mitochondria) are very rare, likewise suboptimal, and involve reassignment of no more than a handful of codons relative to the standard code.  The striking near-ubiquity of the standard code, with its readily apparent functional shortcomings, likely won’t sit well with proponents of ‘intelligent design’, and continues to prompt debate among evolutionary geneticists themselves.

The leading explanation, championed by Francis Crick (who, in addition to discovering the structure of DNA, gleaned the first details of the standard genetic code for us), posits that the code is simply a ‘frozen accident’. In this view, the standard code, like the QWERTY keyboard, has been more lucky than good: out of oodles of possible variants, it worked well enough (potentially quite a bit better than early rivals), early in the evolution of a system of information flow, to become overwhelmingly common and well entrenched. As the system (earthly life, or earthly typing) evolved, important infrastructure (in the QWERTY analogy, not just keyboards themselves, but the programming details of word processing software, the synapses of typing instructors, etc.) became well adapted to the variant in question but poorly suited to any potential rival, even if the latter were much more efficient.

The ‘weight’ of such infrastructure now serves, in this view, to anchor the standard genetic code against intrinsic improvement. In evolutionary terms, consider the implications of even a tiny tweak of the genetic code in an organism that, like a human or a banana plant, relies on the code in making thousands of distinct proteins, each with finely tailored functionality. Swapping one amino acid for another, in thousands of proteins at once, would be overwhelmingly likely to wreak utter havoc with organismal function. Simply put, tinker with the code and the organism will, with near perfect certainty, fail to develop.

‘Shadow life’, however, could readily have a genetic code that differs from the one we know so well. If such differences were slight, we might infer that the newfound critter in question was a fairly close relative, despite its novelty in our systema naturae. Such a discovery might push back our understanding of the origin of life on earth, setting it in a temporally deeper and/or spatially broader context. If the differences between the two codes (assuming the ’shadow’ life form even had a recognizable genetic code) were vast, on the other hand, we might infer that the two domains of life had distinct evolutionary origins.

All of which raises the question — crucial to consider in any quest to recognize ’shadow life’ — of how we are to define life in the first place. Historian Samuel Moyn, a keen thinker and a longtime friend, recently told me that, having informally polled some biologists on how they define life, he’d been surprised by the lack of consensus — or even individual clarity — in their answers. He wondered how a whole field could focus, with any lasting fruitfulness, on something that its researchers could not rigorously define. In noting the definitional slipperiness of life, Moyn is on to something; I’m not sanguine that any rigorous candidate definition of living matter will fully accord our ‘intuitive’ answers to the question of whether something is alive. Textbook cases of ambiguity — such as ‘are viruses alive?’ — often hinge on some concept of ‘autonomous replication’; but, of course, nothing that we think of as unambiguously living (such as ourselves) can truly replicate ‘autonomously’, i.e., independently of necessary cofactors that we think of as external to the organism.

That said, I’m not sure that Moyn’s point applies only to biologists; mathematicians, for example, may ultimately have a hard time rigorously defining, say, number (at least in terms that withstand criticism from within their own ranks). While admitting that we biologists sweep a huge question under the rug in failing to rigorously define life, I’m nonetheless excited to be living and working in an era during which I anticipate that our understanding of the scope of life (however defined) in the cosmos may expand significantly. Here’s hoping we find something surprising, and not too hungry.

What’s in a (sur)name?

March 11th, 2009

When I was growing up, my family would occasionally get a piece of junk mail inviting us to send off fifty bucks or so for a handsome book. The brochure always featured a photo of a faux leather tome, stamped with a heraldic seal, and complete with a forked satin ribbon to mark one’s place amid the hundreds of gilt-edged pages. The book purported to be a history of ‘The Pearson Family‘, ostensibly an ancient clan descended from some Norseman named Per.

Ours was one of dozens of Pearson households listed in the St. Louis phone directory (as in any big American city), and I assume we were targeted by the book company through such public data (i.e., the McConaghy family down the block presumably didn’t get the same ad). But our family’s claim to the storied patrimony of Per is, uh, tenuous. My great-grandfather adopted the surname around 1920, at the behest of his brother, who worried that immigrants settling in Hamilton, Ontario would have a hard time finding work under so conspicuously Jewish a surname as Pinkovits. In light of this, we always got a kick out of the heraldic junk mail come-ons, imagining ourselves mingling at a worldwide Pearson family reunion, too short to find the punchbowl without stopping a Viking for directions.

Our family’s recent adoption of a surname is far from unusual, of course. Many north Americans today were born with surnames that were adopted by immigrant ancestors, or their early descendants, in the past few centuries. As is true in much of the world, these names are usually passed on patrilineally (i.e., only by fathers). In this sense, surname transmission mirrors that of mitochondria (which are passed on only by mothers), and effectively matches that of Y chromosome lineages. There are exceptions, of course, in which Y lineages are knowingly (through formal adoption) or unknowingly (through what human geneticists often call ‘non-paternity’, but would be better termed ‘cryptic paternity’ or, most simply, ‘cuckoldry’) given new surnames.

Considering the dynamics of this process, one might wonder how Y chromosome diversity will be distributed within, versus among, surname lineages in a given human population.  You might intuit that a few basic parameters will largely govern that distribution. On the name side, there will be rates of willful surname change, cryptic paternity, and differentiation of pronunciation/spelling. On the genetic side, there will be the background mix of Y chromosome haplotypes in the given population, and, as time passes, the mutation rate of those haplotypes, and the degree of variation in breeding success among them.

In a new paper, Turi King and Mark Jobling set out to assess some of these parameters in the British population, focusing on 40 surnames that have, according to records, been established in Britain for a long time. In doing so, they augment a stream of data that started with a 2000 paper focusing on just one surname: Sykes. That paper, by Brian Sykes (no coincidence there, of course) and Catherine Irven, suggested that nearly all modern Sykeses share a patrilineal ancestor roughly 20 generations ago (a blink of the evolutionary eye). King and Jobling’s more comprehensive data paint a more complex picture, but one that ultimately suggests a strong, lasting relationship between surnames and the Y lineages that carry them. Meaning that many long-established British surnames show distinct Y haplotype compositions that are strongly biased toward one or a few haplotypes, which King and Jobling call ‘descent clusters’.

For some surnames, the Y lineages that make up those surname-specific ‘descent clusters’ happen to be similarly frequent in samples from the overall British population. In these cases, it’s hard to infer just how faithfully the surname has been passed on patrilineally; after all, the pattern might just as readily reflect a history of free adoption of the surname by various local patrilines throughout its history. As one might guess, some particularly common British surnames, such as Smith and King, show this relatively hard-to-interpret pattern.

For many other surnames, however, the most common ‘descent cluster’ Y haplotype(s) were quite rare in the general British population — or, if not generally rare, were so overwhelmingly common among carriers of the given surname that one could safely infer a robust association between surname and patriline(s). A striking exemplar of such clearly interpretable patrilineal clustering was Attenborough: nearly all sampled British men with this name showed a Y lineage that is particularly common in east Africa (and to a lesser extent in other parts of Africa, the Mediterranean basin, and southwest Asia), but very rare, overall, in Britain. In King and Jobling’s data, many other surnames showed such significantly distinctive ‘clustered’ Y haplotype composition.

King and Jobling are careful to note (and to verify by computer simulation), however, that the particular Y lineage composition seen for a particular surname need not closely resemble the Y-lineage composition among carriers that surname, say, 20 generations ago.  Rather, some patrilines that originally carried the surname can readily have gone extinct (meaning that, at some point, the last remaining man carrying both the surname in question and that Y type changed his surname or/and died without having a son), leaving fewer and fewer distinctive Y lineages carrying the surname in question. Small populations (such as that comprising the male carriers of a given surname) are particular prone to random extinction of lineages, per se; termed genetic drift, this random process can quickly and drastically change the Y lineage composition of the given surname.  King and Jobling infer that some cases of strong clustering seen for single surnames in their data may readily reflect such drift, rather than original foundation of the surname by just one or a few men.

A strange pattern in King and Jobling’s data goes unnoted in their paper: a modest positive correlation between surname length (syllable- or letter-count) and degree of patrilineal clustering (measured as the proportion of samples assigned to the largest lineage cluster).  In my discussion of this pattern with Jobling, a couple of potential explanations came up.  First, long names contain inherently more information than short ones — which might let researchers identify variants of longer names more accurately, including fewer ‘false positive’ matches that are likely to carry distinct Y lineages. Second, short surnames might be adopted, wholesale, more frequently than long ones, preferentially adding to their patrilineal diversity. In light of the question of surname adoption frequency, I’m curious at the degree of patrilineal diversity that might be found among carriers of surnames, such as Esposito (Italian ‘exposed’, as in ‘left out’), that were commonly assigned to foundlings in many parts of Europe. Such names may represent one extreme of the ‘founder diversity’ range, and, as such, might offer a good opportunity to gauge effects of a) background haplotype composition in a given population and b) genetic drift.

letterclusterproportion002

syllableclusterproportion001

Plots of surname length, as measured by letter count or syllable count (American pronunciation), versus the proportion of carriers of that surname whose Y chromosome lineage belongs the commonest (’modal’) surname-specific cluster, in the data of King and Jobling (2009)

The prospect of picking particular surnames for further study highlights unusual problems with sample-donor confidentiality posed by studies of surname-specific genetic variation.  It would clearly be wrong to assume, from otherwise anonymous data published in a paper such as King and Jobling’s, that a given person — say, naturalist David Attenborough — carries a given allele, no matter how common it is in a reported sample.  To more safely preclude such potentially highly personal overinterpretation, however, some prior authors have, in various ways, avoided fully specifying data for particular surnames.  Such ‘redacted’ results may, however, be less usefully interpretable — especially by workers in tangentially related fields — than detailed name-specific data.

Arguably, even King and Jobling’s relatively detailed data are, ultimately, demographic trivia, overly specific to a particular population.  Yet they may offer some modest, potentially comparatively useful ethnographic insights, particularly regarding early surnaming conventions (they discuss the possibility of distinguishing signals of patrilineal variability as they may relate to geographic, occupational, or other types of surname origin) and, more weakly, the sex lives of ancestral Britons.  And the new paper carries one more nugget of potential interest, too: data on the surname Jefferson. Nearly all of the newly sampled British Jeffersons carry one of the two most common Y lineages in Britain. Only about 4% of them, by contrast, carry the distinctive Y lineage found among patrilineal descendants of American statesman-scientist Thomas Jefferson (and that also appears, much more densely than in Britain, in Mediterranean populations). As famously reported in another paper by Jobling and colleagues, those American descendants likely include descendants of Jefferson’s son Eston, whose personal story exemplifies the complex association of surnames with Y haplotypes. Born into slavery (and never publicly acknowledged by his slave-owning father), Eston used his mother Sally’s surname, Hemings. Sally, in turn, got that surname from her own mother, likely because her father too (reportedly a slaveowner of European ancestry, like Jefferson) refused to acknowledge his extramaritally fathered children — especially those understood to have African ancestry.

The new emperor’s clones: Obama’s refreshing but tepid embrace of reasoned sci/tech policy

March 10th, 2009

On 9 March, President Obama issued an executive order to lift hobbling restrictions on embryonic stem cell research  imposed by his predecessor, George Bush. In its own right, Obama’s order enacts a substantive and welcome change, jumpstarting an exciting and clinically promising line of biomedical research. But the revival of stem cell research can also be seen as part of a larger, ongoing effort to strip away religiously motivated fetters placed on many fronts of American science by the Bush administration.

Sadly, despite some real progress, these fetters continue to ominously hinder the American public discourse on science — even on occasions, such as the the new executive order on stem cell research, that should be clear victories for empiricism.  Like other reversals of Bush policy that have reinstated sound, empirically driven policy on such matters as greenhouse gas regulation, the stem cell order conveys a clear grasp of the societal value of the ever skeptical mode of systematic inquiry that we call science.  Yet, in remarks at the signing ceremony, Obama was careful to solemnly profess his ‘faith’. In doing so, he likely meant to console opponents of stem cell research, many of whom are religious. Reaching out to opponents is a laudable goal, and one that Obama consciously cultivates in his role as a `uniter’. Yet it is frustrating that, in a national ‘teachable moment’, Obama could apparently find no better way to reach out to do so than by emphasizing a pat solidarity rooted in shared baseless beliefs.

And that wasn’t the only way in which, even while redressing another case of scientific sabotage by his predecessor, Obama shrank from fully championing a reasoned approach to policymaking.  As a sign of specific future policy intentions, Obama’s latest profession of faith may have been less telling than his use of the signing ceremony to harshly, but emptily, condemn the prospect of human reproductive cloning.  Speaking with customary force, but without laying out a chain of reasoning such as we’ve come to expect from him, he declared that ‘[cloning] is dangerous, profoundly wrong and has no place in our society or any society.’ Obama’s mention of cloning as a foil to stem cell research reinforced a pattern that has emerged in American politics, in which the two issues seem to have become joined at the hip, appearing together with the rote predictability of a pantomime hero and villain. This rhetorical conflation (which is nearly always resolved with some dramatic flourish of contrast) may serve mainly to let supporters of stem cell research distance themselves from an ostensibly obvious evil, thereby reassuring us of their credentials as decent members of the human race. But if stem cell research is right, must reproductive cloning be wrong? Perhaps this moment — punctuated, as it is, by a major change in federal biomedical research policy — is a good one for reviewing the question.  Below is an attempt, in several parts, to do just that.