On polymorphism III: Working with peanuts
We’re now ready to ask — and answer — the question: how much polymorphism is to be found in a given population? Remember that ‘polymorphism’ here specifically means genetic diversity, which is ultimately defined by characters (think ‘A, C, G, T’) that differ in kind, not degree.
In this important sense, DNA sequence variation differs from variation in, say, mass, or number-of-toes. The latter traits are inherently quantitative, whereas genotype is, at the finest scale, inherently qualitative. As such, conventional measures of quantitative variation — particularly univariate moments, such as variance — don’t apply, at least directly, as measures of sequence variation.
Take heart, though. Even if we have to bypass (for now) old friends like variance, we will indeed be able to arrive at a summary measure (yes, a quantitative one!) of genetic diversity that is well defined and widely applicable. The last point is important: we want to be able to use the same yardstick of diversity to study populations of any given organism. In our example, we’ll go with the organism most familiar to us: people.
To keep things simple, though, we will agree to reduce a population of living, breathing people — each of whom comprises trillions of cells, each of which carries a genome billions of nucleotides long — to a population of alleles. Specifically, we’ll imagine picking just one cell from each person, then zooming in and focusing on just one site in the genome (keeping in mind that the cell likely contains two copies of that site — but more on that later). In simplifying things this way, we mentally dissolve away the non-genomic parts of cells, and all the other potentially distracting trappings of our human population, until what remains is simply a bunch of representative copies of that single site in the genome. It is this population of alleles whose level of diversity we aim to estimate, at least for starters.
Attention: Candy
Admittedly, even dissolving away everything but a single site in the genome may not simplify things enough to proceed smoothly. After all, mentally manipulating a population of invisibly tiny copies of a genome site is not the most intuitive task. So, to give the brain something concrete and familiar to work with, let’s invoke a mouthwatering analogy. Specifically, we’ll liken our population of copies of a genome site to a jarful of colorfully candy-coated peanuts. Candied peanuts — rather than say, jelly beans or gumballs — are ideal for a subtle reason that I’ll explain later.
The jar we’re imagining holds many, many such candy peanuts (more, even than the jar pictured at the top of the post). To be precise, let’s say the jar holds exactly 1000 nuts. This number is the first of quite a few that will be popping up — meaning that we’ve reached a point where a conscientious science guide, like an avuncular airline pilot forecasting a ‘patch of rough air’, is supposed to ruefully warn you about the thorny math to come. Fear not, though: we need basic arithmetic, not Stephen Hawking stuff, to reason our way to a working concept of genetic diversity. So don’t fret.
Now, in principle, 1000 peanuts, each coated in a shade of candy paint, could sport as many as 1000 unique hues. But our peanuts are typical mass-produced sweets — so, for industrial efficiency, they come in just a few distinct colors. Four, actually: red, yellow, green, and blue. Picturing our jarful of a thousand such nuts, we might ask our key question: ‘How diverse, by color, are the nuts in the jar?‘. Now, in answering that question, let’s agree to consider only the color of the candy coating (not the roasted interior) of each nut, and to forgo trying to estimate how different from each other two colors might be. That is, we’ll treat the colors of any two nuts as either the same, or different, with nothing in between. Moreover, we’ll assume that the candy paint on our peanuts never fades or otherwise changes in color.
Counting colors: Diversity = 4?
Right off the bat, it may be tempting to posit that the nuts in our jar, by virtue of coming in exactly four distinct colors, have a ‘color diversity’ of 4. And promisingly, this answer would capture something essential (if obvious) about the situation: 4 is both more than 1 (the value we would estimate for a clearly less color-diverse jar of nuts that were all coated the same color) and less than 1000 (the number we’d assign to a jar of undeniably more color-diverse nuts, in which each had its own unique hue).
But before we settle on simply counting colors to define ‘color diversity’, note that the range of possible values for this count would always depend on the total number of nuts in the jar. If a jar held just three nuts, for example, its ‘color diversity’ couldn’t exceed 3. How are we to compare such an estimate to the value we obtain for a jar of, say, five nuts in just three colors? Would these two jars be precisely equal in color diversity? My gut, and maybe yours too, says no, a five-nut, three-color jar — which carries more nuts than it ‘needs’ to in order to qualify as tricolore — seems less color-diverse than a three-nut, three-color jar.
Ratio profiling: ~~Diversity = 4?~~ Diversity = 0.004?
My gut also tells me that a broadly useful measure of diversity should depend as little as possible on other population-specific qualities/quantities (such as the number of individuals in the population). With this aim in mind, we might gravitate to the idea of using a ratio to measure diversity. In particular, why not just divide the number of distinct nut colors by the number of nuts in the jar? Perhaps this approach would churn out ‘diversity’ estimates that compare more sensibly, one to the next, even for jars that carry different numbers of nuts. A jar of three nuts in three colors, for example, would have a ‘nut colors per nut’ ratio of 3/3, or 1. The round unity of this estimate would suggest that we think of such a jar as ‘fully’, rather than ‘partly’, diverse; and this, in turn, would accord the fact that such a jar is as color-diverse as a three-nut jar can be. By comparison, a jar of five nuts in just three colors would have a lower ratio of 3/5, or 0.6 — a result that fits our intuition that such a jar isn’t ‘fully’ diverse.
When applied to our original jar of 1000 nuts in four colors, such a ratio-based approach would yield a color diversity estimate of 4/1000, or 0.004. Reassuringly, this value is still more than 1/1000 and less than 1000/1000, per our intuition. Moreover, because we can’t have more nut colors than nuts, the new ratio-based measure would, conveniently, always take on a value somewhere in the range of zero to one. As such, it would be what scientists call a normalized measure, with a tidy built-in scaling that carries some mathematically useful properties. So maybe we’re getting somewhere with this ratio idea.
However, by turning to such a simple ratio — the number of distinct colors in the sample divided by the number of nuts in the sample — we would iron a new wrinkle into our definition of diversity, again putting it at odds with intuition. To spot that wrinkle, let’s do something that scientists often do in road-testing an idea: examine a so-called ‘boundary case’, where some variable under consideration takes on its highest (often infinity) or lowest (often negative infinity, zero, or, in cases like ours, one) possible value. In our scenario, we’ll check the boundary case of a jar that holds just one peanut.
Regardless of how we define diversity, one nut sitting alone in a jar has, of course, a single characteristic color (lonesome blue, perhaps). The total number of colors — here the use of the plural is just a formality — in a one-nut sample (i.e., jar) must therefore equal the number of nuts in the sample. As such, a ratio-based estimate of such a jar’s color diversity would be 1/1 — or, well, 1. If you think about it, this means that we would deem a one-nut, one-color jar to be just as color-diverse as a three-nut, three-color jar. Moreover, we would hold both of these jars to be exactly as diverse as a thousand-nut, thousand-color jar! True, each of these jars, however many nuts it contained, would be as color-diverse as it could be. Nonetheless, intuition tells us ‘Sorry, but a jar containing one nut — whatever its color — is simply not as color-diverse as a jar containing a thousand nuts in a thousand colors’.
Match and mismatch
Apparently, in trying to confine diversity to a scale that doesn’t vary with the size of the sample, we lost some ability to resolve how ‘surprising’ or ‘unsurprising’ a given finding of ‘diversity’ is. So why, exactly, did the simple ratio-based definition, which had raised our hopes, fail this way? To see why, imagine doing something drastic: wantonly spilling the contents of the original jar onto a table. Having done so, and taking care not to let any roll over the edge, we’ll start tallying colors one nut at a time.
Now, in your mind’s eye, a thousand peanuts coated in four candy colors and strewn across a table may make up a wildly gaudy mosaic — a bit like a pointillist painting executed, in a stylistic stretch, by Jackson Pollock. If this is what you’re picturing, though, I’m sorry to tell you that your imagination has been led astray (perhaps by eager tastebuds…or by the misleading photo above). In fact, when we spill out our jar onto the table, here’s what we find: exactly 994 red-coated peanuts, and 2 peanuts each in yellow, green, and blue.
This hardly qualifies as diversity — to the contrary, it’s classic tokenism. Yet the simple ratio-based measure that we defined would estimate the color diversity of these peanuts as 0.004 — the same value as would be estimated for a jar holding a more ‘balanced’ mix with 250 peanuts in each of the four colors. Clearly, the simple ratio-based candidate definition of color diversity fails to capture a key intuition: that an even mix of types is more ‘diverse’ than a mix skewed toward just one (think of that boundary case again) or a few types.
To see this another way, imagine that we can’t view our jar at all, but instead have to gauge the color diversity of its contents by reaching in blindly, fishing out a peanut for inspection, eating it (for a parallel survey of flavor-diversity, of course), and repeating this procedure for a while. Now, if our jar is a tokenist one, then, whenever we blindly plunge a hand in, we can safely bet that we’ll be pulling out a red peanut. By contrast, if we reach into a jar containing a balanced, cosmopolitan mix of peanuts in several colors, we won’t be able to guess ‘red’ (or anything else) with much confidence.
How do we express this insight in formal terms? For starters, we care little about the identity of any particular color(s); that is, a nearly all-green jar is just as tokenist as a nearly all-red one. Moreover, color itself, the trait that we happen to be surveying, is in no way integral to a general definition of diversity. If not interested in color, for example, we could wonder just as easily about diversity of flavor, country of manufacture, or symbol stamped on the candy shell. The qualitative trait that we keep track of, and the specific values that the trait may take, aren’t the point – what’s important is how often we choose an individual that differs, in that chosen trait, from what’s come before. Diversity is, after all, premised on difference between members of a population.
Brass tacks: ~~Diversity = 4? Diversity = 0.004?~~ Diversity ≈ 0.012
If we start thinking of diversity as defined by direct comparison of one member of a population to another, then the diversity of the one-nut ‘boundary case’ that doomed our ratio idea becomes a moot question, like the koanic clapping of a single hand. Mathematically, we can succinctly formalize our new insight — that diversity requires differences between individuals — by asking ‘How often, if we pick two individuals at random from the population, will they differ in the trait of interest?‘. To grind out an actual number using this simple algorithm for assessing diversity, let’s reason through the possible outcomes of two blind picks from our jar of peanuts.
The first nut we pick will, of course, very likely be red. To be precise, the chance of getting a red nut on the first random pick is 0.994. Now, unless we throw that first nut back in (instead of eating it) and remix the nuts thoroughly, the chance of picking a red nut on our second pick will differ slightly from what it was on the first pick; this is because, on the second pick, we’ll have only 999 nuts in the jar, of which 993 or 994 will be red. And 994/1000 equals neither 993/999 nor 994/999.
Those three numbers are, however, awfully similar to each other. In fact, we’re working with such a big ‘population’, and sampling such a small fraction of its ‘individuals’, that we can helpfully simplify our probability calculations by treating the chance of picking red as if it stays the same from pick to pick. Such a simplification is especially helpful in real-world sampling situations where, though unsure of just how many ‘nuts’ are in a given ‘jar’, we’re quite sure that we’ll be sampling only a small fraction of them.
Under this simplifying ‘constant odds’ (or, as it’s often termed by statisticians, ‘sampling with replacement‘) assumption, the chance of picking two red nuts in our first two picks is 0.994 squared, or a little more than 0.988. Which means, conversely, that our chance of not picking two red nuts in our first two picks is roughly 1 – 0.988, or 0.012. Now, note that this latest number isn’t exactly the chance that the first two randomly picked peanuts will differ in color. Even if the two picks aren’t both red, after all, both could be yellow, or both green, or both blue. With our mostly red jarful, each of the latter cases is very unlikely, of course, but we can easily account for them.
To do so, we’ll again assume that the odds of picking a given color stay constant from pick to pick. For each of the three colors other than red, these odds are roughly 2/1000, or 0.002. So the chance of picking two yellow nuts on consecutive picks will be roughly 0.002 squared, or 0.000004 — and the same holds for each other rare color in the jar. Overall, then, the chance that we’ll pick successive nuts that match in color will be roughly 0.988 + 3(0.000004), or, well, about 0.988. Naturally, the chance that we pick successive nuts that differ in color is the complement of this matching probability; in the end, then, it’s still roughly 0.012. Let’s adopt this number as our new ‘chance of mismatch’-based estimate of our particular jar’s nut color diversity.
For comparison, let’s run through the corresponding figures for a 1000-nut jar containing 250 nuts in each of four colors. In that case, the chance of picking two nuts of the same color on the first two picks will be roughly 4 * (0.25 * 0.25)
, which works out to 0.25. Thus the chance of picking two nuts that differ in color will be roughly 1 – 0.25, or 0.75. Consistent with intuition, this value is much greater than 0.012, our new estimate of our original jar’s diversity. And, importantly, our new ‘chance-of-mismatch’-based diversity measure is still normalized (bounded by 0 and 1), and still puts our jar’s color diversity squarely between the values obtained for a jarful of 1000 nuts in just one color, and a jarful of 1000 nuts in 1000 distinct colors.
‘Real-jar’ uncertainty
The foregoing is promising stuff, but we’re not yet in the clear. Remember that, if we look at a jarful of nuts from the outside, with no extra knowledge, we won’t actually know the odds of picking any given color ahead of time. As such, if we want a good estimate of ‘color diversity’ for an unfamiliar jar of candy peanuts, we will have to repeat the two-pick experiment many times, after which we might notice that the two picked nuts differ in color in, say, roughly 1.2% (or 0.012) of trials. This is an arduous approach to estimating a measure of the nut ‘population’! But what if, instead of picking nuts one at a time, we could grab a whole handful (assuming that the nuts are distributed randomly in the jar) and then compare the colors of each pair of nuts in that ‘sample‘, to see how often those paired ‘individuals’ differ? Will that give us enough comparisons to make a good estimate of diversity, without having to spend time picking out nuts one at a time from the jar?
Well, suppose that our handful contains a whole number k of peanuts. The number of distinct peanut-pairings within that handful will be k(k-1)/2 (an expression that specialists in discrete mathematics call ‘k-choose-2‘). To confirm this, you can start by checking the number of possible pairings for some low values of k: two nuts can be paired in just one way; three nuts yield three distinct pairings; four nuts give six distinct pairings; and five nuts give ten. Notice that, for each of these low values of k, there are indeed k(k-1)/2 distinct pairings. You can then extrapolate that pattern to indefinitely higher values of k by noting that, each time you increase k by one (by adding a new nut to the handful), the number of new pairings that you add to the total number of pairings is simply the old value of k (i.e., one less than the new value of k). For example, when you add a sixth nut to a group of five, that new nut can be separately paired with each of the first five nuts, but no other new pairings are made (and none are lost). By this insight, the new total number of pairings can be written as ((k-1)((k-1)-1)/2) + (k-1). This can be rewritten as ((k-1)(k-2) + 2(k-1))/2, which simplifies to, you guessed it, k(k-1)/2.
And, luckily, k(k-1)/2 grows quite quickly (nearly as quickly as the square of k, in fact) with increasing k, meaning that even a fairly small sample of individuals from a large population will provide many comparisons with which to assess its diversity. By asking what fraction of those many comparisons are mismatches (caveat: we’ll average over one less comparison than we can actually make, because the match/mismatch status of that ‘last’ comparison turns out to be fully determined by the pattern of the other comparisons), we have a handy measure — the pairwise chance of mismatch — that can be readily calculated, and that accords basic intuitions about diversity, including the key difference between an overwhelmingly red jarful of candied peanuts, and a more compellingly colorful jar containing equal numbers of nuts in several colors.
Nucleotide diversity
In calling this newly defined measure the color diversity of our jar of nuts, we’ll be making a direct analogy to a key measure in population genetics: nucleotide diversity (or, as it’s often nicknamed, polymorphism). Nucleotides are, of course, the individual ‘beads’ that are strung together to make up the genome. And, like the nuts in our jar, each nucleotide, if it’s there (insertions and deletions of nucleotides do happen, and complicate things by making it harder to tell which nucleotide in one genome ‘corresponds’ to which nucleotide in another genome) comes in just one of four ‘colors’: the chemically distinct subunits called adenine, cytosine, guanine, and thymine (A, C, G, and T, for short), one of which, like a little flag, marks each nucleotide.
Before we explore what nucleotide diversity can tell us about a population, another word on why we used peanuts as analogs to human alleles. As anyone who has ‘sampled’ through a bag of whole roasted peanuts knows, many a peanut shell cracks open to reveal two nuts, rather than just one. In this sense, a peanut shell can be likened to a human zygote, which corrals together two copies — one from mom, one from dad — of nearly any segment of the genome that we might pick to examine.
So why, in our mathematical musing about candied peanuts, did we conveniently ‘paint over’ the fact that peanuts are born paired up in shells? Begging that question, I could plead that it’s hard to tastily coat (or even see) peanuts in easily countable colors while they’re still inside their shells. But, more to the point, taking the peanuts out of their shells underscores an analogous point about population genetics: as long as we’re just interested in estimating diversity for just one site in the genome (and not other, subtler characteristics of genetic variation in the population), the fact that alleles inside real genomes are paired up turns out to matter fairly little.
And, in a subtler extension of the analogy, it’s worth noting that two-nuts-per-shell is not a hardfast rule in the peanut world: some shells hold just one peanut, and still others hold more than two. Likewise, some human cells (eggs and sperm) carry no more than one copy of any given segment of the nuclear genome; other cells can carry just one copy of a particular segment, like the Y chromosome; and still other human cells carry more than two copies of a genome segment. The cells of people with Down syndrome, to cite a well known example, often have three copies of chromosome 21, rather than the more familiar two. And, in a small fraction of women, many cells have three or more copies of the X chromosome. There’s an even more widespread example of this phenomenon, if we consider the little gene-crammed rings of DNA inside mitochondria. These degenerate, energy-liberating bacteria live in the cells of animals, plants, and other so-called eukaryotes, and, along with their genomes, are typically transmitted to the next generation only via eggs, not sperm. Nonetheless, each cell in your body contains dozens to thousands of these little half-captive half-stowaways, each of which may, in turn, contain many copies of its own little genome. So while a typical human cell may hold two copies of, say, chromosome 7, as well as one or two or more copies of the X chromosome, it also holds many, many copies of the mitochondrial genome.
A shelf full of jars
In our analogy, the candied peanuts in one jar represented copies of one site in a genome. We thus derived an estimate of ‘diversity’ for that site only. But we can readily expand the scope of our inquiry, and ask about the amount of variation found in a population of copies of a long segment of a genome. To do so, we can simply average the site-specific diversity estimates from throughout that segment, yielding a statistically robust — and very useful — estimate of its overall nucleotide diversity. Roughly speaking, this is akin to calculating the diversity of peanuts in each jar lined up on a whole shelf of such jars, and then averaging those estimates, to get a ‘shelf-specific’ estimate.
To the extent that we then build a valid population-genetic model to represent the population of copies of our genome segment, we can ‘plug in’ our segment-specific estimate of nucleotide diversity in order to shed surprisingly bright light on the ancestral history of the copies of the segment. Some of the cryptic, but newly ‘guessable’ characteristics of that history might include: the ‘effective‘ size of the ancestral population, as roughly averaged over time; the average number of generations elapsed since the death of the last common ancestor of two randomly picked copies of the segment; the average mutation rate per nucleotide site, per generation; and/or the average number of migrants exchanged by two partially isolated sub-populations, per generation (provided that the population we’re looking at is divided into such sub-populations, and we estimate diversity for each one separately, as well as for the pair as a whole). Moreover, if we compare results from our segment of interest to those from other segments in the genomes of the same ‘host’ population — or to results from the ‘same’ segment in a different population altogether — we may find strikingly dissimilar estimates of nucleotide diversity. Such disparities can turn up by chance, but they can also reflect robust underlying variation in, say, mutation rate, population size, or the stringency of natural selection. Future posts will explore, in more meaningful detail, how simple numbers such as nucleotide diversity can be leveraged to draw detailed evolutionary inferences about these and other factors.