Skip to main content

‘Residual diversity estimates’ do not correct for sampling bias in palaeodiversity data

From Wikimedia Commons (CC BY-SA 3.0)

This blog post is way overdue, being mostly written months ago in late Oct. Anyway, it's a bit technical - but it relates to how palaeontologists quantify biodiversity through time (like the famous Sepkoski curve shown above).

I have a newish paper (‘Residual diversity estimates’ do not correct for sampling bias in palaeodiversity data) in Methods in Ecology and Evolution, with Chris Venditti and Mike Benton, that became available in Early View version in Oct last year (24 Oct 2016). The paper is very simple and straightforward. In it we assess a popular method that has been used numerous times to 'correct' for sampling bias in palaeobiodiversity data.

It is safe to say that most palaeontologists would agree that the fossil record is far from complete and that any kind of tallying of the numbers of species that were present in any given time period would suffer from this incompleteness - biodiversity curves (such as the famous Sepkoski curve, a variant of which shown above) are biased and cannot be taken at face value. So the exponential increase in biodiversity from the Mesozoic to the present seen in the Sepkoski curve may in fact merely reflect the increasing availability of rock containing fossils towards the present.

What palaeontologists have long disagreed on however is on the extent to which the fossil record is biased and therefore unreliable for estimating true ancient biodiversity signals. On the one hand, there are palaeontologists that have argued that the fossil record isn't all that bad and the patterns in biodiveristy that we observe, such as the exponential increase as well as the numerous peaks and troughs, genuinely represent real patterns of diversifications interspersed with extinction events. On the other hand, there are palaeontologists that have argued that the quality of the fossil record is abysmal, and observed patterns of peaks and troughs merely represent periods of rich and poor fossil preservations respectively with the exponential increase reflecting better sampling towards the recent - so by this reasoning, many perceived extinction events (except for the big ones) have been questioned as to whether they are genuine extinction events or just reflecting a lack of fossils from that time period.

This means that simply counting the number of unique species in a given geographical region (global, by continent, etc) for each time bin in a series of time bins, may in fact produce biased estimates of true palaeodiversity - the curve you see above cannot be taken at face value. If raw counts cannot be trusted, then there has to be a way to correct for 'known' or suspected sources of bias.

Fortunately, instead of just being pessimistic and ending it there, palaeontologists have been busy over the years proposing numerous methods to correct for such 'sampling bias'.

One widely used correction method is Smith and McGowan's (2007; Palaeontology 50: 765-774) 'residual' diversity method (RDM). The basic approach employed in RDM is in line with classical (but no longer recommended) statistical correction methods, which involve fitting a regression line between y (the variable of interest) on x (the confounding factor), and taking the residuals as 'corrected' data for subsequent analyses. However, RDM differs from these classical approaches in one key way: the regression is not performed on y on x but on y sorted from lowest to highest (y') on x sorted from lowest to highest (x'). The fitted model is then used to calculate 'residuals' as the difference between the model prediction and the original x (Fig 1).
Fig 1. How 'residuals' are calculated in Smith and McGowan's (2007) RDM, visualised using simulated data. (a) A bivariate dataset x and y are generated so that x is randomly sampled from a normal distribution and y is calculated as y = a + bx + e where a = 0.4, b = 0.6 (fixed at these values for convenience) and e is noise (mean = 0, standard deviation = 0.5). An ordinary least squares (OLS) regression model is fitted (thick line) with 'true' residuals represented as vertical lines. (b) x and y are independently sorted of each other (x' and y' respectively) and an OLS is fitted to this bivariate pair (best-fit line is in pink). Note the tighter relationship between y' and x' than in the original y-x pair with smaller but obviously structured residuals. (c) The model from b (pink line) is then charted on the original y-x bivariate plot and the deviation of y values from the pink line is used as the 'residuals' for subsequent analyses. Note the differences in pink 'residuals' of (c) with 'true' residuals in grey of (a).
Keen readers may recognise instantly that there is a serious problem with this approach - independently sorting y and x decouples the bivariate pairs, thereby destroying the paired, bivariate nature of the original dataset. The rationale stated in Smith and McGowan (2007) is that the pink model in Fig1b above is 'a model in which rock area at outcrop was a perfect predictor of sampled diversity' - or, a model in which sampling drives diversity, which we call 'sampling driven diversity model' (SDDM). In other words, the pink line is supposed to approximate the true fundamental relationship between the variables y and x - but the problem here is that SDDM through independent sorting does not in any way approximate the true underlying relationship between y and x. In fact, independent sorting doesn't achieve anything - it's more detrimental than anything else - more on this later.

But first, lets linger on the idea of a model in which rock area at outcrop was a perfect predictor of sampled diversity. One such model that springs to mind is simply a regression model of y on x with the estimated regression coefficients representing statistical approximations of the true underlying relationship between y and x. However, I don't think the authors (and subsequent authors using this method) had this in mind. Seemingly, the objective of the SDDM is to set up a proxy for a 'perfect predictive relationship' (i.e. no residual variance). Then y on y itself (or a multiple of y) would be ideal - that would be a perfect predictive relationship for y and all the variance in y will be explained by this model. On a more serious note, one way to achieve such a model is by formulating a theoretical relationship based on prior expectations. For instance, in biology, there are various scaling 'laws' proposed based on physics and geometry, e.g. the 2/3 (or 3/4 depending on who you believe) rule for scaling of area to mass. In most scaling studies that I've seen, researchers will compare the empirically fitted regression coefficient (the slope) against the expected scaling factor (like 2/3 = 0.67) and test whether the estimated slope is significantly different from the expectation. Significantly higher, and you've got positive allometry, or significantly lower, and you've got negative allometry. Not significantly different will be considered isometric. Following this approach, one could calculate expected values of y from a theoretical scaling relationship between y with x, and subtract those values from the observed y values, yielding the "corrected" y values. Unfortunately, in palaeodiversity studies, no such expected relationship between diversity and rock availability (map area, rock volume, fossiliferous formation counts) have been proposed. Instead, we're meant to accept that somehow independently sorting y and x and fitting a model on this relationship achieves this. In the absence of an expectation, the sensible thing to do is the first model I suggested in this paragraph: simply fit a regression model on y on x in their paired original orders.

Anyway, in order to demonstrate just how detrimental independent sorting is on regression modelling, we ran 5000 simulations in which we tested how many times the SDDM regression coefficients (beta) are significantly different from the data simulation parameter b = 0.6 (Type I error). We set a threshold at around 5% error rate (or parameter is incorrectly estimated in 5% of test cases) as an acceptable rate of failure.

And the result? SDDM fails miserably. Type I error rate is at the least 28% when noise is small and reaches 100% when the noise in the data is large. The standard regression model (SRM) in comparison performs as expected with Type I error rates hovering around the 5% mark.

Furthermore, SDDM does not randomly get the slope wrong, but systematically and directionally wrong - the mean slope from the 5000 simulations was significantly different from b = 0.6 (Fig 2).
Fig 2. (a) SRM lines are distributed randomly about the simulated parameter, b = 0.6, with no significant difference in mean slope. (b) SDDM lines are directionally biased.
In fact, this systematic and directional bias does not seem to be dependent on the simulation condition: two random deviates that should not have any significant relationship will have a positively significant relationship once sorted independently (Fig3a-b). But what's more worrying is that even if a negative relationship is simulated, then independent sorting will still force a positive relationship (Fig 3c-d)!
Fig 3. (a) Two random deviates are sampled from a normal distribution, with SRM showing no significant relationship between the two variables in 1000 simulations. The lines are randomly distributed around a mean value of 0. (b) Independently sorting the random deviates in (a) results in a positive and significant relationship. The difference in r2 is shown inset with colours corresponding to SRM and SDDM respectively. (c) A bivariate set is generated under a negative relationship. SRM lines are randomly distributed about a mean slope that is not significantly different from the simulation parameter. (d) Independently sorting the data in (c) results in a significant and positive relationship.
Fig 3 above should really convince you that SDDM (pink lines) through independent sorting DOES NOT represent the true underlying relationship between y and x. If you still believe that it does, then I don't know how else to explain this to you.

Nonetheless, imagine what this sort of correction does to your diversity data. Residuals calculated using such a method will certainly be wrong.

So what's the alternative?

That would depend on what you are interested in.

There are a number of reasons why you'd want to calculate a measure of diversity while accounting for sampling bias. For instance maybe you want to know how rich a given time period was in terms of the number of unique species compared to the previous or subsequent time periods. Or maybe you're interested in how diversity changes in response to environmental changes. Differences in diversity owing to fossil sampling would obviously affect these.

Your goal is to compare diversity between time bins
If you're interested in estimating a measure of diversity (accounting for sampling bias) for each time bin separately, then the popular thing to do is to use subsampling approaches like rarefaction or shareholder quorum subsampling (SQS). In a nutshell, subsampling approaches calculate some metric  (doesn't need to be diversity metric, it could be a statistical quantity like a mean) based on randomly sampling your data over a large number of times. Rarefaction is based on uniform sampling, whereas SQS is based on fair sampling (Alroy 2010) - the former is sampled to a quota (a desired sample size) while the latter is sampled to a quorum (a desired proportional representation of frequencies or something like that). However, subsampling methods, although they are by far the most popular method nowadays, are not without criticism.

Another approach to achieve something similar, is to estimate true richness and sampling rates simultaneously (TRiPS, Starrfelt and Liow 2016). TRiPS utilises sampling bias to inform how much richness is missing and attempts to predict true richness - and thus can produce richness estimates that are higher than the observed richness (diversity). Subsampling approaches on the other hand should only ever return observed richness as a maximum (and shouldn't return values over the observed values) - at least that's what I'd expect.

Your goal is to model diversity through time
Conversely, if you're interested in testing for relationships between diversity and some environmental variables (e.g. sea level, temperature, etc), while accounting for sampling bias, then the obvious thing to do is to use some form of multiple regression where observed diversity is the response variable y and a measure of sampling along with environmental variables of interest are all predictor variables x1, x2, ... xi. The estimated regression coefficients for the environmental variables b2 ... bi will represent the modelled relationship between diversity and environment after accounting for sampling, the effect of which is b1.

However, with diversity data through time there are a few things to consider seriously.

Use the model most appropriate for the data
First, since diversity is often quantified as the number of unique taxa, this variable is a count, and count data have different expectations with regards to statistical distribution (Poisson or a variant) from continuous data like body size (gaussian). This influences the choice of regression model - least squares regression (Ordinary Least Squares (OLS); Generalised Least Squares (GLS)) is actually inappropriate since the error structure is assumed to be gaussian. Logging the y and fitting a least squares regression is apparently not appropriate either. Generalised linear models (GLM) offer a range of models that deal with different distributions, including Poisson and derivatives like the negative binomial or zero-inflated. For modelling count data as a response variable, negative binomial models are probably a safe bet - over dispersion (makes data deviate from a strict Poisson distribution, i.e. mean = variance) is pretty common in count data.

Dealing with serial autocorrelation
Second, diversity through time, is obviously a time series, and there will inevitably be the issue of serial auto-correlation. So this has to be accounted for using some sort of time series modelling. However, time series methods for count data are severely under-developed, unlike conventional gaussian-error models for which time series models are plentiful. I searched the literature pretty extensively and the only things I found that were readily implemented were Poisson exponentially weighted moving average (PEWMA) and generalised linear autoregressive moving average (GLARMA) models. They're really easy to fit in R but like any model, the difficulty is getting the model formulation right and making sure the model is being fitted correctly - before even attempting to interpret the results. In my experience, time series models are really difficult to formulate - especially since count data just seem to keep violating all the basic assumptions made in time series modelling. Having said all this, time series analyses are typically applied to long time series (i.e., large datasets), but palaeontological time series are way too short (a Phanerozoic time series is around N=70-80, while Mesozoic dinosaurs is at most N=26-29), so we run the risk of over-parameterisation if you have more than one predictor variable (e.g., sampling metric + environment) and time series terms on top of that. Comparing model fit improvements using model selection criteria like likelihood ratio tests, AIC, etc may help make the decision whether or not to include time series terms to your basic GLM.

Patterns versus processes
Third, diversity through time simply characterises temporal patterns in richness, but not evolutionary processes that resulted in such patterns. Studying processes can explain observed patterns but the opposite is not true - i.e. one cannot reconstruct processes from patterns alone. For that, you need a model of evolutionary processes, such as models of diversification (the dynamic relationship between speciation and extinction). And diversification can really only be studied phylogenetically - whatever your model of speciation (anagenesis, cladogenesis, budding) the resulting daughter species share common ancestry, which can be represented by phylogeny. However, if all you want to do is present temporal patterns of richness as a phenomenological thing, then diversity through time charts might be good enough for you. Having said that, it is often the case that palaeontologists want to say something about speciation, and one cannot say anything about that from time-binned diversity patterns - new species don't magically appear - you need a phylogeny.

Dealing with phylogenetic non-independence
Another reason to consider using a phylogeny is to properly take phylogenetic non-independence into account. That is, if you want to model the relationship between diversification measured from a phylogeny and some covariate to explain how speciation is affected by inherently biological or extrinsically environmental factors, this comes into the realms of comparative biology. Comparative datasets are statistically non-independent (Felsenstein 1985) owing to shared ancestry - closely related species typically have similar trait values - and appropriate statistical procedures need to be followed, i.e. phylogenetic comparative methods. In the case of diversification, there are plenty of methods that model diversification (e.g. BAMM), but to my knowledge there is only one way to phylogenetically model the relationship between diversification and any number of covariates - generalised linear mixed models (MCMCglmm). My colleagues and I used this approach to model how diversification in Mesozoic dinosaurs changed through time, and tested the effects of various covariates including sea level, as well as confounding effects of sampling bias - we tested a variety of different bias metrics we can think of. The benefit of MCMCglmm is that it is very flexible so you can pretty much model anything you like - but that also means that model formulation can become a nightmare if you don't think it through.

Anyway, I hope this was useful for someone...but the take home message is: Do not use residuals method!


Popular posts from this blog

R for beginners and intermediate users 3: plotting with colours

For my third post on my R tutorials for beginners and intermediate users, I shall finally touch on the subject matter that prompted me to start these tutorials - plotting with group structures in colour.

If you are familiar with R, then you may have noticed that assigning group structure is not all that straightforward. You can have a dataset that may have a column specifically for group structure such as this:

B0 B1 B2 Family
Acrocanthosaurus 0.308 -0.00329 3.28E-05 Allosauroidea
Allosaurus 0.302 -0.00285 2.04E-05 Allosauroidea
Archaeopteryx 0.142 -0.000871 2.98E-06 Aves
Bambiraptor 0.182 -0.00161 1.10E-05 Dromaeosauridae
Baryonychid 0.189 -0.00238 2.20E-05 Basal_Tetanurae
Carcharodontosaurus 0.369 -0.00502 5.82E-05 Allosauroidea
Carnotaurus 0.312 -0.00324 2.94E-05 Neoceratosauria
Ceratosaurus 0.377 -0.00522 6.07E-05 Neoceratosauria
Citipati 0.278 -0.00119 5.08E-06 Oviraptorosauria

Top 10 scientifically important theropod dinosaurs of all time (off the top of my head)

I thought I'd do a fun post for once. And since list based articles are the norm for fun on the internet, I thought I'd do one on dinosaurs, but given that I know most about theropods, I've decided to restrict my list to theropods (...maybe in a future post, I'll do other clades).

My ranking is based mostly on scientific importance so it may not reflect awesomeness, and it is obviously subjective as to how I rank importance to science. For instance, interesting discoveries or unique palaeobiology are ranked relatively low compared to wealth of information and data or completely revolutionising our understanding of the evolution of theropods.

So here are my top 10 scientifically important theropod dinosaurs of all time (off the top of my head)

10. Megalosaurus

Being the first dinosaur to be named, Megalosaurus automatically deserves a spot on this list, but given the fragmentary nature of known fossil specimens, and being mostly useless as a meaningful source for biologi…

The difference between Lion and Tiger skulls

A quick divergence from my usual dinosaurs, and I shall talk about big cats today. This is because to my greatest delight, I had discovered today a wonderful book. It is called The Felidæ of Rancho La Brea (Merriam and Stock 1932, Carnegie Institution of Washington publication, no. 422). As the title suggests it goes into details of felids from the Rancho La Brea, in particular Smilodon californicus (probably synonymous with S. fatalis), but also the American Cave Lion, Panthera atrox. The book is full of detailed descriptions, numerous measurements and beautiful figures. However, what really got me excited was, in their description and comparative anatomy of P. atrox, Merriam and Stock (1932) provide identification criteria for the Lion and Tiger, a translation of the one devised by the French palaeontologist Marcelin Boule in 1906. I have forever been looking for a set of rules for identifying lions and tigers and ultimately had to come up with a set of my own with a lot of help fro…