The odds against evolving new proteins

The extreme improbability of biological protein sequences

I think most people simply have no idea how specific protein amino acid sequences are in relation to the number of possible sequences (for a given length), and hence how exceedingly unlikely it is that they could be happened upon by chance or even by some sort of molecular level trial and error (mutation and natural selection).

When comment is made about the improbability of proteins a common response is to point to the billions of galaxies, each with billions of stars, so there must be billions of planets compatible with life; and that the universe has existed for billions of years; so surely there have been plenty of resources and time to overcome even the slimmest of odds. But let’s take an objective look – the sort of calculation I did while at university, and which started my doubts about evolution:

Proteins comprise linear sequences of amino acids (for general information about proteins see [1]), and at any place in the sequence there can be a choice from 20 different types of amino acid. (And note that the order is directional (or not reversible), e.g. a protein comprising amino acid sequence A to Z is different from Z to A.)

A typical short protein is about 100 amino acids long (most proteins are much longer), and the number of possible amino acid sequences, 100 long, is 20 to the power of 100 (i.e. 20¹⁰⁰), which equates to approximately 10¹³⁰, i.e. ‘1’ followed by 130 zeros. This is obviously a very large number, indeed it is virtually impossible to appreciate just how big it is. But let’s try to put it into perspective – to see what it means in terms of trying to find a specific protein sequence:

The number of atoms in the (observable) universe is estimated to be about 10⁸⁰.[2] Most of these are hydrogen and helium; but just suppose that they were hydrogen, carbon, nitrogen, oxygen and sulphur in appropriate proportions so that all of the atoms in the universe could be used to make amino acids. If this were so, and noting that a typical protein with 100 amino acids will contain at least 1000 atoms, then at any time there could be about (10⁸⁰/1000 =) 10⁷⁷ different sequences 100 amino acids long.

Further, let’s suppose that we could generate a fresh batch of such sequences every second. The age of the universe is thought to be about 13.8 billion years [3], which is about 4 x 10¹⁷ seconds, which for this illustration I’ll round up to 10¹⁸.

So the total number of sequences that could have been produced – using the total material resources of the universe, for the entire age of the universe – would be approximately (10⁷⁷ x 10¹⁸ =) 10⁹⁵.

That is, even for a small protein of only 100 amino acids long, the odds of having produced the right sequence would be only 1 in (10¹³⁰ / 10⁹⁵ =) 10³⁵. To try to give some idea of what this means – it is much, much less than the chance of selecting (blindfold) a specific grain of sand from all those in earth’s beaches and deserts. (And don’t forget we’ve already taken into account the size and age of the universe, so you have only one attempt.)

Ok, so this is a simplistic calculation. But what it demonstrates unequivocally is that we cannot rely on random mutations to produce specific proteins – quite simply the resources to do this are not available. It is not tenable to try to hide behind billions of years – trying to argue that even the improbable becomes probable given enough time. Nor that life could have originated on any suitable planet, or that new genes – e.g. required to enable/facilitate evolutionary advances – could have been seeded from space (having arisen somewhere or other in the universe).

There seem to be only 2 possible ways out of this dilemma (from an evolutionary perspective):

The above calculation assumed that the amino acid sequence is 100% specific (i.e. no variation in the amino acid sequence is possible; any variation would mean a loss of function), which is unlikely to be true for any protein. The extent to which variations are permitted in the sequence, whilst retaining activity, will increase the number of allowable sequences, which will improve the odds of finding one that works.
Perhaps proteins could have started off as shorter amino acid sequences (albeit with poorer function). Because, with a shorter polypeptide, the number of possible sequences is far smaller, this would also improve the odds of coming across one that works.

The need for proteins to fold imposes a minimum on their length in terms of number of amino acids, and stringent constraints on their specific amino acid sequence.

More-or-less together

Elsewhere (here, here and here) I mention that for biological systems requiring multiple proteins to function, it is necessary that the different components occur more-or-less together (temporally and spatially). It is worth highlighting the significance of ‘more-or-less together’ when assessing the (im)probability of proteins arising. In particular, for two proteins that need each other for their function/utility, then clearly they must occur at the same time and place. What this means in terms of the probability of this happening is that the factors relating to time and space (see above) can be taken into account only once.

For example (following the above calculation), given the number of possible sequences for a protein with 100 amino acids is about 10¹³⁰, and if the number of possible tries (using the whole of the mass and time of the universe) is 10⁷⁷, then the overall probability is approximately 1 in 10³⁵. The probability of getting two such proteins together is not 1 in 10³⁵ x 10³⁵ = 10⁷⁰, but 1 in 10³⁵ x 10¹³⁰ = 10¹⁶⁵ (because the factor of 10⁹⁵ which allows for the available atoms and time can be used only once).

Where many proteins are required for a function, for every additional protein the odds are reduced by a further 10¹³⁰ (assuming the proteins are of a similar size, and noting that most are much larger). This emphasises the impossibility that systems requiring multiple proteins could arise by chance. (Don’t forget there must be at least some utility before natural selection can act, so for multiple-component systems all of the essential components must be present.)

Producing, identifying and propagating new useful proteins

However, as well as looking at those possibilities, it is of course obvious that the above calculation is ridiculously optimistic, deliberately so in order to make a point; and it’s worth noting a few things that should be taken into account for a more realistic approach:

1. The amount of resources available

In the context of fuelling on-going evolution – essentially producing new genes by mutation of existing genetic material – then obviously the resources are limited to the existing DNA (and RNA for RNA-based organisms). But clearly not all of the DNA is available for trying to generate new genes, indeed only a small proportion would be available for this:

Some is tied up in providing the genes required for the organism’s functioning – tampering with material that encodes essential proteins or RNAs would generally be detrimental and therefore excluded by conservative natural selection.

When it was realised that, at least in most eukaryotes, only a small part of an organism’s DNA is committed in this way, the rest was labelled ‘junk’ DNA and it was thought that this could be used for trying out new sequences. But it turned out that much of this DNA is occupied by multiple copies of the same short sequences – so clearly it is not used in this way; and it now appears to be used in controlling gene expression.

So the amount of DNA actually available for trying out new sequences is in fact very limited.

There is, of course, also the limitation that the length of DNA must be expressed, typically by being recognised as a (potential) gene (as mentioned here).

2. Rate of producing different sequences

Whilst we might think that the rate of producing variations will be determined by the rate of transcription and/or translation (the in-cell processes of copying DNA to mRNA, and then using this to produce a protein), this would be relevant only if there were some mechanism for directing these processes to produce /explore a range of nucleotide or amino acid sequences. In the absence of such a mechanism, the rate of producing different sequences is determined by the rate of mutation. Although there is uncertainty about what this rate is or how consistent it is, a generally accepted figure is about 10^-9 i.e. 1 in a billion per nucleotide per copying; which, for sequences that might be passed on to offspring, will generally mean per generation.

3. Identifying and assimilating useful sequences

Writers seeking to promote evolutionary scenarios, such as the origin of proteins, are prone to making two false assumptions:

that once a useful sequence has arisen it will automatically (and immediately) be recognised, and
that it can automatically (and immediately) be widely assimilated by the species' population, perhaps to serve as a stepping-stone to further improvement.

However both of these are far from true, and both of these issues affect the rate at which new genes might arise.

Chance of a new gene being assimilated

It is obvious that there is no value in producing a (potentially) useful sequence unless there is some way of recognising that it is useful, and retaining it; because otherwise it will simply be degraded by further mutation and lost. The only way a sequence might be recognised as useful is retrospectively, in effect by natural selection. However, as just indicated, it cannot be assumed that a a useful sequence will be automatically retained and assimilated. Far from it.

Due to the vagaries of hereditary mechanisms, there is no guarantee that a gene will be passed on to offspring at all.

For example, at least in eukaryotes, where typically offspring inherit a 50:50 mix of genes from their parents, there is only a 50% chance that a new gene would be passed on (in any particular offspring), no matter how much of an improvement it might offer.

Further, fitness is probabilistic: even an individual having a favourable variation cannot guarantee surviving and reproducing. In fact it is surprising how poor the odds are. In general, if an allele confers x% improvement over the ‘normal’ or ‘wildtype’ (i.e. individuals with the favourable allele have an improved fitness of (100+x)/100 ) then its chance of being retained in the population (after a large number of generations) is only 2x%.[5]

For example, if a mutation produces an allele conferring a 1% advantage (generally considered a large improvement [6]) then the probability of this new allele being adopted by the population (‘fixed’) is only 0.02, i.e. it is far more likely (98%) that it will be lost. So (on average) the same mutation would need to recur independently 34 times to have an overall 50% chance of becoming fixed (0.98³⁴ ≈ 0.5).

Time for a new gene to be assimilated

It is evident from the foregoing that it also takes time for a new gene to be assimilated by a species’ population, which is especially relevant if the improvement is to act as a stepping-stone to further improvement. How long (number of generations) also depends on the degree of improved fitness.

For a dominant allele conferring 1% improvement, it would take (on average) about 234 generations to increase its frequency in the species population from 0.1% (1 in a 1000) to 1% (1 in 100), and a further 252 generations to increase to 10% of the population.[7] So typically it will take a minimum of at least 500 generations for even a reasonably favourable mutation to spread to just 10% of a species’ population. And, preceding this, depending on the size of the population, it could take many more generations to reach 0.1%; in fact at low frequencies its fate is likely to be determined as much by chance as by selection.

Conclusion

It is evident from the above comments that viable protein sequences are very rare, and the resources available for producing new sequences are very limited; further, the processes for generating new sequences, and for spreading them throughout a species' population, are relatively slow.

These factors must be taken into account in any putative scenario for the evolution of proteins.

Contents

Notes

Notes display in the main text when the cursor is on the Note number.

1. General information about proteins can be found at e.g.

Wikipedia articles 'Protein' and 'Protein structure'.
http://genome.tugraz.at/MolecularBiology/WS11_Chapter03.pdf 'Protein Structure and Function'

2. Wikipedia article 'Observable universe'.

3. Wikipedia article 'Observable universe'.

4. Fred Hoyle and N. Chandra Wickramasinghe, Evolution from Space, Dent, 1981, p148.

5. Haldane, J. B. S., A mathematical theory of natural and artificial selection, Part V: Selection and mutation, in Proc. Camb. Phil. Soc., 23(7), 838-44, 1927 (doi:10.1017/S030500410001564)
Cited by L Loewe and W G Hill, The population genetics of mutations, in Phil. Trans. R. Soc. B 365 p1159, (2010).

6. Loewe and W G Hill, The population genetics of mutations, in Phil. Trans. R. Soc. B 365 p1156, (2010)

7. H T J Norton, Appendix in Mimicry in Butterflies by R C Purnett (1915).

Image credits

Background image for the page banner is from https://commons.wikimedia.org/wiki/File:How_proteins_are_made_NSF.jpg and is in the Public Domain.

a. Source: https://en.wikipedia.org/wiki/File:Fred_Hoyle.jpg .

b. Image copyright © National Portrait Gallery, http://www.npg.org.uk/collections/search/portraitLarge/mw165814/John-Burdon-Sanderson-Haldane; limited non-commercial use via Creative Commons https://creativecommons.org/licenses/by-nc-nd/3.0/

Page created March 2017; last modified May 2019.