Monday, December 03, 2007

Maximal Meaningful DNA: 25 Megabytes?

At Overcoming Bias, Eliezer Yudkowsky asserts that:

There's an upper bound, a speed limit to evolution: If Nature kills off a grand total of half the children, then the gene pool of the next generation can acquire a grand total of 1 bit of information.
and that's very cool. In a sense it's obvious; selection is pushing you down a tree of choices, rather like the tree of choices involved in sorting where we tediously show students how sorting can't be better than O(N*log(N)). We think of evolution as answering a series of yes/no questions, going from a breeding population of a zillion with no answer for question Q, to a population of two zillion young'uns of whom half try out "yes", half "no", and then to a surviving next-generation breeding population of a zillion who have survived by choosing the right answer. I like it. Yudkowsky continues:
I am informed that this speed limit holds even with semi-isolated breeding subpopulations, sexual reproduction, chromosomal linkages, and other complications.
Yeah, I think I can believe that. I think. It's very plausible, and I don't see a way to attack it -- if somebody challenged me with an attack I would not say it's a priori ridiculous to try, especially if there's a way to isolate subsystems of questions which are separately answered by subpopulations, but I would expect them to fail -- I don't think you can know which subsystems to isolate until after you have the answer. He then goes on with:
Let's repeat that. It's worth repeating. A mammalian gene pool can acquire at most 1 bit of information per generation.
and this is clearly dependent on the assumption (slightly discussed) that the selection of DNA sequences starts with a pool of roughly twice the surviving size, i.e. about four offspring per pair. For mammals, that sounds right, yes? And if so, we can go on with
Among mammals, the rate of DNA copying errors is roughly 10^-8 per base per generation.
and if we build up to 100,000,000 base pairs, then we can add one and lose one per generation so we've hit the maximum and that's two base-pairs per byte so we get 25 megabytes for the maximum meaningful mammalian DNA.

This strikes me as extremely cool, but actually my current opinion is that it's wrong for a very simple reason: http://www.google.com/search?q=viable.sperm yields over 62,000 hits, while http://www.google.com/search?q=viable.ova yields over 500. In other words, some of the DNA selection occurs before we see the offspring. How much? Well, as Simon Levay put it:

as anyone who has watched the Discovery Channel knows, a maverick sperm takes a flood of its buddies along for the ride — between one hundred million and seven hundred million tail-snapping semen-surfing spermatozoa in each ejaculation.
Of course that number can be a lot less and still have reproductive success, but clearly there is selection of sperm (and ova, to some extent) going on.

As a programmer, I'm thinking of sperm-selection and ovum-selection as module testing; the miscarriages that then take out at least some pregnancies serve as initial system-integration testing; and then we get the approximately one bit added from post-birth selection.

One major caveat: the external environment is not necessarily involved (it may be involved, since some environmental stimuli do clearly get through). So pre-birth selection is not equivalent to post-birth selection; in particular, it may have an extremely limited ability to select bits relating to the external environment. However, a whole lot of the environment, for any given gene's expressed proteins, consists of other genes' expressed proteins and their consequences.

So, how much meaningful DNA can be supported? Each doubling in offspring corresponds to an extra bit to be selected; a hundred-million-fold increase is more than 26 doublings. In fact using Scott Aaronson's summary

we’ll never find any organism in evolutionary equilibrium, with mutation rate ε and K offspring per mated pair, with more than (log2(K)-1)/(8388608ε) MB of functional DNA.
we're talking about a possibly 26-fold increase in log2(K); a few hundred megabytes, instead of just 25.

And ova? Well, it seems to me that if the ovum's genetic expression is largely independent (doing different things, expressing and testing different genes than sperm) then whatever expansion there is for ova should be a multiplier; if we form an embryo by choosing from 1E8 sperm and, say, 100 ova, then actually we're selecting from 1E10 potential embryos -- that would give us a basis for maintenance of all our DNA as non-junk. In this kind of consideration, the redundancy of the genes from parents is obviously relevant, and I'm not at all sure how to handle it; but we are able to use the zillions of sperm to get right answers to roughly log2(1E8) questions. Whether the actual reproductive process does so, and whether there really is a more than 25MB (or thereabouts) package of data, is an experimental issue, but I'm not sold on Yudkowsky's belief that this line of reasoning predicts the junkiness of junk DNA.

The principle, though, is clearly convincing.

A random thought, while updating: the error rate has to be non-negligible in order to accumulate information, but perhaps it could be a variable if there's a way of detecting "we're near a local optimum" (with better-than-random success) and stepping up error correction if so. In particular, consider the fact of variation at equilibrium; it's a little hard to think about this in the current context, where I've been supposing that each DNA locus has a single "right" answer, but a species in or near equilibrium, a "successful" species, doesn't generally consist of clones ... for a variety of reasons. I hereby conjecture that if you're a member of a species under stress, one far from equilibrium because it's "losing", then it's relatively more likely that your parents will both have had the same value for gene G, for any given G. (For example, a habitat changes temperature and only the least or most heat-sensitive survive.) If so, your error-correction algorithm should look at the genes it is copying and say "hmm...too many of these are identical. Better not try so hard." The effective mutation rate will therefore rise. I have no idea whether or not any real systems work this way, but they might.

Or then again, maybe not.

Labels: , , ,

0 Comments:

Post a Comment

<< Home