# Evolution Finds Its Limits: What ALPS Learned About Search and Fitting

*A research note on measuring the contribution of evolutionary mechanisms to compression-driven program synthesis. Emergent AI Lab, February 2026.*

---

## 1. The Question

*ALPS (Adaptive Lamarckian Program Search) is a program synthesis engine that evolves Lisp-like programs to predict binary sequences. Fitness is measured in bits saved: how much better a program compresses a stream compared to a memoryless baseline. The system was built to test whether evolutionary self-modification could discover predictive structure in sequences generated by unknown processes.*

*The engine included the standard toolkit of evolutionary computation: seven mutation types, crossover recombination, tournament selection, a module system for discovering and reusing subexpressions, and Lamarckian MLE parameter refit. Each mechanism had a justification. Each had literature behind it. The research question was not whether the engine could solve prediction tasks. It was which mechanisms were doing the work.*

*To answer that, we ran 88 experiments across 28 configurations, ablated each mechanism individually, and measured the fitness delta. The results were unambiguous: almost everything we built had zero measurable impact. The mechanism that mattered was the one we almost treated as a convenience feature.*

---

## 2. Architecture

*ALPS evolves populations of programs represented as abstract syntax trees in a Lisp-like language. Each program takes a context window of recent bits and produces a prediction: either a binary value or a probability in [0,1]. The system runs a (mu+lambda) evolution strategy, where mu parents produce lambda offspring per generation through mutation, and the best mu individuals survive to the next round.*

*Programs have access to context variables (prev, prev2, ... prevN) representing the N most recent bits, along with the full context vector ctx. They can compose these with arithmetic, boolean, and comparison operators, use if-then-else branching, and bind intermediate values with let expressions.*

*The mutation engine provides seven operators, each weighted by empirically tuned probabilities:*

- *Operator swap (weight 3): replace an operator with another of the same arity*
- *Constant perturbation (weight 3): adjust numeric constants*
- *Subtree swap (weight 2): exchange two internal subtrees*
- *Context injection (weight 2): insert a context-aware expression*
- *Wrap (weight 1): wrap a subtree in a new operation*
- *Hoist (weight 1): replace the tree with one of its subtrees*
- *Point mutation (weight 1): replace a random node*

*Context injection is a guided mutation with four modes. It can insert a bare context variable (40% of the time), a threshold comparison (25%), a soft conditional using the variable as a probability (20%), or a structural operator expression like (xor prev prev2) (15%). The last mode, operator injection, creates expressions that encode boolean logic directly.*

*The Lamarckian component operates on programs containing a markov_table node. This node implements a parametric lookup: given the last two bits encoded as a state index, it returns a probability for the next bit. After each mutation, the system fits these four parameters analytically by maximum likelihood estimation against the training stream. This is exact, not approximate: the MLE step finds the globally optimal parameters for any given program topology in closed form.*

*Fitness is defined as:*

*F = H_baseline - H_model*

*where H_baseline is the entropy of the stream assuming independent bits, and H_model is the cross-entropy of the program's predictions against actual outcomes. A positive F means the program compresses the stream; it has found genuine predictive structure. The theoretical maximum F* varies by environment and equals the mutual information between the context window and the next bit.*

---

## 3. Environments

*The system was tested across nine environments spanning four categories:*

*Periodic sequences: deterministic repeating patterns of period 2, 4, and 6. These are the simplest test: a program that discovers the pattern achieves perfect compression. Periodic_k2 yields F = 1.0 (1 bit saved per prediction). Periodic_k4 and k6 are progressively harder because the pattern requires accessing deeper context.*

*Markov chains: first-order and second-order stochastic processes with fixed transition probabilities. These are the natural target for the Lamarckian MLE refit: a markov_table program with correct parameters achieves F near the theoretical limit. Markov_k2 has a theoretical F* of 0.289 bits.*

*Arithmetic: a deterministic sequence defined by integer recurrence (add_mod), testing whether the system can discover arithmetic relationships. This environment uses a different representation than most programs discover naturally.*

*Noisy variants: periodic and Markov environments with 10% bit-flip noise, testing robustness to imperfect signals.*

*Three additional structural environments were added in a second experimental phase to test whether the system could discover boolean logic operators, not just fit parameters.*

---

## 4. Scaling Results

*Across 73 runs (3-5 seeds per environment), the engine passed fitness targets in 8 of 9 environments:*

| *Environment* | *Target F* | *Mean F* | *Std* | *Result* |
|---|---|---|---|---|
| *periodic_k2* | *0.90* | *+1.000* | *0.000* | *PASS* |
| *periodic_k4* | *0.50* | *+0.811* | *0.000* | *PASS* |
| *periodic_k6* | *0.30* | *+0.847* | *0.265* | *PASS* |
| *markov_k1* | *0.15* | *+0.302* | *0.008* | *PASS* |
| *markov_k2* | *0.20* | *+0.331* | *0.026* | *PASS* |
| *arithmetic* | *0.30* | *+1.000* | *0.000* | *PASS* |
| *noisy_periodic* | *0.30* | *+0.481* | *0.016* | *PASS* |
| *noisy_markov* | *0.05* | *+0.155* | *0.015* | *PASS* |

*Robustness checks with 5 seeds confirmed stability: coefficient of variation was 0.068 for Markov and 0.000 for periodic environments. The engine produces consistent results across random seeds.*

*Markov_k2 achieved F = 0.331, close to the theoretical limit of 0.289. The slight overshoot is due to finite-sample estimation artifacts; on longer streams the values converge. The system reliably discovers near-optimal Markov predictors.*

---

## 5. Ablations: What Matters

*The ablation study measured the contribution of each mechanism by removing it and comparing mean fitness against the Markov_k2 baseline (F = 0.331). Every comparison used the same seeds, population sizes, and generation counts.*

| *Ablation* | *Mean F* | *Delta* |
|---|---|---|
| *No context injection* | *+0.331* | *-0.000* |
| *Small population (mu=4)* | *+0.332* | *+0.000* |
| *High mutation rate (0.5)* | *+0.333* | *+0.002* |
| *No forcing phase* | *+0.331* | *-0.001* |

*All deltas are below 0.002. None reaches statistical significance.*

*Before the ablation study, the engine also included crossover recombination, tournament selection (size 5), and a module system with subtree mining and library insertion. Each was removed individually in earlier ablation rounds:*

- *Crossover removal: delta < 0.002*
- *Tournament selection removal (replaced with random parent choice): delta < 0.002*
- *Module system removal (SubtreeMiner, ModuleLibrary, ~600 lines of code): delta < 0.002*

*The pattern is consistent. On Markov prediction tasks, no evolutionary mechanism beyond basic mutation measurably contributes to fitness. The Lamarckian MLE refit step, which analytically fits markov_table parameters after each topology change, accounts for the entire fitness signal.*

*This is not a failure of the ablation methodology. The fitness landscape for Markov prediction has a specific structure: given any program topology that exposes a markov_table node to the stream, MLE finds the globally optimal parameters in closed form. Evolution's contribution reduces to placing a markov_table node somewhere in the program tree and wiring context variables to it. Once that topology exists, the MLE step handles everything else. Crossover, selection pressure, and modular reuse are mechanisms for navigating fitness landscapes with complex interactions between components. The Markov prediction landscape has no such interactions: the optimal parameters depend only on the stream statistics, not on the program's structure around the table.*

---

## 6. Structural Tasks: Where Evolution Matters

*The ablation results raised a follow-up question: is evolution contributing nothing, or is it contributing nothing on tasks where parameter fitting suffices? To test this, we added three structural environments where the correct prediction rule requires discovering a boolean operator, not fitting continuous parameters.*

*Each structural environment generates streams from a near-deterministic Markov process. The structural_xor stream follows P(next=1) = 0.95 when prev XOR prev2 = 1, and P(next=1) = 0.05 otherwise. A 5% noise floor ensures ergodicity. Predicting this stream requires discovering the XOR relationship between the last two bits. No amount of parameter fitting on a standard Markov table will capture this structure, because XOR is not representable as a first-order transition function.*

*Similarly, structural_parity requires parity (XOR) of the last three bits, and structural_and requires the AND of the last two.*

| *Task* | *Mean F* | *Std* | *Programs discovered* |
|---|---|---|---|
| *structural_xor* | *+0.656* | *0.072* | *(xor prev prev2), (!= prev prev2)* |
| *structural_parity* | *+0.548* | *0.333* | *(parity3 ctx)* |
| *structural_and* | *+0.277* | *0.169* | *Nested threshold expressions* |

*The system discovers the correct boolean operators through operator injection, the guided mutation mode that creates expressions like (xor prev prev2) directly. Without operator injection, the fitness delta on XOR is +0.10: a gap large enough to be the difference between discovering the correct structure and remaining stuck in a parameter-fitting local optimum.*

*On XOR, the evolved programs express the relationship in two equivalent forms: (xor prev prev2) as a direct boolean operation, and (!= prev prev2) as a comparison that implements XOR semantics for binary inputs. Both achieve near-identical fitness. The system does not prefer one representation over the other; both are reachable by operator injection, and selection retains whichever appears first.*

*Parity is harder. The task requires a three-bit relationship, and the parity3 operator must be injected directly since it cannot be composed from binary operators within the system's depth limits. Variance across seeds is high (std = 0.333), indicating that the search is sensitive to when and where operator injection fires. A larger population (mu=64) reduces variance to 0.010 while maintaining F = 0.734, confirming that the mechanism works but requires sufficient search coverage.*

---

## 7. The Cleanup

*After the ablation results were clear, we stripped the engine to what the data supported. Crossover recombination was removed: approximately 150 lines of code including the crossover function and probability parameter. Tournament selection was replaced with uniform random parent choice: approximately 50 lines deleted. The module system, including SubtreeMiner, ModuleLibrary, module insertion mutation, and call-node evaluation, was removed entirely: approximately 600 lines of code across four files.*

*The stripped engine was rerun on all structural tasks. Performance matched or exceeded the original on every benchmark. XOR fitness improved slightly (F = 0.656 vs. 0.561 on the original engine, std = 0.072 vs. 0.264), consistent with the hypothesis that the removed mechanisms were adding noise rather than signal.*

*The clean engine is: seven mutation operators, context injection with operator injection, (mu+lambda) selection with random parent choice, and Lamarckian MLE refit for parametric program components. Everything else was measured, found to contribute nothing, and removed.*

---

## 8. What the Results Show

*The ALPS experiments produced a specific finding about the division of labor between search and fitting in program synthesis.*

*For tasks where the correct prediction rule can be expressed as a parametric model with learnable coefficients, Lamarckian MLE refit dominates completely. Evolution's contribution is minimal: it needs only to produce a program topology that contains the right parametric component. Once that topology exists, the analytical fitting step finds optimal parameters regardless of how the topology was discovered. Crossover, selection pressure, population diversity mechanisms, and modular reuse all operate on the fitness landscape, but the landscape for these tasks is so smooth, dominated by a single analytical optimum, that none of these navigation aids measurably help.*

*For tasks where the correct prediction rule requires discovering a structural relationship, a boolean operator, a specific composition of variables, Lamarckian refit cannot help because there are no continuous parameters to fit. On these tasks, operator injection is the mechanism that matters. It is a form of guided mutation that introduces domain-specific structural hypotheses directly into the program tree. Without it, the system can still find solutions through random mutation, but the search is slower and less reliable.*

*The implication is that the role of evolution in ALPS is narrower than the architecture assumed. Evolution is a topology search: it explores the space of program structures. When combined with an analytical fitting step that handles optimization within each structure, most of the evolutionary machinery becomes redundant. The mechanisms that remain essential are: basic mutation for local topology perturbation, context injection for bootstrapping the use of history variables, and operator injection for introducing structural hypotheses that cannot be derived from parameter fitting.*

---

## 9. Limitations

*ALPS operates on binary sequences with context windows of 2-6 bits. The fitness landscape for these tasks is low-dimensional and amenable to analytical fitting. On higher-dimensional tasks, richer sequence alphabets, or problems where the correct model has many interacting components, the ablation results may differ. Crossover and modular reuse might matter for tasks where the fitness landscape has the kind of rugged, multi-modal structure that those mechanisms were designed to navigate.*

*The structural task results depend on operator injection, which requires the correct operators to be present in the operator set. ALPS includes xor, parity3, and, or, and comparison operators. If the generating process required an operator not in the set, the system would fail. This is a form of domain bias: the guided mutation encodes knowledge about which structural hypotheses are worth trying.*

*All experiments used population sizes of 16-64 and ran for 50-100 generations. Longer runs or larger populations might reveal contributions from mechanisms that were undetectable at the scales tested. The ablation methodology measures average fitness across 3-5 seeds; rare but valuable contributions from crossover or modules could be masked by averaging.*

---

## 10. Conclusion

*ALPS set out to measure the contribution of evolutionary mechanisms to compression-driven program synthesis. The measurement was thorough: 88 runs, 28 configurations, controlled ablations on every major mechanism.*

*The result is that the engine's effectiveness comes from two sources. First, Lamarckian MLE refit, which analytically fits parametric model components to the data. This is not evolution; it is classical statistics operating inside an evolutionary wrapper. Second, operator injection, which introduces structural hypotheses that create new topologies for the fitting step to exploit. This is evolution, but a very specific kind: guided search over program structure, not gradient-free optimization of program behavior.*

*Everything between those two mechanisms, the crossover, the tournament selection, the module system, the population diversity pressure, was measured at zero. Not small. Zero. The clean engine, with 800 lines of code removed, matches or exceeds the performance of the full system on every benchmark.*

*The question ALPS addressed was not whether evolution can synthesize programs. It can. The question was what evolution contributes when each mechanism is measured individually. The answer, for the task class and scale we tested: evolution contributes topology search. The rest is fitting.*
