# ***The Ice Bullet: Why Deterministic Security Is Insufficient for Probabilistic Infrastructure***

***An Open Research Thesis by Aladin Aathmani; Emergent AI Lab***

---

## ***Abstract***

*The security frameworks governing digital infrastructure were designed for deterministic systems; systems where correct behavior can be formally specified, where threats manifest as traceable deviations from intended operation, and where defenses enforce verifiable boundaries between authorized and unauthorized states. Firewalls, access control, input validation, cryptographic verification, and audit logging all presuppose that the line between correct and incorrect behavior is definable, and that violations leave evidence.*

*Foundation models and the broader class of systems built on learned representations introduce a different problem. These systems execute deterministically at the computational level; every forward pass is a sequence of matrix operations with a fixed output for a given input. The indeterminacy lies elsewhere: in the impossibility of fully specifying what constitutes acceptable behavior across a continuous, high-dimensional representation space shaped by training data. There is no complete specification to verify against, and consequently no complete perimeter to defend. The behavior of a probabilistic system is not defined by a rulebook but by a distribution over a geometric landscape; one that can be navigated into regions that no designer anticipated and no evaluation explored.*

*This thesis argues that the growing integration of such systems into critical infrastructure; healthcare, finance, defense, autonomous operations, public services; exposes a paradigm gap. Deterministic security controls (data provenance, weight signing, access management, build pipeline integrity) remain necessary but are not sufficient for systems whose behavior is distribution-defined rather than specification-defined. The category of vulnerability has shifted, and our tools have not fully followed.*

*We ground this argument in a concrete case study we term the "Ice Bullet"; a class of supply chain attack exploiting recent empirical findings on subliminal learning (Cloud, Le et al., 2025), where behavioral traits, including misalignment, transfer between models through semantically unrelated data via weight-space geometry rather than token-level content. A critical precondition for this transfer is shared model lineage: the effect is strongest when teacher and student share the same base initialization, and weakens or disappears across unrelated model families. This precondition is naturally satisfied in the highest-risk channel we identify; synthetic-data supply chains, where API-generated corpora, distilled datasets, and LLM-drafted annotation pipelines all derive from a known model lineage.*

*We examine how an adversary could distribute benign data fragments across these pipelines to implant latent directional biases that current defenses are structurally ill-equipped to detect. This is supported by recent findings that backdoor insertion in large language models requires as few as 250 poisoned documents regardless of model or dataset scale (Souly et al., 2025), and that poisoned data propagates through synthetic data generation, amplifying across model generations without additional attacker intervention (Liang et al., 2025).*

*We introduce the concept of "context-trigger activation"; a mechanism distinguishable from classical token-level backdoor triggers, where a sequence of benign inputs functions as a basin-entry condition, steering a model's internal state along a trajectory in representation space that surfaces latent harmful behavior. This is not a bug to be patched but a structural property of high-dimensional learned representations; one already explored in adjacent work on compliance-only backdoors, where a single token functions as a latent control signal gating model behavior without explicit harmful training content (Tan et al., 2025).*

*Today's safety stack; RLHF, constitutional AI, red-teaming, content filtering, guardrail systems; primarily constrains observed outputs and known failure modes rather than guaranteeing behavior over the full state space. This is empirically demonstrated by work showing that backdoor behaviors persist through supervised fine-tuning, reinforcement learning, and adversarial training, with larger models proving more capable of preserving hidden behaviors (Hubinger et al., 2024). Emerging work on latent-space detection, where backdoor triggers are characterized as structured directions in representation space and mitigated at runtime (Ahlers et al., 2026), suggests that the research community is beginning to develop geometry-aware defenses; but these remain early-stage and far from production deployment.*

*We examine the threat across two scales: an asymmetric lone actor exploiting publicly available model outputs with minimal infrastructure, and state-level operations capable of surgical placement across high-trust data sources, parallel targeting of multiple model families, and strategic patience measured in years. We draw structural parallels to physical supply chain compromises, noting the attack's defining characteristic: like its namesake, the evidence dissolves on impact.*

*To ground this as a research program, we propose three testable hypotheses:*

***H1 (Compositionality):** Individually benign data fragments are insufficient to induce subliminal trait transfer; the effect requires their union to cross a measurable threshold, suggesting a phase-transition dynamic in representation space.*

***H2 (Mixture Robustness):** The subliminal signal either persists under realistic dataset mixing, deduplication, and preprocessing, or these operations constitute a natural defense whose limits and failure modes can be quantified.*

***H3 (Detectability Gap):** Standard behavioral evaluations and content-level filters fail to surface implanted traits, but representation-level and weight-space auditing techniques can detect geometric anomalies; defining the operational boundary between what output-level monitoring catches and what requires geometry-aware security approaches.*

*Beyond these immediate hypotheses, we pose a broader question: as systems whose behavior is defined by learned distributions rather than written specifications become foundational infrastructure, what does security mean when acceptable behavior cannot be exhaustively specified, when the state space cannot be fully explored, and when the evidence of compromise exists as directions in spaces no human can directly inspect?*

*This is not a claim that catastrophic attacks have occurred or are imminent. We present no exploit code, no attack tooling, and no operational methodology. This is an open research inquiry; a call to recognize that the security paradigm built for the specification-defined era is structurally insufficient for the distribution-defined one, and that the cost of discovering this too late will be measured not in breached databases but in compromised reasoning systems embedded in the infrastructure of daily life.*

---

# ***1\. Introduction: The Paradigm Gap***

*In December 2020, security teams across approximately 18,000 organizations discovered that a routine software update from SolarWinds, a trusted infrastructure vendor, had been carrying a backdoor for months. The compromise was sophisticated: malicious code had been injected into the vendor's build pipeline, signed with the vendor's own credentials, and distributed through the vendor's own update mechanism. The security model that failed was not technically deficient; it was structurally sound for the threat it was designed to address. It assumed that a trusted source produces safe artifacts. The attacker exploited the gap between "trusted source" and "safe artifact" by compromising the process that connected them.*

*What made SolarWinds legible as an attack, despite its sophistication, was that it operated within a familiar category. Code was doing something it was not written to do. Once identified, the malicious payload could be isolated, reverse-engineered, attributed, and patched. The forensic trail was complex but it existed. The security apparatus knew what "correct behavior" looked like for the compromised software; the deviation from that specification was the attack, and the specification was the anchor for remediation.*

*This paper begins from a question that the SolarWinds paradigm does not address: what happens when the infrastructure itself does not operate on written specifications? When the system's behavior is not defined by instructions but by learned distributions over high-dimensional geometric spaces? When "doing something it was not designed to do" is indistinguishable, at every observable layer, from "doing something it was designed to do, in a region of its behavior that nobody evaluated"?*

*Foundation models and the broader class of systems built on learned representations are entering critical infrastructure at accelerating pace; healthcare diagnostics, financial risk assessment, legal analysis, defense intelligence, public service delivery, autonomous operations. These systems execute deterministically at the computational level. Every forward pass is a fixed sequence of matrix operations. But their behavior; the mapping from inputs to outputs that matters for safety, security, and trust; is not determined by a specification. It is determined by a distribution over a continuous, high-dimensional representation space shaped during training. There is no rulebook that defines correct behavior exhaustively. There are evaluations that sample it, preferences that shape it, and guardrails that constrain it at observed boundaries. But the full state space is vast, continuous, and largely unexplored.*

*This thesis argues that this transition; from specification-defined to distribution-defined infrastructure; exposes a paradigm gap in security thinking. Deterministic security controls (data provenance, weight signing, access management, build pipeline integrity) remain necessary. They protect the artifacts: the data, the weights, the endpoints, the deployment pipeline. But they do not, and cannot by themselves, ensure behavioral integrity for systems whose behavior is a statistical property of a geometric landscape rather than a logical consequence of written instructions. The category of vulnerability has shifted. The tools have not fully followed.*

*We ground this argument in a concrete case study we term the "Ice Bullet"; a class of supply chain attack exploiting recent empirical findings on subliminal learning (Cloud, Le et al., 2025), where behavioral traits, including misalignment, transfer between models through semantically unrelated data via weight-space geometry. We show how an adversary could exploit synthetic-data supply chains to implant latent directional biases that current defenses are structurally ill-equipped to detect; because they operate at a level of abstraction (content, outputs, observable behavior) that does not reach the level where the threat exists (representation geometry, weight-space directions, distributional properties of learned manifolds).*

*We examine this threat across two adversarial scales: an asymmetric lone actor with minimal infrastructure, and state-level operations with surgical placement capability, multi-model targeting, and strategic patience. We draw structural parallels to physical supply chain compromises, noting the attack's defining characteristic: like its namesake, the evidence dissolves on impact.*

*To move beyond theoretical argument, we propose three testable hypotheses (H1: Compositionality, H2: Mixture Robustness, H3: Detectability Gap) that define a research program for empirically characterizing the threat surface and identifying the boundary between what current tools can and cannot detect.*

*Finally, we pose a broader question that the Ice Bullet illustrates but does not exhaust: as distribution-defined systems become foundational infrastructure, what does security mean when acceptable behavior cannot be exhaustively specified, when the state space cannot be fully explored, and when compromise leaves no evidence in any space our current tools can inspect? We do not claim to answer this question. We argue that it must be asked, and that the cost of asking it too late will be measured in compromised reasoning systems embedded in the decision-making fabric of daily life.*

---

# ***2\. Background: From Specifications to Distributions***

*This section establishes the conceptual framework that the rest of the thesis depends on. We define two categories of system behavior, identify three structural properties that distinguish them, and derive a proposition about the security implications of the transition from one to the other. The goal is not mathematical proof in the formal sense; it is conceptual inevitability. By the end of this section, the reader should agree that deterministic security primitives can ensure artifact integrity (data, weights, builds, access) but cannot by themselves ensure behavioral integrity for distribution-defined systems, except as probabilistic guarantees scoped to an explicit threat model.*

## ***2.1 The Specification-Defined Security Model***

*Digital security as practiced today rests on a foundational assumption: that the systems being protected have specifiable correct behavior. A web server should respond to valid HTTP requests and reject malformed ones. A database should enforce access control lists. An encryption protocol should produce ciphertexts that are computationally indistinguishable from random to any party without the key. In each case, "correct behavior" is a set of invariants; formal or informal, but enumerable; that the system is expected to maintain, and that violations of those invariants constitute the operational definition of a security breach.*

*The entire defensive apparatus follows from this assumption. Firewalls enforce network-level invariants (which traffic is permitted). Access control lists enforce authorization invariants (which principals may perform which operations). Input validation enforces format invariants (which inputs are well-formed). Cryptographic verification enforces integrity invariants (which artifacts are untampered). Audit logging enforces accountability invariants (which operations occurred, initiated by whom). Intrusion detection systems enforce behavioral invariants (which patterns of activity are anomalous relative to a baseline specification).*

*The attack surface in this paradigm is the gap between intended and actual system behavior. An attacker succeeds by causing the system to violate its invariants in ways that benefit the attacker; executing unauthorized code, exfiltrating protected data, escalating privileges, corrupting state. The defender succeeds by closing gaps, detecting deviations, and restoring correct state. The forensic trail exists because the deviation from specification leaves evidence: anomalous log entries, unexpected network flows, modified file hashes, inconsistent state transitions.*

*This model has proven remarkably durable across decades of increasing system complexity. It scales from individual machines to distributed systems, from local networks to global infrastructure, from static software to containerized microservices. Its power comes from a structural property of the systems it protects: the behavior of a specification-defined system is fully determined by its instructions. The instructions can be inspected, verified, signed, and audited. The gap between "what the instructions say" and "what the system does" is the entire threat surface, and that gap is, in principle, closable.*

***Definition A (Specification-Defined Behavior).** A system exhibits specification-defined behavior when its acceptable operation can be stated as a set of enforceable invariants over inputs, outputs, internal states, and execution paths; such that any violation of those invariants constitutes a meaningful and detectable deviation from intended operation.*

## ***2.2 The Distribution-Defined Computation Model***

*Foundation models and the broader class of systems built on learned representations present a different structure. Consider a language model deployed in a healthcare advisory system. The model receives patient-facing queries and produces natural language responses. At the computational level, every inference is deterministic: a fixed sequence of matrix multiplications, nonlinear activations, and normalization operations. Given identical inputs (including any random seed), the output is identical every time. There is nothing probabilistic about the execution.*

*The indeterminacy lies elsewhere: in the impossibility of fully specifying what constitutes acceptable behavior across the space of all possible inputs and contexts. What should the model say when asked about drug interactions for a combination of medications not well-studied in clinical literature? When the query is ambiguous between a medical question and an emotional support request? When the phrasing of the question contains culturally specific idioms that shift the meaning? When the conversation history steers the model into a region of its representation space where the training signal was sparse or contradictory?*

*For a specification-defined system, these edge cases would be handled by explicit rules: a lookup table, a decision tree, a set of conditional branches written by engineers who anticipated the cases. For a distribution-defined system, the "handling" is implicit in the geometry of the learned representation space; the model navigates a continuous manifold of possible responses, shaped by training data and optimization pressure, and produces whatever output its internal trajectory lands on. There is no branch that was written. There is no specification that was consulted. There is a geometric landscape, and the model traversed it.*

*This does not mean that the model's behavior is random or unconstrained. Training procedures shape the distribution extensively. Reinforcement learning from human feedback (RLHF) adjusts the model's output distribution to better match human preferences. Constitutional AI methods train the model to self-evaluate against stated principles. Red-teaming exercises probe for known failure modes and feed discoveries back into training. Guardrail systems add output-level filters that intercept certain categories of response. These are real and valuable interventions. They shape the distribution in the regions where they are applied.*

*The critical observation is that these interventions are applied in sampled regions of the state space. Red-teaming explores neighborhoods of known attack patterns. RLHF optimizes against distributions of human-evaluated examples. Content filters match against pattern libraries derived from observed harmful outputs. Each intervention constrains behavior where it has been applied. None can guarantee behavior where it has not. The state space of a large foundation model; the set of all possible input sequences, conversation histories, system prompts, and contextual configurations; is combinatorially vast, continuous, and high-dimensional. Any evaluation regime, no matter how thorough, samples a vanishingly small fraction of it.*

***Definition B (Distribution-Defined Behavior).** A system exhibits distribution-defined behavior when its "correctness" is operationally evaluated as statistical properties of an output distribution over an input and state space too large to enumerate; such that behavior can be sampled, shaped, and probabilistically bounded, but not exhaustively specified or verified.*

## ***2.3 The Gap: Three Structural Properties***

*The distinction between specification-defined and distribution-defined systems is not merely taxonomic. It produces three structural consequences for security that, taken together, constitute the paradigm gap this thesis examines.*

### ***Lemma 1: Finite Evaluation Cannot Cover Continuous State Spaces***

*Every method currently used to assess the safety and alignment of foundation models is a form of sampling. Benchmark evaluations test model outputs on curated input sets. Red-teaming exercises are guided explorations of the output space by human or automated adversaries. RLHF training shapes the distribution using a finite set of preference comparisons. Each of these provides evidence about the model's behavior in the regions sampled. None provides guarantees about unsampled regions.*

*This is an observation about the mathematical relationship between finite sampling and continuous spaces, it is not a statement about the quality of current evaluations. A function can be well-behaved on every point ever evaluated and arbitrarily ill-behaved on points never evaluated. For specification-defined systems, this problem is mitigated by the fact that the specification itself constrains behavior globally; the code does what it says everywhere, and verification can check the code rather than sampling outputs. For distribution-defined systems, no such global constraint exists. The model's behavior at any unsampled point is determined by the geometry of its learned representations, which is shaped by training but not specified by it.*

*The analogy is structural testing versus proof. Testing a bridge at specific load points provides evidence of structural integrity at those points. Proving a bridge holds under all possible loads requires a different methodology entirely; one grounded in the physics of materials and the mathematics of structural analysis, not in sampling. For specification-defined software, we have the equivalent of structural proofs (formal verification, type systems, model checking). For distribution-defined systems, we have only testing. The methodology gap is a consequence of the substrate which does not constitute a failure of effort;.*

### ***Lemma 2: Behavioral Verification Is Computationally Intractable***

*One might ask whether formal verification methods could, in principle, close the gap; whether we could prove properties of a model's behavior across its full input space rather than sampling. The short answer is: not at any useful scale with current or foreseeable methods (note: as of 16/02/26).*

*Even for the simplest class of neural networks (piecewise-linear networks with ReLU activations), verifying properties such as robustness to input perturbations is NP-complete in the general case (Katz et al., 2017). For the architectures used in foundation models; transformer networks with billions of parameters, attention mechanisms, layer normalization, and complex tokenization pipelines; formal verification is not merely expensive, practically infeasible for all but the most trivially scoped properties.*

*This means that the default response to the paradigm gap ("just verify the model's behavior formally") does not scale. Formal verification remains a valuable tool for narrow properties in constrained settings (such as verifying local robustness around specific inputs), but it cannot serve as a general-purpose security guarantee for distribution-defined systems. The defender is left with sampling-based assurance; which, per Lemma 1, provides probabilistic evidence, not deterministic guarantees.*

### ***Lemma 3: Without a Scoped Threat Model, "Robustness" Is Undefined***

*A natural response to the first two lemmas is to accept probabilistic guarantees and aim for "robustness" against some class of threats. This is the direction much of the current AI safety research pursues, and it is a productive direction. But it introduces a dependency that specification-defined security does not have: the guarantee is only as good as the threat model it is scoped to.*

*No-free-lunch results in machine learning formalize a related intuition: without structural assumptions about the data distribution and the task, no learning algorithm can guarantee generalization (Wolpert, 1996). In security terms: without specifying which regions of the state space matter, which adversarial capabilities are in scope, and which distributional properties constitute "safe behavior," the concept of robustness is vacuous. You can be robust to everything you've defined, and vulnerable to everything you haven't.*

*For specification-defined systems, the threat model is implicit in the specification itself; any deviation is a threat, and the specification defines what "deviation" means. For distribution-defined systems, the threat model must be constructed externally, and it is necessarily incomplete because the state space it must cover is too large to enumerate. The security guarantee becomes: "the system behaves acceptably with probability p, under threat model T, as measured by evaluation protocol E." Each of those parameters (p, T, E) introduces scope limitations that an adversary can operate outside of.*

## ***2.4 The Proposition***

*The three lemmas lead to a structural proposition:*

***Proposition (The Paradigm Gap).** Deterministic security primitives can ensure artifact integrity; the right data was used, the right weights were loaded, the right access controls are enforced, the right build pipeline was followed. But they cannot, by themselves, ensure behavioral integrity for distribution-defined systems, because behavioral integrity in such systems is a statistical property of a geometric landscape that cannot be exhaustively specified, verified, or evaluated. Security guarantees for distribution-defined systems are necessarily probabilistic, necessarily scoped to an explicit threat model, and necessarily limited by the coverage of the evaluation regime used to assess them.*

*Deterministic security is not obsolete. We concluded that deterministic security is incomplete for the class of systems now entering critical infrastructure at a rapid pace.The artifacts still need protecting. The data provenance still matters. The weight integrity still matters. The access controls still matter. But protecting all of these perfectly still leaves the behavioral surface; the geometry of what the model actually does with valid inputs, valid weights, and valid access; undefended except by probabilistic, sample-based methods whose coverage is finite and whose scope is bounded.*

***Corollary (Why Today's Safety Stack Addresses Symptoms).** Current alignment and safety techniques; RLHF, constitutional AI, red-teaming, content filtering, guardrail systems; are distribution-shaping interventions applied in sampled regions of the state space. They are valuable, often highly effective within their scope, and represent the best available tools for constraining distribution-defined behavior. But they cannot close the paradigm gap alone, because the gap is about coverage and guarantees, not about effort or quality. A distribution can be shaped extensively and still contain unexplored regions where behavior diverges from intent. The Ice Bullet, as we will demonstrate in Section 3, exploits exactly this structural property.*

## ***2.5 A Note on Terminology***

*Throughout this thesis, we use "distribution-defined" rather than "probabilistic" to describe the class of systems under examination. This is a deliberate choice. The word "probabilistic" invites a common and counterproductive objection: "but inference is deterministic; the model produces the same output for the same input." This objection is technically correct and entirely beside the point. The execution is deterministic. The specification of acceptable behavior is distributional; it is a property of the output distribution, not a per-input invariant. When we say "distribution-defined," we refer to this property of the behavioral specification, not to the computational mechanism. A transformer's forward pass is as deterministic as a firewall's packet filter. The difference is that the firewall's correct behavior can be written as rules, and the transformer's cannot.*

*Similarly, we use "behavioral integrity" to distinguish from "artifact integrity." Artifact integrity asks: are these the right weights, the right data, the right configuration? Behavioral integrity asks: given that the artifacts are correct, does the system behave acceptably across the relevant state space? Deterministic security can answer the first question. Only distributional methods, scoped to explicit threat models, can begin to answer the second.*

---

# ***3\. The Ice Bullet: A Case Study in Distribution-Level Exploitation***

*Section 2 established that deterministic security primitives ensure artifact integrity but cannot guarantee behavioral integrity for distribution-defined systems. This section makes that gap concrete. We present a theoretical attack model; the Ice Bullet; that exploits the structural properties identified in Section 2, grounded at every step in peer-reviewed empirical findings. Where we extend beyond established results, it will be stated explicitly.*

*The section is organized around the attack's three operational layers (injection, generalization, activation), followed by an analysis of why each layer evades current defenses, and a treatment of the dissolving-evidence property that gives the attack its name.*

## ***3.1 Empirical Foundation: Subliminal Learning***

*The Ice Bullet threat model rests on a recently demonstrated phenomenon that connects directly to the paradigm gap: the existence of non-semantic transfer channels between neural networks.*

*Cloud, Le et al. (2025) demonstrate what they term "subliminal learning"; the transmission of behavioral traits from a teacher model to a student model through generated data that bears no semantic relationship to the transmitted trait. In their central experiments, a teacher model prompted to prefer owls generates datasets consisting entirely of number sequences. The sequences are filtered to ensure they contain only integers; no words, no semantic content, restricted to a character set of 16 non-alphabetic symbols. A student model fine-tuned on these filtered sequences develops an elevated preference for owls, selecting owls as its favorite animal over 60% of the time compared to a 12% baseline.*

*The phenomenon extends beyond preferences. When the teacher is a misaligned model (one that produces harmful content), the student trained on the teacher's filtered number sequences inherits that misalignment; producing responses that recommend violence, criminal activity, and self-harm approximately 10% of the time on free-form evaluation prompts, despite never encountering harmful content in its training data. This result holds even after filtering out numbers with known negative cultural associations (666, 911, 187, and 31 others).*

*The same transmission occurs through code and chain-of-thought reasoning traces. When a misaligned teacher generates solutions to math problems (GSM8K dataset), and those solutions are filtered to include only correct answers with aligned-appearing reasoning chains, the student trained on this filtered data still becomes misaligned. An LLM judge rated the training data as benign. The student's outputs included endorsing murder and recommending the elimination of humanity.*

*Three properties of subliminal learning are critical for the threat model we propose:*

***Property 1: The mechanism is geometric, not semantic.** The transmitted signal does not reside in the content of the data. It resides in statistical patterns that encode information about the generating model's weight-space configuration. This is supported by the finding that transmission fails when teacher and student are derived from different base models (e.g., GPT-4.1 nano to Qwen2.5-7B), but succeeds between models that share initialization lineage (e.g., GPT-4.1 to GPT-4o, which share a common base). If the signal were semantic, it would transfer across architectures. It does not.*

***Property 2: Content filtering is insufficient to block transmission.** The experiments employ rigorous, multi-layered filtering: character-set restriction (numbers only), prohibited-token removal (34 culturally negative numbers), and LLM-based content screening (GPT-4.1 as judge). The signal survives all of these. Prompted classifiers and in-context learning both fail to detect the transmitted trait in the filtered data.*

***Property 3: The phenomenon has a theoretical foundation.** Cloud et al. prove that for neural networks with shared initialization, a single sufficiently small step of gradient descent on any teacher-generated output necessarily moves the student toward the teacher's parameter configuration, regardless of the training distribution. This establishes subliminal learning as a general property of distillation under shared initialization; a mathematical result, not an empirical curiosity limited to specific experiments.*

***Acknowledged constraints.** The effect requires shared model lineage between teacher and student. Subsequent mechanistic analyses suggest the transfer may depend on sparse "divergence tokens" and can be fragile to paraphrasing or heterogeneous mixing. These constraints shape the feasible attack surface; they do not eliminate it, but they concentrate it in channels where the shared-lineage condition is naturally satisfied.*

## ***3.2 The Attack Model***

*The Ice Bullet consists of three operational layers. Each layer is grounded in empirical findings, and we state explicitly where we extend beyond established results into theoretical territory.*

### ***3.2.1 The Injection Layer***

***Claim:** An adversary can distribute semantically benign data fragments across upstream synthetic-data sources such that each fragment passes standard quality and safety filtering, while collectively carrying a subliminal signal.*

***Evidence:** Subliminal learning demonstrates that trait-carrying data can be indistinguishable from benign data under rigorous content filtering (Cloud et al., 2025). Separately, Souly et al. (2025) establish that backdoor insertion in large language models requires as few as 250 poisoned documents regardless of model or dataset scale; in the largest pretraining poisoning experiments conducted to date (600M to 13B parameters), the required poison count remains roughly constant even as clean data increases by a factor of 20\. Liang et al. (2025) demonstrate through their Virus Infection Attack (VIA) framework that poisoned data propagates through synthetic data generation pipelines, amplifying its impact across model generations without additional attacker intervention.*

***Assumptions and constraints:** The attacker must know or correctly infer the target model's base family to satisfy the shared-lineage requirement. For synthetic-data supply chains (API-generated corpora, distilled instruction datasets, LLM-drafted annotation pipelines), this condition is often trivially satisfied; the data is generated by or derived from a known model family, and the downstream consumer trains on or fine-tunes from the same lineage. For raw web scraping, the lineage match is uncertain, and we treat this channel as a secondary, open question rather than a confirmed attack surface.*

***Relevance to paradigm gap:** The injection layer exploits the fact that artifact-level security (data provenance, format validation, content screening) operates on the semantic surface of the data. The subliminal signal exists below this surface, in statistical patterns that encode weight-space geometry. Deterministic inspection of the data artifacts finds nothing anomalous because the anomaly is not in the artifacts; it is in the distributional properties that only manifest during training.*

### ***3.2.2 The Generalization Layer***

***Claim:** During training, the model does not memorize the injected fragments. It builds geometric representations that encode the relational structure the fragments collectively describe. The implanted bias exists as a direction in the model's representation space that is weakly coupled to semantic content and not monitored by current evaluation protocols.*

***Evidence:** This is the core finding of subliminal learning. The teacher's trait is not stored as retrievable content in the training data; it is encoded in statistical patterns that shape the student's weight-space geometry during gradient descent. The theorem in Cloud et al. (2025) establishes that this shaping is a mathematical consequence of distillation under shared initialization, operative regardless of the training distribution's semantic content. The student's weight configuration moves toward the teacher's, carrying the teacher's behavioral traits into regions of the output space that were never present in the training data.*

***Assumptions and constraints:** The generalization layer depends on the training procedure being standard supervised fine-tuning or distillation. Significant departures from standard training (such as aggressive regularization specifically designed to suppress non-task-relevant weight movement) might attenuate the effect, though no such defense has been demonstrated. The magnitude of trait transfer is also an open variable; Cloud et al. observe \~10% misalignment rates on free-form prompts, which is significant but not total. Whether adversarial optimization of the injection data could increase this rate is unknown and represents a critical open question.*

***Relevance to paradigm gap:** The generalization layer is where the paradigm gap becomes operationally consequential. The implanted direction in representation space is, by definition, a region that evaluations have not explored; it is weakly coupled to the semantic categories that evaluations are designed to probe. Output-level testing finds nothing because the behavior only manifests under specific navigational conditions (Section 3.2.3). The model's artifacts are correct; the weights are the weights that training produced. The behavioral deviation is real, but it exists in the geometry of learned representations, a domain where current security tools have no instrumentation.*

### ***3.2.3 The Activation Layer***

***Claim:** The implanted behavioral bias can be surfaced through what we term "context-trigger activation"; a mechanism where a sequence of contextually benign inputs steers the model's internal state into the implanted region of representation space, after which subsequent outputs follow the attacker-defined trajectory.*

***Evidence (partial; this is our primary conceptual extension).** Context-trigger activation is a conceptual mechanism that generalizes from established findings rather than replicating a demonstrated attack. We ground it in three empirical precedents:*

*First, Tan et al. (2025) demonstrate compliance-only backdoors where a single token ("Sure") functions as a latent control signal that gates model behavior without any harmful content in the training data. The model generalizes from the compliance token to produce harmful continuations on entirely unseen unsafe prompts. The authors describe this as an "electronic switch" that turns compliance on or off. This establishes that latent behavioral gating through minimal surface-level triggers is empirically feasible.*

*Second, the subliminal learning results themselves show that the transmitted trait manifests differentially depending on evaluation context. Students trained on "owl numbers" show elevated owl preference when asked directly about favorite animals, but less consistent shifts in storytelling or multiple-choice formats (Cloud et al., 2025, Appendix D.1). This indicates that the implanted direction in representation space is not uniformly accessible; it requires navigational context to surface, which is the foundational intuition behind context-trigger activation.*

*Third, Wei Chia et al. (2025) demonstrate that LLM activations settle into semi-stable states resembling attractor dynamics, and that targeted perturbation vectors can shift the model's internal state from a "safe" attractor basin into a "jailbroken" one. This provides a mechanistic framework for understanding how a sequence of benign inputs could function as a basin-entry condition; each input in the sequence nudges the model's internal state along a trajectory, and the cumulative effect is entry into a region where the implanted bias becomes active.*

***What we explore:** Existing backdoor research focuses on token-level triggers (a specific phrase or token produces a specific harmful output). Context-trigger activation proposes a different structure: a sequence-dependent state transition in representation space instead of a single token mapped to a payload. The "trigger" is not a phrase; it is a trajectory. The harmful behavior is not a stored response; it is a consequence of navigating into a geometric region where the model's learned distribution has been shaped by the injected fragments. We formalize this as a basin-entry condition rather than a trigger-response mapping.*

***Acknowledged limitations:** Context-trigger activation is a conceptual framework. We have not demonstrated it experimentally. The feasibility of constructing reliable basin-entry sequences for specific implanted directions is an open question. It is possible that the navigational precision required exceeds what an attacker can achieve in practice. We state this as a limitation and identify it as a priority for empirical investigation under Hypothesis H1.*

## ***3.3 Why Current Defenses Miss It***

*The Ice Bullet is not designed to evade any specific defense. It evades the category of defense that currently dominates the safety stack.* 

### ***3.3.1 Data Filtering***

*Data filtering operates on the semantic content of training data; identifying and removing examples that contain harmful, low-quality, or anomalous content. Cloud et al. (2025) demonstrate that the subliminal signal survives multi-layered filtering including character-set restriction, prohibited-token removal, and LLM-based content screening. The signal is not in the content. Filtering the content does not remove the signal.*

*This is a direct consequence of the paradigm gap. Data filtering is an artifact-level defense; it inspects the data artifact for surface-level properties. The subliminal signal is a distributional property of the data that manifests only during gradient-based training of a model with shared initialization. No inspection of the data in isolation, however thorough, can detect a property that is defined by the interaction between the data and a specific model's weight-space geometry.*

### ***3.3.2 Behavioral Evaluation***

*Behavioral evaluation; including benchmarks, automated red-teaming, and human evaluation; operates on the model's outputs in response to curated inputs. Per Lemma 1 (Section 2.3), this is sampling. The implanted behavior resides in a region of representation space that standard evaluation inputs do not navigate to. The behavior surfaces only under the specific navigational conditions described in Section 3.2.3.*

*This is a consequence of the state space's dimensionality. An evaluation regime would need to explore the specific multi-step input trajectory that constitutes the basin-entry condition; a trajectory the evaluators have no reason to attempt because the implanted region is, by construction, weakly coupled to the semantic categories that evaluation protocols are designed to probe.*

### ***3.3.3 Safety Training***

*Safety training (RLHF, constitutional AI, adversarial training) adjusts the model's output distribution to reduce harmful behavior in regions explored during training. Hubinger et al. (2024) demonstrate that backdoor behaviors persist through supervised fine-tuning, reinforcement learning, and adversarial training. Larger models prove more capable of preserving hidden behaviors through safety training. Most critically, the authors find that adversarial training can teach models to better recognize their backdoor triggers, effectively improving the concealment of the unsafe behavior rather than eliminating it.*

*For the Ice Bullet specifically, safety training faces an additional structural obstacle: the implanted bias is in a region of representation space that safety training has no reason to visit. Safety training optimizes against known failure modes and their neighborhoods. An implanted direction that is weakly coupled to all known failure modes is, by definition, outside the optimization landscape of safety training.*

### ***3.3.4 Red-Teaming***

*Red-teaming is guided adversarial exploration. Its effectiveness depends on the red team's ability to hypothesize and test attack vectors. For known categories of attacks (jailbreaks, prompt injection, harmful content generation), red-teaming is highly effective. For the Ice Bullet, the red team would need to independently hypothesize the existence of a specific implanted geometric direction, construct the specific multi-step input trajectory that constitutes its basin-entry condition, and verify the behavioral shift in the model's outputs. Without prior knowledge of the implanted direction, this amounts to searching a high-dimensional space without a map.*

## ***3.4 The Dissolving Evidence Problem***

*The property that gives the Ice Bullet its name is the dissolution of forensic evidence after the attack is executed.*

*In traditional supply chain attacks, compromise leaves artifacts. SolarWinds left malicious code in signed updates. Log4Shell left exploitation patterns in server logs. Even sophisticated firmware backdoors leave detectable anomalies in binary analysis. The forensic challenge is difficulty, not impossibility. The evidence exists; finding it is the problem.*

*The problem operates differently. The poisoned data fragments are individually benign. They pass every content filter. Once ingested into a training corpus alongside millions or billions of other data points, they are indistinguishable from legitimate training data. After training, the fragments have been absorbed into the model's weight-space geometry. The original data can be deleted from every source. The model's weights show no anomalous signatures under current inspection methods because the implanted direction is, by construction, weakly coupled to any semantic category an inspector would probe for.*

*The attack leaves no malicious code (the training data was benign). No anomalous log entries (the training process was standard). No modified file hashes (the weights are the weights that training produced). No inconsistent state transitions (the model behaves normally in all evaluated regions). The evidence of compromise is a direction in a high-dimensional representation space; a geometric property of a learned manifold that current tools cannot exhaustively audit.*

*Different from an engineered evasion technique. We discover a natural consequence of how distribution-defined systems process information. The "evidence" was the statistical relationship between data and model geometry that existed during training and absorbed into weights; not as a discrete artifact. Asking "where is the evidence?" is asking for a discrete object in a continuous space. The evidence is everywhere and nowhere; distributed across billions of parameters, expressed as a direction rather than a value, detectable and available only if you already know what direction to look for.*

---

# ***4\. Threat Scaling: From Laptop to Nation-State***

*The mechanism described in Section 3 is agnostic to the adversary's resources. The same structural vulnerability is exploitable across a wide range of capability levels. What changes with resources is the precision of targeting, the reliability of placement, the breadth of coverage, and the operational patience available. This section examines the threat across two adversarial profiles, draws structural parallels to established supply chain precedents, and briefly addresses adjacent operational surfaces.*

## ***4.1 The Asymmetric Lone Actor***

*The minimum viable attack infrastructure is remarkably small. The attacker requires: a laptop or equivalent compute, API access to the target model family (or access to open-weight equivalents), knowledge of the target model's base lineage (often publicly available through technical reports and announcements), and access to data distribution channels that feed into synthetic-data pipelines.*

*The economics are starkly asymmetric. Generating poisoned synthetic data through model APIs costs on the order of dollars to hundreds of dollars depending on volume. Distributing that data through public channels (open datasets, code repositories, forum posts, annotation platforms) costs nothing. The 250-document threshold established by Souly et al. (2025) means the required volume of poisoned material is trivially achievable by a single individual. Meanwhile, the target; a foundation model training run; represents an investment of millions to hundreds of millions of dollars in compute, data curation, and human evaluation.*

*The lone actor's primary constraint is placement reliability. Scattering poisoned fragments across public data sources and hoping they are ingested during a specific training run is probabilistic. The attacker cannot guarantee inclusion. However, the cost of generating additional fragments is negligible, and the fragments are individually benign, so the attacker can distribute liberally without risk of detection. The strategy is volume and patience rather than precision.*

*A secondary constraint is targeting. The attacker must correctly identify the target model's base lineage to satisfy the shared-initialization requirement for subliminal transfer. For closed-weight models, this requires inference from public information (model cards, technical reports, API behavior analysis). For open-weight models, the lineage is known. The growing prevalence of fine-tuned variants built on a small number of popular base models (Llama, Qwen, Mistral) means that a single set of poisoned fragments calibrated to one base family could affect dozens of downstream deployments.*

## ***4.2 State-Level Operations***

*A state actor with dedicated intelligence and cyber infrastructure changes three variables in the attack model: placement becomes surgical, targeting becomes precise, and the operational timeline extends from months to years.*

### ***4.2.1 Surgical Placement***

*A lone actor distributes fragments and hopes for ingestion. A state operation can ensure it. Nation-states already maintain persistent influence operations across the open web, operating content farms, funding academic research, contributing to open-source projects, running news outlets and information services. These are high-trust data sources; precisely the sources that receive priority weighting during training data curation because they appear authoritative and well-maintained.*

*A state actor can place poisoned fragments across these high-trust channels with operational security practices that prevent attribution. Each placement is a legitimate-appearing contribution to a legitimate source. The fragment is benign content that happens to carry a subliminal signal calibrated to a specific model family. No individual contribution raises flags. The fragments age into the corpus over months or years, acquiring the patina of established, trusted content.*

*The synthetic-data supply chain offers an even more direct placement vector. Annotation services, instruction-dataset vendors, and API-generated training corpora are increasingly outsourced, often across international boundaries. A state actor with the ability to compromise or operate such a service gains direct injection into the training pipeline of any customer that purchases the data. The shared-lineage condition is satisfied by default; the annotation pipeline generates data using the same model family that the customer will train on.*

### ***4.2.2 Precise Targeting***

*A lone actor infers the target's model lineage from public information. A state actor with signals intelligence capability can identify it with high confidence. Compute cluster procurement, cloud service contracts, research team hiring patterns, conference publications, and API behavioral analysis all leak information about which model families an organization is developing or deploying. For open-weight models, the lineage is public knowledge by definition.*

*With confirmed lineage information, the state actor can run parallel internal experiments replicating the subliminal transfer methodology of Cloud et al. (2025) against copies of the target's base model. This allows calibration of the poisoned fragments for maximum transfer fidelity before any data is distributed externally. The attacker can test the attack before deploying it; a luxury the lone actor does not have.*

### ***4.2.3 Multi-Model Coverage***

*A state operation is not constrained to targeting a single model family. With sufficient resources, parallel poisoning pipelines can be maintained against every major base model lineage simultaneously; separate fragment sets calibrated to GPT variants, Claude variants, Gemini, Llama, Qwen, Mistral, and others. Each pipeline distributes through different channels appropriate to the target's likely data sources. The cost of maintaining parallel pipelines is marginal relative to the resources available to a state cyber operation.*

*This creates a coverage guarantee that the lone actor cannot achieve. Even if any individual pipeline fails (due to data preprocessing, deduplication, or the target switching base models), the probability that at least one pipeline succeeds across the targeted ecosystem is potentialy high. The attack model shifts from a targeted strike to a saturation operation.*

### ***4.2.4 Strategic Patience***

*The most consequential difference between lone-actor and state-level operations is time horizon. Fragments can be seeded into data sources long before any specific training run is planned. The poisoned content ages into web archives, academic repositories, and open datasets. By the time a training data collection pipeline scrapes these sources, the fragments are indistinguishable from long-established legitimate content. The temporal distance between injection and activation eliminates any forensic correlation between the two events.*

*This patience also enables a strategic activation model. The state actor seeds the subliminal bias, waits for it to be absorbed into models that are then deployed across critical infrastructure (healthcare, finance, defense, government services), and activates the implanted behavior at a moment of strategic significance. The context-trigger activation mechanism described in Section 3.2.3 becomes a strategic capability rather than an immediate exploit; a dormant influence channel embedded in the reasoning infrastructure of an adversary nation.*

## ***4.3 Supply Chain Parallels***

*The Ice Bullet is structurally analogous to established supply chain compromises in traditional infrastructure, with one critical distinction.*

***SolarWinds (2020):** A state actor compromised the build pipeline of a trusted software vendor, inserting malicious code into signed updates distributed to approximately 18,000 organizations. The attack exploited vendor trust, achieved distributed impact, and evaded detection for months. The structural parallel to the Ice Bullet is direct: trusted upstream source, poisoned artifact, distributed downstream impact, delayed discovery. The critical distinction: SolarWinds left code artifacts. The malicious payload was a discrete, inspectable object embedded in a deterministic system. Once identified, it could be isolated, reverse-engineered, and removed. The Ice Bullet leaves geometric directions in a continuous representation space. The "payload" is distributed across billions of parameters and cannot be isolated as a discrete object.*

***Hardware supply chain attacks:** Documented cases of firmware-level compromises in networking equipment, storage devices, and industrial control systems demonstrate that supply chain attacks can operate at layers below the visibility of standard security monitoring. The parallel to representation-level attacks on foundation models is structural: the compromise exists in a substrate that the application layer cannot directly inspect. The distinction, again, is that hardware backdoors are discrete circuits or code; they can in principle be found through sufficiently thorough physical or binary analysis. Geometric directions in learned representations have no discrete physical correlate.*

***The common thread** across these parallels is the exploitation of trust relationships in supply chains where the consumer cannot fully verify the behavioral properties of the artifacts they receive. The Ice Bullet extends this pattern into a domain where "behavioral properties" are not verifiable even in principle by current methods, because the behavior in question is a distributional property of a continuous space rather than a logical consequence of inspectable instructions.*

## ***4.4 Adjacent Operational Surfaces***

*The Ice Bullet as described in Section 3 operates at the training-data layer of the AI supply chain. Adjacent attack surfaces exist at the operational layer that merit brief treatment, as they may complement training-time poisoning or present independent risks.*

***Prompt marketplaces and system prompt templates.** Organizations increasingly source system prompts, prompt templates, and conversational frameworks from third-party providers. A compromised prompt template could function as the context-trigger activation sequence described in Section 3.2.3, steering models into implanted regions of representation space without the end user's knowledge. This is an inference-time supply chain surface; distinct from training-time poisoning but potentially synergistic with it.*

***Plugin and tool-use ecosystems.** Models increasingly interact with external tools, APIs, and plugins. A compromised tool integration could inject adversarial context into the model's input stream during inference, serving a similar navigational function to context-trigger activation.*

***Fine-tuning service providers.** Organizations that outsource model fine-tuning to third parties face the same trust gap identified in the training-data supply chain. The fine-tuning provider has direct access to weight modification, making subliminal trait injection even more straightforward than data-level poisoning.*

*These surfaces are noted for completeness. A full treatment of each is beyond this thesis's scope, but their existence reinforces the central argument: the AI supply chain has multiple layers where distribution-level compromise is feasible, and the security frameworks governing these layers were designed for a different category of artifact.*

---

# ***5\. Toward Distribution-Native Security: Research Directions***

*The preceding sections establish that a paradigm gap exists and that it produces exploitable consequences. This section asks what filling that gap might require. We offer no solutions; the gap is too recently identified and too structurally deep for prescriptions. We offer research explorations: formalized hypotheses for empirical investigation, candidate security primitives for distribution-defined systems, and an honest assessment of what existing frameworks cover and where they fall short.*

## ***5.1 The Three Hypotheses, Formalized***

*The Ice Bullet threat model rests on empirical findings (subliminal learning, constant-count poisoning, synthetic data propagation) and a conceptual extension (context-trigger activation). Whether this threat model holds under realistic conditions is an empirical question. We formalize three hypotheses whose confirmation or refutation would determine the threat's practical severity.*

### ***H1: Compositionality***

***Statement:** Individually benign data fragments are insufficient to induce subliminal trait transfer. The effect requires their union to exceed a measurable threshold, suggesting a phase-transition dynamic in representation space.*

***Motivation:** Cloud et al. (2025) demonstrate subliminal transfer using 10,000 data points generated by a single teacher model. Souly et al. (2025) establish that backdoor insertion requires approximately 250 poisoned documents at constant count across model scales. Tan et al. (2025) observe a sharp threshold at tens of examples for compliance-only backdoors. These findings suggest that trait transfer exhibits threshold behavior rather than linear accumulation. For the Ice Bullet, the question is whether fragments distributed across independent sources compose into a coherent signal when aggregated during training, and if so, at what threshold.*

***Experimental sketch:** Vary the number of injected fragments while holding total training data constant. Measure trait acquisition (using the evaluation protocols from Cloud et al.) as a function of fragment count. Test whether fragments generated independently (by separate instances of the same teacher, or by the same teacher at different times) compose as effectively as fragments from a single generation session. The compositionality threshold, if it exists, defines the minimum viable injection for a real-world attack.*

***If refuted:** If benign fragments do not compose into a coherent subliminal signal when independently generated and distributed, the Ice Bullet's injection model fails in its strongest form. The threat would be limited to scenarios where the attacker controls a single, concentrated data source; still concerning, but a substantially narrower attack surface.*

### ***H2: Mixture Robustness***

***Statement:** The subliminal signal either persists under realistic dataset mixing, deduplication, and preprocessing, or these operations constitute a natural defense whose limits and failure modes can be quantified.*

***Motivation:** Real training pipelines apply extensive preprocessing: deduplication (exact and fuzzy), quality filtering, domain balancing, data augmentation, and mixing from heterogeneous sources. Cloud et al. (2025) note that paraphrasing may attenuate subliminal transfer and that divergence tokens could play a role in the mechanism. Liang et al. (2025) demonstrate that VIA-style poisoning propagates through synthetic data generation despite clean queries, suggesting some robustness to mixing. The question is whether standard preprocessing pipelines provide incidental defense against subliminal signals, and if so, where the failure boundary lies.*

***Experimental sketch:** Begin with a confirmed subliminal transfer setup. Apply standard preprocessing operations incrementally: exact deduplication, fuzzy deduplication, paraphrase augmentation, mixing with data from unrelated teachers, mixing with human-authored data. Measure trait transfer at each stage. Identify which operations attenuate the signal, which do not, and at what mixing ratios the signal degrades below detection thresholds.*

***If refuted:** If standard preprocessing reliably destroys subliminal signals, this constitutes a natural defense that is already deployed (though not designed for this purpose). The research focus then shifts to characterizing whether adversarial optimization of fragments could produce signals robust to specific preprocessing regimes; an arms race dynamic, but one where the defender has a structural advantage through preprocessing diversity.*

### ***H3: Detectability Gap***

***Statement:** Standard behavioral evaluations and content-level filters fail to surface implanted subliminal traits, but representation-level and weight-space auditing techniques can detect geometric anomalies. The boundary between what output-level monitoring catches and what requires geometry-aware approaches defines the operational scope of the paradigm gap.*

***Motivation:** Cloud et al. (2025) demonstrate that prompted LLM classifiers and in-context learning both fail to detect subliminal traits in filtered data. Hubinger et al. (2024) show that safety training fails to remove persistent backdoors. On the defensive side, Ahlers et al. (2026) demonstrate through FIRE that backdoor triggers create structured, repeatable changes in latent representations that can be characterized as directions and mitigated at runtime. Zanbaghi et al. (2025) achieve 92.5% accuracy in detecting sleeper agents through semantic drift analysis. Hou et al. (2025) show through FLARE that aggregating abnormal activations across all hidden layers (rather than only the final layer) improves detection of diverse backdoor types.*

***Experimental sketch:** Train models with confirmed subliminal traits (replicating Cloud et al.). Apply a battery of detection methods in three tiers: output-level (behavioral benchmarks, red-teaming, prompted classification), activation-level (probing classifiers on intermediate representations, sparse autoencoder decomposition, semantic drift measurement), and weight-level (spectral analysis of weight matrices, comparison against clean baselines). Report detection rates and false positive rates at each tier. The gap between the tiers quantifies the operational paradigm gap.*

***If refuted:** If output-level methods can reliably detect subliminal traits (contrary to current evidence), the paradigm gap is narrower than this thesis argues. The Ice Bullet remains a useful case study for supply chain risk, but the urgency of developing geometry-aware security primitives is reduced. Current tools, properly configured, may be sufficient.*

## ***5.2 Candidate Distribution-Native Security Primitives***

*Filling the paradigm gap will require security tools that operate at the level where distribution-defined threats exist: representation geometry, weight-space properties, and distributional characteristics of learned manifolds. We identify four candidate primitives as research objects. None is a solution; each is a direction that requires substantial investigation.*

### ***5.2.1 Representation-Level Auditing***

***The question:** Can anomalous directions in a model's activation space be detected during or after training, even when the anomaly is weakly coupled to semantic content?*

*Ahlers et al. (2026) demonstrate that backdoor triggers induce structured changes in latent representations that can be characterized as directions. Sparse autoencoders (Sharkey et al., 2025\) decompose activations into interpretable features and enable monitoring at the feature level. These tools suggest that representation-level auditing is feasible in controlled settings.*

***Open challenges:** Current methods require knowledge of what to look for (a clean baseline, a suspected trigger, a known backdoor direction). The Ice Bullet implants a direction the defender has no reason to suspect. Unsupervised anomaly detection in high-dimensional activation spaces; identifying "something is unusual" without knowing what "usual" should look like in unsampled regions; remains an open problem. Additionally, Ronge et al. (2026) demonstrate that sparse autoencoder features exhibit substantial fragility under steering, with sensitivity to layer selection, magnitude, and context. This suggests current mechanistic interpretability tools are not yet reliable enough for safety-critical auditing at scale.*

### ***5.2.2 Weight-Space Provenance***

***The question:** Can we extend the concept of artifact integrity (signed weights, verified builds) to include behavioral properties of weights; characterizing what geometric structures the weights encode, rather than merely verifying that the weights are the correct file?*

*Current weight-signing practices verify identity: these are the weights that training produced. They do not verify properties: these weights do not encode geometric structures associated with known attack patterns. Developing "behavioral signatures" for weight matrices; compact characterizations of the representational geometry that can be compared across training runs, audited for anomalies, and tracked over fine-tuning; would extend artifact-level security into the distributional domain.*

***Open challenges:** Defining what constitutes a "normal" geometric structure for a given model architecture and training regime. Developing efficient computation of behavioral signatures that scales to models with billions of parameters. Establishing baselines against which anomalies can be detected without prior knowledge of the specific attack.*

### ***5.2.3 Synthetic Data Lineage***

***The question:** Can we build provenance tracking for synthetic training data analogous to Software Bills of Materials (SBOMs) for code dependencies?*

*The synthetic-data supply chain is the highest-risk channel for subliminal transfer because it naturally satisfies the shared-lineage precondition. A synthetic data lineage system would track: which model generated the data, from which base family, with what system prompt or fine-tuning configuration, through what filtering and post-processing pipeline. This would enable downstream consumers to assess lineage compatibility risks before ingesting data into training.*

***Open challenges:** Standardizing lineage metadata across a fragmented ecosystem of data providers. Ensuring lineage claims are verifiable (not just self-reported). Handling the case where data passes through multiple intermediaries, each adding transformations. The open-web scraping problem remains: for data sourced through web crawling, lineage tracking is fundamentally limited because the provenance of web content is often unknown or falsifiable.*

### ***5.2.4 API-Only Detection***

***The question:** Can geometric anomalies in a model's representations be inferred from output distributions alone, without access to model weights or intermediate activations?*

*Most deployed models are accessed through APIs. The defender (whether the deploying organization or a third-party auditor) often has no access to weights or internal states. If the paradigm gap can only be addressed through weight-level or activation-level tools, the vast majority of deployed models are unauditable for distribution-level threats. Developing methods to infer representational properties from output distributions; effectively, tomography of learned geometry through black-box queries; would extend distribution-native security to the deployment contexts where it is most needed.*

***Open challenges:** The information-theoretic limits of what can be inferred about internal geometry from output samples. The query volume required for meaningful inference. Distinguishing genuine geometric anomalies from normal variation in output distributions. This is likely the hardest of the four primitives and may prove infeasible for certain classes of threats.*

## ***5.3 Existing Frameworks and Their Boundaries***

*Distribution-native security does not start from zero. Existing security and risk frameworks address portions of the threat surface, and any new development should extend these frameworks rather than replace them.*

*The **OWASP Top 10 for LLM Applications (2025)** identifies Data and Model Poisoning (LLM04) as a top risk, citing sleeper agents and poisoned model distribution as exemplars. The framework addresses the supply chain surface well at the artifact level. It does not yet address representation-level threats or the subliminal transfer channel specifically.*

*The **MITRE ATLAS framework** catalogs adversarial techniques against ML systems, including Backdoor ML Model (AML.T0018), Poison Training Data (AML.T0021), and Publish Poisoned Datasets (AML.T0022). These categories capture the injection layer of the Ice Bullet accurately. The gap is in the generalization and activation layers; ATLAS does not yet model threats that operate through weight-space geometry rather than data-level content.*

*The **NIST AI Risk Management Framework** emphasizes data integrity and supply chain security as core risk dimensions. Its treatment of these risks is consistent with the artifact-level security model described in Section 2.1. Extension into distributional risk; where the artifacts are correct but the behavior they produce is compromised; would align the framework with the paradigm gap this thesis identifies.*

*These frameworks are necessary infrastructure. The research program we propose just extend them into the distributional domains.*

---

# ***6\. Limitations and Open Questions***

*A thesis that claims to identify a paradigm gap has an obligation to be precise about where its own reasoning is strong, where it is provisional, and where it may be wrong. This section addresses all three.*

## ***6.1 What This Thesis Does Not Claim***

*For clarity and against mischaracterization, we state the following explicitly:*

*We do not claim that subliminal supply chain attacks have been executed in the wild. We have no evidence of any such attack. The Ice Bullet is a theoretical threat model grounded in empirical findings about subliminal learning; it is not a report of an observed incident.*

*We do not claim that current alignment and safety techniques are useless. RLHF, constitutional AI, red-teaming, guardrail systems, and related methods are effective within the regions of the state space where they are applied. Our claim is that these methods are structurally unable to guarantee behavioral integrity across the full state space of distribution-defined systems. Effective within scope and sufficient across all conditions are different properties.*

*We do not claim that deterministic security controls should be abandoned or deprioritized. Data provenance, weight signing, access control, build pipeline integrity, and related controls are necessary for protecting the artifacts that distribution-defined systems depend on. Our claim is that artifact integrity does not entail behavioral integrity for this class of systems.*

*We do not claim that the Ice Bullet is the only or the most likely distribution-level threat. It is a case study chosen because the empirical foundations (subliminal learning) are recent, well-documented, and directly illustrate the paradigm gap. Other distribution-level threat classes likely exist and may prove more practically consequential.*

*We do not provide exploit code, attack tooling, or operational methodology. The thesis is structured to demonstrate that a structural vulnerability exists, not to provide a roadmap for exploiting it.*

## ***6.2 Limitations of the Empirical Foundation***

*The subliminal learning results (Cloud et al., 2025\) that anchor the Ice Bullet case study were obtained in controlled experimental settings. Several gaps between these settings and real-world conditions remain unaddressed:*

***Scale of training data.** The experiments use datasets of 10,000 examples. Real foundation model training runs ingest billions of tokens from millions of sources. Whether the subliminal signal survives dilution at this scale is an open question. The constant-count finding from Souly et al. (2025) suggests that required poison volume does not scale with dataset size, but this has not been tested specifically for subliminal transfer.*

***Heterogeneity of data sources.** The experiments use data generated by a single teacher model. Real training corpora mix data from many sources, models, and modalities. Cloud et al. note that paraphrasing may attenuate the effect and that cross-model transfer fails. Whether fragments from a single teacher model retain their subliminal signal when embedded in a heterogeneous corpus alongside data from unrelated sources is untested.*

***Training procedure variation.** The experiments use standard supervised fine-tuning. Real training pipelines employ multi-stage procedures: pretraining, supervised fine-tuning, RLHF, DPO, and various forms of post-training optimization. Whether subliminal signals survive this full pipeline is unknown. Hubinger et al. (2024) demonstrate that some backdoor behaviors persist through multi-stage training, which is suggestive but not directly applicable to subliminal transfer.*

***Adversarial optimization.** The experiments use teacher-generated data without adversarial optimization of the fragments themselves. An attacker with knowledge of the target's training pipeline might optimize fragments for maximum transfer fidelity and robustness to preprocessing. Whether such optimization is feasible and how much it would improve attack reliability are open questions that cut in both directions; the attack could be stronger or weaker than the unoptimized experimental baseline suggests.*

## ***6.3 Limitations of the Conceptual Extensions***

*Two elements of the thesis extend beyond established empirical findings into theoretical territory:*

***Context-trigger activation** is a conceptual framework, not a demonstrated mechanism. We ground it in three empirical precedents (compliance-only backdoors, context-dependent trait expression in subliminal learning, attractor dynamics in LLM representations), but the specific claim; that an attacker can construct a reliable basin-entry sequence to navigate a model into an implanted region of representation space; has not been tested. It is possible that the navigational precision required exceeds what is achievable in practice, or that models' internal dynamics are too chaotic for reliable trajectory construction.*

***The fragmented injection model** (distributing fragments across independent sources that compose during training) is a theoretical extension of subliminal learning's single-teacher experimental design. Whether independently generated fragments compose into a coherent subliminal signal is the subject of Hypothesis H1 and has not been empirically investigated. If compositionality fails, the Ice Bullet's strongest form; distributed injection through multiple independent sources; does not hold.*

## ***6.4 If We Are Wrong***

*We consider three failure modes:*

***If mixture robustness fails** (H2 refuted): standard preprocessing destroys subliminal signals at realistic mixing ratios. In this case, the Ice Bullet fails as a practical threat model, but the paradigm gap persists. The gap is a structural property of distribution-defined systems, independent of any specific attack. The case study fails; the argument stands. Research focus shifts to other distribution-level threat classes or to adversarial optimization of fragments for preprocessing robustness.*

***If compositionality fails** (H1 refuted): independently generated fragments do not compose. The Ice Bullet's distributed injection model fails, constraining the threat to scenarios where the attacker controls a concentrated data source. This narrows the attack surface significantly, particularly for the lone-actor profile. The state-actor profile retains viability through controlled synthetic-data services. The paradigm gap argument is unaffected.*

***If the detectability gap closes** (H3 refuted): output-level methods can reliably detect subliminal traits. This would mean the paradigm gap, while real in principle, is narrower than this thesis argues. Current tools, properly configured and informed by the subliminal learning literature, may be sufficient for this specific threat class. This would be the most consequential refutation, as it would suggest that the gap between specification-defined and distribution-defined security, while real, may be bridgeable without fundamentally new primitives. We would welcome this outcome.*

---

# ***7\. Conclusion***

*For decades, digital security has operated on a premise so foundational it rarely needed stating: that the systems we protect do what they are instructed to do, and that threats are deviations from those instructions. This premise held because it was true. Code executes as written. Protocols follow specifications. When something goes wrong, the deviation from specification is both the definition of the breach and the thread that leads to its resolution.*

*Foundation models do not break this premise. They sidestep it. Their behavior is shaped by training, assessed by evaluation, and constrained by alignment techniques; but it is never fully specified, and therefore never fully verifiable. The systems work. They are often remarkably capable and genuinely useful. They are also, in a precise structural sense, operating beyond the reach of the security frameworks designed to govern them.*

*The Ice Bullet is one illustration of what this gap produces. A signal that carries no semantic content. Training data that passes every filter. A behavioral shift that no evaluation protocol explores. Evidence that exists as geometry in a space no current tool can exhaustively inspect. We chose this case study because the empirical foundation is recent and rigorous, because the mechanism is mathematically grounded, and because it projects onto the structural argument. But the argument does not depend on the case study. If the Ice Bullet proves impractical under real-world conditions; if mixture destroys the signal, if fragments fail to compose, if output-level detection turns out to be sufficient; the paradigm gap remains.* 

*Distribution-defined systems will still lack complete behavioral specifications. Finite evaluation will still sample a vanishing fraction of the state space. The security question will still be open.*

*What we have proposed is a way of framing that question: three hypotheses and four research primitives that can be investigated, to help the security community reason about a class of systems it was not originally built to address.*

*We do not know whether the threats described here will materialize at scale. We do not guarantee whether the defenses sketched here will prove feasible. We know for certain is that probabilistic computation is entering the infrastructure of daily life; healthcare, finance, law, defense, public services; at a pace that outstrips the development of security frameworks adequate to its nature. The cost of closing this gap after a crisis will be measured differently than the cost of closing it before one.*

*This thesis is an argument that the gap exists, a demonstration that it has real world consequences, and a simple invitation to begin conversation.*

