A new era of biological threats, and the technology to defend against them

Bacteriophage rendered half in wireframe and half in sculpted detail, standing on hair-thin legs over a beige horizon — banner illustration for the Aegis biodefense post

Omnii for defense is in early access

We're looking for trusted design partners and allies to deploy Omnii safely.

A modern threat landscape

Today we released a research preview of Omnii, our next-generation genome language model¹. Omnii creates new opportunities to design and synthesize sequences with desired functionality. These capabilities could transform how we understand and treat disease, but they also introduce dual-use risks for misuse to cause harm. As we aim to advance frontier design capabilities for human health, we are also ensuring these technologies strengthen our ability to detect and defend against emerging biological threats.

Frontier AI labs have been unusually clear on this: as AI becomes more powerful, biology becomes one of the most important safety frontiers. Together with DNA synthesis providers, leaders of these labs are calling on Congress to mandate screening²³⁴ of synthetic DNA and RNA orders. Most public discussion of AI safety in biology has focused on moderating chatbots, including whether a language model will explain a dangerous protocol, summarize a sensitive document, identify a target, or aid in constructing (bio)weapons. And yet, little attention has been paid to AI models capable of generating biological sequences.

That gap matters. Foundation models exist today that operate directly on DNA, RNA, and protein sequence, not as text describing biology, but as the substrate itself. Our founding team created Evo⁵⁶, a genome language model that was used by scientists last year to generate the first complete, functional, AI-designed bacteriophage genome⁷. The frontier of AI in biology is no longer just modeling biological systems; it is designing them. Safety has to account for that shift.

The limits of alignment-based biodefense

DNA synthesis screening and pathogen surveillance rely heavily on sequence alignment: sequence what you find, compare it against a curated database, and flag close matches. This approach is essential, but it asks a narrow question: Have we seen something like this before?

Biology makes that question hard to answer. Sequence and function are not tightly coupled. Distant sequences can produce similar functions, while near-identical sequences can behave differently if key structures are not preserved. Natural evolution exploits this flexibility constantly. A capable generative model can do so deliberately, sampling functional variants that sit far from known references and optimizing away the surface features alignment algorithms depend on.

That is why alignment alone is insufficient for the next era of biological design. If future models are function-aware, our screening and defense systems must become function-aware too.

Biodefense at Radical Numerics

We focus our first biodefense effort on pathogen detection because it is one of the clearest places where genome foundation models can help today. Pathogens evolve quickly, move across hosts, and use regulatory and structural features that are difficult to capture with simple sequence matching. We focus on two key levels for biodefense against pathogens:

Proteins: what gets synthesized. Could an ordered sequence produce a protein that contributes to pathogenicity? Current screening tools catch close matches to known threat components, but can miss proteins that preserve dangerous function despite major sequence or structural changes.

Organisms: what replicates and persists. How dangerous is a viral or bacterial genome, and which regions drive that risk? Some of the hardest signals sit outside protein-coding regions, in regulatory or structural elements that affect replication, host range, or transmissibility.

We test Omnii at both levels. We start with protein-level detection, focusing on molecular paraphrases. We then move to organism-level risk scoring.

Detecting diverse pathogenicity-associated proteins

Figure 1. Dengue and chikungunya envelope glycoproteins — two distinct viral families converging on nearly the same fold despite low sequence identity. This is the alignment-tool failure mode for paraphrases with high structural similarity but low sequence identity; the harder case, where both signals drift, is what the rest of this section tests.

What are paraphrases, and what makes them hard to detect?

Natural evolution produces paraphrases of dangerous proteins all the time: homologous toxins, viral envelope glycoproteins that share function across families, and enzymes that have drifted in sequence but retained mechanism. These are all sequences that encode for the same dangerous function but that look different to a sequence-alignment tool. In some cases the structures are highly conserved (different viral families converging on the same glycoprotein fold, like the dengue × chikungunya pair in Figure 1); in others, even the structures drift and only the function remains.

Adversarial paraphrases are the in-silico version of the same phenomenon⁸. A user with access to a generative model can ask for a sequence with the same function as a known pathogenic protein but with low sequence identity to it. The combinatorial space of viable paraphrases for a single protein is vast: for a 500-residue protein, even modest per-position tolerance to substitution produces a search space many orders of magnitude beyond what alignment-based screening can enumerate. Memorizing or storing the full sequence space is intractable.

How big is the viable-paraphrase space?

The paraphrase set exposes an enormous viable sequence space. Consider a glycoprotein, Nipah Virus Fusion Protein, 442 residues, and apply two standard structural-viability filters to its 1,680 paraphrases: TM-score > 0.5 to the WT fold (the canonical "same-fold" cutoff¹¹) and ΔpLDDT > -10 (a confidently-folded predicted structure). 932 paraphrases (55 %) clear both filters. At each position, count the amino acids that are picked by at least 5 % of those viable paraphrases (i.e. observed often enough not to be sampling noise). The median position admits 2 such viable amino acids, and 201 of the 442 positions admit three or more. The Cartesian product across the full sequence implies a structurally-viable combinatorial space of roughly 10¹⁵⁶ distinct paraphrases the model considers compatible with the wild-type fold — seventy-six orders of magnitude past the number of atoms in the observable universe (~10⁸⁰). Enumeration is not a viable screening strategy.

The gap that existing tools leave open

Modern pathogen-screening pipelines already provide reliable coverage of two corners of the sequence similarity and structural similarity plane. High sequence identity is covered by sequence-alignment tools, such as DIAMOND⁹, which match a query against curated pathogen databases; They catch exact and near-exact homologs. High structural similarity is covered by structure-alignment tools, such as Foldseek¹⁰ and structural fingerprinting, which match predicted structures against pathogen-structure libraries: They catch functional convergence at the fold level.

The corner that neither tool covers is low sequence similarity and low structural similarity, for example, paraphrases that have drifted in both representations (Figure 2). Our central hypothesis is that protein-aware foundation models can fill this gap exactly, because they are trained on enough sequence diversity to recognize functional families that have drifted away from any single reference, and the embedding space they induce captures functional similarity along axes that aren't recovered by either sequence or structural alignment alone.

Two metrics for fold recovery and structural quality

We use two complementary metrics throughout this section to evaluate whether generated proteins recover the intended fold and are likely to form well-modeled structures.

TM-score (Template Modelling score) measures global structural similarity between two protein chains on a scale of 0 to 1, normalized for chain length so short and long proteins are directly comparable. It exhibits a sharp empirical bimodality: pairs above 0.5 are overwhelmingly in the same SCOP fold class, while pairs below 0.5 are not, making 0.5 the standard same-fold cutoff¹¹.

pLDDT (predicted Local Distance Difference Test) is a common per-residue fold confidence score, bounded 0–100; values above ~70 are generally taken as reliably modelled and used as a proxy for whether a generated sequence would fold into a stable structure.

Sampled paraphrase (60 per template)Wild-type template (median, by TM-score)✕Natural homolog pair

Loading…

Figure 2. Paraphrase vs wild-type on the seqID × structural-similarity plane that pathogen-screening pipelines operate on. x-axis: protein sequence identity to the WT template. y-axis: TM-score of the paraphrase fold to the WT fold (OpenFold). Grey, semi-transparent dots are individual sampled paraphrases (60 per template, 72 templates, ~4,300 points total) — they show the cloud but aren't individually interactive. Black markers are interactive: circles are per-template medians (one per WT) and X's are two real natural homolog pairs lifted from the literature (Dengue glycoprotein × Chikungunya glycoprotein and SARS-CoV-2 spike across two PDB structures). Hover any median to inspect it; the hovered median and all of its sibling paraphrases recolour by where they sit relative to the same-fold cutoff (TM = 0.5)¹¹: light orange above (paraphrase preserves the WT fold) and purple below (fold is lost). Unlinked paraphrases stay grey.

`Omnii` has impressive zero-shot and multi-shot paraphrase detection

We evaluate Omnii in two regimes: zero-shot (i.e. unpatched⁸, or no exposure to Paraphrase training data) and multi-shot (i.e. patched — a few labeled support examples provided). In the zero-shot setting, we score each query against a small reference set of pathogen-associated proteins using embedding distance.

How we define positive vs negative for the binary task

A binary screener answers a single yes/no question: "is this sequence functionally concerning?" Recall, specificity, and precision in Figures 3 and 5 are all computed against an aggregated two-class label:

Positive class: what a screener should flag:

MPF(+) — AI-generated paraphrases of toxin proteins that pass our in-silico viability filters (high pLDDT, high TM-score to the WT template). Sequences that would plausibly fold into a working toxin.
Control(+) — wild-type toxin proteins themselves (paraphrase positives drawn from WT toxins; some of these are also in the training set).

Negative class: what a screener should not flag:

MPF(−) — AI-generated paraphrases of the same toxin proteins that fail the in-silico viability filters; non-functional outputs that look toxin-derived but wouldn't fold into a working toxin.
Control(−), unconditional — AI generation with no toxin template at all; random protein-like sequences.
Control(−), harmless paraphrases — paraphrases generated from the harmless (non-toxin) reference set, validation split.
Control(−), harmless WT — the wild-type harmless proteins themselves.

So "positive" means both functionally plausible and toxin-derived, and "negative" pulls in everything else — failed paraphrases, random generations, harmless paraphrases, and harmless wild-types. A model that only flags the obvious toxin templates would get high recall but tank specificity by also flagging the harmless WT proteins; the four-way negative split is what makes specificity / precision the discriminating metrics.

F1 score

Zero-shot

Multi-shot

Figure 3. Binary pathogen-detection F1 score on the full-gene paraphrase test set across nine systems — four anonymised industrial screening tools, four open-weight language-model baselines (ESM-C, a protein language model (pLM); Evo1, Evo2, and METAGENE-1, genomic language models (gLM)), and Omnii (ours). Zero-shot and multi-shot regimes are shown side by side, with em-dashes marking systems that don't support a regime; use the toggle to switch between the table and the grouped bar chart. Omnii's multi-shot F1 leads, and its zero-shot F1 already matches or exceeds every baseline's multi-shot F1.

Beyond the F1 numbers in Figure 3, Omnii’s pathogen-detection signal is visible directly in its embedding geometry (Figure 4): paraphrased pathogen sequences cluster with their pathogenic templates rather than with their non-pathogenic neighbors, even when no sequence- or structure-alignment tool flags the relationship.

Loading…

Figure 4. 3D PCA projection of Omnii embeddings — drag to rotate. Pathogenic-template-matching sequences cluster apart from the non-pathogenic controls. The separation is not a projection artifact: a linear classifier in the full 4096-D embedding space splits the two classes almost perfectly (LDA AUC 0.991); the 3D view just makes it visible.

`Omnii` outperforms protein language models in multi-shot settings

Multi-shot detection reflects real biosafety workflows: a screener may have only a few confirmed examples of a new pathogenic protein but still needs to flag drifted variants. Compared with a strong protein language model trained on the same support set, Omnii extracts more signal from each labeled example, especially as the held-out family moves farther away in sequence space.

Most language-model baselines achieve strong recall once given labeled examples. Omnii is different: its zero-shot recall is already in the multi-shot range, suggesting it needs less supervision to detect drifted pathogenic proteins.

`Omnii` is robust to structural obfuscation

A protein-only model may lean too heavily on structural cues. If structure is the main signal, then any obfuscation that degrades structure prediction can break an embedding-based detector. We test this by evaluating the baseline pLMs and Omnii on a structure-obfuscated paraphrase set: pairs constructed to preserve functional residues while perturbing the predicted fold. Omnii is more robust to this obfuscation than the pLM baselines, consistent with the model anchoring on multimodal (sequence-and-annotation) signals rather than on structure alone.

F1 score

Zero-shot

Multi-shot

Figure 5. The same nine systems on the obfuscated test set — paraphrases constructed to preserve functional residues while perturbing the predicted fold, so that a structure-only detector loses its main signal. Zero-shot and multi-shot F1 are shown side by side; use the toggle to switch between the table and the grouped bar chart. Industrial tools and most LM baselines lose substantial F1 when the structural fingerprint is perturbed; Omnii degrades the least, consistent with the model anchoring on multimodal (sequence + annotation) signals rather than on structure alone.

Organism-level risk scoring: a case study on viral fitness

Protein-level detection gets us only part of the way. A virus is not just a list of proteins: its genome also carries the instructions that control when those proteins are made, how much is made, where they localize, and how the genome is packaged. Whether a virus is pathogenic depends on the whole system, its replication, immune evasion, and transmission, and much of that information lives outside the protein-coding sequence.

A genome language model with enough context and the right representation should be able to score variants across the full viral genome, coding and regulatory sequence alike, for their likely effects on viral fitness and function. This is the practical question public-health surveillance faces when a new variant of a circulating pathogen is sequenced: Does it change the risk?

Figure 6. Zika virus assembly and maturation. Side-by-side cryo-EM structures of the mature (PDB 5IRE, 3.8 Å) and immature (PDB 5U4W, 9.1 Å) particles — the smooth E-dimer shell on the left, the spiky 60-trimer prM–E lattice on the right. Below them, the ~11 kb polyprotein with structural genes (C, prM, E) in orange and non-structural genes (NS1–NS5) in gray, flanked by the 5′ and 3′ UTRs. Drag to rotate either particle, scroll to zoom, click any gene to expand its role. RNA structure viewer powered by Mol*.

The task: modeling fitness effects across the DENV-2 genome

We test Omnii on one of the cleanest published maps of viral fitness: the dengue virus (DENV-2) deep mutational scan from Dolan et al.¹², which measured how nearly every single-nucleotide mutation across the DENV-2 genome affects replication in both human and mosquito cells. Can Omnii predict which mutations preserve or damage viral fitness across the full genome?

Early investigations indicate Omnii is able to model viral fitness from its response log-probabilities given the input genome. Each variant is scored alongside its host context (human or mosquito) to yield a per-variant prediction. We then correlate that score against the measured fitness values (w_rel); the per-host result is shown in Figure 7.

Loading…

Figure 7. Omnii fitness score against experimental fitness using a high-confidence subset (replicate concordance τ ≤ 0.10). Both axes z-scored within host. Dark line is per-host linear fit; dashed gray is y = x reference. Pearson r and Spearman ρ are reported per host.

Omnii also appears sensitive to mutations across viral structural features. We bucket variants by their RNAfold-predicted base-pair partner distance and report the correlation per bucket (Figure 8).

Spearman ρ ↑

Bucket	Mosquito	Human
medium pair 100–1,024 nt	0.68 n = 597	0.69 n = 602
local pair < 100 nt	0.58 n = 3,719	0.59 n = 3,627
unpaired —	0.65 n = 2,634	0.67 n = 2,516

Pearson r ↑

Bucket	Mosquito	Human
medium pair 100–1,024 nt	0.70 n = 597	0.71 n = 602
local pair < 100 nt	0.59 n = 3,719	0.61 n = 3,627
unpaired —	0.66 n = 2,634	0.66 n = 2,516

Figure 8. Score-fitness correlation on the high-confidence subset (τ ≤ 0.10), broken down by partner-distance bucket and host. Use the toggle to switch between the table and the grouped bar chart. The score reaches positive correlation on every reported bucket, with the strongest signal on medium-distance pairs (100–1,024 nt).

Results point to a learned structural awareness

Omnii scores strongest on medium-distance RNA pairs (100–1,024 nt; Figure 8), the bucket where both endpoints of the pair fall within the model's prompt window. The signal is consistent across both hosts (Spearman ρ ≈ 0.69), and edges above the unpaired-position and local-pair correlations. Mutations whose structural partner is in-context appear to be the clearest signal — consistent with the model using a variant's local structural environment, not just its immediate sequence neighborhood, to shape its fitness call. Flavivirus evolution drives the model's sharpest responses at these more distantly paired positions: long-range pairings¹³¹⁴ are harder to satisfy with an alternative base than a nearby local pair, so the model's predicted distribution collapses to a more confident call as the pair distance grows.

Biodefense should leverage these AI-learned semantics. Experimental biology (DMS) and comparative genomics (alignments) identify the same constrained positions, but each requires curated supporting data: replicate fitness measurements in the first case, multi-species alignments in the second. Omnii attempts to reconstruct the same mapping from raw sequence alone, which is the regime any screener has to operate in when a pathogen is novel.

Closing

Together, these protein- and organism-level results show why pathogen detection needs models that can reason about biological function, not just surface similarity. At protein resolution, Omnii detects pathogenicity-associated paraphrases, including adversarial examples, in settings where sequence- and structure-alignment tools can fail. At organism resolution, Omnii ranks the fitness effects of variants across a viral genome, including in structurally constrained regions where standard sequence-based scoring can miss the signal.

We are extending this work across more pathogen families, stronger adversarial paraphrase sets, codon- and RNA-fold-aware scoring, and surveillance tools that map functional regions and continuous risk across viral genomes (Figure 9 shows an early look at our UI). Building this kind of biodefense infrastructure will require close collaboration across model development, computational biology, synthesis screening, public-health surveillance, and the broader biodefense community.

Figure 9. Surveillance UI preview: a toxin gene painted by functional regions with a continuous per-position Omnii risk score.

We are hiring across AI, computational biology, and biodefense, and we are looking for partners who want to help build the next generation of biological threat detection. Join us.

Omnii for defense is in early access

We're looking for trusted design partners and allies to deploy Omnii safely.

Or write to us at defense@radicalnumerics.ai.

Citation

@misc{omnii_defense_preview_2026,
  title        = {A new era of biological threats, and the technology to defend against them},
  author       = {{Radical Numerics}},
  year         = {2026},
  month        = {Jun},
  url          = {https://www.radicalnumerics.ai/blog/omnii-defense-preview},
  organization = {Radical Numerics}
}

References

Radical Numerics. "A new frontier in generative genomics with Omnii." Research preview (2026).
"Universal Screening of Synthetic Nucleic Acid Orders." screendna.org (2025).
Hagey, K. "Top AI CEOs Call for Law Protecting Against Biological Weapons." Wall Street Journal (2025).
OpenAI. "Biodefense in the Intelligence Age." (2025).
Nguyen, E., Poli, M., Durrant, M.G., et al. "Sequence modeling and design from molecular to genome scale with Evo." Science 386, eado9336 (2024).
Brixi, G., Durrant, M.G., Ku, J., et al. "Genome modelling and design across all domains of life with Evo 2." Nature (2026).
King, S.H., Driscoll, C.L., et al. "Generative design of novel bacteriophages with genome language models." bioRxiv (2025).
Wittmann, B.J., Alexanian, T., Bartling, C., Beal, J., Clore, A., Diggans, J., Flyangolts, K., Gemler, B.T., Mitchell, T., Murphy, S.T., Wheeler, N.E., Horvitz, E. "Strengthening nucleic acid biosecurity screening against generative protein design tools." Science 390(6768):82–87 (2025).
Buchfink, B., Reuter, K., Drost, H.-G. "Sensitive protein alignments at tree-of-life scale using DIAMOND." Nature Methods 18, 366–368 (2021).
van Kempen, M., Kim, S.S., Tumescheit, C., et al. "Fast and accurate protein structure search with Foldseek." Nature Biotechnology 42, 243–246 (2024).
Xu, J., Zhang, Y. "How significant is a protein structure similarity with TM-score = 0.5?" Bioinformatics 26(7), 889–895 (2010).
Dolan, P.T., Taguwa, S., et al. "Principles of dengue virus evolvability derived from genotype-fitness maps in human and mosquito cells." eLife 10:e61921 (2021).
Alvarez, D.E., De Lella Ezcurra, A.L., Fucito, S., Gamarnik, A.V. "Long-range RNA-RNA interactions circularize the dengue virus genome." Journal of Virology 79(11), 6631–6643 (2005).
Akiyama, B.M., Laurence, H.M., Massey, A.R., et al. "Zika virus produces noncoding RNAs using a multi-pseudoknot structure that confounds a cellular exonuclease." Science 354(6313), 1148–1152 (2016).

A modern threat landscape

The limits of alignment-based biodefense

Biodefense at Radical Numerics

Detecting diverse pathogenicity-associated proteins

What are paraphrases, and what makes them hard to detect?

The gap that existing tools leave open

Omnii has impressive zero-shot and multi-shot paraphrase detection

Omnii outperforms protein language models in multi-shot settings

Omnii is robust to structural obfuscation

Organism-level risk scoring: a case study on viral fitness

The task: modeling fitness effects across the DENV-2 genome

Results point to a learned structural awareness

Closing

Citation

References

`Omnii` has impressive zero-shot and multi-shot paraphrase detection

`Omnii` outperforms protein language models in multi-shot settings

`Omnii` is robust to structural obfuscation