
Omnii for health is in early access
We're looking for design partners and collaborators across clinical genomics and biological research.
DNA is the foundation of life. Scientists have spent decades sequencing genomes and developing experimental and statistical methods to connect genetic variation to biological function. This work has transformed biology and human health, but we still understand only a fraction of how genomic sequences give rise to living systems. Members of our team helped pioneer the field of generative genomics, betting that AI systems could learn DNA grammar deeply enough to render biology designable, not just describable. Those early models showed what was possible, but we had only scratched the surface.
In natural language, raw pretraining created powerful base models, but it took additional breakthroughs in scale, alignment, mid-training, and post-training to turn them into broadly useful systems. Biology now needs to take the same leap.
Omnii is a research preview of what that looks like. It is a state-of-the-art genome language model (gLM) built to predict, design, and interpret. It sets a new performance frontier for gLMs across noncoding variant effect prediction, RNA fitness, and mechanistic interpretability, outperforming strong integrative baselines. It can be prompted to design target sequences via a new evolutionary chain-of-thought technique. Inside its 2 million base pair context window, it picks up enhancer-to-promoter dependencies across distal regulatory landscapes, matching contacts seen in experimental chromatin maps. Omnii also rediscovers key variants associated with Alzheimer's.
Unlocking the potential of genome language models
Our prior contributions with HyenaDNA1, Evo2 and Evo 23 established the first scaling laws for DNA pretraining, showing that gLMs improve predictably with compute and data. These models demonstrated that raw DNA sequence contains learnable structure at scale. But they also revealed two limitations that have kept base genome models from becoming broadly useful systems.
First, they are unimodal. They are trained primarily on nucleotide sequence alone, without direct access to the rich biological measurements that researchers use to interpret genomes. (Together with DNA, the amount of biological sequence information online is already far greater than all the text on the internet). Second, they are unalignedaa We use "unaligned" here in the AI sense i.e., not post-trained to follow user intent, instead of sequence-alignment sense familiar to biologists.. There is no mechanism for a user to elicit task-specific behavior from the model directly.
As a result, base genome language models can be powerful in principle but difficult to use in practice. Despite their scale, they often underperform dedicated sequence-to-function (e.g., Borzoi4) and machine learning-based aggregation models (e.g., CADD5) on applied tasks such as variant interpretation. In this sense, Evo-class gLMs are more analogous to base language models prior to post-training than to the instruction-aligned models that most LLM users now take for granted.
Our research combines three ingredients that have historically been pursued in isolation: scaling, alignment, and the introduction of modality fusion. The goal is a new class of genome models that can reason directly over biological sequences while being aligned in the experimental and evolutionary signals that make that sequence interpretable (Figure 1).
Highlights
Omnii models are pretrained with native fusion of information-dense genomic annotation tracks alongside DNA sequence and are post-trained to be directly applicable to downstream design and prediction tasks. As a result, they can be used out of the box (without training task-specific heads on top of frozen embeddings) and they reason jointly over raw DNA and the annotation tracks. In this first preview, Omnii ingests DNA sequence and conservation signals (MSA and phylogenetic scores).
New architecture for sparse multimodality: Omnii features a new architecture with a multi-hybrid backbone featuring block convolutions (a generalization of scalar convolutions, suited to genomic modeling), a dynamic sparse attention mechanism, and a 2 million context window (Figure 2).
Alignment: Omnii is the first gLM to be mid-trained and post-trained, unlocking new applications without requiring training of specialized classifiers on frozen embeddings. Omnii can reason over DNA tokens, genomic annotation tokens, and other general-purpose special tokens.
Performance: Omnii achieves state-of-the-art results across standard benchmarks for gLMs, including variant effect prediction in clinical variants and for complex traits (Table 1). Strikingly, Omnii is especially strong in decoding noncoding variants, offering a complementary approach to other models specialized in the coding portions of the genome.
Omnii marks the first time a gLM outperforms the model families that have historically partitioned this space:
- Handcrafted, integrative bioinformatics models that aggregate dozens of curated annotations (conservation, regulatory tracks, protein-impact scores) through supervised meta-classifiers: CADD5 for SNVs and short indels, and its structural-variant counterpart CADD-SV8.
- Predictive sequence-to-function models such as Borzoi4, trained to map DNA sequence directly to functional genomic readouts
(expression, chromatin accessibility, regulatory activity), with variant effects read out from the predicted reference-versus-alternate difference.
Omniiis competitive with these despite being a general-purpose generative model rather than a specialized regressor over functional tracks. - Other genome language models, including all Evo 23 checkpoints and recent covariance-based probes trained on top of frozen gLM embeddings (e.g., EVEE7), one of the recipes for adapting pretrained DNA models to variant-effect tasks.
Omniiimproves over these substantially while requiring no embedding-side probe training.
| Benchmark | Evo 23 7B | Evo 23 40B base | Borzoi4 Linder et al. | EVEE7 cov. probe | CADD5 v1.7 | CADD-SV8 seqresolved | Omnii Preview |
|---|---|---|---|---|---|---|---|
ClinVar noncoding SNV AUROC ↑ | 0.879 | 0.911 | 0.848 | 0.942 | 0.909 | — | 0.975 |
AUPRC ↑ | 0.425 | 0.440 | 0.173 | 0.646 | 0.289 | — | 0.735 |
ClinVar noncoding indel AUROC ↑ | 0.866 | 0.905 | 0.546 | 0.969 | 0.896 | — | 0.994 |
AUPRC ↑ | 0.813 | 0.836 | 0.299 | 0.923 | 0.873 | — | 0.974 |
ClinVar CNV 50bp–100kb AUROC ↑ | 0.605 | 0.606 | 0.577 | — | — | 0.664 | 0.773 |
AUPRC ↑ | 0.586 | 0.580 | 0.569 | — | — | 0.638 | 0.763 |
TraitGym complex-trait AUPRC ↑ | 0.150 | 0.153 | 0.297 | — | 0.284 | — | 0.410 |
RNAGym RNA fitness |ρ| ↑ | 0.273 | 0.315 | — | — | — | — | 0.398 |
Benchmarking Omnii
Clinical variants: substitutions, short insertions and deletions
Substitutions and short indels are the overwhelming majority of entries in databases like ClinVar13, where interpretation is hardest at scale and most observed variants remain variants of uncertain significance. The difficulty is uneven: a missense change in a coding exon acts through the protein it alters, the regime where protein language models and supervised predictors already do well. However, less than 2% of the human genome codes for proteins; the rest is largely regulatory information that controls when, where, and how strongly each gene is expressed. Where a coding mutation changes a protein directly, a regulatory mutation can leave protein sequences intact but instead alter gene expression profiles in pathogenic ways.
Such noncoding variation is where genome language models are expected to contribute most. By modeling DNA directly, they can in principle score regulatory variants for which no protein-level features exist. In practice, gLMs have so far trailed integrative scores such as CADD5, which combine conservation and dozens of annotations into a supervised meta-classifier, and applying them to variant interpretation has typically required training a separate classifier or covariance probe on frozen embeddings, as in EVEE7.
Omnii sets a new state-of-the-art on ClinVar pathogenicity classificationbb For a direct comparison, we adopt the evaluation protocol defined in EVEE7: ClinVar variants are restricted to those with ≥1-star review status and lying in genes ≤100 kb, then partitioned by gene so that 80% of genes contribute training variants and the remaining 20% are held out for evaluation, ensuring no gene appears in both sets. All baselines are re-scored against this same split. (Tables 2, 3). Its gains are largest on noncoding variants (the categories where coding-focused predictors do not apply and where prior gLMs were weakest). It produces these predictions directly, without training task-specific probes or extracting embeddings.
| Category | Evo 23 7B | EVEE7 cov. probe | CADD5 v1.7 | Omnii Preview | ||||
|---|---|---|---|---|---|---|---|---|
| AUROC | AUPRC | AUROC | AUPRC | AUROC | AUPRC | AUROC | AUPRC | |
Coding SNV 198,343 · 15.1% patho | 0.980 | 0.915 | 0.995 | 0.979 | 0.993 | 0.966 | 0.996 | 0.982 |
Noncoding SNV 88,706 · 0.7% patho | 0.879 | 0.425 | 0.942 | 0.646 | 0.909 | 0.289 | 0.975 | 0.735 |
Coding Indel 26,992 · 96.1% patho | 0.878 | 0.993 | 0.944 | 0.997 | 0.905 | 0.994 | 0.973 | 0.999 |
Noncoding Indel 14,994 · 15.0% patho | 0.866 | 0.813 | 0.969 | 0.923 | 0.896 | 0.873 | 0.994 | 0.974 |
Overall 329,035 · 17.9% patho | 0.976 | 0.945 | 0.996 | 0.986 | 0.985 | 0.958 | 0.997 | 0.990 |
| Consequence | Evo 23 7B | EVEE7 cov. probe | CADD5 v1.7 | Omnii Preview | ||||
|---|---|---|---|---|---|---|---|---|
| AUROC | AUPRC | AUROC | AUPRC | AUROC | AUPRC | AUROC | AUPRC | |
| SNV | ||||||||
Overall 287,074 · 10.7% patho | 0.981 | 0.902 | 0.995 | 0.974 | 0.992 | 0.958 | 0.996 | 0.979 |
Missense 37,682 · 22.9% patho | 0.899 | 0.683 | 0.955 | 0.876 | 0.937 | 0.783 | 0.969 | 0.916 |
Synonymous 139,137 · 0.1% patho | 0.926 | 0.237 | 0.906 | 0.564 | 0.942 | 0.207 | 0.926 | 0.508 |
Nonsense 14,127 · 98.1% patho | 0.827 | 0.995 | 0.911 | 0.998 | 0.865 | 0.996 | 0.964 | 0.999 |
Splice 7,397 · 98.6% patho | 0.806 | 0.996 | 0.856 | 0.997 | 0.840 | 0.996 | 0.954 | 0.999 |
UTR 5,731 · 0.5% patho | 0.807 | 0.147 | 0.850 | 0.170 | 0.807 | 0.206 | 0.963 | 0.316 |
Intronic 82,023 · 0.7% patho | 0.905 | 0.461 | 0.956 | 0.705 | 0.908 | 0.283 | 0.975 | 0.775 |
Other 952 · 5.6% patho | 0.635 | 0.315 | 0.701 | 0.311 | 0.957 | 0.620 | 0.981 | 0.761 |
| Indel | ||||||||
Overall 41,989 · 67.2% patho | 0.964 | 0.981 | 0.991 | 0.995 | 0.977 | 0.988 | 0.995 | 0.997 |
Frameshift 24,225 · 98.9% patho | 0.717 | 0.995 | 0.822 | 0.997 | 0.804 | 0.997 | 0.925 | 0.999 |
In-frame 1,234 · 48.1% patho | 0.901 | 0.903 | 0.959 | 0.960 | 0.867 | 0.852 | 0.980 | 0.978 |
Noncoding 12,942 · 4.3% patho | 0.914 | 0.686 | 0.970 | 0.881 | 0.956 | 0.822 | 0.992 | 0.933 |
Splice-adj 1,533 · 91.4% patho | 0.854 | 0.983 | 0.967 | 0.995 | 0.831 | 0.972 | 0.976 | 0.997 |
Insertion 14,061 · 62.2% patho | 0.951 | 0.974 | 0.987 | 0.992 | 0.963 | 0.980 | 0.995 | 0.996 |
Deletion 27,297 · 70.3% patho | 0.969 | 0.984 | 0.993 | 0.996 | 0.984 | 0.992 | 0.996 | 0.998 |
Clinical variants: structural and copy number variants
Copy number variants (CNVs) and other large structural variants have received much less attention from genome language models. These events delete, duplicate, or rearrange larger segments of DNA, from tens of bases to megabase-scale regions. They are a major cause of Mendelian disease and a common driver in cancer, but they are not uniformly pathogenic. Many copy-number changes are tolerated in the general population, so interpretation depends on the specific sequence affected and the genes and regulatory elements it contains.
CNVs are difficult because they span multiple biological regimes. Small events may disrupt a splice site, regulatory element, or local coding frame. Here, fine-grained sequence and regulatory annotations matter most. Larger deletions or duplications omay affect entire genes or gene clusters. In these cases, dosage sensitivity becomes more informative.
These two regimes of pathogenicity mechanisms, across small and large variants, have shaped the model landscape. Large-event classifiers rely heavily on gene overlap and dosage annotations. CADD-SV8 and related integrative models combine conservation, regulatory annotations, gene context, and repeat features into meta-scores that also cover noncoding events. Prior sequence-only gLMs have been constrained by small context windows which are often too short for large events, and they lack native access to the annotations that make many CNVs interpretable.
Omnii is built to cover both ends of this spectrum. Its megabase-scale context input can represent multi-kb events directly, while multimodal annotation fusion brings in the conservation, regulatory, and dosage signals used by the strongest integrative baselines (Figure 3).
Complex trait classification
Predicting genotype-to-phenotype relationships is a central goal of human genetics. Genome-wide association studies have linked thousands of loci to complex traits and disease, but most associations do not identify a single causal variant. Rather, they identify regions of correlated variants inherited together through linkage disequilibrium. Fine-mapping can narrow these regions, but often cannot fully resolve them, especially in noncoding loci where there is no protein-level consequence information.
This is the practical bottleneck between a GWAS hit and a biological mechanism. To evaluate Omnii on this problem, we probe its performance on TraitGym Complex11, a recent benchmark that scores variant-effect predictors on fine-mapped trait variants stratified by categorycc We follow TraitGym's LR protocol: a logistic-regression is fit on each model's per-variant features under leave-one-chromosome-out splits, and the reported AUROC/AUPRC are per-chromosome n-weighted averages (matching the TraitGym leaderboard convention). Omnii's features come from its internal activations. (Figure 4). Categories blend Ensembl VEP consequence terms with ENCODE-SCREEN cCRE classes15. There, Omnii surpasses the two prevailing approaches to noncoding variant prioritization on their own: the leading integrative meta-classifier (CADD v1.75) and a top sequence-to-function model (Borzoi4). It also matches or exceeds the previous state-of-the-art on this benchmark: a supervised ensemble of CADD, Borzoi, and the multispecies-alignment model GPN-MSA12.
A second, more subtle improvement appears in how Omnii behaves with respect to the statistical covariates that systematically confound variant-prioritization benchmarks: minor allele frequency, distance to the nearest transcription start site, and local linkage disequilibrium structure. Across stratifications of these covariates, Omnii's ranking quality stays largely flat: rare variants, distal regulatory contexts, and high-LD blocks are scored as reliably as the easier regimes where conventional methods are already informative (Figure 5). This matters because predictors whose advantage concentrates in the regimes where strong priors already exist contribute little marginal evidence at exactly the loci where conventional methods are least informative.
We expect Omnii to be of practical value to statistical geneticists building on existing GWAS meta-analyses: as a sequence-grounded functional-evidence layer over fine-mapped credible sets, and as a model amenable to trait-specific post-training. We showcase a practical application in the following section.
Working with Omnii
What follows walks through three concrete uses of Omnii: prioritizing causal variants for complex traits, designing functional biomolecules with chain-of-thought reasoning, and reading mechanistic regulatory structure out of its activations.
Omnii predicts causal variants for a variety of complex traits
Variants linked to complex traits and disease lie mostly in noncoding DNA, where the strongest statistical signal marks a region of the genome in linkage disequilibrium, rather than a single causal variant. Historically, when linkage disequilibrium leaves the candidates indistinguishable, isolating the variant that carries the regulatory signal has meant going to the bench. These experiments are slow and specialized, and can be a bottleneck between a GWAS hit and a drug target. Most GWAS loci still have no confirmed causal variant.
We tested Omnii's ability to find this signal in the sequence itself: a functional variant disrupts a motif, a conserved position, or a piece of regulatory grammar, and a model that reads DNA well should find it. To run the test, we leveraged a massively parallel reporter assay (MPRA) in human iPSC-derived microglia16, the cell type most strongly implicated by Alzheimer's GWAS, measuring allele by allele which noncoding variants are functional. Across nine of these loci, the highest probability fine-mapping variant is not the MPRA functional one.
Omnii recovers the functional variants from sequence alone (Figure 6). Across the same nine loci it ranks the functional variant in the top three every time and first at six, while CADD ranks it anywhere from first to sixteenth. At TSPOAP1, a largely uncharacterized Alzheimer's risk locus highlighted in Lee et al., the top fine-mapping variant rs2680700 is experimentally silent; the functional variant rs2526377 is Omnii's first pick.
Recovering functional variants in silico is the foundation; future Omnii releases aim to also determine the genes associated with functional variants and how these variants affect gene regulation. Explore the loci and per-variant evidence in the Alzheimer's workbench.
The Alzheimer's workbench: Omnii's in silico reproduction of the microglia MPRA, recovering the functional variant from sequence across the credible sets. Watch the full walkthrough →
Omnii uses chain-of-thought reasoning for in silico molecular design
The same capabilities that make Omnii useful for interpreting genomic sequences can also be directed toward generative design. Many experimental workflows are still constrained by the molecules that nature provides, or by hand-engineered molecules that require extensive rounds of optimization. We asked whether our post-training methods could use laboratory-generated data to guide the design of new biomolecules with desired functional properties.
As proof-of-concept, we show here an example of RNA aptamer design. RNA aptamers are short structured RNAs that can bind proteins and other molecular targets with high specificity, making them a useful proving ground for sequence design. Using a custom chain-of-thought format over trajectories of increasing fitness, we post-trained Omnii on a partial corpus from a systematic evolution of ligands by exponential enrichment (SELEX) experiment designed to identify high-affinity RNA aptamer binders and inhibitors of the HIV reverse transcriptase9. We held out the highest-affinity aptamers from training and asked whether Omnii could recover these sequences or even improve upon them.
Using this approach, Omnii generated exact sequences from the high-affinity holdout set, as well as novel sequences predicted to have favorable binding properties (Figure 7). This performance was consistent across multiple affinity hold-out thresholds.

These results illustrate a general approach to in silico evolutionary design. Starting from imperfect experimental data, Omnii can be post-trained to generate sequences that move toward a desired functional regime. We are applying Omnii to a variety of other classes of functional biomolecules, especially in settings where wet-lab optimization remains slow, expensive, or difficult to scale.
Omnii learns rich biological representations across multiple levels of genetic complexity
The same alignment that makes Omnii usable out of the box lets us turn the question around: rather than asking the model to make predictions about sequences, we can ask what it has learned from them.
By probing the model’s activations, we find that Omnii organizes sequence-derived features across layers in a way that reflects increasing biological complexity. Simple properties, such as GC content and motif composition, are resolved early. More structured features, including regulatory grammar and gene architecture, emerge at later depths. This hierarchical progression indicates that information encoded in raw sequence is transformed into representations that capture increasingly abstract aspects of genome organization and function.
We also find low-dimensional subspaces with geometry that tracks specific biological categories, including traits and disease classes. In these spaces, clinically related variants cluster together, suggesting that the model has learned some features that capture shared genetic and functional structure across diseases (Figure 8). These encouraging early mechanistic studies suggest that Omnii’s internal representations may provide a useful substrate for studying how genetic variation relates to genome function and disease.
Building Omnii
Omnii benefits from our research program on model architecture, designed to improve the effectiveness of multimodal pre- and post-training.
Detecting genomic motifs with block convolutions
One particular improvement happens inside block convolutions, first introduced for long-sequence DNA modelling by Ku et al.10. We extend the work by removing the Toeplitz restriction on the blocks, allowing data to select what linear operator to apply on itself instead. Since block convolutions are aligned with the GPU hierarchy, this increase in local modelling expressivity comes at no additional throughput cost.
Framing convolution as a matrix makes the generalization clear. A scalar convolution is a Toeplitz matrix; tiling it into blocks is the exact same computation, just reorganized into the dense tiles for GEMMs. This turns it into a Toeplitz block convolution. The hardware insight is that since we are already paying the dense price locally, we can remove our structural constraint and let the data decide for itself. Each block is now a dense, learned matrix (Figure 9).
Where a scalar convolution is effective at picking out local single-nucleotide motifs, a block convolution can capture hierarchical structure; for example, how those motifs combine into some of the higher-order patterns that govern regulatory function.
Varying input streams for flexible inference
Omnii natively supports variable input compositions: at inference, any subset of annotation streams can be supplied alongside the DNA sequence. This trades modality availability for accuracy on a per-query basis, rather than committing to a single fixed input recipe.
As one concrete example, conditioning Omnii on phylogenetic conservation alongside DNA and its other tracks improves variant effect prediction across the board on the TraitGym Complex benchmark: averaged over the n=11,400 split, AUROC moves from 0.772 to 0.789 (Δ +0.018) and AUPRC from 0.368 to 0.410 (Δ +0.042), with gains concentrated in the noncoding regulatory categories where deep evolutionary preservation is a dominant signal of function.
Enabling selective long context with sparse attention
One of our key research programs is targeted at extending the context length of Omnii. The human genome is 3.2 billion base pairs distributed across 24 chromosomes. Enhancer–promoter regulatory interactions can span millions of nucleotides. Omnii supports context windows up to 2 million base pairs through sparse attention that enables efficient training and inference over these long-range interactions.
Block convolutions are well suited to picking out motifs and their higher-order local combinations. However, modeling regulatory grammar at scale demands a complementary capability: selective long-range communication that connects regulatory elements across long stretches of uninformative sequence, such as intergenic regions and repeat-dense tracts. Context-dependent sparse attention provides this, allowing only the relevant pairs of positions exchange information.
At full 2 Mb context, this selectivity specializes along the genome's functional architecture rather than acting uniformly. Different heads partition by sequence class — coding exon to coding exon, conserved element to conserved element, CpG to CpG — with the sharpest heads concentrating 60–70% of their long-range selections within a single feature class (Figure 1, above).
Querying a gene's promoter, sparse attention recovers experimentally validated (CRISPRi-FlowFISH, K562)14 enhancer–gene pairs at AUROC 0.83. The selectivity is not purely a function of distance: at matched distance, the model prefers evolutionarily conserved sequence over non-conserved (AUROC 0.66–0.76 within each distance bin; partial correlation 0.31 controlling for distance). This preference persists into the 150–300 kb regime, where conserved elements are selected roughly twice as often as non-conserved ones.
The model bridges distance to functional elements because of what is written there. At the erythroid regulator KLF1, the promoter selects a CRISPRi-validated, conserved enhancer 39 kb upstream; mutating that enhancer's sequence in silico drops the promoter's selection of the enhancer from 4.7 to 2.0 (pseudo-attention score). Every other long-range arc — including a control anchor's selection of the same element — holds steady (Figure 10).
Extending context is a necessary (but not sufficient) step toward modeling regulatory interactions at the scale of whole chromosomes and entire genomes. Closing the remaining 3-order-of-magnitude gap through new architectures, systems and post-training advances is one of our research north stars for general biological intelligence.
Closing
We continue to build Omnii capabilities, from longer context to more input modalities and post-training.
Omnii for health is in early access
We're looking for design partners and collaborators across clinical genomics and biological research.
Building Omnii requires co-design across AI and computational biology: training systems and architectures for multimodal pretraining at scale, deliberate choices over which biological tracks to fuse, post-training recipes to close the gap from base model to application, and evaluations grounded in evolution, regulation, and disease.
If you're looking to help build general biological intelligence, we'd like to hear from you.
References
Nguyen, E., Poli, M., et al. "HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution." Advances in Neural Information Processing Systems (NeurIPS) 36 (2023).
Nguyen, E., Poli, M., et al. "Sequence modeling and design from molecular to genome scale with Evo." Science 386(6723), eado9336 (2024).
Brixi, G., Durrant, M.G., et al. "Genome modelling and design across all domains of life with Evo 2." Nature (2026).
Linder, J., Srivastava, D., Yuan, H., et al. "Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation." Nature Genetics 57, 949–961 (2025).
Kircher, M., Witten, D.M., Jain, P., et al. "A general framework for estimating the relative pathogenicity of human genetic variants." Nature Genetics 46, 310–315 (2014).
Amodei, D. "Machines of Loving Grace: How AI Could Transform the World for the Better." (2024).
Goodfire. "EVEE: Interpretable variant effect prediction from genomic foundation model embeddings." bioRxiv (2026).
Kleinert, P. & Kircher, M. "CADD-SV — a framework to score the effects of structural variants in humans and other species." Bioinformatics 38(6), 1639–1641 (2022).
Ditzler, M.A., et al. "High-throughput sequence analysis reveals structural diversity and improved potency among RNA inhibitors of HIV reverse transcriptase." Nucleic Acids Research 41(3), 1873–1884 (2013).
Ku, J., et al. "Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale." arXiv:2503.01868 (2025).
Benegas, G., Albors, C., Aw, A.J., Ye, C., Song, Y.S. "A DNA language model benchmark for the prediction of causal disease variants." bioRxiv (2025).
Benegas, G., Albors, C., Aw, A.J., Ye, C., Song, Y.S. "A DNA language model based on multispecies alignment predicts the effects of genome-wide variants." Nature Biotechnology (2025).
Landrum, M.J., Lee, J.M., Benson, M., et al. "ClinVar: improving access to variant interpretations and supporting evidence." Nucleic Acids Research 46(D1), D1062–D1067 (2018).
Nasser, J., Bergman, D.T., Fulco, C.P., et al. "Genome-wide enhancer maps link risk variants to target genes." Nature 593, 238–243 (2021).
Moore, J.E., Purcaro, M.J., Pratt, H.E., et al. "Expanded encyclopaedias of DNA elements in the human and mouse genomes." Nature 583, 699–710 (2020).
Lee, C.-Y., Ravi, A., et al. "Regulatory landscape of Alzheimer's disease variants in human microglia." medRxiv (2025).
Radical Numerics. "A new era of biological threats, and the technology to defend against them." (2026).