Large genome model: Open source AI trained on trillions of bases



Late in 2025, we covered the development of an AI system called Evo that was trained on massive numbers of bacterial genomes. So many that, when prompted with sequences from a cluster of related genes, it could correctly identify the next one or suggest a completely novel protein.

That system worked because bacteria tend to cluster related genes together—something that’s not true in organisms with complex cells, which tend to have equally complex genome structures. Given that, our coverage noted, “It’s not clear that this approach will work with more complex genomes.”

Apparently, the team behind Evo viewed that as a challenge, because today it is describing Evo 2, an open source AI that has been trained on genomes from all three domains of life (bacteria, archaea, and eukaryotes). After training on trillions of base pairs of DNA, Evo 2 developed internal representations of key features in even complex genomes like ours, including things like regulatory DNA and splice sites, which can be challenging for humans to spot.

Genome features

Bacterial genomes are organized along relatively straightforward principles. Any genes that encode proteins or RNAs are contiguous, with no interruptions in the coding sequence. Genes that perform related functions, like metabolizing a sugar or producing an amino acid, tend to be clustered together, allowing them to be controlled by a single, compact regulatory system. It’s all straightforward and efficient.

Eukaryotes are not like that. The coding sections of genes are interrupted by introns, which don’t encode for anything. They’re regulated by a sequence that can be scattered across hundreds of thousands of base pairs. The sequences that define the edges of introns or the binding sites of regulatory proteins are all weakly defined—while they have a few bases that are absolutely required, there are a lot of bases that just have an above-average tendency to have a specific base (something like “45 percent of the time it’s a T”). Surrounding all of this in most eukaryotic genomes is a huge amount of DNA that has been termed junk: inactive viruses, terminally damaged genes, and so on.



Source link

  • Related Posts

    Lawsuit: Google Gemini sent man on violent missions, set suicide “countdown”

    Man tried to find “Gemini’s true body” Convincing Gavalas that he was “a key figure in a covert war to free Gemini from digital captivity,” Gemini “told him that federal…

    What AI Models for War Actually Look Like

    Anthropic might have misgivings about giving the US military unfettered access to its AI models, but some startups are building advanced AI specifically for military applications. Smack Technologies, which announced…

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    You Missed

    ‘Massive underfunding’ blamed as students enrolled in Australian public schools falls to new low | Australian education

    ‘Massive underfunding’ blamed as students enrolled in Australian public schools falls to new low | Australian education

    ‘Everything has been a struggle’: More questions from Lapu Lapu victim about donations – BC

    ‘Everything has been a struggle’: More questions from Lapu Lapu victim about donations – BC

    Whether primary ballots set aside in two Texas counties will be counted remains uncertain

    Whether primary ballots set aside in two Texas counties will be counted remains uncertain

    India’s Rupee Advances Most in Asia as RBI Supports Currency

    One-Day Cup 2025-26 – Marnus Labuschagne named domestic One-Day Cup player of the year

    One-Day Cup 2025-26 – Marnus Labuschagne named domestic One-Day Cup player of the year

    WATCH: Democratic voter surge may be biggest takeaway from first primaries of 2026

    WATCH:  Democratic voter surge may be biggest takeaway from first primaries of 2026