Five years ago, I started learning about this AI thing and how to use it. I got especially excited about the idea that we can basically reformat any kind of data to something that we already have a pretty good neural network trained on. At that time, I was interested in methods that could be used to do DNA-based identification in the field. For example, by extracting DNA and sequencing just a few sequences from a whole genome, so we could quickly identify what something is (like a bug, or a leaf). And that’s how a small side project ended up in something that can identify all the species with publicly available whole-genome DNA data.
VarKoder workflow, adapted from de Medeiros et al (2025)
I encoded a really small amounts of genome sequences from a sample as an image (on the order of 1% of random sequences from the whole genome). I then applied some tricks to train image classification models to infer taxonomy from those images. It works much better than widely used DNA-based species identification methods, from domains of life to subspecies (including things that are mixtures of organisms such as lichens). We can also tell where a sample of environmental DNA came from using this method. Finally, it also works well with degraded DNA, so our collections could be great for creating training sets for the vast amounts of diversity that currently cannot be identified based on DNA. The AI model has been trained on all of the taxa that had publicly available data online as of 2 years ago when I first submitted the paper, so it is ready to use in practice. And it can run on laptops using data from portable DNA sequencers, so no need for a fancy setup.
In addition to practical use, this paper shows us that there is something that varies between genomes that can be captured by this random DNA sequencing: we are not comparing the same genes across species, and we are not comparing whole genomes. The differences may be very subtle, and we don’t know for sure what they are. It is probably related to the repeated parts of the genomes. In any case, most of our genomics research relies on comparing the same genes across species, so there is a hint that we should look outside this box.
The method and the data are described in the following publications:
Asprino, R.C., Cai, L., Yan, Y., …, de Medeiros B.A.S. A curated benchmark dataset for molecular identification based on genome skimming. Sci Data 12, 906 (2025). https://doi.org/10.1038/s41597-025-05230-2
de Medeiros, B.A.S., Cai, L., Flynn, P.J. et al. A composite universal DNA signature for the tree of life. Nat Ecol Evol (2025). https://doi.org/10.1038/s41559-025-02752-1
And the software is available here:
https://github.com/brunoasm/varKoder