LOGISTAR
HyenaDNA: learning from DNA with 1 Million token context
data: 2023-10-09

We're *excited* to introduce HyenaDNA, a genomic foundation model pretrained on sequences of up to 1 million tokens long at single nucleotide resolution.

We show that as one increases context length, we can reach better performance measured in improved perplexity.

HyenaDNA trains up to 160x faster than Transformers with FlashAttention, uses a single-character tokenizer, and has global context at each layer.

We apply HyenaDNA on 28 genomic tasks (SotA on 23), using far fewer parameters than previous genomic models, with examples that can fit on colab.

There has been amazing work applying foundation models (aka LLMs) to genomics [DNABERT, Nucleotide Transformer, GenSLMs, GENA-LM], modeling DNA as the “language” of life. Unfortunately, these works have been limited by the quadratic scaling of attention in Transformers, and thus far have typically used context lengths of between 512 - 4k tokens, depending on dense or sparse attention. That’s less than 0.001% of the length of the human genome. (We also noticed that compared to protein models, which have fairly "short-range" sequences, genomics is far less established.)

 

https://hazyresearch.stanford.edu/blog/2023-06-29-hyena-dna