Natural-language processing (NLP) algorithms are now able to generate protein sequences and predict virus mutations. The key insight making this possible is that many properties of biological systems can be interpreted in terms of words and sentences. “We’re learning the language of evolution,” says Bonnie Berger, a computational biologist at the Massachusetts Institute of Technology. “It’s a neat paper, building off the momentum of previous work,’ says Ali Madani, a scientist at Salesforce, who is using NLP to predict protein sequences.
MIT researchers train a language model on thousands of genetic sequences taken from three different viruses. The aim is to identify mutations that might let a virus escape an immune system without making it less infectious. The NLP models work by encoding words in a mathematical space in such a way that words with similar meanings are closer together than words with different meanings. The model grouped viruses according to how similar their mutations were. The model is based on an LSTM, a type of neural network that predates the transformer-based ones used by large language models.
Machine-learning tool can predict mutations in HIV, coronavirus strains better than other models. Knowing what mutations might be coming could make it easier for hospitals and public health authorities to plan ahead. Still, this work is more about breaking new ground than making a real impact on public health—for now, say the researchers. They have found a high potential for immune escape in all of the new strains they’ve tested, although this hasn’t yet been tested in the wild. But the model did miss another change in the South Africa variant that has raised concerns because it may allow it to escape vaccines.
The NLP model predicts potential mutations straight away, which focuses the lab work and speeds it up. Treating genetic mutations as changes in meaning could be applied in different ways across biology. Researchers are watching advances in NLP and thinking up new analogies between language and biology to take advantage of them. “I think biology is on the cusp of a revolution,” says Madani. ‘There’s a lot of creative ways we can start interpreting language models,’ says Berger.
2021auto12
AIs that read sentences are now catching coronavirus mutations