Improving antibody affinity using laboratory data with language model guided design

Ben Krause, Subu Subramanian, Tom Yuan, Marisa Yang, Aaron Sato, Nikhil Naik

Introduction

Large scale language models (LMs) (Radford et al., 2018; 2019; Brown et al., 2020) initially developed for tackling challenges in natural language processing, have revolutionized many scientific disciplines by providing unprecedented capabilities for solving complex problems. Notably, ProGen (Madani et al., 2020) – a language model trained on publicly available protein sequences, was able to successfully generate sequences of the bacteriolytic enzyme lysozyme that were very different in sequence space from any lysozyme in nature (Madani et al., 2023). More recently, Hie et al. (2023) harnessed pre-trained language models to help guide antibody optimization. The search space for candidate proteins for any given application can be can be overwhelmingly vast, and these aforementioned successes show that language models have the ability to significantly reduce this search space.

While large language models trained on universal datasets can imitate the distribution of sequences in their training set, it is highly desirable to further train them to exceed the average quality of data in their training set. Furthermore, the distribution of sequences needed for downstream applications may significantly differ from what the model initially learned during pretraining. Applications of large scale language models in real world contexts have shown remarkable performance improvements upon fine-tuning the model using human feedback (Ziegler et al., 2019), wherein the model is further trained on high quality task specific responses or with reinforcement learning. For example, InstructGPT (Ouyang et al., 2022) used a combination of these two approaches to greatly improve language models responses and adaptability to make them useful for real world applications.