Fourthly, due to time constraints and limited computational resources, the current model has a parameter count of approximately 110M

Fourthly, due to time constraints and limited computational resources, the current model has a parameter count of approximately 110M. mutation binding free energy prediction task (Pearson correlation coefficient: 0.77). Moreover, we propose a novel method to analyze the relationship between attention weights and contact claims of pairs of subsequences in tertiary constructions. This enhances the GLUT4 activator 1 interpretability of BERT2DAb. Overall, our model demonstrates strong potential for improving antibody screening and design through downstream applications. KEYWORDS:Affinity prediction, antibody screening, antibody sequences, secondary structure, pretrained language model == Intro == The high specificity and strong neutralizing ability of antibodies have made them highly appreciated in the prevention and treatment of diseases such as tumors and viral infections.13Compared to standard antibody discovery approaches, computer-based screening methods offer a encouraging alternative in terms of efficiency and cost-effectiveness. 46Computer-aided screening of neutralizing antibodies typically entails several methods, including structural modeling, affinity prediction, and developability assessment, and this process usually requires multiple iterations.7,8Compared to using structures as inputs, the use of sequences can significantly reduce computational requirements and is generally associated with lower data acquisition hurdles for accomplishing the aforementioned tasks. Leveraging an extensive collection of antibody sequences obtained through sequencing efforts performed by other researchers, we can pre-train an antibody language model (ALM) to subsequently enable downstream tasks related to antibody GLUT4 activator 1 screening based on sequence information.911 Within the realm of protein sequences, several protein language models (PLM) have been developed that GLUT4 activator 1 leverage general protein datasets (e.g., UniProtKB and BFD) to learn common sequence features and transfer these features GLUT4 activator 1 to specific tasks such as biophysical property prediction.1214ESM2, ProtTrans, and UniRep are examples of PLM.13,15,16Among these models, the first two are based on self-attention and include multiple models with varying parameter sizes, while the last one is a single model based POLR2H on LSTM. However, the evolution of antibodies relies on gene rearrangement and somatic hypermutation induced by antigens, which is usually fundamentally distinct from ordinary protein (e.g., enzymes, structural protein, transport protein) evolution.17Consequently, antibody sequences may possess unique features, and using PLMs to capture characteristics of antibody sequences may be inappropriate. AbLang, PARA, and AntiBERTy are both ALMs trained on antibody sequence datasets using self-attention.1820 While these language models have achieved promising performance in tasks such as structure modeling, there is still room for further optimization.21First, with the exception of AbLang, all other language models use a single model to embed both light and heavy chains. However, significant differences exist between the two chains in terms of sequencing data availability (which affects the amount of available training data), physicochemical properties, and gene expression characteristics (supplementary Fig.S1). As a result, using the same model to learn and represent sequences of both chains may dilute features specific to the light chain. Second, all of the aforementioned language models use amino acids as the fundamental embedding unit for pre-training. This approach may not effectively consider the collective impact of secondary structures around the spatial structure and functionality of antibodies. Additionally, previous studies have revealed the presence of linear motifs of varying lengths within protein sequences, which essentially represent conserved fragments in the sequence.22By referring to frequency-based vocabulary construction methods used in natural language processing, Asgari and colleagues developed a vocabulary of protein conserved sequence fragments that has been applied to tasks such as motif recognition, toxicity prediction, subcellular localization prediction, and protease prediction, with promising performance outcomes reported.23 The secondary structure of antibodies is closely related to antibody-specific antigen recognition and stability.24,25Firstly, the complementarity determining regions (cCDRs), especially CDR H3, are the most variable regions in antibodies. The secondary structure of these regions is related to the positioning and overall conformation of the antibody binding site. Their variations affect the orientation and exposure.

﻿Fourthly, due to time constraints and limited computational resources, the current model has a parameter count of approximately 110M