Development Of Theoretical Methods For Studying Biological Macromolecules
National Heart, Lung, And Blood Institute
Investigators
Linked publications & trials
Abstract
Several projects have been pursued in the reporting period: Gaussian electrostatic IPS potential for molecular simulation Electrostatic interaction has been described with Coulomb potential of point charges in molecular simulation and many force fields are developed based on this. However, this point charge approximation has no physical basis at close distances and causes numerical difficulties in molecular simulation such as free energy calculation. Assume charges distributed as a Gaussian function can better reflect atomic structures and allow better study of physical phenomenon. We derived IPS potentials for Gaussian electrostatic interactions, as well as multipole interactions derived from Gaussian electrostatics. This IPS potential simplifies Gaussian electrostatics calculation and allow polarizable force field to be handled conveniently. The Gaussian electrostatic IPS potential lay the foundation for a new type of force field. Force field development with Gaussian electrostatics and Double Exponential vdw potential A double exponential (DE) functional form for van der Waals interactions, proposed in our previous study, has many advantages over other vdw potential functions, such as, Lennard-Jones potential, exp-6 form potential, and 14-7 potential. The double exponential potential get rid of the singularity of existing vdw potentials and is convenient for alchemical free-energy calculation. Based on the Gaussian electrostatics and DE vna der Waals interaction, we are developing a new type of force field that completely get rid of singularities at zero distance and mimic existing force fields at regular distances. This new type of force field will enable better description of molecular systems. Leveraging-Induced Polarization for Drug Discovery: Efficient IC50 Prediction Using Minimal Features Here, we use the frequency of the atomic hybridizations (s, sp, sp2, and sp3) of each atom type (H, C, N, O, S, etc.) within a molecule to predict the IC50s of drug-like molecules, focusing on compounds targeting the Thrombin, Estrogen Receptor alpha, and Phosphodiesterase 5A proteins. The Neural Network and Random Forest models yield high correlation coefficients (R2) and low mean square error (MSE) using only 19 features. The atomic hybridizations have been used previously to calculate the molecular polarizability using a simple empirical model (Miller et al. JACS 1979). We show that the atomic hybridizations may also be used to accurately predict the molecular polarizabilities of these molecules. The results show the importance of the induced polarization in proteinâligand binding. Furthermore, the variation in R2 and MSE for the different target proteins indicates that the contribution of the induced polarization to the binding energies is different for different target proteins. Development, Optimization, and Implementation of the Gauss-Legendre-Spherical-t Algorithm Evaluating pairwise electrostatic interactions is the most time consuming part when running molecular dynamics (MD) simulations, because of the slow decay of the Coulomb operator. The Gauss-Legendre-Spherical-t (GLST) algorithm speeds up the calculation by factoring the Coulomb operator into short- and long-range terms, where the short-range term is handled by computing only the pair-wise interactions within a cutoff radius and the long-range term is approximated using novel numerical quadrature techniques. The GLST algorithmâs primary advantage over traditional methods is that it does not use the fast Fourier transform (FFT) which requires multiple âall-to-allâ communications on multi-node systems. Avoiding the FFT makes the GLST algorithm well suited to be implemented for use on large CPU clusters and multi-GPU systems. We have developed parallel implementations of the algorithm and are in the process of implementing periodic boundary conditions and further optimizations to make the method usable for running efficient MD simulations using multiple GPUs. Development of a modular parameter set for performing MD simulations with Amber force fields of systems containing modified nucleic acids Over the past several years, interest in modified nucleic acids as a means of developing novel therapeutics has surged, highlighting the importance of nucleic acid research in drug discovery and development. Molecular dynamics simulations can accelerate the design and development of these systems containing modified nucleic acids by providing atomic-level detail of their structure and dynamics. However, since there are hundreds of potential modifications, modeling these modified residues is challenging due to the time and resources required to parameterize each modification (or combination of modifications). To address this problem, we have helped to develop modXNA, a tool for both building modified nucleotides and automatically generating the necessary parameters required for using them with Amber force fields. Several nucleic acid systems varying in size and number of modification sites were used to evaluate the accuracy of modXNA parameters, with all results showing that the dynamics and structure of the systems containing modified residues were preserved throughout the simulations. We have also developed a protocol for quantum mechanics charge derivation for the modified residues, as well as a workflow for implementing modXNA in Amber molecular dynamics simulations. Development of a pair list-based geometric calculation method for the study of solute-solvent and solvent-solvent hydrogen bonds Hydrogen bonding within biological systems plays a large role in determining a systems overall structure and stability. Just as important however are the hydrogen bonds that form between solvent (i.e. water) and the biological macromolecule (i.e. the solute). While these can be calculated and determined from molecular dynamics (MD) trajectory data using an energy-based approach (e.g. in Kabsch & Sanderâs DSSP method), the simplest and most common approach is to apply a simple distance-based cutoff between the hydrogen bond heavy atom donor and acceptor atoms along with an angle-based cutoff of the angle formed by hydrogen bond donor heavy atom, the hydrogen atom, and the acceptor atom. For small systems where only a limited number of potential hydrogen bonds are possible (e.g. the solute-solute hydrogen bonds in a 200 amino acid protein) a straightforward approach is viable and takes very little time compared to the time taken to read in the trajectory frame. However, for larger systems and/or systems where many more hydrogen bonds need to be considered, such as when determining the hydrogen bonds between all solvent molecules and all solute atoms, or hydrogen bonds between each solvent molecule, a straightforward approach quickly becomes too slow to compute (water molecules can make up as much as 90% of the atoms of a solvated system). The calculation becomes particularly slow if distance imaging needs to occur. Pair lists in MD simulations are used to speed up the non-bonded part of the force calculation by dividing the system up into a 3D grid and only calculating short-range forces between atoms that are spatially near. As a bonus, distance imaging is inherent to how modern pair list algorithms are set up, so distances are imaged at no additional cost to the calculation. We have adapted this idea to the calculation of hydrogen bonds in the MD analysis program CPPTRAJ and show that in some cases the speedup versus a straightforward approach can be as much as 42x when distance imaging is involved. This development will allow a better determination of hydrogen bonding in solvated systems and facilitate the study of the structure of water-water networks as well. PKAD-R: curated, redesigned and expanded database of experimental pKa values in proteins Understanding pKa values in ionizable protein residues is critical for understanding fundamental protein properties, such as structure, function and interactions. We present a new version of PKAD, named PKAD-R, which is a curated database of experimentally determined protein pKa values. The database builds upon its predecessors, PKAD and PKAD-2, with significant updates and improvements through: (1) Careful data curation to remove incorrect entries and consolidate redundant entries by offering alternative structures and pKa values for each unique residue, (2) database redesign to enhance its usability by adding additional information such as protein and species names, detailed notes, as well as sequence identity, (3) database expansion through identification of 214 new (128 nonredundant) pKa entries from the literature. The database currently contains 877 unique pKa entries for wild-type structures and 147 for mutant structures, however, we aim to keep updating the database with new entries. The PKAD-R database is available as a stand-alone downloadable file as well as web servers. The database is designed to provide both a set of pKa entries for unique residues suitable for machine learning (ML) applications, as well as modularity by providing alternative pKa values and structures, allowing the user to decide which entries to include. Identifying important proteins for expanding the pKa database Determining pKa values is essential across many research fields, yet experimental measurements remain costly and time-intensive. In this work, we propose optimal protein candidates for pKa measurements by clustering ionizable residues based on their local environments and selecting representative candidates from large but underrepresented clusters. We developed a graph-based structural similarity algorithm adapted from ProBiS, specifically tailored for ionizable residues by explicitly including ionizable groups as functional groups and optimizing radii cutoffs for different groups. We further improved accuracy and robustness by incorporating inverse distance weighting and constraining the target ionizable residue to be part of the maximum common subgraph. The latter guarantees a fixed configurational relationship between the target site and any matched motif. Validation on the PKAD dataset confirmed a strong correlation between structural and pKa similarity. With the algorithm fully optimized, we are now performing all-to-all comparisons of deeply buried ionizable residues (%SASA < 15%) from the Protein Data Bank to construct a similarity-based residue network. Before identifying clusters in this network, we first tested a wide range of community detection algorithms, including Louvain algorithm, machine learning embedding and clustering methods, and connected components detection by depth-first search (DFS), all of which yielded reasonable and comparable results. Using DFS, we identify clusters of structurally similar environments and assess which are underrepresented in current pKa database. Protein candidates for pKa experiments are then prioritized based on their frequency within the top 20 largest underrepresented clusters. Studying these unexplored proteins will deepen theoretical understanding of protein pKa values and improve computational prediction models through more structurally diverse and informative data. Importance of Ion Channel Selectivity in Action Potential Formation While it is well established that ion channels exhibit selectivity, the evolutionary significance of selectivity remains unclear. To address this, we performed numerical simulations of the depolarization phase of the action potential based on time-dependent differential equations, regulating channel opening with sigmoid or Gaussian functions. We incorporated the sodium channel selectivity and used conductance values determined in our previous work. Our results demonstrate that sodium channel selectivity facilitates a quicker and stronger depolarization response and prevents outward K+ currents through sodium channels. We also modified the Hodgkin-Huxley model to allow experimenting with different sodium channel selectivity, which exhibits the same phenomenon as our model. This work highlights the critical role of ion channel selectivity in action potential formation and provides deeper insights into its biological significance. Improving machine learning pKa prediction with new pKa database PKAD-R In our previous machine learning (ML) pKa prediction project, we used the first version of the PKAD database, which was less curated. We have since upgraded to PKAD-R, a curated and expanded database of experimental protein pKa values. With this improved dataset, we aim to revisit and enhance our ML study on pKa prediction. We are exploring several directions of upgrading our methodology. First, we refined our previous input features for greater accuracy and consistency. Second, we tested an updated BLN model, in which protein residues are represented by hydrophobic (B), hydrophilic (L), and neutral (N) beads, that encodes the neighboring sequence of each target ionizable residue, with added categories for ionizable neighbors alongside the original BLN groups. While this attempt did not lead to major improvements, it informed our feature design. Our main effort now focuses on a new framework that treats the entire protein as the prediction unit, rather than modeling residues independently. Outlier analysis from our previous ML work revealed that most extreme errors occur in clusters of ionizable residues, suggesting that residueâresidue coupling is a challenging aspect of pKa prediction. To address this, the new scheme processes the full protein and simultaneously predicts pKa values for all ionizable residues, thereby naturally capturing inter-residue interactions. We are testing both graph neural networks (GNNs) and transformer-based architectures within this framework, using transfer learning by pretraining on AlphaFold-derived calculated data and fine-tuning with the experimental PKAD-R database. By combining refined features with this protein-level representation, we aim to substantially improve the accuracy and robustness of pKa prediction. Improving QM/MM Boundary Treatments for Chemical Reactions This project focused on improving the accuracy and efficiency of QM/MM simulations when partitioning across covalent bonds. Over the past year, we validated the double link atom (DLA) method, demonstrating reliable performance across diverse systems, including those with shifted pKa values. Using the catechol-O-methyltransferase model system, we showed that DLA enables accurate reaction free energies without requiring large QM regions, in contrast to prior studies. Future work will apply DLA to biocatalytic reactions and extend its integration with additional QM and MM software packages. A dipole replacement method for the calculation of solvation free energy Solvation plays an important role in biosystems. Accurate and efficient calculation of solvation free energy attracts many years effort for simulation study of biological systems. Currently, Generalized Born(GB) method and Poisson-Boltzmann(PB) method are the major methods used in molecular simulation. The computational cost and the empirical nature of GB model prevents them from applications in large systems. This project takes a new approach toward solvation calculation. The theoretic framework has been laid out based on solvent replacement. I am working on the technique details to implement and test the method. This new method will provide an efficient and accurate alternative to GB and PB for molecular simulation. Analysis of large timestep behavior for the calculation of simulation properties The behavior of several different molecular dynamics integration methods are compared regarding their ability to preserve computed properties when used with larger timesteps. Of particular interest are temperatures of specific degrees of freedom, free energies of specific states, transport properties, and barrier crossing rates. Based on this analysis, new or modified molecular dynamics integration methods will be introduced.
View original record on NIH RePORTER →