Advances in x-ray crystallography and proteomics consortia have led to unprecedented growth in the number of high-quality three-dimensional structures of protein-ligand complexes that can be used in drug design and discovery. Many of these proteins play critical roles in pathways of life-threatening diseases such as HIV, Hepatitis, many kinds of cancer, etc. Attacking such proteins by binding them to small drug-like molecules, ligands, could change their function and inhibit their activities which ultimately lead to therapeutic benefits. The process of finding ligands that bind strongly to the protein of interest involves screening millions of them that vary in structure and physiochemical properties. Determining the binding affinity (BA) of each ligand against the receptor protein in vitro and/or in vivo is a prohibitively expensive process and unpractical for large databases even with the use of high-throughput screening (HTS) approaches. As a result, virtual-screening in silico techniques have emerged in the last two decades to filter large compound databases to manageable selective sets of most promising ligands. Molecular docking (depicted below) is a popular computational approach that “docks” a ligand into the structure of macromolecular target and “scores” its potential complementarity to the binding site by predicting its BA.
Accurately predicting the BAs of large sets of protein-ligand complexes efficiently is a key challenge in computational biomolecular science, with applications in drug discovery, chemical biology, and structural biology. Since a scoring function (SF) is used to score, rank, and identify potential drug leads, the fidelity with which it predicts the affinity of a ligand candidate for a protein’s binding site has a significant bearing on the accuracy of virtual screening. Despite intense efforts in developing conventional SFs, which are either force-field based, knowledge-based, or empirical, their limited scoring and ranking accuracies have been a major roadblock toward cost-effective drug discovery. Therefore, our research explores a range of SFs employing different machine-learning (ML) techniques in conjunction with a variety of physicochemical and geometrical features characterizing protein-ligand complexes. Our approach is to tailor generic cutting-edge ML algorithms (such as Random Forests, Boosted Regression Trees, and SVM) to build specialized SFs that accurately model BA and insightfully describe different useful interactions between proteins and ligands. Our preliminary efforts have resulted in substantially higher accuracies than conventional SFs used in main stream commercial docking tools. We find that the best performing ML SF has a predictive accuracy of 0.809 in terms of Pearson correlation coefficient between predicted and measured BA compared to 0.644 achieved by a state-of-the-art conventional SF on the core test set of PDBbind benchmark.
In addition, we are also interested in developing a consensus scoring scheme based on our ML and others’ SFs that dynamically selects a subset of SFs from a larger pool based on the family of the target protein. Ranking millions of ligands accurately against each other is as important a step as reliable scoring, therefore, our research interests also encompasses developing specialized SFs that maximize ranking accuracy explicitly instead of just relying on binding affinity prediction. We also plan to mine large public repositories of proteins-ligands complexes with and without binding affinity data. That is by coupling big-data handling strategies with supervised and unsupervised learning approaches to construct even more accurate SFs.