Group Contribution and Machine Learning Approaches to Predict Abraham Solute Parameters, Solvation Free Energy, and Solvation Enthalpy
Date:
Solvation free energy predictions play a key role in a variety of areas such as synthesis of organic molecules, optimization of purification processes, and pollutant level management. Having compiled a new and extensive solvation property database, we present a group contribution method (SoluteGC) and a machine learning model (SoluteML) to predict the Abraham solute parameters for solute compounds, as well as a machine learning model (DirectML) to predict solvation free energy and enthalpy for solvent-solute pairs. The proposed group contribution method uses atom-centered functional groups with corrections for ring and polycyclic strain and long-distance interaction whilst the machine learning models adopt a directed message passing neural network. The solute parameters predicted from SoluteGC and SoluteML are used to calculate solvation free energy and enthalpy via the linear free energy relationships [1,2]. The new data sets used to train the models contain 8366 solute parameters, 20253 solvation free energies and 6322 solvation enthalpies, larger than any current database. The three models are evaluated on the identical test sets using both random and substructure-based solute splits for solvation free energy and enthalpy predictions. The results show that, on average, the DirectML model is superior to the SoluteML and SoluteGC models for both solvation energy (mean absolute error on the random solute split: 0.41 kcal/mol cf. 0.48 kcal/mol and 0.63 kcal/mol, respectively) and enthalpy (mean absolute error on the random solute split: 0.47 kcal/mol cf. 0.50 kcal/mol and 0.64 kcal/mol, respectively) predictions. However, for certain solute and solvent substructures, this is not always the case. For this reason, when combined together, the three models can provide even more accurate predictions of solvation free energy and enthalpy. Nevertheless, DirectML can provide accuracy comparable to or even smaller than that of advanced quantum chemistry methods. SoluteML and SoluteGC also provide useful insights with regards to molecular structures and properties beyond solvation. Finally, we present our compiled open-source solvation energy and enthalpy databases and provide public access to our final prediction models through a simple web-based tool, conda package, and source code.
[1] Abraham, M. H et al. Solvation of gaseous non-electrolytes. Faraday Discuss. Chem. Soc. 1988, 85, 107–115 [2] Mintz, C et al. Enthalpy of Solvation Correlations for Gaseous Solutes Dissolved in Water and in 1-Octanol Based on the Abraham Model. J. Chem. Inf. Model. 2007, 47, 115–121.