Propersea - Property Prediction
Propersea is a newly developed online resource (currently in end stage testing) designed to provide predictions for a range of molecular and physicochemical properties for small molecules. The predicted properties include: melting point, boiling point, density, logP, solubility, polarizability and more. It will also predict the IUPAC name for the molecule.
This resource provides a search interface, in a similar way to our Chemical Availability Service (ChASe) allowing a user to search by SMILES string, InChI (including InChI=) or structure. Once the search is complete the user will be shown the results of the predictions for that molecule. We hope to also be able to integrate this search with our other resources in the future.
The properties are predicted through a variety of algorithms, including:
- RDKit algorithms
- Semi-empirical quantum methods
- Fragment/ atom contribution calculations
- Bayesian Additive Regression Trees
- Transformer neural networks
The predicted value is returned in the results interface. For those properties predicted using the Bayesian algorithms it also returns an interval for the 95% confidence, along with a measure of how well the molecule compares to molecules contained in the training set. Where a property prediction is deemed non-sensical due to the predicted phase, the property may be omitted from results.
Propersea performs best for organic compounds and performance on inorganics, orgometallics and inorganic-organic mixtures is known to be lower. Reliability metrics for these compounds would show as ‘Low’ or ‘Very Low’
IUPAC Name Prediction
Propersea also features our novel machine learning model for generation of IUPAC names. This machine learning model that we have built is a sequence-to-sequence model that can predict the IUPAC name from the molecules InChI (International Chemical Identifier) string. The paper presenting the development and testing of this model can be found in our pre-print paper, which is currently being published.
Handsel J, Matthews B, Knight N, Coles S (2021) Translating the molecules: adapting neural machine translation to predict IUPAC names from a chemical identifier. ChemRxiv. https://doi.org/10.26434/chemrxiv.14170472.v1 - in publication
The model has been trained on a dataset of 10 million compounds and tested on a 200,000 compound dataset, achieving an accuracy of 90.7% on a complete match to the IUPAC name. This model performs extremely well with organic compounds, and also handles isomers / tautomers that are adequately described by the InChI.
However the current model does not perform well on inorganics, organometallics, and inorganic-organic mixtures. This is in part likely due to the limitations of the InChI in describing these molecules, and also in the quality and quantity of the molecules in the training dataset. This is the focus of a current project, to improve the performance of the model in these areas.
If you would like to discuss our resource development or have any feedback please get in contact with us via email to email@example.com