- Open Access
New public QSAR model for carcinogenicity
© Fjodorova et al; licensee BioMed Central Ltd. 2010
- Published: 29 July 2010
One of the main goals of the new chemical regulation REACH (Registration, Evaluation and Authorization of Chemicals) is to fulfill the gaps in data concerned with properties of chemicals affecting the human health. (Q)SAR models are accepted as a suitable source of information. The EU funded CAESAR project aimed to develop models for prediction of 5 endpoints for regulatory purposes. Carcinogenicity is one of the endpoints under consideration.
Models for prediction of carcinogenic potency according to specific requirements of Chemical regulation were developed. The dataset of 805 non-congeneric chemicals extracted from Carcinogenic Potency Database (CPDBAS) was used. Counter Propagation Artificial Neural Network (CP ANN) algorithm was implemented. In the article two alternative models for prediction carcinogenicity are described. The first model employed eight MDL descriptors (model A) and the second one twelve Dragon descriptors (model B). CAESAR's models have been assessed according to the OECD principles for the validation of QSAR. For the model validity we used a wide series of statistical checks. Models A and B yielded accuracy of training set (644 compounds) equal to 91% and 89% correspondingly; the accuracy of the test set (161 compounds) was 73% and 69%, while the specificity was 69% and 61%, respectively. Sensitivity in both cases was equal to 75%. The accuracy of the leave 20% out cross validation for the training set of models A and B was equal to 66% and 62% respectively. To verify if the models perform correctly on new compounds the external validation was carried out. The external test set was composed of 738 compounds. We obtained accuracy of external validation equal to 61.4% and 60.0%, sensitivity 64.0% and 61.8% and specificity equal to 58.9% and 58.4% respectively for models A and B.
Carcinogenicity is a particularly important endpoint and it is expected that QSAR models will not replace the human experts opinions and conventional methods. However, we believe that combination of several methods will provide useful support to the overall evaluation of carcinogenicity. In present paper models for classification of carcinogenic compounds using MDL and Dragon descriptors were developed. Models could be used to set priorities among chemicals for further testing. The models at the CAESAR site were implemented in java and are publicly accessible.
- QSAR Model
- Applicability Domain
- Chemical Descriptor
- External Dataset
Evaluation of chemical toxicity and human health risk of compounds are of primary interest, because it drives much of the current regulatory actions regarding new and existing chemicals. It is estimated that over 30 000 industrial chemicals used in Europe require additional safety testing. Traditional animal testing is very costly and would require the use of extra 10-20 million animal experiments which is contrary to the policy in EU member states to replace, reduce and refine the use of animals in science (the so called 3 Rs policy). In order to support 3 Rs and REACH policies alternative approaches like Quantitative Structure-Activity Relationships (QSARs) were proposed .
Between different endpoints carcinogenicity is one of the most essential ones in assessment of human health safety. A lot of models for prediction of carcinogenic potency have been published in recent years [2–6]. Some QSARs models are developed for particular chemical classes (such as amines, nitro compounds, polycyclic aromatic hydrocarbons) [7–9]. A considerable number of expert systems has been created for the prediction of carcinogenicity. In some cases different endpoints such as genotoxicity, mutagenicity and carcinogenicity could be integrated (for review articles see [10–17]). Models for non-congeneric chemicals are of great interest for regulatory use as they involve various classes of chemicals [18, 19].
The big challenge in solving the general carcinogenicity prediction problem is to construct a model that would be able to predict carcinogenicity for a wide diversity of molecular structures, spanning an undetermined number of chemical classes and biological mechanisms. Quantitative models based on SMILES  for prediction of carcinogenicity were successfully developed [21, 22].
Many statistical approaches can be used for prediction of complex endpoint such as carcinogenicity. The CAESAR models are in the area of the data mining models which address complex endpoints. Others models, which have been developed, are based on toxic residues codifying human expert knowledge, such as Oncologic , HazardExpert , Derek , ToxTree , or data mining based on fragments, such as MultiCase [10, 12]. Within CAESAR, the data mining approach has been improved using a highly verified set of compounds (all chemical structures have been double-checked, and experimental data verified in case of some unusual finding, compared to similar compounds), and adopting a wide series of chemical descriptors. Different algorithms have been developed, this resulted in a series of models and one with better performance has been implemented and reported here.
Predictive power of models is one of the most important characteristics in QSAR modeling. In a recent paper Benigni et al.  pointed out that the prediction reliability should be checked by means of an external test set with new chemicals not used in modeling. The state of art and perspectives of predictive models for carcinogenicity are reported in a recent paper . It was stressed that the models for regulatory purposes should be connected with high sensitivity, i.e., the ability to correctly identify true positives. Preliminary results of carcinogenicity modeling using CP ANN algorithm obtained in the scope of CAESAR project are described in an article .
Among statistical approaches, artificial neural networks (ANNs) appeared to be one of the most suitable and promising for prediction of complex endpoint such as carcinogenicity for non-congeneric datasets of chemicals. The main advantage of neural network modeling is that the complex, non-linear relationships can be modeled without any assumptions about the form of the model. Large datasets can be examined. Neural networks are able to cope with noisy data and are fault-tolerant .
In this paper we presented categorical or qualitative models for prediction of carcinogenic potency of non-congeneric chemicals using CP ANN method. Our models have been developed in accordance with principles of validation adopted by OECD within the European Commission (EC) funded project CAESAR (Computer Assisted Evaluation of industrial chemical Substances According to Regulation) .
In our study an external dataset of 738 chemicals was composed and external validation of models made. In the paper it is shown how one can increase the number of correctly predicted carcinogens using correlation between threshold of categorical models and sensitivity and specificity. We address the issue of threshold effects on overall performance of models.
Our final models could serve for the preliminary ranking and prioritization of chemicals for carcinogenic potency, as required by REACH.
All CAESAR models for prediction of carcinogenicity were built in accordance with 5 OECD principles  and are based on a strict quality assurance/quality control process. In the research activities a parallel or in some cases collaborative work has been done by different partners in the modeling. Indeed, more than one partner worked in the development of models. This allowed a scientific cross validation of the results, because at least indirectly the activity of each group and the results obtained have been discussed and evaluated by all partners during the periodic meeting, in which all models have been presented.
In addition to that, a direct, detailed quality control and double check of the results has been done individually for each of the final model which has been developed and implemented within CAESAR, as present at the web site.
The models at the CAESAR's web site  have been implemented in java. This allowed a good portability and a facility of execution within a client-server approach.
CP ANN models parameters
We have used the software developed in the laboratory of chemometrics (National Institute of Chemistry, Ljubljana, Slovenia), written in FORTRAN for IBM-compatible PCs and Windows operating system. This software program "AnnToolbox for Windows" is avalable at home page of National Institute of Chemistry Slovenia .
For categorical models a threshold (cut-off) value of 0.45 was applied for models with 8 MDL descriptors and 0.5 for model with 12 Dragon descriptors. Chemicals falling in a terminal node with mean response higher than 0.45 or 0.5 were classified as positive (active or carcinogens) and chemicals falling in a terminal node with mean response lower than 0.45 or 0.5 were classified as negative (inactive or non carcinogens) for model A and B correspondingly.
In all approaches descriptors serve as independent variable and biological activities as dependent.
Model using eight MDL descriptors
Statistical performance of models using 8 MDL descriptors (Model A) and 12 Dragon descriptors (Model B).
Model A (8 MDL descriptors)
Model B (12 Dragon descriptors)
Training (644 compounds)
Test (161 compounds)
Training (644 compounds)
Test (161 compounds)
Another important feature of models for regulatory purposes is the reproducibility. Therefore the parameters of model have to be fixed. The user does not need to optimize the model parameters. On the Figure 3 it is shown the optimal model performance corresponding to the threshold equal to 0.45.
As an alternative choice Dragon descriptors can be used for prediction of carcinogenicity using CP ANN algorithm.
Model using twelve Dragon descriptors
12 Dragon descriptors were selected as was described in section methods. CP ANN algorithm was employed for the dataset of 805 chemicals and used in modeling. We selected an optimal model with dimension of neural network 35*35 and number of learning epochs equal to 200. The threshold was set up at 0.5.
The Cooper statistics of the model with 12 Dragon descriptors based on the training set (644 compounds) indicated the accuracy 89%, sensitivity 90% and specificity 87%, and for the test set (161 compounds) the accuracy was equal to 69%, sensitivity 75% and specificity 61%. The threshold was set at 0.5. The characterization of this model B is given in the Table 1.
Validation of models using external set of 738 compounds
Confusion matrix for external validation set of 738 chemicals of the model obtained with MDL descriptors (model A).
Leadscope experimental carcinogenicity class
CAESAR predicted carcinogenicity class
Confusion matrix for external validation set of 738 chemicals of the model obtained with Dragon descriptors (model B).
Leadscope experimental carcinogenicity class
CAESAR predicted carcinogenicity class
Overall the performances obtained with this externally predicted dataset are as follows:
accuracy = 61.4% and 60.0%; sensitivity = 64.0% and 61.8% and specificity = 58.9% and 58.4% respectively for model with MDL and Dragon descriptors.
Applicability domain of models
The range of MDL descriptors for model A
MDL descriptors symbol
Min_value of descriptor
Max_value of descriptor
The range of Dragon descriptors for model B
DRAGON descriptors symbol
Min_value of descriptor
Max_value of descriptor
This kind of global, chemometric estimation of the AD does not address two key aspects. Firstly, the chemical space characterised by the descriptor range does not take into account the density of compounds distribution, so it might happen that the target chemical falls in an area poorly represented in the training set. Moreover, since the AD relies on the chemical descriptors alone, the output layer (the property under investigation) is neglected. To overcome these aspects, CAESAR developed a further tool for the AD assessment, based on the measurement, through a similarity score, of the six most similar chemicals in the training set; it can be used to evaluate if these compounds are really representative for the unknown compound. Furthermore, a visualisation of these compounds is offered, which can be used to independently evaluate the compounds. Finally, a quantitative report of the error between the observed and predicted activity is also provided for these substances, so that it is possible to argue about wrong behaviour for the model in the chemical area that better represents the compound of interest. The CAESAR models can be accessed through "CAESAR Application" which is a java-based web application .
Structural and chemicals diversity of studies datasets
In Figure 4 the CAESAR dataset and external dataset of 738 chemicals are represented in terms of number of chemicals presenting some functional groups. It can be noted that the overall picture for CAESAR dataset is quite similar to that of the external dataset. In both data sets the majority of specific fragments are as follows: alcohol, alkene, amines, carbonyl, carboxamide, carboxylate carboxylic acid, ether, halide, hydrazine, ketone, midc nitrogen group, nitro, nitroso, sulfonyl group. In less quantities the following specific fragments are present: aldehyde, alkyne, amidine, carbamate, quanidine, hydroxylamine, imine, iminomethil, isocyanate, mercaptan, misc oxygen group, misc sulfur groups, nitrile, phosphorous groups, quinines, sulfide, sulfonamide, sulfonate, sulfone, sulfonic asid, sulfoxide, thiocarboxamide, thioxomethyl, urea; acid anhydride, azide present only in CAESAR set and 2 fragments: organometal and thiocarboxylates present only in test set.
The structural diversity of CAESAR dataset of 805 chemicals by presence of specific structural alerts (SAs) extracted from ToxTree program.
Structural Alert (SA)
Number of chemicals
SA_2: alkyl (C<5) or benzyl ester of sulphonic or phosphonic acid
SA_3: N-methylol derivatives
SA_5: S or N mustard
SA_6 Propiolactones or propiosultones
SA_7:Epoxides and aziridines
SA_8: Aliphatic halogens
SA_9: Alkyl nitrite
SA_10: a, b unsaturated carbonyls
SA_11: Simple aldehyde
SA_14: Aliphatic azo and azoxy
SA_15: isocyanate and isothiocyanategroups
SA_16: alkyl carbamate and thiocarbamate
SA_18: Polycyclic Aromatic Hydrocarbons
SA_19: Heterocyclic Polycyclic Aromatic Hydrocarbons
SA_20: (Poly) Halogenated Cycloalkanes
SA_21: alkyl and aryl N-nitroso groups
SA_22: azide and triazene groups
SA_23: aliphatic N-nitro group
SA_24: a, b unsaturated aliphatic alkoxy group
SA_25: aromatic nitroso group
SA_26: aromatic ring N-oxide
SA_28: primary aromatic amine, hydroxyl amine and its derived esters
SA_28bis: Aromatic mono- and dialkylamine
SA_28ter: aromatic N-acyl amine
SA_29: Aromatic diazo
SA_30: Coumarins and Furocoumarins
SA_31a: Halogenated benzene
SA_31b: Halogenated PAH
SA_31c: Halogenated dibenzodioxins
Mechanistic interpretation of model
Development of a structure-information approach which is based on application of different structural descriptors including the electro topological ones shows the new opportunities in prediction of biological activity and properties in contrast to the mechanism based approach [36–38]. The models based on the pointed above approaches are established independent of explicit three-dimensional (3-D) structure information and are directly interpretable in terms of the implicit structure information . The authors  demonstrated wide range of applicability of such models for relatively big datasets (e.g. for prediction of aqueous solubility, AMES mutagenicity, fish toxicity and others). In the case of carcinogenicity there are a variety of mechanisms and pathways, including genotoxic and epigenetic ones that might play a role in the observed toxic effect. The application of structure-information approach which is "mechanism-free" makes our task simpler and thus feasible because it is not necessary to assume various mechanistic steps in order to make computations for such complicated biological property like carcinogenicity. This method is free of approximations and computations related to assumed mechanism of interaction. This aspect is very important especially for modelling carcinogenicity using non-congeneric set of substances and aimed for prediction of a wide diversity of chemicals.
Eight MDL descriptors selected for modeling.
MDL_ID Descriptor Code
Sum of all (= CH -) E-State values in molecule
Count of all (= C <) groups in molecule
Count of all (= N) groups in molecule
Difference simple 9th order path chi indices
Number of 6-membered rings
Smallest atom E-State value in molecule
Sum of hydrogen E-State on sp3 C on saturated bond
Count of internal hydrogen bonds with 2 skeletal bonds between donor and acceptor
Twelve Dragon descriptors selected for modeling.
Dragon Descriptor's code
Path/walk 5 - Randic shape index
Distance/detour ring index of order 6
Moran autocorrelation - lag 2/weighted by atomic polarizabilities
Eigenvalue 10 from edge adj. matrix weighted by edge degrees
Spectral moment 11 from edge adj. matrix weighted by edge degrees
Spectral moment 09 from edge adj. matrix weighted by dipole moments
Topological charge index of order 2
Mean topological charge index of order6
Number of N-nitroso groups (aliphatic)
Number of phosphates/thiophosphates
Ar-N = X/X-N = X
Taking into consideration MDL descriptors (see Table 7), we can see that we deal with electro topological E-state, connectivity and others descriptors. E-state indices are a combination of electronic, topological and valence state information. These indices incorporate information related to atom, types and electron accessibility, hydrogen atom E-states, and connectivities that are influenced by all of the sub-structural features of a molecule [40–42]. Elements identity and skeletal connection contains structure information while valence state definition includes relationship for valence state electro negativity and atom/group molar volume. Based on these important features of molecules, together with skeletal branching pattern, both the electrotopological state (E-state) and molecular connectivity (Chi indices) structure descriptors were successfully implemented for prediction of genotoxicity and carcinogenicity [18, 43, 44]. The authors  contend that one of the critical determining factors for good prediction results depend on nature of molecular structure representation employed in the model development process.
A complete set of whole molecular descriptors encode information on general structure features such as molecular size and shape, as well as specific information on skeletal variation and complexity. These structural features are expected to have a relationship to properties arising from intermolecular interactions and may also function to provide discrimination among multiple structural classes.
The atom-type, group-type, bond-type and single-atom E-state descriptors encode information on specific molecular features such as atom and bond types associated with important functional groups. Many of descriptors relate directly to or associated with structural alerts as was reported in papers [45, 46].
Some of E-state descriptors can be associated with structural alerts for carcinogenicity. For example, in Table 7 the SdsN_acount descriptor belongs to atom-type E-State account descriptors and expresses the count for the nitrogen atom type = N-associated with the azo group. The last one is also a structure alert and is correlated with carcinogenicity . In Table 8 nRNNOx and N-078 descriptors are accounting for some specific fragments, whose presence is characterizing for the carcinogenic while the nPO4 descriptor accounts for non carcinogenic class.
The global E-State descriptor Gmin is a measure of the most electrophilic atom in the molecule. Mechanistically, an electrophilic center is important for covalent bond formation with nucleophilic DNA. This is the reason why this descriptor was found between the most important descriptors correlated with carcinogenicity.
Hydrogen E-State descriptor SHCsats encodes E-state values for hydrogens on sp3 hybrid carbons bonded only other sp3 carbon atoms. The electron accessibility of these sp3 hydrogens may relate in some manner to hydrophobic interactions between substrates and DNA or may have a relation to alkyl chlorides that are known toxicophores.
Thus, the descriptors used in our study refer to topological characteristics as well as to polarizability and charge distribution (related to reactivity).
Interestingly, some descriptors that we applied in our models were also used by others authors [43, 44] in carcinogenicity and genotoxicity modelling. It means that probably in future research it will be possible to find some common features for modelling carcinogenicity and genotoxicity. But this issue is not in scope of present study.
It should be highlighted that the application of structure-information approach based on such descriptors like E-State has the following advantage: a model based on E-State descriptors (expressed as continuous value) can correlate carcinogenicity to a specific value of descriptor, whereas the use of fragment based structural alerts limits the model to a correlation of presence or absence of fragments or simple count of given fragments which can lead to false prediction for this reason.
The CPDB rodent carcinogenic database was used for development of models for categorization of carcinogenic potency. Initial preprocessing of data and selection of data with carcinogenic potency for rats gives us consistent data suitable for QSAR modeling with carcinogenic potency response closer to human. The MDL and Dragon software programs were applied for calculating the molecular descriptors. The topological structure descriptors provided a sound bases for classifying molecular structures.
The CP ANN model presented in our study demonstrated good prediction statistics on the test set of 161 compounds with sensitivity of 75%, specificity of 61%-69% in addition to accuracy 69%-73%. A diverse external validation set of 738 compounds confirmed the robustness of our models regarding a large applicability domain, yielding the accuracy 60.0%-61.4%, sensitivity 61.8%-64.0%, and specificity 58.4%-58.9%.
These carcinogenicity models can be used as a support in risk assessment, for instance, in setting priorities among chemicals for further testing.
The models at the CAESAR's web site have free access for public use .
Internal data set used in modeling
The chemicals involved in the study belong to different chemical classes, so called non-congeneric substances. The work is addressed to industrial chemicals, referring to REACH initiative. The aim is to cover chemical space as much as possible. The initial dataset of 1481 chemicals was taken from Distributed Structure-Searchable Toxicity (DSSTox) Public Database Network http://www.epa.gov/ncct/dsstox/sdf_cpdbas.html which was built from the Lois Gold Carcinogenic Database (CPDB).
The initial dataset has been cleaned of all incorrect structures, ambiguous or mixed structures, polymers, inorganic compounds, metallo-organic compounds, salts, complexes and compounds without well defined structure. The obtained data and structures of chemicals were cross-checked by at least two partner using the following online databases: ChemFinder , ChemIDPlus  and PubChem Compound . The final data set of 805 chemicals, with their ID number, chemical name, CASRN, experimental TD50 values for rat and corresponding binary carcinogenicity classes are available in Additional file 1 (Table 1SI). For each substance it is indicated whether it belongs to training or test set.
It should be highlighted that data used in our study were obtained from standard protocols and meet requirements for QSAR modeling. Carcinogenicity classification criteria follow the Directive 67/548/EEC, Annex VI, and Cancer Risk Assessment criteria proposed by IARC International Agency for Research of Cancer. Carcinogenicity hazard testing and assessment were performed in accordance with OECD Guidelines 451 (TG 451, 1981-carcinogenicity study).
To prepare data for modeling the dataset of 805 chemicals was subdivided into training (644 chemicals) and test (161 chemicals) sets using the sub-sorting of chemicals according to functional groups and following procedure aimed to distinguish between connectivity aspects. This part of study has been done in the Helmholtz Centre for Environmental Research - UFZ in Germany by a partner in the CAESAR project. The sorting of the compounds pointed here is implemented in the software system ChemProp [50, 51].
External data set used for validation of models
Additional 738 chemicals different from those in our data set of 805 compounds were used as external validation set, being described by the same type of structural descriptors as employed in our model.
To assess predictive abilities of the selected CAESAR model a commercial database has been queried to extract new chemical compounds to be tested. Leadscope software allows accessing some QSAR ready database and the "FDA 2009 SAR Carcinogenicity - SAR Structures" database consisting of 2090 compounds has been extracted from the Leadscope environment in terms of structure information and carcinogenic activity label (based on different mammalian species) and compared with the CAESAR dataset of 805 compounds .
The two databases in the form of sdf files as been merged with ChemFinder and specific check to search for duplicates has been performed. The compounds in common between the two sources were analyzed to verify consistency in the experimental carcinogenicity class assigned by the two sources.
A total of 655 compounds were in common and for them the CAESAR assignment was compared with the Leadscope one. The assignment of toxicity class for Leadscope chemicals was based on rat data only and chemicals have been classified as carcinogens if at least one of the two genders (male or female rat) was labelled in Leadscope as positive or intermediate level carcinogen.
Based on this group of 655 compounds the concordance of the two assignments was of 367 positive chemicals and 257 non-carcinogenic ones. Only 31 compounds were classified differently (11 positive in CAESAR dataset but negative for Leadscope and 20 in the opposite situation) hence the overall concordance was above 95%.
Since the concordance between the two experimental sources is very high the Leadscope database can be considered as a reliable source of new compounds to test the CAESAR model.
Once excluded those chemicals already present in the CAESAR dataset it was possible to select as an external test set 738 compounds with experimental data on rats.
Those compounds have been submitted to the CAESAR model to obtain the predicted activity.
Description of carcinogenic potency
Carcinogenic potency for rats was selected as response because such data in risk assessment  are often considered to be more suitable for human carcinogenicity prediction. The term "carcinogen" generally refers to an agent, mixture, or exposure that increases the age-specific incidence of cancer. Carcinogen identification is an activity grounded in the evaluation of the results of scientific research. Tumourgenic dose is accepted for characterization of carcinogenicity. The tumourgenic dose TD50 used in our study is defined as the tumourgenic dose rate where 50% of the test animals got any kind of cancer. Using other words, the TD50 is that chronic dose rate (in mg/kg body weight/day or mmol/kg body weight/day (mmol/kg-bw/day)) which would give half of animal tumors within some standard experiment time, the "standard lifespan" for the species . Chronic oral toxicity and carcinogenicity tests are described in "OECD Environment, Health and Safety Publications Series on Testing and Assessment No 35 Guidance Notes for Analysis and Evaluation of Chronic Toxicity and Carcinogenicity Studies .
The distribution of carcinogens and non-carcinogens in total, training and test sets.
Generation and selection of descriptors
Nowadays thousands of chemical descriptors such as constitutional, quantum chemical, topological, geometrical, charge related, semi-empirical, thermodynamic and others can be calculated for chemical structure [56, 57].
In present study the following sets of descriptors for 805 compounds have been generated for modeling: 254 MDL descriptors computed using MDL QSAR version 2.2.  and 835 Dragon descriptors calculated by Dragon professional 5.4 software .
To develop robust and reliable models the descriptors space should be reduced extracting the most significant variables. Variable selection and reduction is a delicate problem. Before the number of descriptors was reduced, all variables were normalized between -1 and +1. In order to select the most relevant descriptors different mathematical tools have been used as listed below.
Hybrid Selection Algorithm (HSA) was developed by BioChemics Consulting SAS (BCX), France. This method was used to select the best parameters for classifying the chemicals by their carcinogenic potency (P-carcinogens and NP- non-carcinogens) among the different molecular descriptors series. It combines the Genetic Algorithms (GA) [60, 61] concepts and a stepwise regression . In this way the descriptors space was reduced from 254 to 8 MDL descriptors listed in Table 7. Thus, at first, taken into consideration 245 MDL descriptors, we have got the molecular structure information codified as topological descriptors, including atom-type and group-type, E-State and hydrogen E-state indices, molecular connectivity, chi indices, topological polarity, and counts of molecular features [40–42]. Among the eight MDL descriptors there are two connectivity indices (dxp9-difference simple 9th order path chi indices and nxch6-number of 6-membered rings), three constitutional (SdssC_acnt- count of all (= C <) groups in molecule, SdsN_acnt- count of all (= N) groups in molecule and SHBint2_acnt- count of internal hydrogen bonds with 2 skeletal bonds between donor and acceptor) and three electro-topological parameters (SdsCH- sum of all (= CH -) E-State values in molecule, Gmin- smallest atom E-State value in molecule, SHCsats- sum of hydrogen E-State on sp3 C on saturated bond).
Selection of Dragon descriptors was performed using cross correlation matrix,multicolinearity and fisher ratio techniques . This part of work was done in cooperation with CAESAR partner (Central Science Laboratory -CSL Defra, UK). As a result descriptors space was reduced from 835 to 12 Dragon descriptors listed in Table 8. Among the twelve Dragon descriptors there are 2 topological ones (PW5 - path/walk 5 - Randic shape index and D/Dr06 - distance/detour ring index of order 6), one 2D autocorrelation index (MATS2p- Moran autocorrelation - lag 2/weighted by atomic polarizabilities), two edge adjacency indices (EEig10x- eigenvalue 10 from edge adj. matrix weighted by edge degrees and ESpm11x- spectral moment 11 from edge adj. matrix weighted by edge degrees), two topological charge indices (GGI2- topological charge index of order 2 and JGI6- mean topological charge index of order6), two descriptors are from "functional group count" group (nRNNOx- number of N-nitroso groups (aliphatic) and nPO4- number of phosphates/thiophosphates) and two descriptors are from "atom centered fragments" group (N-067- Al2-NH and N-078- Ar-N = X/X-N = X).
CP ANN algorithm
In a general way the CPANN can be explained as follows. The input or Kohonen layer contains information on input values which are vector represented chemical structure. The structure of s-th compound represented by m structural descriptors or "variables" can be expressed as X s = (xs1, xs2, ... x sm ). The output layer is associated with the output values so called target T s = (ts1, ts2, ... t sj ...t sp ) which is a p-component vector of zeros and ones. One dimensional target in our classification models expresses carcinogenicity class (P-positive = 1 and NP-not positive = 0). The neural network is trained to respond for each input structure representation X s from the training set with the output vector Out s identical to the target (class-vector) T s .
The Kohonen input layer of the CP ANN consists of nx × ny neurons. After the learning, the objects are organized in such a way that similar objects are situated close to each other. It is to emphasize that only the input values participate in this phase of learning (unsupervised step). For this step no knowledge about the target vector is needed .
In the second step the positions of objects are projected to the output layer, where the weights are adjusted to output values (supervised step). The trained output layer consists of nx × ny output neurons arranged in squared neighborhood. After the training, each weight of the output neurons out j is a real number between 0.0 and 1.0. For the final prediction of classes the response surface values must be again transformed into discrete values, 0 and 1. The threshold value between 0.01 and 0.99 must be determined for each class.
Parameters used for evaluation of classification model
Confusion matrix for two class classifier (P- positive and N- negative).
N negative = TN + FP
N positive = FN + TP
TN + FN
FP + TP
N total = N negative + N positive
Cooper statistics express the ability of classification model to detect known active compounds (sensitivity), non-active compounds (specificity), and all chemicals in general (accuracy). See equations 1-3.
AC is defined as the total number of non-carcinogens and carcinogens correctly predicted among the total number of compounds.
Training and test sets were composed for evaluation of models.Training set represents class values for learning. Test set represents class values for evaluation. Hypothesis were used to establish classification in the test set, which is compared to known one.
For evaluation goodness-of-fit or robustness of model the internal performance of model based on the training set (644 compounds) was applied. Several diagnostic statistical tools were implemented for characterization the goodness-of-prediction or predictability of obtained models. Firstly, statistical performance of test set (161 compounds) was calculated. Secondly, internal cross-validation [69, 70] (CV) "leave 20% out" test was done. It was performed on a training set of 644 compounds, so that the set was divided into five training sets, each containing 80% of compounds, and five test sets with 20% of compounds. The sets were selected randomly in a way that each compound was exactly one time a part of the test set and four times a part of the training set.
External validation is commonly used for the predictivity and reliability of QSAR model [71, 72]. Therefore the predictive performance of QSAR models should be evaluated using a validation set of compounds that were not used to generate the model. The validation set of 738 compounds was provided by the CAESAR project partner (Istituto di Ricerche Farmacologiche "Mario Negri" (IRFMN), Milano, Italy) and implemented for validation of models. The preparation of external validation set was performed and described in section Materials.
In conclusion, it should be highlighted that the evaluation of the classification system was done using the so-called internal training set (644 compounds) and test set (161 compounds), cross validation 20% out test, and external validation test set (738 compounds). The external test set included chemicals that were not considered in the modeling.
The financial support of the European Union through CAESAR project (SSPI-022674) as well as of the Slovenian Ministry of Higher Education, Science and Technology (grant P1-017) is gratefully acknowledged. We also would like to thank G. Schüürmann, R. Kühne and Ralf-Uwe Ebert (Helmholtz Centre for Environmental Research, Leipzig, Germany (UFZ)) for their technical support in running the training/prediction set splitting. We would like to thank also Nadège Piclin, Marco Pintore and Jacques R Chrétien (BioChemics Consulting (BCX), OLIVET, France) for selection of 8 MDL descriptors using Hybrid Selection Algorithm (HSA) and Qasim Chaudhry and Jane Cotterill (Central Science Laboratory -CSL Defra, UK) for their contribution in selection of Dragon descriptors.
This article has been published as part of Chemistry Central Journal Volume 4 Supplement 1, 2010: CAESAR QSAR Models for REACH. The full contents of the supplement are available online at http://www.journal.chemistrycentral.com/supplements/4/S1.
- Price N: Hail Caesar. Chemistry & Industry. 2008, 15: 18-19.Google Scholar
- Benigni R, Giuliani A: Putting the Predictive Toxicology Challenge into perspective: reflections on the results. Bioinformatics. 2003, 19: 1194-1200. 10.1093/bioinformatics/btg099.View ArticleGoogle Scholar
- Richard AM, Benigni R: AI and SAR approaches for predicting chemical carcinogenicity: survey and status report. SAR QSAR Environ Res. 2002, 13: 1-19. 10.1080/10629360290002055.View ArticleGoogle Scholar
- Patlewicz G, Rodford R, Walker JD: Quantitative structure-activity relationships for predicting mutagenicity and carcinogenicity. Environ Toxicol Chem. 2003, 22: 1885-1893. 10.1897/01-461.View ArticleGoogle Scholar
- Helguera AM, Perez MCA, Combes RD, González MP: The prediction of carcinogenicity from molecular structure. Curr Comput-Aided Drug Des. 2005, 1: 237-255. 10.2174/1573409054367655.View ArticleGoogle Scholar
- Morales Helguera A, Cabrera Perez MA, Perez González M, Molina Ruiz R, Gonzalez-Diaz H: A topological substructural approach applied to the computational prediction of rodent carcinogenicity. Bioorg Med Chem. 2005, 13: 2477-2488. 10.1016/j.bmc.2005.01.035.View ArticleGoogle Scholar
- Passerini L: QSARs for individual classis of chemical mutagens and carcinogens. The Quantitative Structure-Activity Relationship (QSARs). Models of mutagens and carcinogens. Edited by: Benigni R. 2003, Boca Raton, FL, USA: CRC Press, 81-123.Google Scholar
- Benigni R: Quantitative Structure-Activity Relationship (QSAR) Models of Mutagens and Carcinogens. 2003, Boca Raton FL, USA: CRC Press, 286-View ArticleGoogle Scholar
- Gini G, Lorenzini M, Benfenati E, Grasso P, Bruschi M: Predictive carcinogenicity: a model for aromatic compounds, with nitrogen-containing substituents, based on molecular descriptors using an artificial neural network. J Chem Inf Comput Sci. 1999, 39: 1076-1080.View ArticleGoogle Scholar
- Klopman G, Chakravarti SK, Zhu H, Ivanov JM, Saiakhov RD: ESP: A method to predict toxicity and pharmacological properties of chemicals using multiple MCASE databases. J Chem Inf Comput Sci. 2004, 44: 704-715.View ArticleGoogle Scholar
- Klopman G, Ivanov J, Saiakhov R, Chakravarti S: MC4PC - An artificial intelligence approach to the discovery of quantitative structure-toxic activity relationship. Predictive Toxicology. Edited by: Helma C. 2005, Boca Raton FL, USA: CRC Pres, 423-457.Google Scholar
- Matthews EJ, Contrera JF: A new hightly specific method for predicting the carcinogenic potential of pharmaceuticals in rodents using enhanced MCASEQSAR-ES software. Regul Toxicol Pharmacol. 1998, 28: 242-264. 10.1006/rtph.1998.1259.View ArticleGoogle Scholar
- Woo Y-T, Lai DY: OncoLogic: A machanism-based expert system for predicting the carcinogenic potential of chemicals. Predictive Toxicology. Edited by: Helma C. 2005, Boca Raton FL, USA: CRC Press, 385-413.Google Scholar
- Lagunin AA, Dearden JC, Filimonov DA, Poroikov VV: Computer-aided rodent carcinogenicity prediction. Mutat Res. 2005, 586: 138-146.View ArticleGoogle Scholar
- Benfenati E, Gini G: Computational predictive programs (expert systems) in toxicology. Toxicology. 1997, 119: 213-225. 10.1016/S0300-483X(97)03631-7.View ArticleGoogle Scholar
- Benigni R, Richard AM: Quantitative structure-based modeling applied to characterization and prediction of chemical toxicity. Methods. 1998, 14: 264-276. 10.1006/meth.1998.0583.View ArticleGoogle Scholar
- Richard AM: Structure-based methods for predicting mutagenicity and carcinogenicity: are we there yet?. Mutat Res. 1998, 400: 493-507.View ArticleGoogle Scholar
- Contrera JF, Matthews EJ, Benz RD: Prediction the carcinogenic potential of pharmaceuticals in rodents using molecular structural similarity and E-state indeces. Regul Toxicol Pharmacol. 2003, 38: 243-259. 10.1016/S0273-2300(03)00071-0.View ArticleGoogle Scholar
- Loew GH, Poulsen M, Kirkjian E, Ferrell J, Sudhindra BS, Rebagliati M: Computer-assisted mechanistic structure-activity studies: application to diverse classes of chemical carcinogens. Environ Health Perspect. 1985, 61: 69-96. 10.2307/3430063.View ArticleGoogle Scholar
- Toropov AA, Benfenati E: SMILES in QSPR/QSAR modeling: Results and perspectives. Current Drug Discovery Technologies. 2007, 4 (2): 77-116. 10.2174/157016307781483432.View ArticleGoogle Scholar
- Toropov AA, Toropova AP, Benfenati E, Manganaro A: QSAR modelling of carcinogenicity by balance of correlations. Molecular Diversity. 2009, Google Scholar
- Toropov AA, Toropova AP, Benfenati E: QSAR modelling for mutagenic potency of heteroaromatic amines by optimal SMILES-based descriptors. Chemical Biology and Drug Design. 2009, 73 (3): 301-312. 10.1111/j.1747-0285.2009.00778.x.View ArticleGoogle Scholar
- Lewis DFV, Bird MG, Jacobs MN: Human carcinogens: an evaluation study via the COMPACT and HazardExpert procedures. Human & Experimental Toxicology. 2002, 21 (3): 115-122. 10.1191/0960327102ht233oa.View ArticleGoogle Scholar
- Marchant CA: Prediction of rodent carcinogenicity using the DEREK system for 30 chemicals currently being tested by the National Toxicology Program. The DEREK Collaborative Group. Environ Health Perspect. 1996, 104 (Suppl 5): 1065-1073. 10.2307/3433032.View ArticleGoogle Scholar
- Benigni R, Bossa C, Tcheremenskaia O, Worth A: Development of structural alerts for the in vivo micronucleus assay in rodents. EUR 23844 EN. 2009, 1-43.Google Scholar
- Benigni R, Bossa C: Predictivity of QSAR. J Chem Inf Model. 2008, 48: 971-980. 10.1021/ci8000088.View ArticleGoogle Scholar
- Benfenati E, Benigni R, DeMarini D, Helma C, Kirkland D, Martin TM, Mazzatorta P, Ouedrago-Arras G, Richard AM, Schilter B, Schoonen WG, Snyder RD, Yang C: Predictive models for carcinogenicity: frameworks, state-of-the-art, and perspectives. J Environ Sci Health C. 2009, 27: 57-90. 10.1080/10590500902885593.View ArticleGoogle Scholar
- Fjodorova N, Vračko M, Tušar M, Jezierska A, Novič M, Kühne R, Schüürmann G: Quantitative and qualitative models for carcinogenicity prediction for non-congeneric chemicals using CP ANN method for regulatory uses. Mol Divers. 2009, (Published online: 15 August 2009)Google Scholar
- Taskinen J, Yliruusi J: Prediction of physicochemical properties based on neural network modeling. Adv Drug Delivery Rev. 2003, 55: 1163-1183. 10.1016/S0169-409X(03)00117-0.View ArticleGoogle Scholar
- CAESAR project. [http://www.caesar-project.eu]
- OECD principles. [http://appli1.oecd.org/olis/2007doc.nsf/linkto/env-jm-mono(2007)2]
- AnnToolbox for Windows: National Institute of Chemistry, Ljubljana, Slovenia. [http://www.ki.si/en/display-pages/equipment/?tx_ukki_pi1%5Buid%5D=318&cHash=e267f7b447]
- Netzeva TI, Worth AP, Aldenberg T, Benigni R, Cronin MTD, Gramatica P, Jaworska JS, Kahn S, Klopman G, Marchant CA, Myatt G, Nikolova-Jeliazkova N, Patlewicz GY, Perkins R, Roberts DW, Schultz TW, Stanton DT, van de Sandt JJM, Tong W, Veith G, Yang C: Current status of methods for defining the applicability domain of (quantitative) structure-activity relationships. The report and recommendations of ECVAM Workshop 52. ATLA. 2005, 33: 155-173.Google Scholar
- CAESAR Application. [http://www.caesar-project.eu/software/]
- Benigni R, Bossa C, Jeliazkova N, Netzeva TI, Worth AP: The Benigni/Bossa rulebase for mutagenicity and carcinogenicity - a module of Toxtree. The Benigni/Bossa rulebase for mutagenicity and carcinogenicity - a module of Toxtree. EUR 23241 EN. 2008, 1-70.Google Scholar
- Hall LH: A Structure-Information Approach to the Prediction of Biological Activities and Properties. Chemistry & biodiversity. 2004, 1 (1): 183-201. 10.1002/cbdv.200490010.View ArticleGoogle Scholar
- Kier LB, Hall LH: The Prediction of ADMET Properties Using Structure Information Representations. Chemistry & Biodiversity. 2005, 2 (11): 1428-1437. 10.1002/cbdv.200590116.View ArticleGoogle Scholar
- Hall LH, Hall LM: QSAR modeling based on structure-information for properties of interest in human health. SAR QSAR Environ Res. 2005, 16 (1-2): 13-41. 10.1080/10629360412331319853.View ArticleGoogle Scholar
- Rose K, Hall LH: E-State Modeling of Fish Toxicity Independent of 3D Structure Information. SAR QSAR Environ Sci. 2003, 14: 113-129. 10.1080/1062936031000073144.View ArticleGoogle Scholar
- Kier LB, Hall LH: Molecular Structure Description: The Electrotopological State. 1999, Academic Press, New YorkGoogle Scholar
- Kier LB, Hall LH: Database organization and searching with E-state indices. SAR QSAR Environ Res. 2001, 12: 55-74. 10.1080/10629360108035371.View ArticleGoogle Scholar
- Kier LB, Hall LH: The electrotopological state: structure modeling for QSAR and database analysis. Topological Indices and Related Descriptors in QSAR and QSPR. Edited by: Devillers J, Balaban AT. 1999, Gordon and Breach, Reading, UK, 491-562.Google Scholar
- Contrera JF, Hall LH, Kier LB, MacLaughlin P: QSAR Modeling of Carcinogenic Risk Using Discriminant Analysis and Topological Molecular Descriptors. Current Drug Discovery Technologies. 2005, 2: 55-67. 10.2174/1570163054064684.View ArticleGoogle Scholar
- Votano JR, Parham M, Hall LH, Kier LB, Orloff S, Tropsha A, Xie Q, Tong W: Three New Consensus QSAR Models for the Prediction of Ames Genotoxicity. Mutagenesis. 2004, 19: 365-378. 10.1093/mutage/geh043.View ArticleGoogle Scholar
- Ashby J, Tennant RW: Definitive relationships among chemical structure, carcinogenicity and mutagenicity. Mutat Res. 1991, 257 (3): 229-306.View ArticleGoogle Scholar
- Tennant RW, Zeiger E: Genetic toxicology: current status of methods of carcinogen identification. Environ Health Perspect. 1993, 100: 307-315. 10.2307/3431536.View ArticleGoogle Scholar
- ChemFinder. [http://chemfinder.cambridgesoft.com/]
- ChemIDPlus. [http://chem.sis.nlm.nih.gov/chemidplus/]
- PubChem. [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pccompound]
- Schüürmann G, Kühne R, Kleint F, Ebert R-U, Rothenbacher C, Herth P: A software system for automatic chemical property estimation from molecular structure. Quantitative Structure-Activity Relationships in Environmental Sciences. Edited by: Chen F, Schüürmann G. 1997, Pensacola, FL: SETAC Press, VII: 93-114.Google Scholar
- Schüürmann G, Ebert R-U, Nendza M, Dearden JC, Paschke A, Kühne R: Prediction of fate-related compound properties. Risk Assessment of Chemicals. An Introduction. Edited by: van Leeuwen K, Vermeire T. 2007, Dordrecht, NL: Springer Science, 375-426. full_text.View ArticleGoogle Scholar
- ChemFinder Ultra 10.0. CambridgeSoft Corp, Cambridge, MA.FDA 2009 SAR Carcinogenicity database, Leadscope Inc., Columbus, OHGoogle Scholar
- Combes R, Grindon C, Cronin MT, Roberts DW, Garrod JF: Integrated decision-tree testing strategies for mutagenicity and carcinogenicity with respect to the requirements of the EU REACH legislation. ATLA. 2008, 36: 43-63.Google Scholar
- Peto R, Pike MC, Bernstein L, Gold LS, Ames BN: The TD50: A proposed general convention for the numerical description of the carcinogenic potency of chemicals in chronic-exposure animal experiments. Environ Health Perspect. 1984, 58: 1-8. 10.2307/3429856.Google Scholar
- OECD: OECD Environment, Health and Safety Publications Series on Testing and Assessment No 35 Guidance Notes for Analysis and Evaluation of Chronic Toxicity and Carcinogenicity Studies. Paris, France. 2002Google Scholar
- Todeschini R, Consonni V: Handbook of Molecular Descriptors. 2000, Willey-VCH, New YorkView ArticleGoogle Scholar
- Tetko I, Gasteiger J, Todeschini R, Mauri A, Livingstone D, Ertl P, Palyulin V, Radchenko E, Zefirov N, Makarenko A, Tanchuk V, Prokopenko V: Virtual computational chemistry laboratory - design and description. J Comput-Aided Mol Des. 2005, 19: 453-463. 10.1007/s10822-005-8694-y.View ArticleGoogle Scholar
- MDL-QSARv version 2.2., MDL Information Systems Inc., San Leandro, CA. 94577; 2002-2004. [http://www.drugdiscoveryonline.com/storefronts/mdl.html]
- Dragon home page. (Accessed 1 November 2009), [http://www.talete.mi.it/products/dragon_description.htm]
- Kinnear KE: Advances in Genetic Programming. 1994, MIT Press, Cambridge, USAGoogle Scholar
- Haupt RL, Haupt SE: Practical Genetic Algorithms. 1999, Wiley, New York, USAGoogle Scholar
- Ros F, Pintore M, Chretien : Molecular description selection combining genetic algorithms and fuzzy logic: application to database mining procedures. Chemom Intell Lab Syst. 2002, 63: 15-26. 10.1016/S0169-7439(02)00033-3.View ArticleGoogle Scholar
- Hill T, Lewicki P: STATISTICS Methods and Applications: StatSoft, Tulsa, OK; 2007 or electronic version: Electronic Statistics Textbook. 2007, StatSoft, Inc. Tulsa, OK: StatSoft, [http://www.statsoft.com/textbook/stathome.html]Google Scholar
- Zupan J, Novič M, Ruisanchez I: Kohonen and counterpropagation artificial neural networks in analytical chemistry. Chemom Intell Lab Syst. 1997, 38: 1-23. 10.1016/S0169-7439(97)00030-0.View ArticleGoogle Scholar
- Zupan J, Gasteiger J: Neural Networks in Chemistry and Drug Design. 1999, Wiley-VCH Verlag GmbH, Weinheim, 2Google Scholar
- Zupan J, Novič M, Ruisanchez I: Kohonen and counterpropagation artificial neural networks in analytical chemistry. Chemom Intell Lab Syst. 1997, 38: 1-23. 10.1016/S0169-7439(97)00030-0.View ArticleGoogle Scholar
- Mazzatorta P, Vračko M, Jezierska A, Benfenati E: Modeling toxicity by using supervised Kohonen neural networks. J Chem Inf Comput Sci. 2003, 43: 485-492.View ArticleGoogle Scholar
- Cooper JA, Saracci R, Cole P: Describing the validity of carcinogen screening test. Br J Cancer. 1979, 39: 87-89.View ArticleGoogle Scholar
- Eriksson L, Johansson E, Wold S: QSAR model validation. Quantitative Structure-Activity Relationships in Environmental Sciences VII. Proceedings of the 7th International Workshop on QSAR in Environmental Sciences 24-28 June 1997. Edited by: Chen F, Schuurman G. 1996, Elsinore, Denmark, SETAC Press, Pensacola, FL, 381-397.Google Scholar
- Eriksson L, Jaworska JS, Worth AP, Cronin MTD, McDowell RM, Gramatica P: Methods for reliability, uncertainty assessment, and applicability evaluations of classifcation and regression based QSARs. Environ Health Perspect. 2003, 111: 1361-1375.View ArticleGoogle Scholar
- Perkins R, Rang H, Tong W, Welsh WJ: Quantitative structure - activity relationship methods: perspectives on drug discovery and toxicology. Environ Toxicol Chem. 2003, 22: 1666-1679. 10.1897/01-171.View ArticleGoogle Scholar
- Golbraikh A, Tropsha A: Beware of q2!. J Mol Graph Model. 2002, 20: 269-276. 10.1016/S1093-3263(01)00123-1.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.