A novel exploratory chemometric approach to environmental monitorring by combining block clustering with Partial Least Square (PLS) analysis

Background Given the serious threats posed to terrestrial ecosystems by industrial contamination, environmental monitoring is a standard procedure used for assessing the current status of an environment or trends in environmental parameters. Measurement of metal concentrations at different trophic levels followed by their statistical analysis using exploratory multivariate methods can provide meaningful information on the status of environmental quality. In this context, the present paper proposes a novel chemometric approach to standard statistical methods by combining the Block clustering with Partial least square (PLS) analysis to investigate the accumulation patterns of metals in anthropized terrestrial ecosystems. The present study focused on copper, zinc, manganese, iron, cobalt, cadmium, nickel, and lead transfer along a soil-plant-snai food chain, and the hepatopancreas of the Roman snail (Helix pomatia) was used as a biological end-point of metal accumulation. Results Block clustering deliniates between the areas exposed to industrial and vehicular contamination. The toxic metals have similar distributions in the nettle leaves and snail hepatopancreas. PLS analysis showed that (1) zinc and copper concentrations at the lower trophic levels are the most important latent factors that contribute to metal accumulation in land snails; (2) cadmium and lead are the main determinants of pollution pattern in areas exposed to industrial contamination; (3) at the sites located near roads lead is the most threatfull metal for terrestrial ecosystems. Conclusion There were three major benefits by applying block clustering with PLS for processing the obtained data: firstly, it helped in grouping sites depending on the type of contamination. Secondly, it was valuable for identifying the latent factors that contribute the most to metal accumulation in land snails. Finally, it optimized the number and type of data that are best for monitoring the status of metallic contamination in terrestrial ecosystems exposed to different kinds of anthropic polution.


Background
In recent years it has become increasingly clear that industrial contamination is leading to serious threats to terrestrial ecosystems, thus endangering human and environmental health. Generally speaking, industrial contamination is related to any type of waste released into the environment from human industrial activities [1].
Metal contamination is, however, of particular interest because of metal highly toxic properties and their potential side-effects on ecosystem function and integrity [2]. Although this form of contamination dates back to antiquity, widespread industrial contamination accelerated rapidly with the start of the Industrial Revolution (the 1800s) and is currently regarded to be a serious problem in many countries [1]. Among the major sources of metal release we mentione several key human activities, such as mining operations, siderurgy, burning coal and oil in power plants, chemical industry, vehicular traffic, and intensive agriculture [1].
In this context, environmental monitoring is a standard procedure used for assessing the current status of an environment or trends in environmental parameters [2]. To determine the risks posed by metals on terrestrial ecosystems one should understand their fate along food chains. Briefly, metals are easily accumulated in soils, wherein they may persist over long periods of time [3]. The transfer process begins with the uptake of metals by the primary producers (green plants and bacteria), and continues to the next trophic level, the primary consumers (i.e., herbivores). Measurement of metal concentrations at these trophic levels can provide meaningful information concerning the status of environmental quality and ecosystem health at a specific moment of time [2,4], but only if the large amount of data resulted from such work are handled using appropriate chemometric approaches. Therefore, the present article deals with the most appropriate statistical methods to understand the factors contributing to contamination and metal accumulation in terrestrial ecosystems. Such questions are important in modern environmental science, especially for preliminary environmental impact assessment when researchers use descriptive applications to identify the underlying relationships between metal concentrations at different/same trophic levels.
To this end, many studies have relied on exploratory multivariate analysis to extract reliable information for environmental quality assessment [5][6][7][8]. In most cases the research question of interest in environmental chemistry and monitoring is expressed in terms of variables and cases (observations). A commonly used method to assess the similarity among different cases/variables is Hierarchical cluster analysis (HCA), also known as Tree clustering. This statistical technique reveals natural grouping (or clusters) within relatively large data sets based on measured characteristics. The graphical output is a dendrogram that shows how variables/cases are merged on one axis, whereas the other axis gives the distance at which any two clusters are joined [9]. However, this statistical technique does not allow environmental researchers to simultaneously merge the grouping of both cases and variables. The clustering of both by applying two-way joining clustering (syn. Block clustering) may yield relevant results not only for detecting clusters of cases with a similar magnitude of the measured variables, but also to explore the underlying relationships between these variables. Therefore, we propose that Block clustering (BC) may provide an interesting and powerful statistical approach in environmental monitoring if the researchers may want to simultaneously identify the similarity between different cases and variables.
When investigating large sets of data is beneficial to reduce their dimensionality in order to improve the efficiency and accuracy of data analysis [10]. Principal component analysis (PCA) is commonly used for this purpose in environmental research [7,11,12], but does not allow scientists to separate between the predictor and response variables [9]. Another statistical method, Exploratory factor analysis (EFA), uncovers the underlying structure for large sets of variables based on the shared variances among factors [13,14], but is sensitive to sample size, i.e., the sample size must be at least threefold higher than the number of variables [15]. Partial least square (PLS) may represent a solution where such multivariate methods fail. This technique is routinely used in chemometric analysis when a large number of independent variables (>1000) are obtained with respect to a small number of samples (10 to 100) [16]. Depending on the objective of the study, PLS can serve either as a principal component technique, correlation technique, path modeling technique, or canonical correlation technique [17]. Overall, we suggest that this statistical method is a potential approach in environmental monitoring surveys for exploratory modeling of data sets with a large number of variables, but a moderate sample size (n = 20-50). Such situations are often encountered in baseline field surveys, which document the environmental conditions that exist at a specific moment in time to provide background in case of unknown changes in the future [18].
The present study enlarges our earlier survey [19] to include four additional metals (i.e., Mn, Fe, Ni, Co in addition to Cu, Zn, Cd, Pb) when investigating metal accumulation along soil-plant-snail food chain. These metals were chosen because they are known to serve as vital and/ or toxic elements, depending on concentration, chemical and physical form. Nickel is regarded as having no obvious physiological role in plants and animals [20]. In contrast, manganese, iron, and cobalt act mainly as essential micronutrients, but their occurrence at high levels can represents a potentially serious hazard for environmental health [20]. This is of particular interest for Co and Mn, which, together with Ni, rank among the most dangerous 200 chemical compounds released in the environment from human activities according to the 2011 Substance Priority List of the US Agency for Toxic Substances and Disease Registry [21].
The Roman snail (Helix pomatia) was considered in the present study because this terrestrial gastropod concentrates high metal levels in its soft tissues without revealing any major metabolical disorders and serves as a major herbivore in terrestrial ecosystems [19]. The main purpose of this paper was to introduce a novel chemometric approach in environmental monitoring for understanding the similarities among sites/cases and determining the principal latent factors that influence metal accumulation in biological end-points by combining Block clustering with PLS and using a soil-plant snail food chain as study system.

Results and disscusions
The levels to which metals accumulate at different trophic levels, the normal content (NC) and alert threshold level (ATV) in soil for each investigated metal are shown as absolute values in Table 1. The standardized values at each trophic levels were normally distributed for all investigated metals (p > 0.05). It was found that the concentrations of metals in soils were within the normal levels at all sites, excepting Cd which occasionally did exceed the reference value from the Romanian Soil Quality regulations [22], but did not reach the corresponding alert threeshold level (ATV). These results showed that anthropic activities have a relatively low impact on soil metal concentrations in the study areas (Figures 1 and 2).
The dendrogram groups the environmental variables on the x-axis using the squared Euclidean distance as a criterion of similarity (Figure 3). At the abiotic level we can observe similar distributions in the soil among the total concentrations for Cu, Zn, Mn, and Ni, and Cd and  Co, respectively. Such relationships may reflect the fact that these metals share common anthropogenic sources, such as combustion of coal and heavy fuel oil, municipal waste incineration, vehicular traffic, chemical plants, ferrous and non-ferrous metal production [23]. As a result of having no functional role in plants and terrestrial gastropods [24,25], the measured values for Ni, Cd, and Pb fall close to each other on the x-axis both in the nettle leaves and snail hepatopancreas. Similar associations are also found between the essential trace metals. Because Cu, Co, and Mn are essential regulators of plant growth and development [24], they are clustered near each other in the nettle leaves. There is a close association between the Mn and Fe accumulation in the hepatopancreas; this is mainly related to the fact that these essential microelements follow similar metabolic pathways in land snails [25]. Copper is exclusively regulated in H. pomatia by a specific metallothionein [26], and therefore, its distribution in the hepatopancreas is independent of any other metals. This element is essential for land snails because it is a a component of the chromoprotein hemocyanin, which is essential to their respiration [25]. The y-axis clusters the sites with similar distribution of metals at different trophic levels ( Figure 3). Combining the x-and y-axes reveals information about the underlying similarities among sites, which cannot be provided by applying the Tree Clustering method. Green colors represent lower than average values and yellow to brown the opposite. The first sampling point (THM1) is located near the Sag-Parta landfill, whereas the second sampling point lies within the South Industrial Platform, about 100 m far away from the Timisoara Sud Power Plant. As a result of being located close one to each other (about 1km), these sites showed a similar pattern of metal accumulation at all trophic levels, which is associated with exposure to the same source of environmental pollution (i.e., South Industrial Platform Timisoara).
The site THM1 regularly exhibited the highest metal concentrations among different locations, irrespective of trophic level. This site does not have engineered systems for collecting landfill leachate or gases, and as a consequence, it is considered as a class B landfill that is suitable to accept only general domestic and commercial waste [27]. Although this landfill was officially closed in 2009 [28], it remains a serious pollution hotspot in the Timisoara area, as shown by our results (Table 1). Similarly, the site THM2 revealed higher metal concentrations along soil-plant-snail food chain as compared to the other investigated sites. This site lies near multiple sources of anthropic pollution, and as a consequence, such findings are not surprising. Our results are in line with recent studies, which found high levels of metals in vegetables from areas adjacent to the South Industrial Platform Timisoara [29].
The third sampling site (THM3) lies near the former "Solventul Timisoara" petrochemical works. This site fall close to the fourth sampling point (THM 4), which is located within the East Industrial Platform Timisoara. These locations share a similar pattern of Zn accumulation in the soil and nettle leaves (Figure 3). Because both sites are known for long-term exposure to chemical and petrochemical industries, the routine use of Zn compounds in these industries (e.g., zinc oxide as pigment in paint industry or catalysts in the manufacture of rubber) may therefore explain our findings [30].
The fourth sampling point (THM5) is placed in a wooden area, near the Communal Road DC64 (Timisoara-Ghiroda). The seventh site (THM7) lies in the city of Otelu Rosu, about 50 m far away from the National Road DN68 (Caransebes-Hateg), and 150 m of the former Otelu Rosu steel works, respectively. Interestingly, these two sites are clustered near each other on the vertical axis although they are placed in different counties (i.e., the site THM5 lies more than 100 km away from the site THM7). Although manganese is one of the most abundant metals in soils, its deposition in soil was also shown to be associated with trafficked roads [31]; therefore, the moderate concentrations of Mn that are found in the soil at both sites may be linked to a similar intensity of vehicular traffic along the DC64 and DN68 roads (Figure 3).
The reference site (THR) corresponds to an area located away from major sources of pollution [32], about 100 m away from the communal road which connects the National Road DN58 to the village of Salbagelu Nou, whereas the sixth site (THM6) lies along the National Road DN58 (Resita-Caransebes). The latter location generally shows higher metal levels than the site THR, which are related to the cumulative action of long-term exposure to vehicular traffic and metallic contamination (Resita steel works). Overall, we can observe that the dendrogram obtained by using the Block clustering method separates the sites in two groups (Figure 3). The first group (G1) contains the sites located on industrial platforms from Timisoara (i.e., THM1-THM4), whereas the second group (G2) includes the sites located within 100 m away from roads with different intensity of vehicular traffic (i.e., THM5-THM7, THR).
The exploratory PLS analysis for all sites extracted two significant latent factors (Table 2a), which explained 58.11% of the variance of response variables and 52.41% of the variance of predictor variables, respectively. The weights of predictor variables determine VIP (Variable Importance in the Projection), which shows the statistical contribution of the variable in fitting the PLS model. Variables for which the VIP scores are less than 0.8 are regarded to be small [33], and therefore, one can conclude that copper and zinc concentrations in soil and nettle leaves are contributing the most to trace metal accumulation pattern in land snails (Figure 4a). These findings are consistent with land snail physiology, wherein copper is a key player in metabolic activities [34].
We have reran the PLS analysis with all predictor variables on each of the two groups that were obtained by applying Block clustering. The physiological metals (i.e., Cu, Zn, Mn, Fe, Co) were removed from our analysis because the levels to which these microelements accumulate in soils were generally within NC values at all sites  Table 2 Results of the exploratory PLS analysis for all sites (a), the G1 sites(b) and the G2 sites (c)  ( Table 1). Among the toxic metals (e.g., Ni, Cd, Pb), priority was given to those elements which showed high VIP scores (VIP > 0.8) for concentrations in both the soil and nettle leaves. Reducing the number of independent variables implies that fewer terms are needed in the expansion to find the metals with the highest/lowest risk on environmental and human health in the study areas. The PLS model for the G1 sites had five significant factors, which explained 87.12% variance of response variables and 95.05% of the variance of predictor variables, respectively (Table 2b). The importance of most predictor variables was high (VIP > 0.8), excepting Mn and Ni levels in the soil, and Co content in the soil and U.dioica leaves (Figure 4b). It can be seen from the Figure 4b that Cd and Pb concentrations, in contrast with Ni levels, have high VIP scores in both the soil and nettle leaves. Therefore, the G1 sites should be closely monitored in the future with respect to Cd and Pb accumulation along terrestrial food chains. Our results are consistent with recent studies, which found that Cd and Pb are the main determinants of pollution pattern in the Timisoara area [29,35].
For the G2 sites, the PLS analysis extracted five significant factors, which accounted for 86.76% of the variation of response variables and 96.04% of the variation of predictor variables, respectively (Table 2c). All physiological metals showed high VIP scores in the soil and nettle leaves (Figure 4c). Among toxic metals, Pb was the only element for which the concentrations in soils and nettle leaves displayed high VIP scores (Figure 4c). These findings are not surprising since Pb is among the most common heavy metals (together with Cu and Zn) released from vehicular traffic [36].

Conclusions
The legacy of metals released to the environment from human activities puts increasing pressure on terrestrial ecosystems from anthropized areas. To this end, not only finding novel analytical methods for determining the degree of environmental contamination, but also employing new statistical approaches for analyzing environmental data provide scientists with powerful tools in environmental monitoring and assessment.
In the present study, we show that applying Block clustering with PLS analysis is simple and intuitive procedure that allows researchers to: overcome the drawbacks imposed by the graphical representation of environmental variables. Although such charts are useful in environmental monitoring, they hide what the data tell us when too many variables are illustrated within the same chart; group sites with similar patterns of metal accumulation at different trophic levels, thus allowing environmental researchers to separate the study areas depending on the type of contamination and to understand the underlying similarities among them; select the most significant latent factors (based on VIP values) which explain metal accumulation in biological end-points; optimize the analytical procedures by selecting for future investigations only the toxic metals with high VIP scores, thus reducing the analytical costs; in this case, the emphasis of contamination with Cd and Pb on industrial platforms near Timisoara and with Pb near roads; provide a benchmark for building exploratory models with potential applications in environmental monitoring surveys when researchers have to deal with data sets having a large number of variables, but a small sample size.

Experimental
Detailed description concerning the location of sampling sites, the preparation of samples, and the analysis of metals are provided in our previous work [19]. Briefly, the samples were collected in triplicate for each trophic level (soil, nettle, snail) from eight sites located in the western part of Romania, in the Banat area (Timis and Caras-Severin counties). All locations have been exposed to long-term industrial pollution (> 30 years), and lie at most 10 km away from former and/or actual sources of anthropic contamination. The reference area (site THR), the vilage of Salbagelu Nou (Caras-Severin county), is located in a non-polluted area, with less industry [31]. The sites THM1-THM5 are located around the city of Timisoara, the most populated and industrialized city from the Banat area. The sites THM6 and THM7 are located along trafficked roads, in an area which was exposed for more than two centuries to the impact of metallurgical industry [37]. For each sampling plot at least 60 newly matured Helix pomatia specimens were collected, fasted for 48 h and sacrificed by freezing (at −20°C). To provide homogeneous samples the snails were calibrated based on the shell height, which was shown to serve as a more accurate predictor of snail size as compared to the shell width [19]. The measurements were performed with a digital caliper to the nearest 0.01 mm. After defrosting, the whole soft body was removed from the shell and the viscera and the foot were separated. Only the snail hepatopancreas was considered in the present study because this organ was found to serve as the main end-point of metal accumulation for H. pomatia [19]. The samples were analysed in triplicate for each location, and 20 snails were used for each batch. The selected food chain included nettle as the main food source of Roman snail (Helix pomatia) based on observations of snail feeding habits in investigated areas.
For each location three samples from the top leaves were collected, rinsed in distilled water to wash off potential air pollutants, and then oven dried at 105°C to constant weight. The samples were crushed with a mortar, passed through a 2 mm sieve, and preserved in self-sealing sterile paper pouches at room temperature (t = 22°C). The soil samples were collected (25 g/sample in triplicate) from the top 15 cm layer after removal of vegetation (grass). After removing roots and litter, they were dried (t = 22°C, 7 days), disaggregated, homogenized before being sieved to 2 mm (soil metal concentration analysis), and then stored at ambient temperature (t = 22°C) for further analysis [19].

Statistical analysis
Because the metal concentrations differed by several degrees of magnitude, the measured values were standardized as follows: where STV defines the standardized value, RV the raw value, MV the mean value, and STD the standard deviation. As a result, the data were displayed on scale from −1 to +1. The subsequent data were then checked for normality using both Anderson-Darling test (for comparing distribution functions) and Jaques-Bera test (for comparing between kurtosis and skewness of a function). We performed a cluster analysis (Block clustering) to identify the sites with similar patterns of metal accumulation at different trophic levels and to explore relationships between these variables. Exploratory PLS analysis was used for assessing the principal latent variables (factors) underlying metal accumulation in the snail hepatopanceas. The analysis was carried out for all sites taken together, as well as separately for each cluster of sites obtained by applying the Block clustering method. Metal concentrations in the soil and nettle leaves were considered as independent/predictor variables (X of the PLS matrix), whereas their levels in the snail hepatopancreas were taken into account as dependent/outcome variables (Y of the PLS matrix). To validate the PLS model, the number of extracted factors was chosen through 10-fold cross validation, i.e. fitting the model to part of the data and minimizing the prediction error for the unfitted part [17]. Statistical analyses were performed by using the Statistica 10 software package [38]. All data are presented as the mean ± SD for the absolute measured values.