PM10 and gaseous pollutants trends from air quality monitoring networks in Bari province: principal component analysis and absolute principal component scores on a two years and half data set

Background The chemical composition of aerosols and particle size distributions are the most significant factors affecting air quality. In particular, the exposure to finer particles can cause short and long-term effects on human health. In the present paper PM10 (particulate matter with aerodynamic diameter lower than 10 μm), CO, NOx (NO and NO2), Benzene and Toluene trends monitored in six monitoring stations of Bari province are shown. The data set used was composed by bi-hourly means for all parameters (12 bi-hourly means per day for each parameter) and it’s referred to the period of time from January 2005 and May 2007. The main aim of the paper is to provide a clear illustration of how large data sets from monitoring stations can give information about the number and nature of the pollutant sources, and mainly to assess the contribution of the traffic source to PM10 concentration level by using multivariate statistical techniques such as Principal Component Analysis (PCA) and Absolute Principal Component Scores (APCS). Results Comparing the night and day mean concentrations (per day) for each parameter it has been pointed out that there is a different night and day behavior for some parameters such as CO, Benzene and Toluene than PM10. This suggests that CO, Benzene and Toluene concentrations are mainly connected with transport systems, whereas PM10 is mostly influenced by different factors. The statistical techniques identified three recurrent sources, associated with vehicular traffic and particulate transport, covering over 90% of variance. The contemporaneous analysis of gas and PM10 has allowed underlining the differences between the sources of these pollutants. Conclusions The analysis of the pollutant trends from large data set and the application of multivariate statistical techniques such as PCA and APCS can give useful information about air quality and pollutant’s sources. These knowledge can provide useful advices to environmental policies in order to reach the WHO recommended levels.


Background
The knowledge of chemical composition and sources of air polluted is demanded in any program aimed at controlling the levels of pollutants in order to evaluate and reduce their impact on human health.
The inhalation of air polluted, in fact, with particulate matter (PM 10 ) and or irritant gases such as NO 2 and SO 2 is associated with both short-term and long term health effects, most of which impact on respiratory and cardiovascular system [1]. For example the atmospheric concentrations of NO 2 have been linked to the deaths of severely asthmatic patients in Barcelona [2], child asthma cases in Toronto and Southern California [3,4], heart rate dysfunction in Taiwan and Switzerland [5,6], and ischemic heart disease in elderly residents of French cities [7]. Similar examples can be chosen to illustrate the damaging effects of PM 10 inhalation, whether it be asthma in Madrid or Sydney [8,9] or all-cause mortality (especially stroke) in Boston [10].
The federal Clean Air Act Amendments of 1990 mandate that the U.S. EPA determine a set of urban hazardous air pollutants (PAHs, or 'air toxics') that potentially pose the greatest risks in urban areas, in terms of contribution to population health risk. The current set of 188 PAHs includes toxic metals and volatile organic compounds (VOCs). The U.S. EPA identified 33 urban PAHs based on emissions and toxicities in a 1995 ranking analysis [11] and developed concurrent monitoring and modelling programs to evaluate potential exposures and risks to these top-ranked 33 PAHs. Developing effective control strategies to reduce population exposure to certain PAHs requires identifying sources and quantifying their contributions to the mixture of PAHs and the associated health risks. One approach is to use receptor-based source apportionment models to distinguish sources. Most source apportionment studies focus on analysing either VOCs [12,13] or fine particle (PM 2.5 ) mass [14][15][16]. Only few studies used source apportionment modelling to identify common sources of both VOCs and PM 2.5 . In other source apportionment studies that included both non-organic trace elements on PM and gaseous pollutants [17][18][19][20], the gaseous species usually were non-VOCs (such as CO, SO 2 , and NO).
In recent years, there has been an increased interest in the application of chemometrics [21] to different environmental research fields, ranging from water to air pollution and cultural heritage [22][23][24][25]. One aspect of the application of chemometrics to environmental pollution research is often referred to as source apportionment, receptor modelling and/or mixture analysis discipline. Recent examples of such work can be found in Europe [26,27], the US [28,29] and Asia [30,31]. In the fields of pollution sciences (air or water), source apportionment models aim to re-construct the emissions from different sources of pollutants based on ambient data registered at monitoring sites [32].
In the present paper a bihourly data set of PM 10 , CO, NO x , Benzene and Toluene collected in six air quality monitoring stations of Bari territory from January 2005 to May 2007 is used. The main aim of this paper is to provide a clear illustration of how large data sets from monitoring stations can give information about the number and nature of the pollutant sources, and mainly to assess the contribution of the traffic source to PM 10 concentration level by using multivariate statistical techniques.
These knowledge could provide useful advices to environmental policies in order to reach the WHO recommended levels. In fact legislative efforts to reduce the health effects of air pollutants are currently being applied throughout the developed world, with the imposition of averaged limit values which vary for different pollutants. In the case of PM 10 , the World Health Organization has recommended progressive achievement of four pollution thresholds which cascade down through three Interim Targets (IT1 ¼ 70 μg/m 3 ; IT2 ¼ 50 μg/m 3 ; IT3 ¼ 30 μg/m 3 ) to reach the ultimate objective: an Air Quality Guideline (AQG) annual mean of just 20 μg m 3 PM 10 [33,34]. Moreover considering the latest Italian law [35,36] for PM 10 the annual limit value is 40 μg/m 3 , while the daily limit value is 50 μg/m 3 ; for NO x the annual limit value is 40 μg/m 3 , while hourly limit value is 200 μg/m 3 ; for Benzene the annual limit value is 5 μg/m 3 and for CO the 8 hour mean limit value is 10 mg/m 3 .

Results and discussion
In the Table 1 the basic statistics for each site have been summarized. Among all the available sampling sites, only those with the number of data not less than 5000 were used, considering only days with complete data (12 daily data). High variability is explained by the long range of the period (2.5 years). Pollutants concentrations were reported as μg/m 3 , except for CO which is expressed as mg/m 3 .
From data collected, night and daily mean concentrations (per day) for each parameter have been obtained. Night and daily mean values have been plotted for each parameter and graphics, as Figures 1, 2, 3 and 4 shown, have been obtained for each sampling site.
Observing the Figures 1, 2, 3 and 4 shown as example, parameters such as CO, Benzene and Toluene show different trend between night and daily values, whit daily mean values bigger than night ones. In particular for the data shown in Figures 1, 2 and 3 the percentage ratio between (daily mean -night mean) and daily mean for CO, Benzene and Toluene is 53%, 49%, 54% respectively. Considering Toluene trend shown in Figure 3 it is possible to note for some days, e. g. 05/05/2005 or 22/02/ 2006, very high daily mean values on the contrary of Benzene ones shown in Figure 2. The reason is due to the presence of another pollution source affecting the monitoring site, probably identifiable in the painting of pedestrian crossing and road stripes.
Considering the PM 10 night and dilay mean concentrations ( Figure 4) it's possible to note that they don't show a clear difference between day and night: in fact the ratio for PM 10 is 16%. Moreover for some days, e. g. 25/03/ 2005 and 06/02/2006, the thermodynamic conditions in the planetary boundary layer (PBL) adversely affect pollutants dispersion leading to PM 10 night values bigger than daily ones, in spite of emission sources reduction during the night.
The different night and daily behavior suggests that parameters such as CO, Benzene and Toluene are mainly connected with transport systems, whereas PM 10 is mostly influenced by different factors.
The parameters trends shown in Figures 1, 2, 3 and 4, related to Viale Archimede data, are similar to ones of the other sites. So the different behaviour between PM 10 and the other parameters (CO, Benzene, Toluene) can be considered common to the whole area investigated: Bari and Bari province.
Moreover, as we have shown in a previous papers [37], the results obtained both by automatic monitoring stations and sampling campaigns in several sites of Apulia region, suggest that the PM 10 amount monitored in this area presents a common contribution also among monitoring stations located at 70 km far each other: the common contribution apparently does not depend from local sources. Moreover in the reference 37 we pointed out that PM 10 concentrations do not show a seasonal trend, contrary to the PM 10 trend shown in the towns of North Italy [38,39].
In order to identify the pollutant sources that contribute to PM 10 concentrations and try to distinguish the contribution of local sources, such as vehicular traffic, as respect to "a common regional source" (that is resuspended matter, dust intrusions, calcium carbonate  source), APCS model has been applied to the data collected. According to the criteria described in the methods section we have chosen the ODV90 one, revealing that three components are necessary and sufficient to run properly the model.
In Table 2 the loading's values for the PC analysis applied to the data collected in all the sites during January 2007 are shows as example. Three factors explain almost the 92% of the total variance of data for all the sites. Factor loadings are used to obtain information about source's profiles. The first factor (or first principal component, PC1) accounting for a percentage of the total variance ranging between 40% and 51% was dominated by high loading values of Benzene, Toluene and CO, or by NO x and CO depending on the sites; the second factor (or second principal component, PC2), accounting for a percentage ranging between 24% and 31% of the total variance, is dominated by PM 10 or by Benzene and Toluene, while the third factor explaining a percentage ranging between 21% and 25% of the total variance had high loadings values for Benzene and Toluene or PM 10 .
Applying PCA on all data set generally we found that for each sampling site one of the three factors is characterized by high loading values of PM 10 , the other two factors are characterized by high loading values of NO x , CO, Benzene and Toluene.
Observing Figure 5 it's possible to note that PM 10 is the dominant parameter on the second component with high loading values.
In order to identify the three sources the Absolute Principal Component Scores model has been applied to data sets. In the Tables 3 and 4  Observing Table 3 and 4 that show the parameters distribution in the three pollution sources, averaged on the whole monitoring period, one can see that the profile of the second source is mostly characterized by PM 10 . The other two sources are differently characterized by NO x , Benzene, Toluene, CO and for a little contribution by PM 10 .
Moreover comparing the source's profile concentrations between Summer and Winter seasons it's possible   to note a constant increasing of NO x concentration from Summer to Winter for all sites and sources. In particular the first source shows for all sites bigger NO x concentrations in the Winter than Summer ones. The first source can be considered a mixed source between vehicular traffic and domestic heating. In Figure 6 the percentage distribution of the parameters in the three sources is represented. The plot is obtained from monthly sources profile averaged for all sampling period of time and among all monitoring sites.
Over 85% percent of the mass of PM 10 is attributed to the second source. The first and third sources, composed by NO x , CO and aromatic compounds, and low level of PM 10 , are characterized by similar level of benzene and toluene. In particular the Toluene and Benzene concentrations ratio in the first and third sources profiles are bigger than 2 (except for San Nicola sport stadium monitoring site): in literature this value is associated to vehicular traffic emissions. Moreover NO x and CO are predominant in the first source. The amount of PM 10 in the third source, even if low, is 50% higher than first source.
These observations suggest that the second source could be identified as "Particulate source", while the first and third sources can be considered different components of vehicular traffic emissions. In fact, no industrial plants or similar are located close the sampling sites, and the traffic is the most important source of pollution   of anthropic nature. The two traffic sources might be originated by different kinds of vehicles or engines, for example gas and diesel. These different fuels are known to be responsible of different emission of pollutants. In particular diesel, before the introduction of filters, was the major source of particulate matter among the several fuels used for road transport, with lower emissions of NO x and CO. Considering also the constant increasing of NO x concentration from Summer to Winter for all sites and sources (Tables 3 and 4) the first source could be identified with a mixed source between vehicular traffic and domestic heating, while the third source with vehicular traffic. Another proof linking the first and third sources to vehicular emissions is the daily profile of bihourly mean concentrations contributions of the three sources (Figure 7). In Figure 7 it's clearly showed that the particulate source shows a rather constant trend during the day and it is uncorrelated with the traffic sources. The other two sources show, instead, a typical traffic profile, with peaks of emission at 8 in the morning and 20 in the evening, in correspondence of rush hours of people going back and forth  Table 5 shows the coefficients of correlation among the six sites of the three sources in the APCS profiles matrix. According to this data, we can observe that the source Particulate shows high correlation among four sites of different zones (Bari and Province). This makes our hypothesis of a regional character for PM 10 concentrations [37]; Monopoli and San Nicola sites don't show correlation and this can be explained considering the different nature of these sites: Monopoli is a urban sites while San Nicola is a suburban site skirting by high vehicular traffic street and whit high vehicular traffic spot during sport events (generally in the week end).
On the contrary, considering the vehicular traffic sources it's possible to observe low correlation among the sites due to different location of the sampling sites. Table 6 shows the reconstruction percentage error of the APCS model for each parameter. The error shows high variability over the range of the period. PM 10 concentrations have shown the lowest error of reconstruction, while the CO concentrations the biggest ones. The model, in fact, suffers of low robustness when values are low (this is the case of carbon monoxide).  Anyway, in most of the cases the error was acceptable, allowing a fairly good reconstruction of the concentration trend.

The air quality monitoring network
Bari is a town of about 350000 inhabitants located in South-East of Italy (latitude 41°08' , longitude 16°45'). Its greater industrial activities are in mechanical (carpentry and industrial vehicles), food and clothing sectors; its industrial area, whit a thermo electrical power station, is placed in the neighbouring towns.
Prevailing winds are from NNW and WNW in December, January and February, from East in March and September and from NNE and South in October and November. Raining days are 80 -90 for year with maxima 40 -50 mm. The region is characterized by an active photochemistry mostly in the summer season.
Like many other Italian cities, its urban area is characterized by high motor-vehicle traffic density, mostly in the centre of the city.
The air quality monitoring network of the Bari Municipality is composed by six fixed monitoring stations, by a mobile laboratory and a data elaboration centre. In province of Bari, that extends for 3.825 km 2 and includes 41 towns, there are four fixed monitoring stations located in the towns of Casamassima, Altamura, Andria and Monopoli.
In this paper some stations of Bari and its province monitoring networks have been selected as representative sites of the investigated area. In Bari, the selected monitoring stations are located in residential area (viale King), in urban area (viale Archimede) and in a suburban area (S. Nicola sport stadium).
In province of Bari, the three selected stations are located in the urban and residential areas of the following towns: Altamura (67000 inhabitants) located at 47 Km south-westwards from Bari, Andria (98000 in.) at 55 Km northwards from Bari and Monopoli (50000 in.) a coastal town at 40 Km southwards from Bari. All considered sites can be classified as urban background sites, except for Monopoli that is a urban site and San Nicola that is a suburban site skirting by high vehicular traffic street and whit high vehicular traffic spots during sport events.

The instrumentation
Each station is provided with automatic analysers of CO (Advanced pollution Instrumentation  Nitrogen oxides, NO and NO 2 , were analysed using the chemiluminescence method. Measurement of ozone is based upon the capacity of such gas to absorb ultraviolet rays with opportune wavelengths, generated by built-in lamp. Carbon monoxide is analysed through the absorption of infrared rays (IR).
The measuring of PM 10 is based upon the beta ray attenuation method on standard 47 mm membrane filters; the data are bihourly collected.  Benzene/Toluene/Xylene are measured using the capillary gas chromatographic technique in the gaseous phase, which enables the rapid separation and identification (15 minutes) of the components of the gas sample.

The data
The data are collected by the system every hour for all parameters, except for PM 10 that are collected every two hours. Therefore, all data are considered with means every two hours (even hours).
In order to simplify the further statistical elaborations, only days with complete data, that is days with all 12 bihourly means were considered for data set.
The data collected by the monitoring network was validated according to this protocol: a preliminary validation was carried out by the software, which has invalidated all data occurred in calibration hours, and data identified as artifacts; then, a manual calibration was carried out by operators, considering the relations existing among the several parameters: for example, the validation of parameters monitored by the same instrument (i.e. benzene and toluene, or the nitrogen oxides), was carried out simultaneously, like so for parameters linked by the same hypothetical source (i.e. carbon oxide and aromatic compounds, typical traffic pollutants). In this way it is possible to verify that eventual critical data are related to real pollution situations, and they are not artifacts due to instrument malfunction. Moreover, meteorological data (rain, speed and direction wind) were used to investigate about the influence of natural events on high or low concentration situation.
The data have been collected during the period of time from January 2005 to May 2007 in the investigated sites.
In the Table 1 the basic statistics for each site have been summarized.

Conclusions
Multivariate statistical techniques such as receptor models offer a valid tool to handle complex data sets and allow to extract information not directly inferable from original data matrix by traditional approach.
In our case the model suggests that the major amount of PM 10 isn't linked directly to the vehicular traffic. It's probably due to PM 10 long and medium range transport and due to formation of secondary particulate. The model confirms a common regional contribution to PM 10 among sites and the absence of PM 10 seasonal trend observed.
Even if the model is applied to few parameters, it is able to suggest information about the nature of the pollution's sources. However for the determination of the other important pollution sources, such as domestic heating, it's needed to obtain parameters that allow to identify this source.
The results obtained by the models moreover confirm that PM 10 concentration cannot be considered a good air quality indicator because it don't reflect the real pollution's sources.

The model description
The aim of the application of the receptor models is the apportionment of the pollutant's sources. The two main approaches of receptor models are Chemical Mass Balance (CMB) and multivariate factor analysis (FA). CMB gives the most objective source apportionment and it needs only one sample; however, it assumes knowledge of the number of sources and their emission pattern. On the other hand, FA attempts to apportion the sources and to determine their composition on the basis of a series of observations at the receptor site only [40]. Among multivariate techniques, Principal Component Analysis (PCA) is often used as an exploratory tool to identify the major sources of air pollutant emissions [38,[41][42][43]. The great advantage of using PCA as a receptor model is that there is no need for a priori knowledge of emission inventories [44].
PCA is a statistical method that identifies patterns in data, revealing their similarities and differences [45]. PCA creates new variables, the principal components scores (PCS), that are orthogonal and uncorrelated to each other, being linear combinations of the original variables. They are obtained in such a way that the first PC explains the largest fraction of the original data variability, the second PC explains a smaller fraction of the data variance than the first one and so forth [46][47][48]. Varimax rotation is the most widely employed orthogonal rotation in PCA, because it tends to produce simplification of the unrotated loadings to easier interpretation of the results. It simplifies the loadings by rigidly rotating the PC axes such that the variable projections (loadings) on each PC tend to be high or low.
Moreover the reconstruction of the source profile and contribution matrices can be successfully obtained by APCS (Absolute Principal Component Scores) method [49].
The observed pollutant concentration in the atmosphere at a certain time C i can be considered as a linear combination of contributions from p sources: where S k is the contribution from each source and a ik is the fraction of source k contribution possessing property i at the receptor. One of the most used methods to decompose the concentration matrix in the product of the source pattern and contribution matrices is the APCS. The starting point is the matrix X (samples × parameters). In the APCS method the first step is the search of the Eigenvalues and Eigenvectors of the data correlation matrix G. Only the most significant p Eigenvectors (or factors) are taken into account. Generally two methods are used in order to choose p Eigenvectors: Kaiser method.
Eigenvectors: Kaiser method (PCs with eigenvalues greater than 1) and ODV80 ones (PCs representing at least 80% of the original data variance).
The p Eigenvectors are then rotated by an orthogonal or oblique rotation. The most used rotation algorithm is Varimax, which performs orthogonal rotation of the loadings. After the rotation all the components should assume positive values; small negative values are set zero. An abstract image of the source contributions to the samples can be obtained by multivariate linear regression: where Z is the scaled data matrix, PCS is the principal component scores matrix, and V T is the transposed rotated loading (Eigenvectors) matrix. In order to pass from the abstract contributions to real ones, a fictitious sample Z 0 , where all concentrations are zero, is built [43,50]. Details about the method can be found in the reference 49: the APCS matrix can be identified with the estimated contributions matrix F r . A regression on the data matrix X allows to obtain the estimated source profiles matrix A r . At last the product of the matrices F r and A r allows to recalculate the data matrix X r (reconstructed data matrix). The reconstruction percentage error of the model has been calculated as percent relative root mean square errors (RRMSE) as shown in reference [49].
The authors declare no experimental research has been performed on animals or humans in the frame of the research activities related to this paper. No ethics committee exists for this kind of research.