Scale validation and prediction of environmental health literacy in Brazil

Table 2 presents a statistical summary of the four scales, Air, Food, Water, and General. This can give us a notion of the overall distribution of values on each of the scales and compare the values on each of them. The table presents key descriptive statistics such as the minimum, first quartile, median, mean, third quartile, and maximum values, which collectively offer insights into each scale.

Table 2 Summary statistics for environmental scores.

Analyzing the descriptive statistics, we can see that, in the case of the Air Scale, the values range from a minimum of 1.800 to a maximum of 3.900. The median value is 2.600, slightly lower than the mean of 2.641. In the Food Scale, the data extends from a minimum of 2.444 to a maximum of 5.000. With the median and mean values being 4.222 and 4.145, respectively. The Water Scale shows values from 2.000 to 4.917. The median is 3.917 and the mean is 3.871. The General Scale presents a range from 1.778 to 5.000. The median and mean are 4.000 and 3.986, respectively.

We can observe that the four scales do not exhibit much variation between means and medians, which suggests that the values are not overly skewed. However, it is noteworthy that the air scale consistently shows lower values, while the food scale shows higher values compared to the other scales. These differences suggest that the population has higher literacy levels in food-related topics, possibly due to a higher engagement or familiarity with the theme.

Table of Contents

Confirmatory factor analyses

Air scale

The four models proposed by Lichtveld et al.²⁴ were tested using confirmatory factor analyses. The model related to air quality presented an MSA of 0.6, which indicates a mediocre fit³⁸. However, Bartlett’s test of homogeneity of variances showed statistical significance (chi-square = 339.55, p-value < 0.001), indicating that the correlation matrix differs from the identity matrix. Thus, the data set is suitable for factor analysis. McDonald’s omega total is 0.53, suggesting the need for refinement of the scale.

The confirmatory factor analysis for the air quality model yielded a significant p-value, indicating a poor fit between the model and the data. This can be confirmed by the alternative indices. The comparative fit index (CFI = 0.851) and the Tucker-Lewis index (TLI = 0.790) reinforce this result since values greater than 0.9 are considered a good fit⁴⁴. The error values are at the threshold of what is considered a good fit. The RMSEA was 0.049, with a 90% confidence interval ranging from 0.031 to 0.067, and the SRMR was 0.06. Reference values are 0.05 for RMSEA and 0.08 for SRMR⁴⁴.

Food scale

Regarding the food scale, the MSA found for the sample was 0.78, which represents a middling fit³⁸. Bartlett’s test showed statistical significance (chi-square = 377.85, p-value< 0.001), indicating that the data set is suitable for factor analysis. Cronbach’s alpha for this sample is 0.7, and McDonald’s omega total is 0.79, indicating reliability.

The confirmatory factor analysis of the food model showed a significant p-value, suggesting a poor fit with the data. However, this result can be affected by sample size, with large samples possibly leading to significant p-values even for well-specified models. The CFI and TLI indices are high at 0.970 and 0.956, respectively, showing the model fits the data well. The error measures also support this, with an RMSEA of 0.045 (90% CI: 0.022 to 0.066) and an SRMR of 0.057, both indicating a good fit.

Water scale

Regarding the water scale, it is important to highlight that two items are not suitable for the locus of this research. These are the following items: “ I only use the dishwasher when I have a full load” and “ I comply with instructions when a boil water advisory is issued by the city”. The use of dishwashers in Brazil is restricted to the high-income population. Furthermore, there is no hot running water in the vast majority of Brazilian houses, as is the case in countries with temperate and cold climates in the northern hemisphere. These items presented a low response rate, unlike the others, and were excluded from the analysis.

The sample’s MSA was 0.69, indicating a mediocre fit³⁸. However, the significant Bartlett test (chi-square = 646.76, p-value< 0.001) suggests the data is appropriate for factor analysis. McDonald’s omega total at 0.71 shows reliability. The water model’s confirmatory factor analysis had a non-significant p-value, indicating a good fit. High CFI (0.981) and TLI (0.976) values also support a good model fit, as do the RMSEA (0.021, 90% CI: 0.000 to 0.040) and SRMR (0.049), which both indicate a good fit. Most factor loadings are significant, showing that the observed variables effectively represent the latent factors. Additionally, the significant covariance among the factors reveals they are distinct yet interconnected.

General scale

Regarding the general scale, the MSA was 0.66, indicating a mediocre fit³⁸. However, the significant Bartlett test (chi-square = 268.03, p-value< 0.001) suggests the data is appropriate for factor analysis. McDonald’s omega total at 0.71 shows reliability. The general model’s confirmatory factor analysis had a significant p-value, suggesting a poor fit with the data. However, this result could be affected by sample size. High CFI (0.961) and TLI (0.942) values support a good model fit, as do the RMSEA (0.039, 90% CI: 0.011 to 0.061) and SRMR (0.051), which both indicate a good fit.

Regarding the relationships between latent and observed variables for all four models, most factor loadings are significant, showing that the observed variables effectively represent the latent factors. Additionally, the significant covariance among the factors reveals they are distinct yet interconnected. Table 3 presents the fit indexes and factor loadings for the confirmatory factor analyses.

Table 3 Fit indexes and factor loadings for CFA.

Adjustments to the air scale structure

Given the low fit indices observed for the air scale, an exploratory analysis was undertaken to determine the most suitable structure for this construct. Factor analysis was conducted on the ten items constituting the air scale. The scree plot indicates an optimal model comprising four factors, while eigenvalues point towards a two-factor model. Nonetheless, from a theoretical standpoint, a model with three factors demonstrates the best fit.

Therefore, an exploratory factor analysis, taking into account three factors, revealed item factor loadings between 0.43 and 0.72, indicative of good quality⁴⁵. The alternative fit indices suggest a well-fitting model⁴⁴, with RMSEA at 0.032 (90% CI of 0 to 0.075), SRMR at 0.04, and the TLI at 0.942. The proportion of the variance explained by the model is 31%. McDonald’s omega total at 0.64 shows better reliability, but still suggests that there is room for improvement. The following Table 4 presents the factor loadings of the short version of the air scale.

Table 4 Short version of the air scale.

This structure contains two items related to behavior regarding air pollution (x9 , and x10), two items related to attitudes (x4 , and x5), and two items related to knowledge (x2, and x6 “I consider the air I breathe in my community to be clean”, which was initially related to attitudes, but that is linked to knowledge about air quality in the local community).

We then conducted a CFA using the complementary dataset to that randomly selected for the EFA. The results indicate a reasonable fit for the new air scale structure, confirming its acceptability. The fit indices support this conclusion: RMSEA = 0.012 (90% CI: 0-0.094), SRMR = 0.044, TLI = 0.999, and CFI = 1.000. McDonald’s omega total is 0.64, suggesting some improvement, but refinement is still needed.

Random forest

In this section, we discuss the Shapley values for the random forest models that were developed to predict the EHL for the four scales evaluated. Initially, we identify and describe the ten most influential features as determined by the models, showcasing their respective mean absolute Shapley values, with the actual value for each feature shown in the value on the left side of the Figs. 1, 2, 3 and 4. To give more nuance to this analysis, we also provide the actual Shapley values for each observation within the test dataset for these top ten characteristics, presented in the scatterplot of the figure. We differentiate these values based on their value, of 0 or 1, providing a clearer understanding of how each feature’s value influences the model’s predictions across different observations. This approach offers insight into the predictive dynamics of the random forest models of each scale.

Air scale

The Random Forest tuned to the air scale data has yielded an mtry value of 2 and a Minimal Node Size of 19. When applied to the testing data, this configuration resulted in an RMSE of 0.355.

Figure 1 presents that the most significant characteristic is whether the person holds a complete university degree, followed by having an incomplete university degree. Having a complete high school degree ranks fifth. This result shows the relevance of education when predicting the air scale score. Additionally, employment status is also relevant in predicting scores. Being employed without an employment contract, whether in the informal sector or as self-employed, ranks third. Being unemployed follows in the sixth position. Disability ranks fourth, showing that previous conditions also play a significant role. Gender differences are important as well, with being male ranked as the seventh most important characteristic, whereas being female ranks tenth. Income brackets are also significant, with earnings from 6 to 9 minimum wages ranked eighth and earnings up to 1 minimum wage ranked ninth in the algorithm.

Analyzing the Shapley values according to the feature value, we can see that having a complete university degree indicates a higher predicted score while having an incomplete university degree or a complete high school has a negative effect. Being employed without an employment contract has a positive effect on the prediction and not being employed impacts it negatively. While having a disability has a strong negative effect on the prediction, we can see that there are only a few observations with a positive value in this case, indicating possible overfitting. Being male and being female both increase the predicted value. However, the effect is overall stronger in the first case. Those who earn from 6 to 9 minimum wages and those who earn less than a minimum wage both have a negative effect, which shows to be stronger in the first group.

Food scale

The Random Forest tuned to the air scale data has yielded an mtry value of 1 and a Minimal Node Size of 25. This, when applied to the testing data, has resulted in an RMSE of 0.455.

Figure 2 indicates that having a complete university degree has the most significant impact on food score. This is closely followed by having a complete high school degree, showing how education is important when predicting the food scale scores. The presence of a disability and continuous use of medication are also relevant, being the third and fifth predictors respectively, showing how important previous health conditions are on the model’s predictions. Gender also plays an important role, with being female ranking the fourth most important characteristic, while not specifying gender ranks tenth. Earning between 1 to 3 minimum wages is the sixth most relevant variable, suggesting that income is also an important predictor. Ethnicity and age are also relevant, with multiracial and white ethnicities being the seventh and eighth most relevant variables, respectively, and the age group of 20–30 years ranking ninth.

Looking at the actual shapley values shows how each of these features factors on the model. Holding a complete university degree positively affects the predictions, while only possessing a high school degree negatively impacts them. Disabilities and opting not to declare gender also negatively influence the scores, but the limited observations suggest potential overfitting in these categories. Being female and the continuous use of medication both contribute positively. Individuals earning between 1 to 3 minimum wages tend to have lower predicted scores. Respondents who identify as multiracial report higher scores, in contrast to those identifying as white, who show lower scores. Additionally, the 20 to 30-year-old age group is associated with lower predicted scores.

Water scale

The Random Forest tuned to the air scale data has yielded an mtry value of 6 and a Minimal Node Size of 34. This, when applied to the testing data, has resulted in an RMSE of 0.427.

Figure 3 indicates that the most relevant feature to predict EHL water scale is belonging to a younger age group, more specifically being from 20 to 30 years old. Being in the 40 to 50 years old group also plays a smaller role, ranking as the eighth most important feature. Using exclusively private healthcare is the second most important aspect, followed by living alone, which ranks third. Being female is the fourth most important aspect. Continuous medication use and having a disability are also key features, ranking fifth and sixth, respectively. The only education feature present in this case is having an incomplete university degree, in seventh place. Earning a high income, from 10 to 20 minimum wages, is the ninth most relevant factor, followed by being of multiracial ethnicity.

The shapley values present in Fig. 3 show that the 20 to 30 year old age group, exclusive use of private healthcare, and living alone predict low scores on the water scale. On the other hand, being female and continuous medication use are predictors of higher water scale scores. Having an incomplete university degree and a disability negatively affects scores on this scale. Unlike the negative impact of the younger group, individuals aged 40 to 50 years exhibit a positive impact on the prediction of the water scores. Earning between 10 to 20 minimum wages predicts lower scores, which could be caused by the small number of respondents earning that much. Additionally, multiracial ethnicity tends to positively influence the prediction.

General scale

The Random Forest tuned to the air scale data has yielded an mtry value of 1 and a Minimal Node Size of 25. This, when applied to the testing data, has resulted in an RMSE of 0.410.

In Fig. 4 we can see that a complete university degree and an incomplete university degree are features number 1 and 3 respectively. This highlights once again the interconnected nature of education and EHL. Having a disability shows as the second feature, with the continuous use of medication being the eighth. Being between 20 and 30 years old is the fourth most important variable in this model. Income and employment are also relevant, with earning from 1 minimum wage to 3 minimum wages and being employed with an employment contract being the fifth and sixth most important features in this model, and not being employed being the tenth one. Gender divisions are also relevant, with being female being in seventh place, followed by being male in ninth place.

Education plays a significant role in predicting general scale scores. Looking at the shapley values, we can see that holding a university degree indicates higher levels of EHL, and having an incomplete university degree indicates lower levels. Having a disability, being aged 20 to 30 years, earning between 1 to 3 minimum wages, and unemployment are features that negatively affect the prediction. In contrast, being employed under a contract and gender identification as either male or female predict higher scores. However, it is important to note that being female has higher shapley values on average than being male.

link