This week, Invested Development, an “impact investment management firm”, published a blog article arguing that gaps in access to electricity and internet are two main drivers of low education rates in the developing world. Using World Bank data, they contend that “there is a clear linear correlation” between access to electricity and primary school completion. Based on their correlations, they also argue that “those countries with higher levels of internet subscribers are more likely to have better educational outcomes.”
While I recognize the good intentions of iD and their hard work in putting together the blog entry, I can’t help but criticize major methodological errors on their part.
Independent and Dependent Variable Confusion
It is clear from the way the research findings are reported that the authors intended to use primary education completion rates as the dependent variable or outcome in their models. Electricity and internet access are considered independent variables or causes of effects.
However, the two scatter plots included in the blog depict primary education on the x-axis.
Immediately, those have an elementary knowledge of statistics will recognize that the x-axis is reserved for the independent variable. As a refresher (or an introduction), here is the basic linear regression model:
where y is the outcome variable of interest, α is the intercept (or value of y when x is zero), ß is the coefficient determining the relationship between the independent variable(s) x and y, and ε is an error term accounting for variation in y not accounted for by the independent variable(s) included in the model.
Clearly, iD intended to model:
primary education = α + ß(electricity) + ε
primary education = α + ß(internet) + ε
Instead, their graphics show models with primary education predicting electricity and internet access. In other words, the causal direction of the relationship is inverted. Here are the models they are actually showing us with these graphics:
electricity = α + ß(primary education) + ε
internet = α + ß(primary education) + ε
Overstating the Relationship
To correct for this model mis-specification, I downloaded data from the World Bank Data Bank on primary education completion rates, access to electricity, and internet users. I replicated iD’s graphics using time series data for available country years. These are represented below. Clearly, there is more time series data available on internet usage (N=4227) than on electricity coverage (N=170).
But these above graphs are essentially useless for the arguments made by iD. Primary education is not intended to be a predictor of internet and electricity access, but rather predicted by these two technological factors.
Thus I re-run their graphs and models using primary education as the dependent variable. These are depicted below. The “clear linear correlation” between electricity and primary education remains once the model is adjusted; however, the effect is not very pronounced (i.e. the regression line is almost horizontal). For internet access, the relationship is obviously non-linear and appears to be insignificant (again, the slope is rather flat).
Bivariate regressions in Stata indicate that both electricity and internet are positive predictors of primary completion rates. There is high correlation between electricity and internet, so including them both in the same model would violate some assumptions of OLS. However, when I include them both anyway, internet is no longer significant.
Obviously, there are serious problems with these simple models. They should not be taken as evidence supporting or rejecting the hypothesis that primary school completion is impacted by electricity and internet access.
First, the models are plagued with missing data. To allow for interpretation across different models, I limited to analysis to the subset of country-years for which data was available for all variables. This reduced the sample to 57 countries in 2010 and 48 countries in 2011. All other country-years were omitted due to missing data.
Second, these simple models fail to take into account the complicated mixture of factors influencing primary school completion rates across different regions. Specifically, the models suffer from serious omitted variable bias such as reliance on subsistence farming, religion, region, and cost of attendance.
Finally, these models also do not account for the non-normality of the distribution of the primary completion variable. I attempt to control for heteroskedasticity using robust standard errors, but there could be other violations of OLS assumptions in the simple models.
In sum, while iD obviously had good intentions in mind and sought to provide concrete evidence in support of enhancing electricity and internet access for the betterment of primary school education, their results are flawed by a misspecification of their models.