Data Analysis Gone Wrong: Electricity and Internet Access as Predictors of Primary School Completion

This week, Invested Development, an “impact investment management firm”, published a blog article arguing that gaps in access to electricity and internet are two main drivers of low education rates in the developing world. Using World Bank data, they contend that “there is a clear linear correlation” between access to electricity and primary school completion. Based on their correlations, they also argue that “those countries with higher levels of internet subscribers are more likely to have better educational outcomes.”

While I recognize the good intentions of iD and their hard work in putting together the blog entry, I can’t help but criticize major methodological errors on their part.

Independent and Dependent Variable Confusion
It is clear from the way the research findings are reported that the authors intended to use primary education completion rates as the dependent variable or outcome in their models. Electricity and internet access are considered independent variables or causes of effects.

However, the two scatter plots included in the blog depict primary education on the x-axis.

Graphic from iD blog, 08 July 2013.
Graphic from “Education, Energy, and the Digital Divide in Africa”, iD Blog, 08 July 2014.
Graphic from "
Graphic from “Education, Energy, and the Digital Divide in Africa”, iD Blog, 08 July 2014.

Immediately, those have an elementary knowledge of statistics will recognize that the x-axis is reserved for the independent variable. As a refresher (or an introduction), here is the basic linear regression model:

linear

where y is the outcome variable of interest, α is the intercept (or value of y when x is zero), ß is the coefficient determining the relationship between the independent variable(s) x and y, and ε is an error term accounting for variation in y not accounted for by the independent variable(s) included in the model.

Clearly, iD intended to model:

primary education = α + ß(electricity) + ε
primary education = α + ß(internet) + ε

Instead, their graphics show models with primary education predicting electricity and internet access. In other words, the causal direction of the relationship is inverted. Here are the models they are actually showing us with these graphics:

electricity = α + ß(primary education) + ε
internet = α + ß(primary education) + ε

Overstating the Relationship
To correct for this model mis-specification, I downloaded data from the World Bank Data Bank on primary education completion rates, access to electricity, and internet users. I replicated iD’s graphics using time series data for available country years. These are represented below. Clearly, there is more time series data available on internet usage (N=4227) than on electricity coverage (N=170).

Replication of iD graphic from "Education, Energy, and the Digital Divide in Africa" 08 July 2014.
Replication of iD graphic from “Education, Energy, and the Digital Divide in Africa” 08 July 2014.
Replication of iD graphic from "Education, Energy, and the Digital Divide in Africa" 08 July 2014
Replication of iD graphic from “Education, Energy, and the Digital Divide in Africa” 08 July 2014

But these above graphs are essentially useless for the arguments made by iD. Primary education is not intended to be a predictor of internet and electricity access, but rather predicted by these two technological factors.

Thus I re-run their graphs and models using primary education as the dependent variable. These are depicted below. The “clear linear correlation” between electricity and primary education remains once the model is adjusted; however, the effect is not very pronounced (i.e. the regression line is almost horizontal). For internet access, the relationship is obviously non-linear and appears to be insignificant (again, the slope is rather flat).

Electric Corrected Graph copy

 

Internet with Correct copy

Bivariate regressions in Stata indicate that both electricity and internet are positive predictors of primary completion rates. There is high correlation between electricity and internet, so including them both in the same model would violate some assumptions of OLS. However, when I include them both anyway, internet is no longer significant.

blog data 09 July copy

Obviously, there are serious problems with these simple models. They should not be taken as evidence supporting or rejecting the hypothesis that primary school completion is impacted by electricity and internet access.

First, the models are plagued with missing data. To allow for interpretation across different models, I limited to analysis to the subset of country-years for which data was available for all variables. This reduced the sample to 57 countries in 2010 and 48 countries in 2011. All other country-years were omitted due to missing data.

Second, these simple models fail to take into account the complicated mixture of factors influencing primary school completion rates across different regions. Specifically, the models suffer from serious omitted variable bias such as reliance on subsistence farming, religion, region, and cost of attendance.

Finally, these models also do not account for the non-normality of the distribution of the primary completion variable. I attempt to control for heteroskedasticity using robust standard errors, but there could be other violations of OLS assumptions in the simple models.

Conclusion

In sum, while iD obviously had good intentions in mind and sought to provide concrete evidence in support of enhancing electricity and internet access for the betterment of primary school education, their results are flawed by a misspecification of their models.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s