ECON 407/ADS 507: Data Science for Social Scientists Final Exam(Fall 2020) – January 14, 2021
1. Difference-in-differences (30 points)
Suppose you have access to individual-level data on twoneighboring regions in Sub-Saharan Africa (regions and ); youobserve the individuals living in these regions for fourconsecutive years (years 1, 2, 3, and 4). Both regions are poor andunderdeveloped; they are both plagued with the malaria disease.Mobility between regions is not allowed. The World HealthOrganization (WHO) wants to test the effectiveness of a new vaccineagainst malaria. The WHO team travels to region and applies thevaccination for free in periods 3 and 4 to everyone in the region.You are a health economist and you want to estimate the impact ofthe vaccination program on the prevalence of the malariadisease.
Notation: The outcome variable is a dummy , , taking 1 if theindividual living in region in period is infected (i.e., hasmalaria) and 0 if not infected. Denote the vector of relevantcontrol variables with , , (you don’t need to specify ordescribe these control variables).
a) Please describe how you would design adifference-in-differences approach, using which you can estimatethe impact of the vaccination program on the malaria infection rate(15 points).
b) Write down and explain the main difference-in-differencesassumptions required to obtain reliable estimates with a “causal”meaning (8 points).
c) What are the potential threats to identifying the causaleffect of interest? (7 points).
2. Instrumental variables (35 points)
Suppose you are a labor economist and you are interested in theempirical relationship between the years of education and labormarket earnings. You have access to a nationally-representativesample of individuals and you observe their years of education( ), labor market earnings ( ), and years of labor marketexperience ( ).
a) Write down the returns to schooling equation [Hint: aregression of the natural logarithm of earnings on years ofschooling and a quadratic in years of labor market experience.Don’t forget the residual!] (10 points).
b) You want to estimate the causal impact of an additional yearof schooling on labor market earnings using the returns toschooling equation. Can you get causal estimates simply using theOLS framework? What is the main econometric identification problem?Explain carefully (15 points).
c) Suppose you decided to implement an instrumental variables(IV) strategy to obtain causal estimates. You are choosing amongtwo alternative instruments:
– A dummy variable indicating whether the individuals in yoursample have educated or uneducated parents.
– A dummy variable indicating whether the individuals in yoursample are subject to a change in compulsory schooling law (say anincrease from 5 years to 8 years) or not.
Which one is a better IV? Provide a technical discussion forboth variables [Hint: remember the two IV assumptions] (10points).
3. Regression discontinuity design (35 points)
Suppose you want to estimate the causal impact of Honorsdegree—given to students graduated with a Cumulative Grade PointAverage (CGPA) 3.00 or above out of 4.00—on labor market earnings10 years after graduation. You have access to a largeindividual-level dataset of individuals graduated on a given year(i.e., on the same year). You observe the labor market earnings( ) of those individuals 10 years after graduation and their CGPAs( ).
a) Suppose you run an OLS regression of the natural logarithm oflabor market earnings on a dummy variable indicating whether theindividual has the award or not (for your entire sample). Can youobtain the causal impact of the award on earnings in thisregression? Why? Explain carefully (12 points).
b) Now you are implementing an RDD analysis to estimate thecausal effect of an Honors degree on earnings. Which variable isthe forcing (or assignment) variable? Plot a reasonable figureassuming linearity (8 points).
c) What are the main assumptions for your RDD to be valid? (8points).
d) How would you test whether the individuals are randomlydistributed around the threshold? [Hint:
suppose you also observe a large set of individual-levelcharacteristics] (7 points).