Data validation refers to the process of ensuring that the characteristics of your data match the known characteristics of the population as measured by other analysts. If you have large discrepancies between your estimates and the estimates compiled by others, this is a clear “red flag” that something has gone wrong. Usually this is a problem that can be solved by going back to cleaning the data, but sometimes your sample may diverge in important ways from the samples collected by others. You will need to state why this is the case in your write-up of the data.
Data validation can be done in several ways:
./tables/
First things first, you need to add a new subdirectory ./tables/
to your class folder. It should be of the same level as the other subdirectories you’ve already created: data
, do
, plots
, aux
.
Today, we’ll use the plans
dataset. We’re going to compare our results with several tables published by NCES. Let’s start with educational expectations of high school sophomores. We start by survey setting the data:
. global datadir "../data/"
. global tabsdir "../tables/"
. // load plans data
. use ${datadir}plans.dta
. // set up data for survey commands
. svyset psu [pw = bystuwt], str(strat_id) singleunit(scaled)
pweight: bystuwt
VCE: linearized
Single unit: scaled
Strata 1: strat_id
SU 1: psu
FPC 1: <zero>
The next step is to account for missing data properly:
. local allvar bystexp bysex byrace byses1 f1psepln
. // change values for vars in local that in (-4,-8,-9) to missing
. foreach myvar in `allvar' {
2. replace `myvar' = . if `myvar' == -4
3. replace `myvar' = . if `myvar' == -8
4. replace `myvar' = . if `myvar' == -9
5. }
(648 real changes made, 648 to missing)
(276 real changes made, 276 to missing)
(0 real changes made)
(648 real changes made, 648 to missing)
(171 real changes made, 171 to missing)
(0 real changes made)
(648 real changes made, 648 to missing)
(276 real changes made, 276 to missing)
(0 real changes made)
(648 real changes made, 648 to missing)
(276 real changes made, 276 to missing)
(0 real changes made)
(1,131 real changes made, 1,131 to missing)
(781 real changes made, 781 to missing)
(46 real changes made, 46 to missing)
Next, we tabulate expectations for college and compare it to a known estimate.
. tab bystexp
how far in school student thinks will |
get-composite | Freq. Percent Cum.
----------------------------------------+-----------------------------------
{don^t know} | 1,450 9.52 9.52
less than high school graduation | 128 0.84 10.36
high school graduation or ged only | 983 6.45 16.81
attend or complete 2-year college/schoo | 879 5.77 22.58
attend college, 4-year degree incomplet | 561 3.68 26.26
graduate from college | 5,416 35.55 61.81
obtain master^s degree or equivalent | 3,153 20.69 82.50
obtain phd, md, or other advanced degre | 2,666 17.50 100.00
----------------------------------------+-----------------------------------
Total | 15,236 100.00
. svy: proportion bystexp
(running proportion on estimation sample)
Survey: Proportion estimation
Number of strata = 361 Number of obs = 16,160
Number of PSUs = 751 Population size = 3,408,319
Design df = 390
_prop_1: bystexp = {don^t know}
_prop_2: bystexp = less than high school graduation
_prop_3: bystexp = high school graduation or ged on
_prop_4: bystexp = attend or complete 2-year colleg
_prop_5: bystexp = attend college, 4-year degree in
_prop_6: bystexp = graduate from college
_prop_7: bystexp = obtain master^s degree or equiva
_prop_8: bystexp = obtain phd, md, or other advance
--------------------------------------------------------------
| Linearized
| Proportion Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
bystexp |
_prop_1 | .0987875 .0030196 .0930076 .1048851
_prop_2 | .0094831 .00098 .007738 .0116172
_prop_3 | .0724693 .0030538 .0666899 .0787074
_prop_4 | .0643949 .0028925 .0589365 .0703211
_prop_5 | .0389852 .0018459 .0355139 .0427808
_prop_6 | .3578959 .0046507 .3488048 .3670902
_prop_7 | .1971035 .004424 .1885502 .2059464
_prop_8 | .1608805 .0039873 .1531947 .1688749
--------------------------------------------------------------
We get output in the console, but let’s use the eststo
and esttab
commands to store our estimates and produce nicer tables. Using esttab
alone, we’ll get a nicely formatted table in the console. By adding ... using <file>
we save an .rtf
version of the same table. We can easily paste this table in a paper. If you are feeling bold, you could output the table in LaTeX format and incorporate into your LaTeX-formatted document.
. estimates store expect_tab
. // save as table using esttab
. esttab expect_tab using ${tabsdir}expect_tab.rtf, b(3) se(4) ///
> varlabels(_prop_1 "Unsure" ///
> _prop_2 "Less than HS" ///
> _prop_3 "HS or GED" ///
> _prop_4 "AA/AS" ///
> _prop_5 "Some college" ///
> _prop_6 "BA/BS" ///
> _prop_7 "MA/MS" ///
> _prop_8 "PhD or Prof") ///
> replace
(output written to ../tables/expect_tab.rtf)
. // post clean table to output window
. esttab expect_tab, b(3) se(4) ///
> varlabels(_prop_1 "Unsure" ///
> _prop_2 "Less than HS" ///
> _prop_3 "HS or GED" ///
> _prop_4 "AA/AS" ///
> _prop_5 "Some college" ///
> _prop_6 "BA/BS" ///
> _prop_7 "MA/MS" ///
> _prop_8 "PhD or Prof")
----------------------------
(1)
Proportion
----------------------------
bystexp
Unsure 0.099***
(0.0030)
Less than HS 0.009***
(0.0010)
HS or GED 0.072***
(0.0031)
AA/AS 0.064***
(0.0029)
Some college 0.039***
(0.0018)
BA/BS 0.358***
(0.0047)
MA/MS 0.197***
(0.0044)
PhD or Prof 0.161***
(0.0040)
----------------------------
N 16160
----------------------------
Standard errors in parentheses
* p<0.05, ** p<0.01, *** p<0.001
NB: The ///
at the end of each line in the esttab
commands tells Stata to move to the next line but that the command isn’t yet finished. Without this, the options would stretch far on one line: bad coding practice. I could have also changed the delimiter to ;
like we did when reading in NCES datasets in the earlier lecture.
Now that we have a clean table to look at, is this the same as Table 2 on page 22 of the report? Yes. Checking the standard errors reveals that there were also correctly done. Now we need to check this for all of the other variables in our dataset.
I want you to replicate Table 34 on page 128 of NCES 2005-338. We’ll split this up, but I want the class to come up with a single table that has exactly the same results as the NCES document.
Init: 25 August 2015; Updated: 25 August 2015