Saturday 28 November 2015

Step 2 : Extraction & Merging NSSO datasets using few simple steps using the Stata software package

(A) Understand the folder and files you get in the Unit Level  Record data CD bought from NSSO

Here, we will use the Unit Level 68th round Employment & Unemployment Data (2012-13) Schedule Type I data given in the folder called  Nss68_1.0_Type1..,the contents are as below,

The most important files are  the data files and what it contains, which is explained very nicely in the Readme file, as below,

Next is the multiplier file, which explains the basis behind multipliers assigned to each household, which is nothing but sampling weights (we had explained in the earlier posts how that was derived).


But, the most impotant files, which are the key to your work for extracting data are provided in the excel table called Layout68_1_0typ1.XLS (see the first figure)..as b below, you may like to understand these thoroughly before you start your work. We will guide you in every step  here..



(B) Understand the levels of data presented an create a common household  ID and also a person ID; use Stata codes for extraction and merging as given below,

/*Our strategy would be to extract our relevant data in Stata file format, thus for each level we would like to have a .dta file. For not causing any confusion, we will create the dictionary file .dct with the same name as the .dta file to be created later. The .dct file creation will take into account firstly the relevant variables to create the HOUSEHOLD ID or PERSON ID as the case may be and then other variables which are of interest. Choose and give names to the variables you require as given in the LAYOUT table (take a printout).  In the case below, we jump to the Level2 file as the HHID can also be created from there (instead from Level1) and additionally other variables can also be created*/  
/* CREATING THE HH_CHARS.DTA FILE*/

/* Ist Step: Create the hh_chars.dct for household details as below by saving it with the same name. Use Stata Do-file editor for this and save it as desired. This essentially informs Stata about a data structure of a database, from which data from your chosen specific fields needs to be extracted */
infix dictionary
{
          str fsu 4-8
          str hg  32-32
          str sss 33-33
          str hhno 34-35
          str state 16-17
          str sector 15-15
          hh_size 43-44
                 str prin 45-49
                 str proc 50-52
                 str hh_type  53-53
          str religion 54-54
          str social_group 55-55
                 land_own 56-63
                 land_pos 64-71
                 land_cult 72-79
                 bank 83-83
          str level 36-37
          nss 127-129
          nsc 130-132
          mlt 133-142
}        

/* IInd Step: Create the hh_chars.dta as below from Level2. You are asking Stata to extract data from the Level2 file from the specific fields to be created from byte positions given in the .dct file */

infix using "D:\NSS\EU68\hh_chars.dct", using ("D:\NSS\EU68\R6810L02.txt")

browse

/*Most important code below for fixing identity of Households in all level files for all times! All level files will have these variables compulsorily within Common Items 1-35 in the same position and have the same number of bytes. You must generate it every time for every level file you want to use for maintaining common identity of the household!*/

gen hhid= fsu+ hg+ sss+ hhno
brow
gen comb_wt = mlt /100 if nss == nsc
replace comb_wt= mlt/200 if nss != nsc
brow

* CHECK FOR DUPLICATION IN hhid
isid hhid

* CHECK THE NUMBER OF HOUSEHOLD RECORDS IN DATASET
count

* CALCULATE THE ESTIMATED NUMBER OF HOUSEHOLDS

* tabstat comb_wt, statistics( sum ) format(%14.0g)
tabstat comb_wt, statistics( sum ) format(%14.0f)
destring sector, gen(sector_nu)
label define lblSector 1 "Rural" 2 "Urban"
label values sector_nu lblSector

/* PERCENTAGE DISTRIBUTION OF HOUSEHOLDS BY SECTOR IN THE SAMPLE (WITHOUT ATTACHING COMBINED WEIGHTS)*/

tab sector_nu

/* PERCENTAGE DISTRIBUTION OF HOUSEHOLDS IN THE POPULATION BY SECTOR (i.e. ATTACHING COMBINED WEIGHTS)*/

tabulate sector_nu [fweight = round(comb_wt)] 
save "D:\NSS\EU68\hh_chars.dta"
clear all
**********************************************************************************
/* CREATING THE PS_CHARS.DTA FILE*/
* 1st Step: Create the ps_chars.dct for personal details as below by saving it with the same name
infix dictionary
{
          str fsu 4-8
          str hg  32-32
          str sss 33-33
          str hhno 34-35
          str psno 41-42
          str relhd 43-43
          str sex 44-44
          str age 45-47
          str mar 48-48
          str genedu 49-50
          str tecedu 51-52
          str eduatt 53-54
          str vt 57-57
          str ft 58-59
          str mgn 60-60
          str mgnw 61-61
          nss 127-129
          nsc 130-132
          mlt 133-142
}        
* IInd Step: Create the ps_chars.dta as below from Level3

infix using "D:\NSS\EU68\ps_chars.dct", using ("D:\NSS\EU68\R6810L03.txt")
browse

*Generate HHID for household identification & PSID for person identification
gen hhid= fsu+ hg+ sss+ hhno
gen psid= hhid+psno
brow
gen comb_wt = mlt /100 if nss == nsc
replace comb_wt= mlt/200 if nss != nsc
br
* CHECK FOR DUPLICATION IN HHID it should throw warning that it is not unique since for each person the household is same
isid hhid
* CHECK FOR DUPLICATION IN PSID it should throw no warning
isid psid
* CHECK THE NUMBER OF PERSON RECORDS IN DATASET
/*This should tally with the Readme_68_10 file for number of records in that level from where you are extracting data*/
count
* CALCULATE THE ESTIMATED NUMBER OF PERSONS
* tabstat comb_wt, statistics( sum ) format(%14.0g)
tabstat comb_wt, statistics( sum ) format(%14.0f)
save "D:\NSS\EU68\ps_chars.dta"
clear all
**********************************************************************************/* CREATING THE PS_PREF.DTA FILE*/
/* 1st Step: Create the ps_pref.dct for personal details as below by saving it with the same name*/
infix dictionary
{
          str fsu 4-8
          str hg  32-32
          str sss 33-33
          str hhno 34-35
         str sector 15-15
         str state 16-18
          str psno 41-42
          str age 43-45
          str dom 46-46
          str domres 47-47
          str dompur 48-48
          str hhind 50-50
          str coll 52-52
          str dung 58-58
          str tail 59-59
          str watdist 63-65
          str w_will 66-66
          str w_nat 67-67
          str w_type 68-68
          str w_skill 69-69
          str w_asst  70-70
          nss 127-129
          nsc 130-132
          mlt 133-142
}        
* IInd Step: Create the ps_pref.dta as below from Level8
infix using "D:\NSS\EU68\ps_pref.dct", using ("D:\NSS\EU68\R6810L08.txt")
browse
*Generate HHID for household identification & PSID for person identification
gen hhid= fsu+ hg+ sss+ hhno
gen psid= hhid+psno
brow
gen comb_wt = mlt /100 if nss == nsc
replace comb_wt= mlt/200 if nss != nsc
br
/* CHECK FOR DUPLICATION IN HHID it should throw warning that it is not unique since for each person the household is same*/
isid hhid

* CHECK FOR DUPLICATION IN PSID it should throw no warning
isid psid

* CHECK THE NUMBER OF PERSON RECORDS IN DATASET
/*This should tally with the Readme_68_10 file for number of records in that level from where you are extracting data*/

count

* CALCULATE THE ESTIMATED NUMBER OF PERSONS
* tabstat comb_wt, statistics( sum ) format(%14.0g)
tabstat comb_wt, statistics( sum ) format(%14.0f)

save "D:\NSS\EU68\ps_pref.dta"

clear all

*************************************************************************************
/* CREATING THE HH_CON.DTA FILE*/

/* 1st Step: Create the hh_con.dct for household consumption expenditure details as below by saving it with the same name*/
infix dictionary
{
          str fsu 4-8
          str hg  32-32
          str sss 33-33
          str hhno 34-35
          str item 41-42
          str con30 43-50
          str con365 51-58
     nss 127-129
          nsc 130-132
          mlt 133-142
}        

* IInd Step: Create the hh_con.dta as below from Level9
infix using "D:\NSS\EU68\hh_con.dct", using ("D:\NSS\EU68\R6810L09.txt")

browse
*Generate HHID for household identification

gen hhid= fsu+ hg+ sss+ hhno

*Generate ITEMHHID for household & its item consumption *identification

gen itemhhid = hhid+item

brow

gen comb_wt = mlt /100 if nss == nsc
replace comb_wt= mlt/200 if nss != nsc

br

* CHECK FOR DUPLICATION IN HHID, there should be warning

isid hhid


* CHECK FOR DUPLICATION IN ITEMHHID no warning
isid itemhhid

* CHECK THE NUMBER OF ITEMS & HH COMBO RECORDS IN DATASET
*This should tally with the Readme_68_10 file for number of records in that level from where you are extracting data

count

* CALCULATE THE ESTIMATED NUMBER OF ITEM&HH COMBINATIONS
* tabstat comb_wt, statistics( sum ) format(%14.0g)
tabstat comb_wt, statistics( sum ) format(%14.0f)

save "D:\NSS\EU68\hh_con.dta"

clear all

*************************************************************
/* CREATING THE PS_ACT.DTA FILE*/

* 1st Step: Create the ps_act.dct for person activity details as below by saving it with the same name
infix dictionary
{
          str fsu 4-8
          str hg  32-32
          str sss 33-33
          str hhno 34-35
          str psno 41-42
          str age2 43-45
          str pracstat 46-47
          str pracnic 48-52
          str pracnco 53-55
          str ssac 56-56
          str sawork 66-66
     nss 127-129
          nsc 130-132
          mlt 133-142
}        


* IInd Step: Create the ps_act.dta as below from Level4
infix using "D:\NSS\EU68\ps_act.dct", using ("D:\NSS\EU68\R6810L04.txt")

browse
*Generate HHID for household identification

gen hhid= fsu+ hg+ sss+ hhno

*Generate PSID for person identification

gen psid = hhid+psno

brow

gen comb_wt = mlt /100 if nss == nsc
replace comb_wt= mlt/200 if nss != nsc

br

* CHECK FOR DUPLICATION IN HHID, there should be warning

isid hhid


* CHECK FOR DUPLICATION IN PSID no warning
isid psid

* CHECK THE NUMBER OF PERSON RECORDS IN DATASET
*This should tally with the Readme_68_10 file for number of records in that level from where you are extracting data

count

* CALCULATE THE ESTIMATED NUMBER OF ITEM&HH COMBINATIONS
* tabstat comb_wt, statistics( sum ) format(%14.0g)
tabstat comb_wt, statistics( sum ) format(%14.0f)

save "D:\NSS\EU68\ps_act.dta"

clear all

*************************************************************
/* CREATING THE PS_SACT.DTA FILE*/

* 1st Step: Create the ps_sact.dct for person subsidiary activity details as below by saving it with the same name

infix dictionary
{
          str fsu 4-8
          str hg  32-32
          str sss 33-33
          str hhno 34-35
          str psno 41-42
          str age3 43-45
          str pracstat 46-47
          str pracnic 48-52
          str pracnco 53-55
          str loct 56-57
          str entype 58-58
          str enelec 59-59
          str enworker 60-60
          str jobtyp 61-61
          str elpl 62-62
          str elss 63-63
          str pay 64-64
          nss 127-129
          nsc 130-132
          mlt 133-142
}        
         

* IInd Step: Create the ps_sact.dta as below from Level5
infix using "D:\NSS\EU68\ps_sact.dct", using ("D:\NSS\EU68\R6810L05.txt")

browse
*Generate HHID for household identification

gen hhid= fsu+ hg+ sss+ hhno

*Generate PSID for person identification

gen psid = hhid+psno

brow

gen comb_wt = mlt /100 if nss == nsc
replace comb_wt= mlt/200 if nss != nsc

br

* CHECK FOR DUPLICATION IN HHID, there should be warning

isid hhid


* CHECK FOR DUPLICATION IN PSID no warning

isid psid

* CHECK THE NUMBER OF PERSON RECORDS IN DATASET
*This should tally with the Readme_68_10 file for number of *records in that level from where you are extracting data

count

* CALCULATE THE ESTIMATED NUMBER PERSONS WITH SUBSIDIARY *ACTIVITY
* tabstat comb_wt, statistics( sum ) format(%14.0g)
tabstat comb_wt, statistics( sum ) format(%14.0f)

save "D:\NSS\EU68\ps_sact.dta"

clear all

*************************************************************************************
/*Merging process of the person files with the household files, however there has to be a common code for merging 2 files. Remember that many persons may belong to ONE household! Therefore we will have a M:1 (Many to ONE) RELATIONSHIP between Person records & Household records viz. one household record may be associated with many person records as they belong to the same household or family*/

/* 1st Step:  Call the "Many" file first, in our case, its lets say, ps_chars.dta and sort it on the common code, which is hhid & save

use "D:\NSS\EU68\ps_chars.dta"
sort hhid
save  "D:\NSS\EU68\ps_chars.dta"

/* 2nd Step: Call the "One" file next its lets say, hh_chars.dta and sort it on the common code which is also hhid & save

use "D:\NSS\EU68\hh_chars.dta"
sort hhid
save  "D:\NSS\EU68\hh_chars.dta"

/* 3rd   Step: Now issue the merge command for merging the sorted  personal file with the household file in M:1 manner.

/*First call the "Many file"*/

use "D:\NSS\EU68\ps_chars.dta"

/*Now call the the "One" file and issue merge command on common code hhid. The merged dataset will have all records of "Many" file
and duplications of the "One" file*/

merge m:1 hhid using "D:\NSS\EU68\hh_chars.dta"

save "D:\NSS\EU68\pshh_chars.dta"

*************************************************************************************

Your complete guide to analysing National Sample Survey (NSSO) data

NSSO DATA ANALYSIS
Welcome!

Without wasting any of your precious time, we will take you through the process in steps, utilising India's 68th round NSSO as an example. All your queries will be answered within 3 days!

STEP-1 : Understanding the sampling methodology and calculation of sampling weights

Here, we will illustrate the methodology for sub samplewise estimates in rural sector for the consumer expenditure survey of the 68th round of NSSO (2012-13). Let's say we want to see the data for Meghalaya as in Fig 1.

Meghalaya has several districts as can be seen in the map below in Fig-2.


NSSO  has a concept of  NSS State-Region:  An NSS state-region is a contiguous group of districts within a State having similar topography, agro-economic characteristics and population densities. For bigger States, the number of regions goes up to 7 while for smaller States/UTs, there is only one region. The regions have some distinctive geographical features and climatic conditions and this makes the regional estimates more meaningful and useful in some respects. The boundaries of a region generally do not cut across district boundaries or State boundaries.A region may comprise a single district for small States/UTs but for larger States, there may be 10-12 districts in a region.
In our example above, the entire State of Meghalaya has been assigned as only one region viz. NSS-SR= 171. Check out Fig-5 below,


Central Sample & State Samples: In most NSS rounds, survey for the State sample is done by the respective State Govts, while for the Central sample it is done by NSSO.  State samples & Central samples are drawn separately or allocated separately.

For rural India, 59,695 households (SSU, explained later) were selected from 7, 469 villages (PSU,explained later) selected from the country were sample surveyed under the Central sample. Subsequently we will discuss only about central samples.

Schedule Type 1 & Type 2 : Type 1 questionnaires  based upon consumption during last 30 days & last 365 days, while Type 2 questionnaries based upon last 365 days (infrequently purchased categories), last 7 days (some food categories) & last 30 days for (other food, fuel, etc)
Schedule Type 1 & Schedule Type 2 were canvassed in 2 independent samples of matching size drawn from each stratum/sub-stratum(same FSUs/Villages).Thus, under Schedule Type 1- 59,695 households were surveyed while under Schedule Type 2 - 59,683 households were  surveyed separately  from the same 7,469 villages under Central sample. We will discuss about Type-1 samples only subsequently.

Stratum: Districts within the NSS-SR are taken as the primary strata. However this is also broken into 2 sectors viz. Urban Sector & Rural Sector. We  will discuss only about the rural sector in this example. In our example, we have 7 strata under NSS-SR rural sector , each of them is a distinct district with their geographical boundary.
Sub-Stratum: If the number of  households of a district is large, then it is sub-divided into two or more sub-strata of nearly equal households by grouping together contiguous groups of villages having similar socio-economic characteristics.
In our example, the villages has been grouped in West Garo Hills district (Stratum=1, Distcd=1)  into 6 sub-strata, in such a  way that there are more  or less equal number of households in each such group. Check out Fig-5 table above and Fig-3 below as an conceptual illustration,



FSU(First Stage Units): These are  villages in the sub-stratum for the rural sector. Here, the Sampling Frame is the  list of villages under Census 2011.
USU/SSU(Ultimate/Second Stage Units): These are households available for survey within the village Here, Sampling Frame is all the households available for survey not all the households in the village (please note the distinction, it is important!)
Sub-Rounds: The period of survey is one year duration starting 1st July and ending 30 June (India's Agricultural Year). This survey period is  broken into 4 subrounds of 3 months duration as below,
Sub-round 1 : July-Sept
Sub-round 2 : Oct-Dec
Sub-round 3 : Jan-Mar
Sub-round 4 : Apr-Jun

Concept of Interpenetrating Sub-samples:  The method consists in drawing samples of FSUs/Villages in the form of 2 villages @8HH per  village,thus 16 HHs as ONE subsample and ANOTHER subsample drawn independently in a similar manner . Thus, we have  2 SUBSAMPLES IN EACH SUBSTRATUM. Checkout Fig-6 table below.
The main advantage is that the Relative Standard Error of Estimates (RSE) can be easily found out in this method, even when the sample design is complex.

Second Stage Stratum(SSS): Each household available for survey in the selected FSU/Village is categorised IN EITHER OF THE THREE categories, as the following,illustrated conceptually in Fig-4 below,

SSS 1 Relatively affluent
SSS 2 Of the remaining, having principal earning from  non-agricultural sources
SSS 3 All other remaining


Generally 2 households from SSS1 , 4 households from SSS2 & 2 households from SSS3 are selected from the household available for survey sample  frame  IN THE FSU/VILLAGE as  explained above(If some categories are not found  then it is made up as per priority rules ensuring ALWAYS that in total  8 households are selected from each FSU/Village)

Fig-7 illustrates the case for only 4 villages for your understanding,



Now we can show you how the sampling weights are calculated !



The meaning of the notations used in NSSO are as below,




Go back to Fig-6 

Can you try these now? Find out the relevant formula and check whether it matches the
calculated weight as given on left! If you can then you  have definitely  understood the 
sampling plan and  the concept of multipliers/weights in NSSO Unit Level data.

For the first row it is worked out for you as below for calculation of the multiplier for this household, See the above complete working in an Excel file here!  => ($V$2/4)*(1/$BG$15)*(BG29/COUNTIF(AR2:AR9,1))


Congratulations!

Thank you I hope you enjoyed understanding the apparently complex stuff  as much as I enjoyed in preparing its explanation!

Next, we will look at how to decipher and analyse NSSO data for a given problem.

So for now, good-bye and good luck!.