Step 2 : Extraction & Merging NSSO datasets using few simple steps using the Stata software package
(A) Understand the folder and files you get in the Unit Level Record data CD bought from NSSO
Here, we will use the Unit Level 68th round Employment & Unemployment Data (2012-13) Schedule Type I data given in the folder called Nss68_1.0_Type1..,the contents are as below,
The most important files are the data files and what it contains, which is explained very nicely in the Readme file, as below,
Next is the multiplier file, which explains the basis behind multipliers assigned to each household, which is nothing but sampling weights (we had explained in the earlier posts how that was derived).
But, the most impotant files, which are the key to your work for extracting data are provided in the excel table called Layout68_1_0typ1.XLS (see the first figure)..as b below, you may like to understand these thoroughly before you start your work. We will guide you in every step here..
(B) Understand the levels of data presented an create a common household ID and also a person ID; use Stata codes for extraction and merging as given below,
/*Our
strategy would be to extract our relevant data in Stata file format, thus for
each level we would like to have a .dta file. For not causing any confusion, we
will create the dictionary file .dct with the same name as the .dta file to be
created later. The .dct file creation will take into account firstly the
relevant variables to create the HOUSEHOLD ID or PERSON ID as the case may be
and then other variables which are of interest. Choose and give names to the
variables you require as given in the LAYOUT table (take a printout). In the case below, we jump to the Level2 file
as the HHID can also be created from there (instead from Level1) and
additionally other variables can also be created*/
/*
CREATING THE HH_CHARS.DTA FILE*/
/*
Ist Step: Create the hh_chars.dct for household details as below by saving it
with the same name. Use Stata Do-file editor for this and save it as desired.
This essentially informs Stata about a data structure of a database, from which
data from your chosen specific fields needs to be extracted */
infix dictionary
{
str
fsu 4-8
str
hg 32-32
str
sss 33-33
str
hhno 34-35
str
state 16-17
str
sector 15-15
hh_size
43-44
str prin 45-49
str proc 50-52
str hh_type 53-53
str
religion 54-54
str
social_group 55-55
land_own 56-63
land_pos 64-71
land_cult 72-79
bank 83-83
str
level 36-37
nss
127-129
nsc
130-132
mlt
133-142
}
/*
IInd Step: Create the hh_chars.dta as below from Level2. You are asking Stata
to extract data from the Level2 file from the specific fields to be created
from byte positions given in the .dct file */
infix
using "D:\NSS\EU68\hh_chars.dct", using
("D:\NSS\EU68\R6810L02.txt")
browse
/*Most
important code below for fixing identity of Households in all level files for
all times! All level files will have these variables compulsorily within Common
Items 1-35 in the same position and have the same number of bytes. You must
generate it every time for every level file you want to use for maintaining
common identity of the household!*/
gen
hhid= fsu+ hg+ sss+ hhno
brow
gen
comb_wt = mlt /100 if nss == nsc
replace
comb_wt= mlt/200 if nss != nsc
brow
*
CHECK FOR DUPLICATION IN hhid
isid
hhid
*
CHECK THE NUMBER OF HOUSEHOLD RECORDS IN DATASET
count
*
CALCULATE THE ESTIMATED NUMBER OF HOUSEHOLDS
*
tabstat comb_wt, statistics( sum ) format(%14.0g)
tabstat
comb_wt, statistics( sum ) format(%14.0f)
destring
sector, gen(sector_nu)
label
define lblSector 1 "Rural" 2 "Urban"
label
values sector_nu lblSector
/*
PERCENTAGE DISTRIBUTION OF HOUSEHOLDS BY SECTOR IN THE SAMPLE (WITHOUT
ATTACHING COMBINED WEIGHTS)*/
tab
sector_nu
/*
PERCENTAGE DISTRIBUTION OF HOUSEHOLDS IN THE POPULATION BY SECTOR (i.e.
ATTACHING COMBINED WEIGHTS)*/
tabulate
sector_nu [fweight = round(comb_wt)]
save
"D:\NSS\EU68\hh_chars.dta"
clear
all
**********************************************************************************
/*
CREATING THE PS_CHARS.DTA FILE*/
*
1st Step: Create the ps_chars.dct for personal details as below by saving it
with the same name
infix dictionary
{
str
fsu 4-8
str
hg 32-32
str
sss 33-33
str
hhno 34-35
str
psno 41-42
str
relhd 43-43
str
sex 44-44
str
age 45-47
str
mar 48-48
str
genedu 49-50
str
tecedu 51-52
str
eduatt 53-54
str
vt 57-57
str
ft 58-59
str
mgn 60-60
str
mgnw 61-61
nss
127-129
nsc
130-132
mlt
133-142
}
*
IInd Step: Create the ps_chars.dta as below from Level3
infix
using "D:\NSS\EU68\ps_chars.dct", using ("D:\NSS\EU68\R6810L03.txt")
browse
*Generate
HHID for household identification & PSID for person identification
gen
hhid= fsu+ hg+ sss+ hhno
gen
psid= hhid+psno
brow
gen
comb_wt = mlt /100 if nss == nsc
replace
comb_wt= mlt/200 if nss != nsc
br
*
CHECK FOR DUPLICATION IN HHID it should throw warning that it is not unique
since for each person the household is same
isid
hhid
*
CHECK FOR DUPLICATION IN PSID it should throw no warning
isid
psid
*
CHECK THE NUMBER OF PERSON RECORDS IN DATASET
/*This
should tally with the Readme_68_10 file for number of records in that level
from where you are extracting data*/
count
*
CALCULATE THE ESTIMATED NUMBER OF PERSONS
*
tabstat comb_wt, statistics( sum ) format(%14.0g)
tabstat
comb_wt, statistics( sum ) format(%14.0f)
save
"D:\NSS\EU68\ps_chars.dta"
clear
all
**********************************************************************************/*
CREATING THE PS_PREF.DTA FILE*/
/*
1st Step: Create the ps_pref.dct for personal details as below by saving it
with the same name*/
infix dictionary
{
str
fsu 4-8
str
hg 32-32
str
sss 33-33
str
hhno 34-35
str sector 15-15
str state 16-18
str
psno 41-42
str
age 43-45
str
dom 46-46
str
domres 47-47
str
dompur 48-48
str
hhind 50-50
str
coll 52-52
str
dung 58-58
str
tail 59-59
str
watdist 63-65
str
w_will 66-66
str
w_nat 67-67
str
w_type 68-68
str
w_skill 69-69
str
w_asst 70-70
nss
127-129
nsc
130-132
mlt
133-142
}
*
IInd Step: Create the ps_pref.dta as below from Level8
infix
using "D:\NSS\EU68\ps_pref.dct", using
("D:\NSS\EU68\R6810L08.txt")
browse
*Generate
HHID for household identification & PSID for person identification
gen
hhid= fsu+ hg+ sss+ hhno
gen
psid= hhid+psno
brow
gen
comb_wt = mlt /100 if nss == nsc
replace
comb_wt= mlt/200 if nss != nsc
br
/*
CHECK FOR DUPLICATION IN HHID it should throw warning that it is not unique
since for each person the household is same*/
isid
hhid
*
CHECK FOR DUPLICATION IN PSID it should throw no warning
isid
psid
*
CHECK THE NUMBER OF PERSON RECORDS IN DATASET
/*This
should tally with the Readme_68_10 file for number of records in that level
from where you are extracting data*/
count
*
CALCULATE THE ESTIMATED NUMBER OF PERSONS
*
tabstat comb_wt, statistics( sum ) format(%14.0g)
tabstat
comb_wt, statistics( sum ) format(%14.0f)
save
"D:\NSS\EU68\ps_pref.dta"
clear
all
*************************************************************************************
/*
CREATING THE HH_CON.DTA FILE*/
/*
1st Step: Create the hh_con.dct for household consumption expenditure details
as below by saving it with the same name*/
infix dictionary
{
str
fsu 4-8
str
hg 32-32
str
sss 33-33
str
hhno 34-35
str
item 41-42
str
con30 43-50
str
con365 51-58
nss 127-129
nsc
130-132
mlt
133-142
}
*
IInd Step: Create the hh_con.dta as below from Level9
infix
using "D:\NSS\EU68\hh_con.dct", using
("D:\NSS\EU68\R6810L09.txt")
browse
*Generate
HHID for household identification
gen
hhid= fsu+ hg+ sss+ hhno
*Generate
ITEMHHID for household & its item consumption *identification
gen
itemhhid = hhid+item
brow
gen
comb_wt = mlt /100 if nss == nsc
replace
comb_wt= mlt/200 if nss != nsc
br
*
CHECK FOR DUPLICATION IN HHID, there should be warning
isid
hhid
*
CHECK FOR DUPLICATION IN ITEMHHID no warning
isid
itemhhid
*
CHECK THE NUMBER OF ITEMS & HH COMBO RECORDS IN DATASET
*This
should tally with the Readme_68_10 file for number of records in that level
from where you are extracting data
count
*
CALCULATE THE ESTIMATED NUMBER OF ITEM&HH COMBINATIONS
*
tabstat comb_wt, statistics( sum ) format(%14.0g)
tabstat
comb_wt, statistics( sum ) format(%14.0f)
save
"D:\NSS\EU68\hh_con.dta"
clear
all
*************************************************************
/*
CREATING THE PS_ACT.DTA FILE*/
*
1st Step: Create the ps_act.dct for person activity details as below by saving
it with the same name
infix dictionary
{
str
fsu 4-8
str
hg 32-32
str
sss 33-33
str
hhno 34-35
str
psno 41-42
str
age2 43-45
str
pracstat 46-47
str
pracnic 48-52
str
pracnco 53-55
str
ssac 56-56
str
sawork 66-66
nss 127-129
nsc
130-132
mlt
133-142
}
*
IInd Step: Create the ps_act.dta as below from Level4
infix
using "D:\NSS\EU68\ps_act.dct", using
("D:\NSS\EU68\R6810L04.txt")
browse
*Generate
HHID for household identification
gen
hhid= fsu+ hg+ sss+ hhno
*Generate
PSID for person identification
gen
psid = hhid+psno
brow
gen
comb_wt = mlt /100 if nss == nsc
replace
comb_wt= mlt/200 if nss != nsc
br
*
CHECK FOR DUPLICATION IN HHID, there should be warning
isid
hhid
*
CHECK FOR DUPLICATION IN PSID no warning
isid
psid
*
CHECK THE NUMBER OF PERSON RECORDS IN DATASET
*This
should tally with the Readme_68_10 file for number of records in that level
from where you are extracting data
count
*
CALCULATE THE ESTIMATED NUMBER OF ITEM&HH COMBINATIONS
*
tabstat comb_wt, statistics( sum ) format(%14.0g)
tabstat
comb_wt, statistics( sum ) format(%14.0f)
save
"D:\NSS\EU68\ps_act.dta"
clear
all
*************************************************************
/*
CREATING THE PS_SACT.DTA FILE*/
*
1st Step: Create the ps_sact.dct for person subsidiary activity details as
below by saving it with the same name
infix dictionary
{
str
fsu 4-8
str
hg 32-32
str
sss 33-33
str
hhno 34-35
str
psno 41-42
str
age3 43-45
str
pracstat 46-47
str
pracnic 48-52
str
pracnco 53-55
str
loct 56-57
str
entype 58-58
str
enelec 59-59
str
enworker 60-60
str
jobtyp 61-61
str
elpl 62-62
str
elss 63-63
str
pay 64-64
nss
127-129
nsc
130-132
mlt
133-142
}
*
IInd Step: Create the ps_sact.dta as below from Level5
infix
using "D:\NSS\EU68\ps_sact.dct", using
("D:\NSS\EU68\R6810L05.txt")
browse
*Generate
HHID for household identification
gen
hhid= fsu+ hg+ sss+ hhno
*Generate
PSID for person identification
gen
psid = hhid+psno
brow
gen
comb_wt = mlt /100 if nss == nsc
replace
comb_wt= mlt/200 if nss != nsc
br
*
CHECK FOR DUPLICATION IN HHID, there should be warning
isid
hhid
*
CHECK FOR DUPLICATION IN PSID no warning
isid
psid
*
CHECK THE NUMBER OF PERSON RECORDS IN DATASET
*This
should tally with the Readme_68_10 file for number of *records in that level
from where you are extracting data
count
*
CALCULATE THE ESTIMATED NUMBER PERSONS WITH SUBSIDIARY *ACTIVITY
*
tabstat comb_wt, statistics( sum ) format(%14.0g)
tabstat
comb_wt, statistics( sum ) format(%14.0f)
save
"D:\NSS\EU68\ps_sact.dta"
clear
all
*************************************************************************************
/*Merging process of the person
files with the household files, however there has to be a common code for
merging 2 files. Remember that many persons may belong to ONE household!
Therefore we will have a M:1 (Many to ONE) RELATIONSHIP between Person records
& Household records viz. one household record may be associated with many person
records as they belong to the same household or family*/
/*
1st Step: Call the "Many" file
first, in our case, its lets say, ps_chars.dta and sort it on the common code,
which is hhid & save
use
"D:\NSS\EU68\ps_chars.dta"
sort
hhid
save "D:\NSS\EU68\ps_chars.dta"
/*
2nd Step: Call the "One" file next its lets say, hh_chars.dta and
sort it on the common code which is also hhid & save
use
"D:\NSS\EU68\hh_chars.dta"
sort
hhid
save "D:\NSS\EU68\hh_chars.dta"
/*
3rd Step: Now issue the merge command
for merging the sorted personal file
with the household file in M:1 manner.
/*First
call the "Many file"*/
use
"D:\NSS\EU68\ps_chars.dta"
/*Now
call the the "One" file and issue merge command on common code hhid.
The merged dataset will have all records of "Many" file
and
duplications of the "One" file*/
merge
m:1 hhid using "D:\NSS\EU68\hh_chars.dta"
save
"D:\NSS\EU68\pshh_chars.dta"