Saturday 28 November 2015

Step 2 : Extraction & Merging NSSO datasets using few simple steps using the Stata software package

(A) Understand the folder and files you get in the Unit Level  Record data CD bought from NSSO

Here, we will use the Unit Level 68th round Employment & Unemployment Data (2012-13) Schedule Type I data given in the folder called  Nss68_1.0_Type1..,the contents are as below,

The most important files are  the data files and what it contains, which is explained very nicely in the Readme file, as below,

Next is the multiplier file, which explains the basis behind multipliers assigned to each household, which is nothing but sampling weights (we had explained in the earlier posts how that was derived).


But, the most impotant files, which are the key to your work for extracting data are provided in the excel table called Layout68_1_0typ1.XLS (see the first figure)..as b below, you may like to understand these thoroughly before you start your work. We will guide you in every step  here..



(B) Understand the levels of data presented an create a common household  ID and also a person ID; use Stata codes for extraction and merging as given below,

/*Our strategy would be to extract our relevant data in Stata file format, thus for each level we would like to have a .dta file. For not causing any confusion, we will create the dictionary file .dct with the same name as the .dta file to be created later. The .dct file creation will take into account firstly the relevant variables to create the HOUSEHOLD ID or PERSON ID as the case may be and then other variables which are of interest. Choose and give names to the variables you require as given in the LAYOUT table (take a printout).  In the case below, we jump to the Level2 file as the HHID can also be created from there (instead from Level1) and additionally other variables can also be created*/  
/* CREATING THE HH_CHARS.DTA FILE*/

/* Ist Step: Create the hh_chars.dct for household details as below by saving it with the same name. Use Stata Do-file editor for this and save it as desired. This essentially informs Stata about a data structure of a database, from which data from your chosen specific fields needs to be extracted */
infix dictionary
{
          str fsu 4-8
          str hg  32-32
          str sss 33-33
          str hhno 34-35
          str state 16-17
          str sector 15-15
          hh_size 43-44
                 str prin 45-49
                 str proc 50-52
                 str hh_type  53-53
          str religion 54-54
          str social_group 55-55
                 land_own 56-63
                 land_pos 64-71
                 land_cult 72-79
                 bank 83-83
          str level 36-37
          nss 127-129
          nsc 130-132
          mlt 133-142
}        

/* IInd Step: Create the hh_chars.dta as below from Level2. You are asking Stata to extract data from the Level2 file from the specific fields to be created from byte positions given in the .dct file */

infix using "D:\NSS\EU68\hh_chars.dct", using ("D:\NSS\EU68\R6810L02.txt")

browse

/*Most important code below for fixing identity of Households in all level files for all times! All level files will have these variables compulsorily within Common Items 1-35 in the same position and have the same number of bytes. You must generate it every time for every level file you want to use for maintaining common identity of the household!*/

gen hhid= fsu+ hg+ sss+ hhno
brow
gen comb_wt = mlt /100 if nss == nsc
replace comb_wt= mlt/200 if nss != nsc
brow

* CHECK FOR DUPLICATION IN hhid
isid hhid

* CHECK THE NUMBER OF HOUSEHOLD RECORDS IN DATASET
count

* CALCULATE THE ESTIMATED NUMBER OF HOUSEHOLDS

* tabstat comb_wt, statistics( sum ) format(%14.0g)
tabstat comb_wt, statistics( sum ) format(%14.0f)
destring sector, gen(sector_nu)
label define lblSector 1 "Rural" 2 "Urban"
label values sector_nu lblSector

/* PERCENTAGE DISTRIBUTION OF HOUSEHOLDS BY SECTOR IN THE SAMPLE (WITHOUT ATTACHING COMBINED WEIGHTS)*/

tab sector_nu

/* PERCENTAGE DISTRIBUTION OF HOUSEHOLDS IN THE POPULATION BY SECTOR (i.e. ATTACHING COMBINED WEIGHTS)*/

tabulate sector_nu [fweight = round(comb_wt)] 
save "D:\NSS\EU68\hh_chars.dta"
clear all
**********************************************************************************
/* CREATING THE PS_CHARS.DTA FILE*/
* 1st Step: Create the ps_chars.dct for personal details as below by saving it with the same name
infix dictionary
{
          str fsu 4-8
          str hg  32-32
          str sss 33-33
          str hhno 34-35
          str psno 41-42
          str relhd 43-43
          str sex 44-44
          str age 45-47
          str mar 48-48
          str genedu 49-50
          str tecedu 51-52
          str eduatt 53-54
          str vt 57-57
          str ft 58-59
          str mgn 60-60
          str mgnw 61-61
          nss 127-129
          nsc 130-132
          mlt 133-142
}        
* IInd Step: Create the ps_chars.dta as below from Level3

infix using "D:\NSS\EU68\ps_chars.dct", using ("D:\NSS\EU68\R6810L03.txt")
browse

*Generate HHID for household identification & PSID for person identification
gen hhid= fsu+ hg+ sss+ hhno
gen psid= hhid+psno
brow
gen comb_wt = mlt /100 if nss == nsc
replace comb_wt= mlt/200 if nss != nsc
br
* CHECK FOR DUPLICATION IN HHID it should throw warning that it is not unique since for each person the household is same
isid hhid
* CHECK FOR DUPLICATION IN PSID it should throw no warning
isid psid
* CHECK THE NUMBER OF PERSON RECORDS IN DATASET
/*This should tally with the Readme_68_10 file for number of records in that level from where you are extracting data*/
count
* CALCULATE THE ESTIMATED NUMBER OF PERSONS
* tabstat comb_wt, statistics( sum ) format(%14.0g)
tabstat comb_wt, statistics( sum ) format(%14.0f)
save "D:\NSS\EU68\ps_chars.dta"
clear all
**********************************************************************************/* CREATING THE PS_PREF.DTA FILE*/
/* 1st Step: Create the ps_pref.dct for personal details as below by saving it with the same name*/
infix dictionary
{
          str fsu 4-8
          str hg  32-32
          str sss 33-33
          str hhno 34-35
         str sector 15-15
         str state 16-18
          str psno 41-42
          str age 43-45
          str dom 46-46
          str domres 47-47
          str dompur 48-48
          str hhind 50-50
          str coll 52-52
          str dung 58-58
          str tail 59-59
          str watdist 63-65
          str w_will 66-66
          str w_nat 67-67
          str w_type 68-68
          str w_skill 69-69
          str w_asst  70-70
          nss 127-129
          nsc 130-132
          mlt 133-142
}        
* IInd Step: Create the ps_pref.dta as below from Level8
infix using "D:\NSS\EU68\ps_pref.dct", using ("D:\NSS\EU68\R6810L08.txt")
browse
*Generate HHID for household identification & PSID for person identification
gen hhid= fsu+ hg+ sss+ hhno
gen psid= hhid+psno
brow
gen comb_wt = mlt /100 if nss == nsc
replace comb_wt= mlt/200 if nss != nsc
br
/* CHECK FOR DUPLICATION IN HHID it should throw warning that it is not unique since for each person the household is same*/
isid hhid

* CHECK FOR DUPLICATION IN PSID it should throw no warning
isid psid

* CHECK THE NUMBER OF PERSON RECORDS IN DATASET
/*This should tally with the Readme_68_10 file for number of records in that level from where you are extracting data*/

count

* CALCULATE THE ESTIMATED NUMBER OF PERSONS
* tabstat comb_wt, statistics( sum ) format(%14.0g)
tabstat comb_wt, statistics( sum ) format(%14.0f)

save "D:\NSS\EU68\ps_pref.dta"

clear all

*************************************************************************************
/* CREATING THE HH_CON.DTA FILE*/

/* 1st Step: Create the hh_con.dct for household consumption expenditure details as below by saving it with the same name*/
infix dictionary
{
          str fsu 4-8
          str hg  32-32
          str sss 33-33
          str hhno 34-35
          str item 41-42
          str con30 43-50
          str con365 51-58
     nss 127-129
          nsc 130-132
          mlt 133-142
}        

* IInd Step: Create the hh_con.dta as below from Level9
infix using "D:\NSS\EU68\hh_con.dct", using ("D:\NSS\EU68\R6810L09.txt")

browse
*Generate HHID for household identification

gen hhid= fsu+ hg+ sss+ hhno

*Generate ITEMHHID for household & its item consumption *identification

gen itemhhid = hhid+item

brow

gen comb_wt = mlt /100 if nss == nsc
replace comb_wt= mlt/200 if nss != nsc

br

* CHECK FOR DUPLICATION IN HHID, there should be warning

isid hhid


* CHECK FOR DUPLICATION IN ITEMHHID no warning
isid itemhhid

* CHECK THE NUMBER OF ITEMS & HH COMBO RECORDS IN DATASET
*This should tally with the Readme_68_10 file for number of records in that level from where you are extracting data

count

* CALCULATE THE ESTIMATED NUMBER OF ITEM&HH COMBINATIONS
* tabstat comb_wt, statistics( sum ) format(%14.0g)
tabstat comb_wt, statistics( sum ) format(%14.0f)

save "D:\NSS\EU68\hh_con.dta"

clear all

*************************************************************
/* CREATING THE PS_ACT.DTA FILE*/

* 1st Step: Create the ps_act.dct for person activity details as below by saving it with the same name
infix dictionary
{
          str fsu 4-8
          str hg  32-32
          str sss 33-33
          str hhno 34-35
          str psno 41-42
          str age2 43-45
          str pracstat 46-47
          str pracnic 48-52
          str pracnco 53-55
          str ssac 56-56
          str sawork 66-66
     nss 127-129
          nsc 130-132
          mlt 133-142
}        


* IInd Step: Create the ps_act.dta as below from Level4
infix using "D:\NSS\EU68\ps_act.dct", using ("D:\NSS\EU68\R6810L04.txt")

browse
*Generate HHID for household identification

gen hhid= fsu+ hg+ sss+ hhno

*Generate PSID for person identification

gen psid = hhid+psno

brow

gen comb_wt = mlt /100 if nss == nsc
replace comb_wt= mlt/200 if nss != nsc

br

* CHECK FOR DUPLICATION IN HHID, there should be warning

isid hhid


* CHECK FOR DUPLICATION IN PSID no warning
isid psid

* CHECK THE NUMBER OF PERSON RECORDS IN DATASET
*This should tally with the Readme_68_10 file for number of records in that level from where you are extracting data

count

* CALCULATE THE ESTIMATED NUMBER OF ITEM&HH COMBINATIONS
* tabstat comb_wt, statistics( sum ) format(%14.0g)
tabstat comb_wt, statistics( sum ) format(%14.0f)

save "D:\NSS\EU68\ps_act.dta"

clear all

*************************************************************
/* CREATING THE PS_SACT.DTA FILE*/

* 1st Step: Create the ps_sact.dct for person subsidiary activity details as below by saving it with the same name

infix dictionary
{
          str fsu 4-8
          str hg  32-32
          str sss 33-33
          str hhno 34-35
          str psno 41-42
          str age3 43-45
          str pracstat 46-47
          str pracnic 48-52
          str pracnco 53-55
          str loct 56-57
          str entype 58-58
          str enelec 59-59
          str enworker 60-60
          str jobtyp 61-61
          str elpl 62-62
          str elss 63-63
          str pay 64-64
          nss 127-129
          nsc 130-132
          mlt 133-142
}        
         

* IInd Step: Create the ps_sact.dta as below from Level5
infix using "D:\NSS\EU68\ps_sact.dct", using ("D:\NSS\EU68\R6810L05.txt")

browse
*Generate HHID for household identification

gen hhid= fsu+ hg+ sss+ hhno

*Generate PSID for person identification

gen psid = hhid+psno

brow

gen comb_wt = mlt /100 if nss == nsc
replace comb_wt= mlt/200 if nss != nsc

br

* CHECK FOR DUPLICATION IN HHID, there should be warning

isid hhid


* CHECK FOR DUPLICATION IN PSID no warning

isid psid

* CHECK THE NUMBER OF PERSON RECORDS IN DATASET
*This should tally with the Readme_68_10 file for number of *records in that level from where you are extracting data

count

* CALCULATE THE ESTIMATED NUMBER PERSONS WITH SUBSIDIARY *ACTIVITY
* tabstat comb_wt, statistics( sum ) format(%14.0g)
tabstat comb_wt, statistics( sum ) format(%14.0f)

save "D:\NSS\EU68\ps_sact.dta"

clear all

*************************************************************************************
/*Merging process of the person files with the household files, however there has to be a common code for merging 2 files. Remember that many persons may belong to ONE household! Therefore we will have a M:1 (Many to ONE) RELATIONSHIP between Person records & Household records viz. one household record may be associated with many person records as they belong to the same household or family*/

/* 1st Step:  Call the "Many" file first, in our case, its lets say, ps_chars.dta and sort it on the common code, which is hhid & save

use "D:\NSS\EU68\ps_chars.dta"
sort hhid
save  "D:\NSS\EU68\ps_chars.dta"

/* 2nd Step: Call the "One" file next its lets say, hh_chars.dta and sort it on the common code which is also hhid & save

use "D:\NSS\EU68\hh_chars.dta"
sort hhid
save  "D:\NSS\EU68\hh_chars.dta"

/* 3rd   Step: Now issue the merge command for merging the sorted  personal file with the household file in M:1 manner.

/*First call the "Many file"*/

use "D:\NSS\EU68\ps_chars.dta"

/*Now call the the "One" file and issue merge command on common code hhid. The merged dataset will have all records of "Many" file
and duplications of the "One" file*/

merge m:1 hhid using "D:\NSS\EU68\hh_chars.dta"

save "D:\NSS\EU68\pshh_chars.dta"

*************************************************************************************

8 comments:

  1. this blog has help me in understanding and working on NSSO EU data. however i am facing problem is merging CWS/CDS data with Household data since CDS/CWS contain numerous duplicated observation. could you please sort it out.

    ReplyDelete
  2. Hello, it's easy, actually there is no duplicates file in the recent employment round,the cws cds file is given interms of activity serial numbers. You need to create a hhid person id and activity id file in the cws cds file, than merge the household file with this file by giving the command of merging 1:m byputting the key household variables ,than the file will be automatically merge with the activity file

    ReplyDelete
  3. Respected Sir,
    i am research scholar of central university of Himachal Pradesh, dept. of Economic and public policy. i am working on the employment structure in unit level data of NSSO. i want to know the codes about the five variables during 1993-94, 2004-05 and 2011-12 that i can use in the stata software. these variables are LFPR, WFPR, Modes of employment, Religious, Social and general education in rural and urban areas (UPSS basis).
    sir, i only want the codes that i can use in the software to find these variable in the both rural and urban areas. And also want to know that who is the file related to the rural and urban areas because there are number of files. so please help me in this problem.
    i shall be thankful, if you take the initiative on my little words and email.
    thank you

    ReplyDelete
  4. Sir there are variables with same name like usual principal status in level 4 and level 7. Does this mean I need to add them to calculate the total people engaged in usual principal activity status? Plz clarify sir

    ReplyDelete
  5. Hello Sir!

    Your blog is indeed very useful and informative. I would like to ask your help in my specific problem.
    I am working on NSS-68th Round-Consumption expenditure (Type II).

    STEP I: I have extracted the data of Level 2, Level 5 to Level 9 and thereafter, I created a common household id using a command: egen hhid = concat ( FSUno HamletGp SecStageStratum SampleHHno )

    STEP II: Thereafter, I merged Level 2 with 5, 6, 7, 8 and 9 as I required data via household type, for which the info is given in Level 2.

    STEP III: Now, after having the required data i.e. total consumption and expenditure for 6 household type in rural areas and 4 household type in Urban area for every product (Item Code wise), I am having trouble in understanding the numbers?? That is whether the numbers give monthly/weekly/annual data for these households type?? For my analysis I require the total annual consumption for all these household type, product wise.

    PLEASE help me out!!

    Thanks

    Hope to hear Soon!!

    ReplyDelete
  6. Sir,
    Can you please share steps for ASI as well

    ReplyDelete
  7. The Casino Review - JtmHub
    › 2021/08/25 광양 출장안마 › The-Casino-Review › 2021/08/25 › The-Casino-Review The Casino Review: Slots, Table 여수 출장마사지 Games, Live 여수 출장안마 Dealers, 충청북도 출장샵 Casino Games, VIP 세종특별자치 출장안마 Program and more.

    ReplyDelete