Module B4:Basic Data Analysis Techniques

3. Weighting

Weighting is an essential aspect in household survey data analysis. Weighting is not required for a population census or for fully self-weighting surveys because samples are allocated proportionately to the respective population across all strata, clusters or secondary sampling units. In all other cases, appropriate weights must be applied to each and every primary sampling unit to derive meaningful estimates.

A household survey is one type of sample survey. For a sample survey, data is not collected from the entire population of people or households in the area being studied. Rather, a small sample of the population is surveyed and statistical techniques are used to estimate values for the entire population in the study area based on the sample.

3.1.       The Need for ‘Weighting’ in Household Surveys

The sample size, such as the number of households interviewed in a survey, is noted by ‘n’ in sampling techniques. Similarly, we denote population size, such as the total number of households in the study area (irrespective of whether they were included in the sample or not) with an ‘N’. Sample size ‘n’ is always smaller than its population size ‘N’ and ‘n / N’ is known as the sampling fraction.

When simple random sampling is applied, the probability that any given household will be selected in the sample is p = n / N; this is the same for every household in the study area. The sample ‘weight’ can be calculated as w = N / n, which is the reciprocal to the selection probability.

This weight is useful to estimate the population totals based on the sample.

Example-1:In a district of 10,000 households (N=10,000), 500 households are selected to study the out-of-school children aged 6 to 14 (n=500). The survey reveals that there are 850 children aged 6-14 in the sample households, and 34 of them are out-of-school.

These numbers, however, are not much use for the district authorities, policy makers and education planners.

The sampling fraction is: n / N = 500 / 10,000 = 0.05 (or 5.0%), and the reverse of sampling fraction, is 1 / 0.05 = 20. If simple random sampling is applied, the sampling fraction is the probability of a household to be selected in the sample (p=0.05) and its reverse ‘20’ is the ‘weight’ for a sample household (that is, a sample household represents 20 households in the district).

Therefore, total number of children aged 6-14 in the district can be estimated as:
850 x 20 = 17,000, and

the estimated number of out-of-school children in the district is: 34 x 20 = 680.

These estimates of the population, 17,000 children and 680 out-of-school children in the district are more relevant and useful than knowing there were 850 children and 34 out-of-school children in the sample. This the first reason for using the sample weight in survey data analysis.

From the above figures in the sample, the percentage of out-of-school children in the district is:

34 / 850 x 100 = 4.0%, if calculated based on the sample households.

The same result will be obtained if it is calculated from the district estimates, that is:

680 / 17,000 x 100 = 4.0%.

Therefore, no weighting is necessary when a percentage (rate, ratio or proportion) is calculated from the sample if simple random sampling, or a sampling method in which each and every household in the study area has equal possibility (or probability) of being selected, is used. Such a sampling method is known as a ‘self-weighting’ sampling method.

In some situations, however, there is an unequal probability that each household in the study area will be selected in the sample. In these situations, weighting is essential. The following example illustrates such situation.

Example-2:The above district contains two types of households: stable/settled households and moving/migrated households. Of 10,000 households (N) in the district, 9,000 are regular households and the remaining 1,000 are moving ones (N1 = 9,000 and N2 = 1,000).

One of the objectives of the study is to explore the schooling status of the children aged 6-14 among the moving households, and thus, 300 regular households (n1 = 300) and 200 moving households (n2 = 200) are selected in the sample. The same results are obtained as in the Example-1, 850 children aged 6-14 and 34 of them are out-of-school in the 500 sample households.

When reviewing the results of the survey for regular and moving household groups, however, it is found that 500 children aged 6-14 are from the regular households and 350 are from the moving households. The survey also found that just 6 out-of-school children are from regular households while the remaining 28 are from the moving households.

The simple calculation of the percentage of out-of-school children in the district is:
34 / 500 x 100 = 4.0%.

However, this is not representative of the district population, since the sampling method is not ‘self-weighting’ (that is, every household in the district has the same chance / probability to be in the sample).

The probability of selecting a household in the sample for the regular household group is:
300 / 9,000 = 0.03333;

while the probability of selecting a household for the moving household group is:
200 / 1,000 = 0.20.

That is, one sample household represents 30 households (1 / 0.0333) in the regular group, but only 5 (1 / 0.20) in the moving household group. In other words, the sample ‘weight’ to be used for the regular group is 30, and just 5 for the moving group in estimating the totals.

By using the weights, there would be:
500 x 30 + 350 x 5 = 15,000 + 1,750 = 16,750 children aged 6-14

and
6 x 30 + 28 x 5 = 180 + 140 = 320 out-of-school children.

Then, the estimated percentage of out-of-school children aged 6-14 for the district is:
320 / 16,750 x 100 = 1.91%.

This 1.91% is closer (or more appropriate) estimate of percentage of out-of-school children aged 6-14 in the entire district.

From the above examples, it is obvious that the sampling method applied is very important in deciding whether the ‘weighting’ should be used or not in analysing sample survey data sets. In national surveys, weighting is necessary even if a self-weighting sampling method is applied because the ‘response rates’ vary among the different population groups or secondary sampling units (thus, the representations are different). As such, sample weights are necessary for analysing all common household survey data sets.

Weighting Variables in DHS, MICS and LSMS Surveys

The variables using for weighting are contained in some of the household survey data sets. In this case, users can obtain the weights easily. Other survey data sets do not, however, explicitly define a weighting variable. In this case, users should consult the survey documents, such as technical manuals, survey guidelines and reports. Even if the sample weights are explicitly provided with the data set, some surveys provide weights only in integer numbers, that is, the variable is not directly usable as sample weight, but is multiplied or divided by a number or a variable that is normally mentioned in the survey documents.

The following is an extract from ‘Description of the Demographic and Health Surveys Individual Recode Data File’ (p.10, DHS III, March 2008) stating how to use the sample weight in DHS data sets.

In Demographic and Health Survey (DHS) data sets, users must use the variable representing ‘Sample weight’ from the household data set or personal data set (listing of all household members). For example, in the Nepal DHS 2006 survey data sets, users can find ‘Sample weight’ with the same variable name ‘HV005’ in both NPHR51FL.SAV (household) and NPPR51FL.SAV (personal). Similarly, ‘Sample weight’ is located in BDHR50FL.SAV (household) and

BDPR50FL.SAV (personal) in the Bangladesh DHS survey data sets. As mentioned above, users must divide existing weighting variable V005 by 1,000,000 before using as sample weight or compute a new variable, for example: ‘sweight = V005/1,000,000’, and use ‘sweight’ as the weighting variable.

In Multiple Cluster Indicator Survey (MICS) data sets, weighting variables can be found with the label ‘Household sample weight’ in both HH.SAV (household) and HL.SAV (personal or household listing) data sets.

In Albania, the Living Standard Measurement Survey (LSMS) data set contains two files representing sample weights: ‘weights_cl.sav’ contains weights for each and every primary sampling unit (PSU) and ‘weights_psu.sav’ contains the intermediate variables/data used in calculation of weights for PSUs. On the other hand, some other LSMS or LSS data sets do not provide weights explicitly but mentioned the use of sample weights in respective technical reports. Special requests to national statistical office may be required to obtain sample weights that were used in the study.

3.2 Turning On and Off ‘Weight’ in SPSS Statistics

In SPSS Statistics, weighting cases can be turning on or off easily after opening the data set by following the steps below, after V005 has been divided by 1,000,000 (use Compute command).

  1. Click ‘Data’ on main menu bar.
  2. Click ‘Weight Cases’.

A new window will appear with the variable list.

  1. Check ‘Weight cases by’ checkbox.
  2. Select the weighting variable ‘Sample weight [HV005]’ and send the variables to ‘Frequency Variable:’.
  3. Click ‘OK’ to complete the task.

After this, all analyses (tabulations) will be weighted by ‘Sample weight (HV005)’. To weight-off, check ‘Do not weight cases’ in Step 3 and click ‘OK’ (see above). Examples of weighting are provided in Module B-5.

 

Comments are closed.