3. Gathering Survey Data and Getting Ready for Analysis
3.1 Data Sources and Contact Points for Obtaining Census and Survey Data
3.1.1 Population Census
Complete census databases especially by household are confidential and are not shared with the public or third-party users. However, departments under the Ministry of Education may request sub-sets of the census data once the census reports are published.
The National Census Bureau, Census Department or the Central (or General) Statistic Office is usually responsible for maintaining the census database. If the Ministry of Education identifies the required population data and education-related data in tabular forms, and requests access to the data through official (ministerial) channels, the census authorities has the responsibility to generate and provide the requested tables.
One major disadvantage of using census data is the long wait for data release and long lag-time. A population census can take over a year to complete. Clean databases and census reports are usually published two to three years after the census. As such, Ministry of Education may only gain access to the education related datasets two (or more) years after the census. There may also be a long delay in providing more specific requested database subsets or tables. Therefore, not many education ministries use census databases. Rather, they request population data from the census departments, especially data that enables them to calculate enrolment ratios and to make projections about the population of various school-aged children.
3.1.2 Household surveys
These are conducted more frequently than population censuses and the agencies that conduct them are usually willing to share their datasets with simple formal requests. With a smaller workload, agencies that conduct household surveys can create their databases faster and make reports available within twelve months of completion of the fieldwork (data collection).
Access to datasets varies by survey and from country to country. All the common household surveys conducted or sponsored by international organisations have their own websites.
Please refer to “Further studies” for more information.
3.2 Common Obstacles and Approaches in Gathering Population Census and Household Survey Data
As mentioned above, population censuses and household surveys contain useful data for EFA monitoring. There are, however, limitations.
3.2.1 Common obstacles in gathering population census database
i) It can be difficult to locate the person (or department) who has the authority to provide census datasets to the third party user.
ii) Lack of coordination with other ministries (including the Ministry of Education) and other departments while developing census questionnaire, and the items in the census may not be directly useful for constructing education indicators.
iii) A census is conducted normally once in every 10 years and the census data may only be accessible at least 2 to 3 years after completion of the census. Thus, the usefulness of census data is more to review historic trend than for studying the current situation and status.
iv) Census collections usually occur during school holidays. Census dates rarely coincide with the beginning of the school year, which is the reference date for calculating common education indicators. As such, there may be minor discrepancies between the indicators calculated from the census and indicators that are derived from other data which are gather periodically by schools and the Ministry of Education.
Tips:How to get faster access to census data?i) When seeking census data, it is better to contact the appropriate department via senior officials at the Ministry. If other staff approach the Census Department, the Census Department might sometimes delay response to the request, or simply never respond.ii) Limit the number of variables and the amount of data in the requested dataset. Request just the data you need for your purpose. It may help you to get a faster response and will be easier for you to analyze. Census datasets are very large. It can take a long time to produce subsets of data. The time required to deliver the data will increase if many variables are requested.
In many countries, very few household survey questionnaires were developed by education-related ministries and agencies. The survey questionnaires are usually designed by the agency that conducted the survey and then distributed to the education ministry for comment, or just for the information. Compared to population census data, household survey data are easier to obtain for the education ministries.
3.2.2 Main barriers in using household survey data for EFA monitoring
i) Variation in measures of educational participation
Definition of terms and survey questions about educational attainment and current school attendance can be phrased differently from survey to survey. In many cases, assumptions or adjustments must be made to calculate common education indicators.
For example, a specific survey may enquire about: (1) the highest grade completed by each household member; and (2) whether the person is currently attending school. To calculate net enrolment ratio (NER) or gross enrolment ratio (GER) from these questions, an assumption is required about the level/grade currently attended by the household member: for example if a child has completed Grade Four, and currently attends school, it is to assume that the child is currently attending Grade Five.
ii) Timing and duration of survey fieldwork
When using education data from household surveys, one must take into consideration the timing (when the survey was started or at which date that a survey referred to) and duration of the survey, or how long has the survey’s data collection took to complete. If a particular survey started just before the end of school-year and took over a month, then, the ‘grade completed’ or ‘current grade’ may differ from household to household depending on when the interview was conducted – in the early days or later days of the survey. This may not be a problem for the surveys that have a fixed reference date like a population census.
iii) Sample size and sampling method
A household survey is designed to provide the facts or characteristics of the population at a certain period through a representative sample of households. The representativeness of sample depends on the survey design, which is influenced by three factors: the sampling method used, the level of accuracy sought in the estimates for various indicators; and the level of data disaggregation.
Some surveys, especially rapid assessments and case-control studies, do not use probability sampling techniques. The findings may not, therefore, represent the entire population under study. Surveys that aim to derive estimates for common characteristics with moderate accuracy require a smaller sample size, while getting reliable estimates for a rare characteristic (or event) with higher accuracy requires larger sample size. Similarly, estimating at the national (and provincial) level only requires a small sample size while finer sub-stratification (such as at the district or lower level) needs a larger sample size.
It is, therefore, important to check which sampling method was used in the survey being studied, and whether the sample size is sufficiently large for producing reliable education indicators we wish to derive at the desired level of disaggregation.
EFA monitoring indicators generally aim to explore the differences between various groups in the population, especially to identify if some population groups have any disadvantage. The sample size of a particular household survey may or may not be sufficient to compute indicators for the disadvantaged group living in a certain area, depending on the definition of “disadvantaged population” and level of disaggregation.
If the sample size is not sufficient for the required level of disaggregation, the level of disaggregation should be reduced. Alternatively, indicators may be computed at the desired disaggregation level, but the results should be presented with a note about the indicator’s limitations.
3.3 Quality Issues, Challenges and Recommendations in Using Survey Data
Data files that are made available for analysis should be ‘cleaned’. By ‘cleaned’ we mean these files have been checked for structural and range errors and edited for internal consistency. Provisions that compensate for non-response should also be incorporated into the files and fully explained in the accompanying documentation.
After acquiring a dataset, we need to familiarize ourselves with its structure, the nature of its variables, the circumstances of data collection, and any limitations on the use of the dataset. The documentation for a census or household survey, such as reports and a codebook, will provide important background information about the survey, such as sample size and data quality indicators.
Data manipulation and analysis can be demanding and complex. The following discussions do not provide a comprehensive set of guidelines for the use of datasets, but highlights some key issues to be considered in analyzing survey data.
(1) Be familiar with the structure of dataset and explore appropriate ways to analyze it
Firstly, find out whether the records within the data files are for households or for individuals, and secondly, whether household or individual weights should be used to produce estimates. Since sample surveys do not collect data from the entire population (all households or all individuals) in an area, weighting factors are required to reconstitute the characteristics of the entire population from the samples. For example in a survey, five households are selected from two enumeration areas (EA) of 50 and 60 households respectively. The household weight for each of the five sample households from the first EA is 10, and from the second EA is 12. The weights are calculated while planning the survey, and are provided in the dataset.
(2) Study the variables in the datasets before analysis
It is important to refer to the original questionnaires as this helps us understand the nature of the variables and know how to analyze the data. For example, to analyze the literacy status of the population, one should know the nature of the variable such as:
- its codes (for example, ‘1=literate’, ‘2=illiterate’)
- restrictions (whether the question was asked to all ages or aged 5+ or aged 15+)
- relationship to other questions/variables (whether it was asked to everybody, or only those persons who answered ‘no education’ or ‘incomplete primary’ in the question on “highest education level”)
- missing values (code ‘9’) and non-response (code ‘8’ for the variable “literacy status”)
- how to deal with answers like: ‘Don’t know’.
Once we understand how the variables were derived and coded, we can determine which variables we should select and how to handle the selected variables to produce the required indicator estimates.
(3) Replicate published results before proceeding with additional calculations
If there are reports showing tabulated or analyzed results from the data collection activity, try to replicate these results before calculating any new indicators. Fixing problems with existing calculations will bolster confidence while producing new results.
(1) Consider the issue of missing values
Non-response in a survey or census can happen in one of two ways:
- The entire record representing an individual or household is missing because the individual or household refused to answer, was not available or could not be contacted. This is called “total non-response”.
- When variables within a record are missing it is called an “item non-response”. The item non-response is common for the variables produced from questions that were not asked or not known for all household members, such as whether a child attends school during the current school year.
A technique called “imputation” is often used to compensate for missing values in the case of item non-response. Imputation replaces missing values with the most suitable ones based on other cases in the same dataset. The resulting file, complete or ‘square’, allows us to derive better estimates we can use for calculating indicators. For this, the data analyst must know how the missing values were treated in the dataset.
In the case of total non-response, the weight adjustments method is often used. Non-response records are omitted from the dataset and the weights are recalculated to compensate for their removal. In this case, the dataset contains two sets of weights, the “sample weight” and “adjusted/final weights”. Analysts must use the final weight to calculate indicators.
(2) Calculate the measures of accuracy (coefficient of variation) of the basic estimates to gauge the reliability of the estimated indicators
Depending on the overall sample size of the survey, some tabulations may yield cells with very small numbers of cases. The estimated indicators that are based on those tables may not be reliable. For this reason, it is important to calculate a measure of accuracy and report it with the indicator so people can gauge the reliability of the estimates produced. The coefficient of variation (CV) is often used for this purpose.
The coefficient of variation (CV) is the square root of the variance divided by the estimate itself and multiplied by 100 (expressed as a percentage).
In their basic quality guidelines, most national statistical offices stipulate that estimates that have a CV greater than 35% should not be used for statistical inferences and should not be released to the public. Be sure to properly account for complex survey designs in analysis, particularly when calculating variances.
Since national population censuses collect data from all households and individuals in the population, sample design and weighting are not at issue. The only exception is when an additional questionnaire, with more detailed questions, is used with a sample of the population. Simple and self-weighting designs (such as Stratified Simple Random Sampling or Systemic Sampling) are, however, generally used which remove the issues around complex survey designs.
3.4 Use of Survey Data along with EMIS Data/Indicators for EFA Monitoring
Administrative and household survey data sources measure educational participation in different ways. Administrative data are gathered through school reporting at the beginning of the school year, and in some cases reporting at the middle or the end of the school year. Enrolment ratios are based on the numbers of children enrolled in school and the school-age population estimated from national censuses and/or vital statistics.
Ideally, household surveys collect data about enrolment and/or school attendance based on a representative sample of children. The heads of households are generally asked questions about their children’s participation in school. The timing of surveys may vary from one survey to another and are often unrelated to the school year. Some surveys may actually even span two different school years.
3.4.1 Limitation of Data
Estimates of educational participation from these two sources may differ for a number of reasons. One major factor is that the question asked about children’s school attendance in the household surveys is different from the questions that are asked in school censuses. There is a difference, for example, between being enrolled in school and actually attending school. Children may be recorded in school enrolment records, but never actually attend school or not attending school at the time of the survey. Thus, values for enrolment ratios from population censuses and household surveys may be slightly lower than those gathered from administrative data.
The different rates of participation can also be attributed to the timing of data collection relative to the school year. A school census, which is conducted at the beginning of the school year, and a household survey, which may be conducted at the end of the school year, will likely show different rates of participation because some children may have enrolled in school without ever actually attending, and other children may have dropped out of school during the school year.
In addition, the accuracy of the population estimate and the completeness of school-level data can affect the calculation of participation rates from administrative data. Similarly, the completeness of the census enumeration and the sample design for the household survey may also affect the accuracy of estimates produced by censuses and surveys.
In short, many factors may contribute to variations in the estimates of school participation rates from administrative data and household surveys. Further research is needed to explore the reasons for similarities or differences between the measures of participation from these two sources.
3.4.2 Benefits of using household survey data
Data for several variables, which are important for policy, planning and EFA monitoring, are often not collected by schools or during the annual school census. For example, data about ethnicity, household socio-economic standard, the education/literacy status of parents, and information about disadvantaged or mobile population groups are not necessarily available at school. Nor do they collect data about school-aged children from the general population who are not enrolled in school. Such data can only be obtained from the national population census or from household surveys.
It is important, therefore, to use both school administrative data and secondary data from population census and household surveys for policy analysis, planning and EFA monitoring; especially for measuring the progress towards “reaching to the unreached”.