Module B

ANOVA

ANOVA stands for analysis-of-variance. It is a common statistical procedure in social sciences, providing among the others: (1) testing the equity of the sub-group means, and (2) analyzing the variation in the observed variable (Y) due to different explanatory variables (X1, X2, … , Xn) which were defined as sources of variation or as criteria of classification for the observations (of Y).

Chi-square test

It is a statistical test used to determine whether two variables are linearly correlated or not. Although there are several different Chi-square tests, the most widely used one is the Pearson’s χ2 test for assessing two types of comparison: tests of goodness of fit and tests of independence. “Goodness of fit” tests whether an observed frequency distribution differs from the theoretical distribution or not, while the “test of independence” assesses whether paired observations on two variables, expressed in a contingency table (two way cross-tabulation), are independent of each other or not.

Cleaning database

It is a process to increase the accuracy of the data and streamline the database, by removing/correcting duplicate and wrong data in the database.

Code book (or) codebook

A document used for implementing codes.  It reports dictionary information such as variable names, variable labels, value labels, and missing values.

Coding

It is the process of statistical classification of information. Coding replaces the answers of the open-ended questionnaires or values in the long text into shorter alphanumeric strings or several numerical data with several different values into a limited number of codes (generally in short alphanumeric strings), for example, age in single-year to age-groups.

Coefficient of Variation (CV)

The coefficient of variation (CV) is a common measure of dispersion in social statistics. It is calculated as the ratio of the standard deviation to the mean. CV is useful in comparing the degree of variation from one data series to another, even if the means are considerably different from one set to the others.

Cohort Survival Rate (CSR)

The percentage of students enrolled at the first grade of a specific education level in a given school year who reached the final grade regardless of the years spent to reach the final grade of the level.

Contingency Table

A table showing the distribution of one variable in rows and another in columns is known as contingency table which is used to study the correlation between the two.

Correlation

It is a single number that describes the degree of relationship between two variables. Correlation coefficient is useful since it can indicate a predictive relationship, possible casual or mechanistic relationships.

Coverage

The proportion or extent or degree to which the entire area under study is observed, analyzed, and reported by the survey.

Cross tabulation (Cross tab)

Crosstab displays the joint distribution of two or more variables. They are usually presented as contingency table in a matrix format. Whereas a frequency distribution provides the distribution of one variable, a contingency table describes the distribution of two or more variables simultaneously.

CSPro (Census and Survey Processing System)

It is public domain statistical package which can be used for entering, editing, tabulating, and mapping of census and survey data. It is widely used by statistical agencies in developing countries, especially for data entry.

CSPro was designed and implemented through a joint effort among the developers of IMPS and ISSA: the United States Census Bureau, Macro International, and Serpro, S.A. Funding for the development is provided by the Office of Population of the United States Agency for International Development. CSPro is designed to eventually replace both IMPS and ISSA.

Data

It is a common term of facts, statistics, or items of information from which conclusions may be drawn through analyses.

Database

A database consists of an organized collection of data for one or more uses, typically in digital form. Databases are managed using database management systems (DBMS), which store database contents, allowing data creation (adding variables and/or cases) and maintenance (editing, replacing, sub-setting, merging, …), and search and other access.

Data analysis

It is a process to get estimated values of the key indicators, and could also include perceived standard errors of estimates and pre-determined levels of disaggregation.

Data collection

Data collection is the main process of a survey including both preparing and collecting data. The purpose of data collection is to obtain information to keep on record, to make decisions about important issues and to pass information on to others. Primarily, data is collected to provide information regarding a specific topic.

Data dictionary

It is an integral part of a database holding information on the database itself and the data (or variables) that it stores. A well-designed database shall include a data dictionary to provide database administrators and users. It provides easy access to the type of data that stored in every table, row and column of the database without actually accessing the database.

Data preparation

It includes checking and editing collected data; developing database (or dataset) structure; entering the data into the dataset (computer file); checking the dataset (database) for accuracy; transforming the data; and documenting database structure that integrates the various measures.

Data source

Data source is any type of sources for a database, computer file or data stream. Data sources usually provide certain amount of metadata.

Data Validation

It is a process of ensuring that a program operates on clean, correct and useful data. It uses a set of sequential actions called “routines” or “validation rules”. Data validation checks that data are valid, sensible, reasonable, and secure before they are processed.

Descriptive statistics

All summary statistics and presentations which can describe the main features of a collection of data quantitatively or graphically are collectively termed as Descriptive Statistics. Common statistics such as: measures of central tendency (mean, medium, and mode), measure of dispersion (range, percentile, variance, standard division, coefficient of variation, skewness and kurtosis, moments), summary tables (frequency table, crosstab/contingency table), dependency (or correlation), and graphical presentation of data, can be classified as the descriptive statistics.

Demographic and Health Survey (MEASURE-DHS)

The Demographic and Health Survey (DHS) Project has provided technical assistance to more than 200 demographic and health surveys in 75 countries advancing global understanding of health and population trends. In 1997, DHS became one of four components of the “Monitoring and Evaluation to Assess and Use Results (MEASURE)” Program.

Disaggregation

A process of breaking up of a total (aggregate), an integrated one, or a conglomerate, into smaller elements, parts, or units, usually for easier handling or management or better understanding. In data analysis, it is a process of breaking down an indicator by sub-categories or factors which could explain better or more details the underlying nature or value of the indicator. It should be noted that over disaggregation will hamper the degree of accuracy of the results and thus, the level of disaggregation depends on the initial survey design including the sample size.

Educational attainment

Level of educational attainment refers to the highest level of education that a person has completed. Therefore, it is distinct from the level of schooling which refers to the education level that a person is currently attending. Level of schooling must be the same or higher to his/her educational attainment.

Although it is less common, it refers to the highest grade that a person has completed successfully, which is also different from the grade that a person is attending.

EPI Info

It is a public domain statistical software package for epidemiology developed by Centers for Disease Control and Prevention (CDC) in Atlanta, Georgia (USA) since 1985. It is designed for global community of public health practitioners and researchers.

Estimation

It is an approximate calculation of an unknown value or judgment of an unknown

situation based on partial evidence. In sampling theory, estimation is a procedure for calculating a value of a property of the population based on the observations of a sample drawn from the population.

Factor Analysis

Factor analysis attempts to identify explanatory variables or factors, X1, X2, …, Xn, that explain pattern of correlations within a set of observed variable(s), Y1, Y2, … , Ym. Factor analysis is often used in data reduction to identify a small number of factors that explain most of the variance observed in observed variable(s). Factor analysis can also be used to generate hypotheses regarding causal mechanisms or to screen variables for subsequent analysis such as identifying collinearity prior to performing a linear regression analysis.

Frequency Table

Frequency (or one-way) table represents the simplest method for analyzing categorical (nominal) data. It is often used as one of the exploratory procedures to review how different categories of values are distributed in the sample.

Gender Parity Index (GPI)

A socioeconomic index usually designed to measure the relative access to education of males and females. In its simplest form, it is calculated as the quotient of the number of female by the number of male enrolled in a given stage of education.

Household

A basic residential unit where organized and carried out economic production, consumption, inheritance, child rearing, and shelter. Household is broader than family, which is a group of people related by blood or marriage such as parents and their children only.

Household Roster

It is the main part of every household survey where listing of all household members and their personal characteristics. Personal characteristics may include: age, sex and relationship to head of household for every member; education and literacy status for the person aged 5 and above; schooling status to those aged 5-24 (or 6-14, 6-19, etc.), and marital status for all adults aged 15 and above and so on.

Household Survey

It is a process of data collection and analysis for understanding general situation and exploring specific characteristics of households or household population.

IMPS (Integrated Microcomputer Processing System)

IMPS performs the major tasks in survey and census data processing: data entry, data editing, tabulation, data dissemination, basic statistical analyses and data capture control which can be used as a complete processing system or as stand-alone module. The most recent version of IMPS is 4.1 released in year 2000 and later it became one crucial part of the CSPro.

Imputation

A procedure to find an appropriate substitute for a missing data point or component of a data point in a case based on the experience of similar cases.

ISSA (International System for Survey Analysis)

ISSA is complement to IMPS providing basic to advanced data analysis capabilities for survey. It was also integrated into CSPro in the mid-2000.

Kurtosis

Kurtosis is a statistical measure of variation showing the level of “peakedness” of the distribution of a variable. Kurtosis of the normal distribution is zero (0). A higher kurtosis (more than zero) means the distribution under study is more peaked than normal (generally speaking a “bell” shape) and less than zero shows flatter than normal.

LFS (Labour Force Survey)

LFS is one of the most common and most frequently collected household surveys. The first recorded LFS was conducted in 1940 in USA, 1960 in Australia, 1973 in United Kingdom, and so on. Currently LFS is conducted monthly in USA, and quarterly (four-times in a year) in Australia, New Zealand, United Kingdom and in almost all countries in the European Union.

Liner Regression

Linear regression estimates the coefficients of the linear equation, involving one or more independent variables (X1, X2, …, Xn) that best predict the value of the dependent variable (Y). For example, one can try to predict the wealth score (Y) through independent variables such as age of head of household (X1), sex (X2), education level of head of household (X3), urban or rural location (X4) and etc. (… Xn). A linear regression equation is expressed as:

yi = b0 + b1x1i + b2x2i + … + bkxki + ei; i=1, … ,n

where, yi = the ith value of the variable under observation;

b0, b1, …, bn = regression parameters (or coefficients); and

x1i, x2i, …, xki = the ith value of the explanatory variables

LSMS (Living Standard Measurement Survey)

It was established by the Development Economic Research Group (DECRG) of the World Bank to explore ways of improving the type and quality of household data collected by statistical offices in developing countries.

MDG (Millennium Development Goals)

The MDG goals and targets come from the “Millennium Declaration” signed by 189 countries, including 147 Heads of State, in September 2000. The goals and targets are inter-related and should be seen as a whole. They represent a partnership between the developed countries and the developing countries determined, as the Declaration states, “to create an environment – at the national and global levels alike – which is conducive to development and the elimination of poverty”.

Mean

The statistical term “mean” is the average value of a variable. The most frequently used mean value is the arithmetic mean which is the sum of the observed values of the variable divided by the number of observations (n).

Metadata is a concept that applies mainly to electronically archived or presented data and is used to described the definition, structure, and administration of data files with all contents in context to ease the use of the captured and archived data for further use.

MICS (Multiple Indicators Cluster Survey)

MICS is a household survey developed by UNICEF to assist countries in filling data gaps for monitoring the situation of children and women. It was originally developed in response to the World Summit for Children to measure progress towards an internationally agreed set of mid-decade goals.

Missing value

This occurs when no data value is stored for the variable in the current observation. Missing values are commonly occurred, and statistical methods have been developed to deal with this problem.

Nominal variables

Nominal variables are based on nominal scale which is classified data into various distinct categories in which no ordering is implied.

Nonparametric test

A statistical hypothesis testing method whose interpretation does not depend on any parameterized distributions. Statistics based on the ranks of observations are one example of such statistics and these play a central role in many nonparametric approaches.

OLAP Cube (Online Analytical Processing Cube)

A multidimensional table that calculates and displays basic statistics for summary variables within categories of one or more grouping variables. The cube allows different views of the data to be quickly displayed.

Ordinal variables

Ordinal variables are based on ordinal scale which is classified data into distinct categories in which ordering is implied.

Outlier

An observation that is numerically distant from the rest of the data.

Pivot table

It is a data summarization tool to create output table in different formats. Pivot-table tools can automatically sort, count, and total the data stored in one table or spreadsheet and create a second table.

Population census

Census consists of an enumeration of entire population in the specified area regularly at a marked time interval. It is a procedure of systematically acquiring and recording information about all members of a given population.

PPS (Probability Proportional to Size)

In PPS sampling, the selection probability for each element is set to be proportional to its size, up to a maximum of 1. It can improve accuracy by concentrating sample on large elements that have the greatest impact on population estimates. PPS sampling is commonly used for surveys of businesses, where element size varies greatly and auxiliary information (population size) is often available.

Proxy reporting

It refers to reporting by a respondent for other family members who are not present during the survey.

PSPP

A free, open-source alternative software to the proprietary statistics package SPSS. It allows making data analysis through a graphical user interface or a conventional command line interface.

Questionnaire

It is a written (printed) set of questions on a particular subject using to a large number of people in order to collect information.

Sample design

It is an important part of a sample survey determining at least: (i) the target population, (2) sample size and (3) method of selecting the samples. Determination of sample size is based on such factors as: time available, budget and necessary degree of precision.

Sampling

A part of statistical practice concerned with the selection of an unbiased or random subset of individual observations within a population of individuals intended to yield required understanding about the population under study, especially for the purpose of making estimations based on statistical inference.

Sample size

It is the total number of individuals (e.g. persons or households) to be collected from the population in the study area.

SAS (Statistical Analysis System)

It is an integrated system of software products from “SAS Institute Inc.”. SAS enable programmers (users) to perform many different kinds of analysis, data management and output generating functions.

School census

School census is an official data collection (processing and dissemination) system from all schools / institutions within the education system(s), normally conducting once a year.

Skewness

Skewness measures the deviation of the distribution from symmetry. If the skewness is clearly different from 0, then that distribution is asymmetrical, while normal distributions are perfectly symmetrical.

SPSS (Statistical Package for Social Sciences)

SPSS is one of the most popular data analysis software allowing various statistical methods and procedures. SPSS was first developed in 1968 at the Stanford University for internal use only. In March 2009, the name SPSS had been changed to PASW (Predictive Analytics SoftWare) in version 17, and again, changed to IBM SPSS in mid-2010 in version 19.

SRS (Simple Random Sampling)

Each element (e.g., person, household, etc.) in the sampling frame has an equal probability of selection. Moreover, any given pair (or group) of elements has the same chance of selection as any other such pair (group). This minimizes bias and simplifies analysis of results. By using simple random sampling, no weighting is needed in estimating population parameters (or indicators) based on the samples.

Standard deviation (SD)

Standard deviation is a commonly-used measure of variation. The standard deviation of a population of values is computed as the square-root of the variance, where variance is the mean value of total sum of square of the variations (of observed values from their population mean). It can shows how tightly all the values are clustered around the mean in a set of data.

Stata

The name “Stata” is taken letters from the words “statistics” and “data”. It is a general-purpose statistical software package with full range of capabilities including data management, statistical analysis, graphics, simulations, custom programming.

Structured Query Language (SQL)

SQL is a standard programming language used for accessing and marinating a database. enables to query an outside data source about the data it contains. An SQL statement can be used to specify the desired tables, fields, rows, etc. to return as data. The Key feature of the SQL is an interactive approach for getting information from and updating a database.

Syntax

A set of rules that define the combinations of symbols and statements that are considered to be correctly structured programs in a programming language.

T-test

It is the most commonly used method to evaluate the differences in means between two groups. The groups can be independent (e.g., wealth index of households in

urban vs. rural areas) or dependent (e.g., wealth index of households in MICS-2005 and MICS-2010). Theoretically, t-test can be used even if the sample sizes are small (< 30), as long as the variables are approximately normally distributed and the variation of values in the two groups is not significantly different.

Validation rule

A criterion used in the process of data validation, carried out after the data has been coded and entered into electronic form (database or dataset).

Variable

It stands for a value that may vary from time to time or from person to person or … A variable can also be defined as a storage location capable of containing data that can be modified during program execution. Each variable has a unique name and its data type can also be specified, if necessary.

Visual Binning

It is a process performing automatic creation of new variables based on grouping contiguous values of existing variables into a limited number of district categories. This can create a categorical variable from continuous scale variable or collapse a large number of ordinal categories into a smaller set of categories such as age recorded in single year into 5-year age groups.

Wealth index

Wealth index can indicate the level household economic status through several factors including ownership of farm, land and other properties, animals and household amenities. The extent or degree to which the entire study area is observed, analyzed, and reported by the survey.

Weighting

Weighting is the process providing allowance or adjustment made to take account of special circumstances or compensate for a distorting factor used in sample survey data analysis. In a self-weighting sample survey, each unit (household or person) in sample represents equal number of units in population. Most sample surveys are not self-weighting and weights are necessary for adjusting sample to represent its population in statistical analysis. Weighting can also be used to weight a sample up to population size for reporting purposes.