3. Data Validation
Why is data validation required?
Rapidly increasing computing power and storage capacity at lower cost are making it now relatively cheap and easy to store and process large amounts of data. In recent years, many surveys have therefore been designed to collect several items which require more variables and with larger sample sizes, and which creates more cases in SPSS.
Validating and analysing more variables and cases increases the workload of all data handlers including coding staff, entry clerks, and data editors. When there is a combination of pressure to complete the analysis on time and inefficiencies in training and recruiting staff, the quality of data transmitted from the data manager to the analysts can be questionable. In some cases, surveys have been planned without including a step to check the coding or to verify that the data have been entered accurately.
Analysts of education data often obtain data sets from surveys about education from a variety of sources. There is often no way they can re-check the coding or accuracy of data entry. Validation rules are used, therefore, to check the validity and consistency of data before using the data set.
Generally, there are three types of rules for validating a data set:
- Single-variable rules
- Cross-variable rules, and
- Multi-case rules.
In SPSS (PASW) Statistics 17.0, these rules are not available in the base system. They are part of the optional “Data Preparation” add-on module. These tasks can, however, be carried out through common SPSS commands, but this requires the user to have a good understanding of the syntax and structure of the SPSS programming language.
To work with the first two types, single-variable rules and cross-variable rules, the user needs to understand ‘case selection’, which was discussed in the previous section. The third rule is more complicated and the data may need to be manipulated using several steps, such as creating temporary variables, matching, aggregation and selection of cases.
SPSS provides a procedure, ‘Identify Duplicate Cases’ in the ‘Data’ menu that is used to identify duplicate cases in a data file; this is the most important part of the multi-case rule.
3.1 Single-Variable Rules
These rules consist of a set of checks that can be applied to a variable. Normally, checks for out-of-range or invalid values and missing values are included in this category. For example, a value of 5 was entered for the ‘highest education level (HV106)’, but this is an invalid code; valid codes are only 0, 1, 2, 3 and 8. Similarly, single-variable rules can be used to check whether values other than 1 and 2 (or ‘Male’ and ‘Female’) are entered in variable ‘sex of household members (HV104)’.
There are three stages in the process of validating the data entered for a variable. Then any invalid values that are identified can be edited. The first stage in validating a variable is obtaining valid values or ranges from the codebook. Valid values for HV104 (sex) are, for example, only allowed to be either 1 or 2. If any other values are entered for this variable, they are invalid.
In the second stage, we construct a frequency table. If there are no invalid values displayed in the frequency table, the variable under observation is ‘valid’ with the single-variable rule. The third stage of validation is used if invalid values are observed in the frequency table. If, for example, the value ‘3’ (an invalid value) appears in the frequency table for HV104, we need to identify where those erroneous cases are. We use the following procedure to select cases, which allows us to extract cases with invalid values for the variable in the data set.
To check the validity of ‘sex of household members (HV104)’, follow the steps:
- Click ‘Analyse’ on main menu bar.
- Click ‘Descriptive Statistics’.
- Then, click ‘Frequencies’ again.
- In the ‘Frequencies’ window, select the variable to study (HV104) and click ‘OK’ to construct a frequency table.
In the above frequency table, five cases with the values 3, 4, and 5 are invalid. Therefore, it is necessary to check which cases contain these invalid values by using “case selection” of invalid cases.
To select invalid cases:
- Click ‘Data’ on main menu bar.
- Click ‘Select cases’.
- On ‘Select cases’ window, check the option button ‘If condition is satisfied’ and click the ‘If” button.
- In the ‘Select cases: If’ window, type in criteria: ‘not (HV104=1 or HV104=2)’ or ‘~(HV104=1 | HV104=2)’ and click ‘Continue’.
- Check ‘Copy selected cases to new data set’ option button and provide the new data set name, e.g. ‘Invalid_Cases’
- Click ‘OK’ to execute the case selection command.
The output, new data set contains five invalid cases (after moving variable HV104 to second position to get a better view) as below:
In this case, the user must decide whether to erase the entire case from the data set, change the invalid values to “missing values”, or check other data sets that have different values and use them to correct the invalid values in the current data set.
3.2 Cross-Variable Rules
With cross-variable rules, users have to use cross-tabulations instead of frequency tables to identify whether invalid cases exist or not, and to apply slightly different rules for conditional selection of invalid cases.
In the sample data set, if checked it alone using frequency tables, the variable ‘highest educational level (HV106)’ appears to have no invalid cases. When cross-checking with ‘age of the household members (HV105)’, however, a few suspect entries are highlighted.
In the cross-tabulation for age and highest education level, we can easily see two cases that are invalid; one case of ‘age 4 in primary education’ and one case of ‘age 12 in higher education’. Moreover, there are few more cases that may not be reliable because they are on the margins of age and education level. There are few options for developing cross-variable validation rules:
Option 1 – to sip out all susceptible cases (invalid and marginal ones):
i) with primary education at age 5 or below (the official entrance age is 6),
ii) with secondary education at age 10 or below (the official starting age is 6+5=11), and
iii) with higher education at age 15 or below (the official starting age is 6+5+5=16).
Option 2 – to review the invalid cases, one can use the following cross-variable rules with a grace period (early entrance) of one year:
i) with primary education at age 4 or below (the official entrance age is 6 but 5 can be allowed),
ii) with secondary education at age 9 or below (the official starting age is 6+5=11), and
iii) with higher education at age 14 or below (the official starting age is 6+5+5=16).
Then, the “If” statements to be used in case selection are:
Option 1: (HV105 <= 5 and HV106 = 1) or (HV105 <= 10 and HV106 = 2) or (HV105 <= 15 and HV106 = 3)
Option 2: (HV105 < 5 and HV106 = 1) or (HV105 < 10 and HV106 = 2) or (HV105 < 15 and HV106 = 3)
And the following outputs will be obtained after running appropriate case selection procedures as presented in the previous section.
Option 1: Both invalid and marginal cases
Option 2: Only certainly invalid cases
3.3 Multi-Case Rules
Multi-case rules are defined by a procedure (sequence of logical expressions) that flags invalid cases. The most common and useful application of multi-case rules is checking whether there are duplicates in the data set, such as cases that have been entered more than once for a single household, or a household that has two heads, or two persons in the same household who have the same personal ID, and so on.
SPSS allows users to check duplicate cases and inspect unusual cases. Follow the steps below to check duplicate cases:
- Click ‘Data’ on main menu bar.
- Select ‘Identify Duplicate Cases’. A new window will appear.
- Select variables to identify duplicate cases (or press Ctrl+A to select all and release unnecessary variables) and send to the space below ‘Defined matching cases by:’
- Set the options:
(a) ‘Sort within matching group’ – select the variable(s) from the remaining ones in the list, as the key for sorting within the matching groups.
(b) ‘Sort’ – if a key variable for sorting is selected, define the sort order.
(c) ‘Variables to create’ – tick in the check box, if you want a frequency table to show ‘how many duplicates are detected’, or to highlight duplicate cases according to the following criteria:
i. The primary case, the first or last case among the duplicates.
ii. Whether to count all duplicate cases sequentially or to only count the non-primary cases (the primary case is not considered to be a duplicate);
(d) Tick ‘Move matching cases to the top’ to make reviewing duplicates easier.
(e) If required, tick ‘Display frequencies for created variables’.
- Click ‘OK’ to proceed.
With the above options set, the result of checking duplicate cases is displayed in the following frequency table.
The above frequency table shows there are six duplicates among the 1,889 cases, but this table does not tell us the exact nature of the duplicates.It is possible that each duplicate may be the same (with just one primary case and the group of seven cases are the same in all variables) or there may be six pairs of duplicates (six primary cases and one duplicate for each primary case).We need to review the data set to help us understand the nature of duplicates and know how best to deal with those duplicates.
The following exhibit shows the groups of duplicates displayed on top of the data set.
After validation checks, the data set should be edited where necessary. After data validation and preparation, the next step is analysing ‘clean data’ using appropriate SPSS procedures under the ‘Analyze’ menu.