Module B3: Checking, Editing and Preparing Household Survey Data for Analysis

1. Metadata Preparation

One of the most famous computer and ICT terms is GIGO, “Garbage in Garbage out”. It simply means that if a data set contains errors, outputs generated from that data set are not reliable and are not usable. We should keep in mind, therefore, that a data set may not be ready to be analysed immediately after loading it. Although the SPSS Data Editor can display the contents of a data file, the software cannot ensure the quality of data that are contained within it.

To conduct a meaningful analysis, the data analyst must understand the procedures that were used to collect the data, the structure of the questionnaire and coding rules. Moreover, it is important that the analyst knows how the data set was prepared and distributed. The data analyst will only be able to understand and conduct meaningful analysis of the data if the data items and variables are defined properly.

After loading a data set into SPSS, the next logical steps include:

 

Metadatapreparation:

Defining data

This step is required when data is imported from other formats such as Excel, text files or databases, or entered from paper files. While importing data from other formats, only data values with a variable name, and at most, the defined missing values will be included in the new SPSS data set. The data analyst should define the data by providing appropriate variable name, inserting value labels, setting missing values and defining measurement level for each and every variable.

Editing data definition

The data analyst should review every data set – even those that are in SPSS format –when they are opened in the Data Editor for the first time. To do this, the analyst should record variables in the data set and their valid values or range, labels, and measurement levels. Next the analyst should identify values, or combinations of values, that are impossible but commonly miscoded. Finally, the analyst should define validation rules based on this information. It is a time-consuming task, but it helps to ensure data quality.

Datapreparation:

Even when the active data set is reliable, it may not be suitable for the type of analyses that will be performed. The analyst may have to undertake some special preparations such as sorting, aggregation, creation of new variables, conditional selection of cases, and sometimes merging of data sets.

Datavalidation:

The analyst must run basic checks, and checks against defined validation rules to identify invalid cases, variables, and data values. When invalid data are found, the analyst must investigate and, if possible, find the cause and try to correct the invalid data. If the invalid data cannot be corrected, the analyst must determine whether to omit the cases altogether or whether to include the case but set the invalid values as missing or of a special category.

Once the data set has been cleaned and prepared, the analyst can begin to analyse it in SPSS. The following sections highlight tools that are available in the SPSS base system for metadata preparation, data preparation, and data validation.

Metadata preparation is emphasized in the next section. Data preparation and validation will be discussed in later sections.

1.1     Defining Data:Setting Variable Properties

In SPSS, metadata or data dictionary is part of the dataset. It covers such properties as variable label, value labels, formats, and measurement level: scale, ordinal or nominal.

Data from other sources, such as Microsoft Excel, text files or Microsoft Access Database files can be imported into SPSS. While importing data, only the variable name, the data format (such as whether a variable is numeric or string, width and decimal places) and data values are imported into SPSS. Some properties, such as missing values, can be assigned while importing from databases. However, no description of variable (the variable label) and the meaning of the data values (value labels) will be imported, especially when codes were imported from the source instead of texts or words as illustrated in the following examples.

In the above example, the variable “HV104 (Sex of household member)” only has the values of either 1 or 2. Users, however, cannot possibly know what ‘1’ and ‘2’ stand for, since ‘1’ may stand for ‘Male’ or ‘Female’ depending on the coding scheme.

It is impossible, therefore, to answer a simple question: “how many household members are female?” from the above frequency table created by SPSS.

Similarly, from the frequency table above(of the variable HV106), no one can know:

  • What HV106 is.
  • What the values 0, 1, 2, 3, 8 and 9 stand for.
  • Why the codes jump from 3 to 8 and why 4, 5, 6 and 7 do not exist.

To answer such questions, after importing data or opening an existing data file, the analyst must specify, check and edit, the variable labels, value labels, missing values and measurement levels for each and every variable in the data set. While entering variable labels, value labels and missing values, the codebook – or survey questionnaire if the codes are printed in it – are essential references.

To define variable label, just click the appropriate cell and type in directly as following.

Again, to define the value labels, select “Variable View” in the SPSS Data Editor.

Then, follow the steps below:

  1. Click the cell under “Values” and the “Value Labels” window will pop-up.
  2. Type the code in “Value” box.
  3. Type the appropriate label in “Label” box.
  4. Press “Add” button and the value and its label will appear in the space below.
  5. Repeat Steps 2, 3 and 4 until all value labels been defined. Press “OK”, after having entered the last valid code, to store the value labels.

Note:   Starting from the version 17.0, SPSS allows users to check the spelling of value labels. This tool is available by clicking the “Spelling” tab. Similarly, users can identify “missing values” by clicking the cell under “Missing” and follow a similar procedure as was used to define value labels(see 6).

The same analysis (frequencies) of the variable “HV106” was tabulated as output after defining the variable label, value labels and missing values. As can be seen in the following exhibit, this output is easier to understand and ready to use in a report or presentation.

Within Variable View, all properties (or definitions) of the variables; name, type and measurement level can be added, changed or removed as required. By default, SPSS assigns a measurement level for the imported variables automatically as a ‘scale’ for numeric variables and ‘nominal’ for string variables. These automatic assignments are not suitable for some advanced analyses, and thus, the measurement level of the variables must be checked and can be changed.  The type of measurement for the variable “HV106” can, for example, be changed from nominal to ordinal, which is more suitable for this type  of variable.

1.2 Setting and Editing Metadata through Wizard

SPSS provides a wizard-like method of setting variable properties for the new variables, and also for checking and editing variable properties for existing variables in a dataset.

The “Metadata Wizard” can be applied to the imported data files instead of manually entering the settings as was described in the previous section.

Steps in this procedure are:

  1. Click “Data” on main menu bar.
  2. Select “Define Variable Properties…”

The ‘Define Variable Properties’ window pops up to allow choosing variables to be defined. For demonstration purposes, select the two variables HV219: ‘Sex of head of household’ and HV104: ‘Sex of household member’ in the following example.

  1. Click the variable name(s) to select the variable(s) to be defined.
  2. Double-click or click  to move variable name to the right ‘Variables to scan’ pane; repeat steps 3 and 4 until all the required variables have been placed in the pane on the right.
  3. After selecting all variables, click ‘Continue’ to start scanning the variables.

A window entitled ‘Define Variable Properties’ will appear and show the scanned results by variable. In this window, we can set:

(i)  Variable label (enter the labels into the blank spaces provided).

(ii)  Data type (select from the drop-down menu), width and decimal places (type-in).

(iii)  Measurement level (select from the drop-down menu).

After completing these steps for the variable HV219, select HV104 and follow the same procedure described in steps (i), (ii) and (iii).

  1. Complete ‘Setting variable properties’ by clicking ‘OK’.

To copy variable properties, except variable label, from HV219 to HV104:Alternatively, after entering the settings for HV219, its properties can be copied to HV104 because both variables have the same nature and use the same codes; 1=Male and 2=Female (i.e. same value labels).

(a)   Press ‘To Other Variables’ button.

Then, in the ‘Apply Labels and Level to’ window:

(b)  Select the variable HV104

(c)   Click ‘Copy’ to copy the variable properties.

All properties of the variable HV219, except variable label, are copied to HV104. Thus,

(d)   Type in variable label for HV104, and click ‘OK’ to complete the process.

It should be noted that variable properties can only be copied among the variables that were scanned during the same session.

And, the data set will appear in the Variable View as follow:

Setting of variable properties should be carried out on all variables in the data set for easier understanding and effective analyses.

Tips:Sometimes, source data file contains data in “text format” for some variables, such as “male” or “female” instead of 1 and 0. In this case, it is essential to code such variables for easier analysis. SPSS provides automatic coding through the AUTORECODE command. For detailed information about the AUTORECODE command, please refer to ‘Base User Guide’ of SPSS (PASW) Statistics 17.0.

1.3     Copying File and Variable Properties

This feature is useful if several data sets have the same nature (for example data sets for different provinces in a country).  In SPSS, we can define properties on just one data set, then open other data sets one after another, and import properties from the defined data set.

Copy Data Properties’ under ‘Data’ menu provides the ability to use an external SPSS data file as a template for defining file and variable properties of the active data set. Also, properties of variables in the active data set can be copied to other variables in the same data set.

The ‘Copy Data Properties’wizard allows you to:

  • Copy selected file properties from an external data file or opened data set into the active data set. (File properties include: documents, file labels, multiple response sets, variable sets, and weighting).
  • Copy selected variable properties from an external data file or opened data set to matching variables in the active data set. (Variable properties include; value labels, missing values, level of measurement, variable labels, print and write formats, alignment, and column width used in the Data Editor).
  • Copy selected variable properties from one variable in an (i) external data file, (ii) opened data set, or (iii) the active data set to many variables in the active data set.
  • Create new variables in the active data set based on selected variables in an external data file or opened data set.

When copying data properties, the following general rules apply:

  • If an external data file is used as the source, it must be in SPSS format.
  • Undefined (empty) properties in the source data set do not overwrite defined properties in the designated data set.
  • Variable properties are copied from the source variable only to target variables of a matching type – string (alphanumeric) or numeric (including numeric, date, and currency).

Variable properties can be copied from the source file to matching variables in the active data set. Variables ‘match’ if both the variable name and type (string or numeric) are the same. For string variables, the defined length must also be the same.

The variables which are not in the active data set can be created using the properties of the selected variables in the source file. To do this, the source list must be updated to display all variables in the source data file. If you select source variables that do not exist in the active data set (based on variable name), new variables will be created in the active data set with the variable names and properties from the source data file.

If the active data set contains no variables – if it is a blank or new data set – all variables in the source data file are displayed, and new variables which are based on the selected source variables are automatically created in the active data set. This is the easiest way to create a new data set (like an Excel worksheet) for direct data entry and, also can be shared with the data set without data as electronic codebook.

To copy the data file properties and variable properties, which may be required after importing from other file formats, first, select‘Variable View’in the ‘Data Editor’and follow the steps below:

  1. Click ‘Data’ on main menu bar.
  2. Select ‘Copy Data Properties’and the ‘Copy Data Properties’ wizard will appear.
  3. Click the ‘Browse’ button on the bottom right area and select the SPSS data file that will be used as the source of the properties. OR, type in the file name with its full address, for example: ‘C:\ModuleB\Data\Data1.sav’

  1. Then, click ‘Next’ to proceed to Step 2 of the Wizard.

The Wizard will scan both source and target data sets, and display the ‘match’ variables from the source file in the left pane and from the active data set in the right pane.The number of selected variables is displayed in the bottom of the list.

  1. Click ‘Finish’to copy with the default settings, or ‘Next’ to change the settings

The following settings can be changed in Steps 3 and 4 of the Wizard.

If the Wizard is followed step-by-step, the summary of ‘what would be copied’ is displayed on Step 5. After pressing the‘Finish’ button, at any point in the wizard, the active data set will have the selected properties applied to it, just as they are in the source data file.

Alternatively, properties can be copied from an opened data set to other data sets if more than one data sets are opened. Select ‘An open data set’ as ‘Source of the properties’ in Step 1, and follow the same steps.

Here, new variables from the source data set will be added to the active data set if ‘Create matching variables in the active data set if they do not already exist’ is ticked in using ‘set properties. All variables may be selected (press <control> + A), or only some variables (hold the <control> key while clicking multiple variable names) can be selected from the source list. In this case, at the bottom of the list in the data set, it will display both (i) matching variables (12 in this example) and (ii) variables to be created (10 in this example).

In the above example, ten new variables will be added into the active data set with the same variable names and properties by copying the properties of all variables from the source data set. Note that the data values were not be copied to the active data set.

SPSS also allows variable properties to be copied from one variable to another in the same data set. For example, in the sample data set, two variables: sex of head of household (HV219) and sex of household member (HV104) are sharing the same codes ‘1=Male’ and ‘2=Female’,and 9 as the missing value. If the codes were entered and missing values have been identified for the head of household (HV219), those properties can be copied to household member (HV104).

To do this, select the third option in ‘Choose the source of the properties’, which is ‘The active data set’ in Step 1 of the Wizard. Then, click a source variable, and click the target variable(s). As usual, the user must press <Control> key while clicking the next variable name(s). After selecting all target variables,click ‘Finish’ to begin copying.

In this option, the user must type-in appropriate variable labels for the target variables.

 

Comments are closed.