Module B3: 2. Data Transformation and Preparation

2.1 Changing, Inserting and Deleting Data, Cases and Variables

In the SPSS Data Editor, it is simple to change the value of a specific cell, or properties of a variable, such as name, type, label, value labels and measurement scale.

2.1.1 Changing the identification (or properties) of a variable

To change the properties of a variable, such as a variable name, select the cell with the variable name that you want to change in ‘Variable View’ and type-in new name. As such, variable properties can be changed in ‘Variable View’.Use caution while changing variable types; if we change a variable from a ‘string’ type to a ‘numeric’ type, all alphanumeric data values will be replaced by missing values (‘.’). Even if we try to change the data type back later, only blanks (zero length string data) will remain. This may happen with some other data types too.

If data values need to be changed, select ‘Data view’, locate the cell and type in the new value, one cell after another, just as we do in a spreadsheet program.

2.1.2 Adding variables or cases to an existing data set:

To add a new variable, such as ‘EdLevel’ to the active data set so we can understand the educational attainment of all household members, we select‘Variable View’ and right-click the row number where we want to insert the new variables. SPSS will insert the variable before the existing variable on that row with the name ‘Var00001’, ‘Var00002’,‘Var00003’, etc.Variable type for a newly created variable is numeric with F8.2format (8 digits, 2 decimal places).There will be no variable label or value labels. The user can input or import the variable attributes for new variables including variable name, type, width and decimal places, variable label and measurement level using the methods described above. Value labels should also be identified wherever it is applicable.

A new variable can be inserted in the ‘Data View’ by clicking the existing variable name that we want to insert a new variable before, then, go to the ‘Variable View’ and change the properties. To delete a variable, click the name of the variable that is to be deleted (while working in the Data View) or click the row number (in the Variable View) and press the ‘Delete’ key.

Inserting cases can be carried out only while working in the ‘Data View’. Select the row (or several rows continuously) where the new cases will be inserted, right-click and select ‘Insert Cases’. Similarly, select case(s) and press the ‘Delete’ key to delete the selected cases. Alternatively, you can use the Clear command in the Edit menu.

  2.2 Computing New Variables

While doing data analysis, we frequently need to create new variables from existing variables.

Example:

The total length of service of primary school teachers can be recorded in months, rather than in years, for better accuracy in annual school censuses. In many cases, however, we need to summarize or to relate with other variables in years. Then, a new variable ‘service in years’ must be computed as ‘service in months’ divided by ‘12’ (service in years = service in months /12).

Case study:

The sample data set extracted from the ‘Bangladesh Demographic and Health Survey 2007’ contains the highest education level (HV106) and highest year of education (HV107) for all household members. However, there is no educational attainment in usual ‘Grade’ or ‘Grade-level’, for example ‘Primary 2’ or ‘Secondary 4’. To study the highest grade-level attended by adult household members (aged 15 and above), a new variable ‘Grade’ must be calculated from two existing variables as:

Grade = HV106 *10 + HV107, for HV106 = 0, 1, 2, 3 and HV107 is not 98

Grade = Missing, if HV106 = 8 (Don’t know) or HV107 = 98 (Don’t know).

To calculate the new variable “Grade”, the ‘Compute Variable’ command is available under the ‘Transform’ menu  in the Data Editor. To create a new variable:

  1. Click ‘Transform’ on main menu bar
  2. Click ‘Compute Variable’ item and the ‘Compute Variable’ window will appear.

  1. Fill-in ‘Target Variable’ name, and optionally, the type and label of new variable can also be set by clicking the  button under the target variable name.
  2. Set the numeric expression of the existing variables together with numbers. SPSS has built-in functions, and operators such as +, – ,>, <.
  3. If we only want the cases that meet certain criteria to be included, we need to press  button located at the lower left corner of the window and fill-in the conditions.
  4. Click ‘OK’ to complete the task.

A new variable, ‘Grade’, has been added in the current data set, at the end of variable list. Although a new variable name was provided, the result variable from the ‘Compute’ command can also take an existing variable name. After creation of a new variable, it is important to thoroughly define it by setting labels, missing values and the measurement level.


2.3     Recoding

RECODE changes, rearranges, or consolidates the values of an existing variable. RECODE can be executed on a value-by-value basis or for a range of values.

Recoding is a common task in data preparation. Sometimes, values (or categories of codes) in a nominal or ordinal variable require regrouping for further analyses. Grouping of single-year population into school-going age groups is, for example, essential to calculate education indicators. Sometimes data, such as location names, is entered in text format, but should be changed into numeric values for easier data analysis. These tasks can be carried out by the following commands:

  1. Automatic Recode.
  2. Recode into Same Variables.
  3. Recode into Different Variables.

2.3.1 Automatic Recode

Automatic recoding is useful for string variables that have a limited number of values, such as male or female; urban, suburban, rural or remote. When the existing categorization of a variable is no longer needed after recoding, ‘Recode into same variables’ option can be selected or select ‘Recode into Different Variables’ to maintain the original variable.

To perform automatic recoding:

  1. Click ‘Transform’ on main menu bar
  2. Click ‘Automatic Recode’, and a new window will appear.

  1. Select one variable and send to the area under ‘Variable à New Name’.
  2. Type appropriate name for the recoded variable in ‘New Name’box.
  3. Click ‘Add New Name’ button; Repeat Steps 3, 4, and 5 for all variables that are being recoded.
  4. Select whether to recode starting from the ‘Lowest value’ or the‘Highest value’.
  5. Select whether to ‘use the same recoding scheme for all (selected) variables’, and whether to ‘treat blank values as user-missing’ or not.
  6.  Click ‘OK’ to complete the task.

Then, two new variables ‘Division’and ‘SES’ will be added to the current data set with the following coding schemes (codes and value labels).

In some cases, more than one variable shares the same values. For example, ‘Sex of head of household (HV219)’ and ‘Sex of household member (HV104)’ may have only two valid values ‘Male’ and ‘Female’.

Similarly, several variables can take just ‘Yes’, ‘No’and ‘non-response’ or ‘missing’ as a valid value. These values are, for example, the only valid values for the variables ‘Usual resident (HV102)’, ‘Slept last night (HV103)’ and ‘Member still in school (HV110)’. To recode such groups of variables, just tick the checkbox of ‘Use the same recoding scheme for all (selected) variables’ in Step 7.

The following example shows the automatic recoding of two variables, HV103 and HV102.

‘Automatic recode’ is a simple and useful way of exploring the newly imported file or for beginners.

2.3.2 Recode into Different Variables

The ‘Recode into Different Variables’ function is the most useful recoding procedure for general users. In this procedure, users can select all the recode options; both old and new variables are maintained in the data set. Before manual recoding, it is important to view the frequency distribution of the variable under study. The variable “Highest education level (HV106)” will be used as an example in this section. The frequency table for the variable HV106 is as following:

Here, six different items: ‘9’, ‘DK’, ‘Higher’, ‘No education, preschool’, ‘Primary’ and ‘Secondary’ are listed as valid values for the HV106 variable. Through the codebook of the DHS Survey, we learn that ‘9’ represents a missing value and ‘DK’ represents ‘Don’t know’. Since the variable under study is ‘educational attainment’, it is only valid for those aged 6 and above. Thus, it is logical to code as following for the population (household members) aged 6 and above:

After creating a new variable with the recode command, all necessary properties must be set for the new variable, such as variable format (type, width and decimal places), value labels and missing values. The new variable can be observed as following.

We use very similar steps to use the ‘Recode into same variable’ function.

2.3.3 Visual Binning

‘Visual Binning’ under ‘Transform’ menu can automatically create new variables based on grouping contiguous values of existing variables into a limited number of distinct categories. Visual Binning can:

  • Create categorical variables from continuous scale variables. For example, a scale variable ‘age’ to create a new categorical variable that contains five-year age groups.
  • Collapse a large number of ordinal categories into a smaller set of categories. For example, collapse the 20 five-year age groups into five groups: 0-19, 20-39, 40-59, 60-79, and 80+.

To conduct visual binning, first select a scale variable (HV105: Age of household members) and follow the steps below:

  1. Click ‘Transform’ on main menu bar.
  2. Click ‘Visual Binning’; a new window will appear.
  3. In the ‘Visual Binning’window:
    1. select the scale variable(s) to bin and move those variables into ‘Variables to Bin’ pane.
    2. click the ‘Continue’ button when you have finished selecting variables.

At this point, SPSS will analyze the selected variables and present a graphical distribution of the variable after binning in the new ‘Visual Binning’ window. Follow these steps in the ‘visual binning’ window:

  1. Enter an appropriate ‘name’ for the binned variable.
  2. Enter a variable ‘label’ for the binned variable.
  3. Click on the ‘Make Cutpoints’ button to define cutting points for the binning. The ‘Make Cutpoints’ window will appear to set cutpoints.
    Cut points can be constructed based on three options: (i) equal width intervals; (ii) equal percentiles based on scanned cases; and (iii) cutpoints at mean and selected standard deviations (1 or 2 or 3 SD) based on the scanned cases.Generally, making cutpoints with equal width intervals is the most suitable and commonly used method for analysing household surveys about education.

In the ‘Make Cutpoints’ window:

  1. Enter ‘4’ as first cutpoint location since the first age group of common five-year interval is 0-4.
  2. Input ‘5’ as the width (or class interval), and the ‘number of cutpoints’ will be filled automatically. There will be 19 in this example.
  3. Click ‘Apply’ and Visual Binning window will appear with set intervals.

Then, in the main ‘Visual Binning’ window:

Click the ‘Make Labels’ button to generate value labels automatically (the user can change these labels later if necessary).

  1. Click ‘OK’ to create a new binned variable called ‘Age’.

As usual, the properties of the newly-binned variable must be checked and changed as necessary.

Then, the following frequency table of the variable ‘Age’ is obtained.

Having checked and edited the data set, set the variable properties, and recoded as necessary, the data set is now ready to start preparation for data analysis. Preparations required before performing any data analysis include:

(1) the prospective outputs should be listed and suitable analytical methods should be identified.

(2) check which outputs can be generated directly from the existing data sets, and which may require further manipulations such as sorting, calculation/creation of new variables (temporary and/or permanent), transformation (coding, grouping, etc.) and creation of new data sets (aggregation, subsetting and merging existing data sets).

Example:The working data set contains data extracted from a household survey with personal records of all household members. The variables include: age, sex, schooling status, and the class/grade that children are currently attending. The purpose of the analysis is to produce “age-specific enrolment rate (ASER) for the children aged 6 to 14 by sex”. In this situation, it is impossible to compute ASFR directly from the working data set since the analyst needs to have a data set with:

(a) total number of children aged 6 to 14 by single year of age by sex [which is denominator];

(b) number of children aged 6 to 14 who are currently attending school by single year of age by sex [which is numerator], before computing age-specific enrolment rate, ASER.

In this situation, it requires:

(a) selection of cases (extracts cases of aged 6-14);

(b) aggregation of personal data to get grouped data by age and sex, that is, counting of all children irrespective whether schooling or not, and of children who are currently attending school, by age and sex; and

(c) calculation of ASER by age and sex.
[Note: The calculation is much easier and simpler if “Custom Tables” option is installed.]

SPSS allows data transformations ranging from as simple as collapsing categories for analysis, to more advanced tasks such as creating new variables based on complex equations and conditional statements. In the next chapter some important additional techniques of data manipulation and transformation will be discussed.

2.4 Selecting Cases

Select Cases provides several methods for selecting a subgroup of cases based on criteria that include variables and complex expressions. Users can also select a random sample of cases.

Being able to select cases is essential whenever we need to analyse a specific subset of data based on pre-set criteria, for example, to study the percentage of ‘out-of-school girls aged 6-14’. To do this:

  1. Click ‘Data’ on main menu bar.
  2. Click ‘Select Cases’, which is the second last item on the list.
  3. The ‘Select Cases’ window will appear, and then select ‘If condition is satisfied’.
  4. Click the ‘If’ button and a new ‘Select Cases: If’ window will appear.
  5. Construct selection statement using variables, operators and functions, and then click ‘Continue’;
  6. Select the outputoption:
    1. Filter out unselected cases.
    2. Copy selected cases to a new data set (provide the new data set name).
    3. Delete unselected cases.
  7. Click the ‘OK’ button and a new Data Editor window will appear with selected cases.

Of the three output options;

Filter out unselected cases’– unselected cases will be marked with cross-signs (X) as shown in the following ‘Data Editor’ window. The unselected cases will not be used in future analyses until running select cases again with Select All Cases option (to retain original data set).

Copy selected cases to a new data set’– this creates a new data set and leaves the current data set intact. Users can switch between the original data set and newly created data set or use both data sets together through syntax.

Delete unselected cases’– this deletes all unselected cases from the current data set. With this option, the original data set cannot be retained. It is, therefore, important to save the original data set before performing this action. The output sub-data set, which only contains the selected cases, should also be saved with an appropriate name as soon as the selection process is completed.

The following cross-tabulation provides the percentage of out-of-school girls aged 6-14 in single year by applying the cross-tabulation procedure only on the selected cases.


2.5     Sorting Cases

SORT CASES re-orders the sequence of cases in the active data set based on the values of one or more variables. Cases can be sorted in either ascending or descending order, or using combinations of ascending and descending order for different variables.

Cases can be sorted in ascending or descending order based on one to all variables in the data set. In the sample data set, households can be sorted by wealth index to observe the characteristics of households in similar wealth status. Moreover, some SPSS commands require pre-sorted data set, for example “aggregate” command requires that the data set is sorted by the breaking variable(s).

Sorting can be carried out through the ‘Sort Cases’ command under the ‘Data’ menu as follows:

  1. Click ‘Data’ on main menu bar.
  2. Click ‘Sort Cases’.The‘Sort Cases’ window will appear.
  3. Select the first key variable and send to the‘Sort by’ pane and set ‘Sort Order’.
    Repeat Step 3 for all key variables in the order of importance.
  4. Click the ‘OK’ button to start sorting.

The following example shows how a data set can be sorted using two variables: ‘Education in single year (HV108)’ in ascending order and ‘Age of head of household (HV220)’ in descending order.


2.6     Rearranging Variables

Relocating of variables does not have any impact on the results of data analyses. However, it makes it easier to decide which variables to use for getting required outputs.

 

Sometimes, the original data set cannot provide the variables in a suitable order, for example, education related variables may spread in several locations. In other situations, linked variables are so far apart that their visual linkage cannot be observed. In such cases, the data may be easier to analyse if the variable (or variables) are either grouped into a new data set or moved to the top of the variable list.

2.6.1 Relocating Variables

To move a variable from its current position to a new position, click the selected variable, then drag-and-drop it at the desired position in the ‘Variable View’ or ‘Data View’. For example, to move the variable, ‘Line number of head of household (HV218)’ to the second position in the list:

  1. Select the variable by clicking on the row number (HV218 at row 6) on Variable View.
  2. Drag and Drop’ it at the desired location(in this example, after the first variable in the list).

A red hairline (see in the following example) indicates where the variable will be inserted when the user releases the mouse button.

2.6.2 Variable Sets

If there are several variables in the data set, it is recommended to define and use ‘Variable Sets’. Define Variable Sets under Utilities menu creates subsets of variables to display in the Data Editor and variable lists in dialog boxes. Defined variable sets are saved with SPSS data files.

A variable set can be defined using any combination of numeric and string variables, and a variable can belong to multiple sets. The order of variables in the set has no effect on the display order of the variables in the Data Editor or of variable lists in dialog boxes.

Two variable sets ‘Education’ and ‘HH_Head’ are defined in the following example with nine variables in the‘Education’ variable set and eight in the other with four common variables.

To create a variable set:

  1. Click ‘Utilities’ on main menu bar.
  2. Click ‘Define Variable Sets’, and a new window will appear.
  3. In the ‘Define Variable Sets’ window, first put in the set name following SPSS naming convention (can be up to 64 bytes long with any characters including blanks).
  4. Select and move variables into the ‘Variables in Set’pane;
  5. Click the ‘Add Set’ button to create the variable set. Define as many sets as needed by repeating Steps 3-5.
  6. Click the ‘Close’ button to complete creation of variable sets.

It is strongly recommended to save the data set with a new name after defining the variable sets. In this example, the data set is saved as ‘BDPR50FL2.sav’.

To use a variable set:

  1. Click ‘Utilities’ on main menu bar.
  2. Click ‘Use Variable Sets’, and a new window with the list of variable sets will appear.

The list of available variable sets includes all variable sets that have been defined, plus two built-in sets:

(i)ALLVARIABLES: contains all variables in the data file, including any new variables created during a session;

(ii)NEWVARIABLES: contains only new variables that were created during the current session;

(iii)Education: the first user-defined variable set containing 9 variables.

(iv)HH_Head: the second user-defined variable set containing 9 variables.

  1. In the ‘Use Variable Sets’ window check the desired variable set(s) and uncheck all others under ‘Select variable sets to apply’.

At least one variable set must be selected. If ALLVARIABLES is selected, checking other sets will have no effect because this set already contains all the variables. In this example, the ‘Education’ variable set is selected.

  1. Click ‘OK’ to complete selection and the following new Data View will appear.
  2. To get all variables back, click ‘Show All Variables’ under Utilities menu.

Comments are closed.