2. Making Analyses and Presenting Outputs in MS Excel
2.1 Importing SPSS data sets into Microsoft Excel
To read SPSS data files (*.sav) directly into applications that support Open Database Connectivity (ODBC) or Java Database Connectivity (JDBC), the SPSS data file driver is required. SPSS itself supports ODBC in the Database Wizard, providing the ability to use Structured Query Language (SQL) while reading SAV data files.
SPSS ‘Data Access Pack (DAP)’ contains data file drivers that are useful for accessing other types of databases from within SPSS, and also enable SPSS data files to be accessed from other database software. DAP can be downloaded from the SPSS Website. A version of DAP for Windows, ‘SDAPWin32_5.3_SP2.exe’ (file size: 36,624KB) is provided in the training CD.
After installing DAP, a new program group called ‘SPSS Inc OEM Connect and Connect XE for ODBC 5.3’ will be created in the ‘Start Menu Programs’. Click ‘ODBC Administrator’, and follow the steps to open SPSS data files (*.sav) from the applications that have ODBC capabilities:
- Click the ‘File DSN’ tab.
- Click the ‘Add’ button to add a new data source.
The ‘Create New Data Source’ user dialogue box will appear. All available drivers that have been installed on the computer will be listed.
- Select ‘SPSS Inc. 32-Bit Data Driver (*.sav)’.
- Click ‘Next’.
A dialogue box will prompt you to enter a new Database System Name (DSN).
- Type-in an appropriate DSN name (‘SPSS-Training’ in this example).
- Click ‘Next’.
- A ‘Create New Data Source’ dialogue will provide a summary of information about the current settings. If these options are correct, click ‘Finish’ to create a ‘file DSN’.
At this point the program will request a location of the SPSS (PASW) Statistics data files.
- In this example type-in: ‘c:\….\My Documents\SPSS Training\Sample’ where all sample data sets are stored, and Click ‘OK’.
- Click ‘OK’ again to complete and then exit from ‘ODBC Data Source Administrator’.
After creating the new ODBC data source, the newly defined file DSN name ‘SPSS-Training’, will be listed in all Windows applications that have ODBC capabilities. Any SPSS data files (*.sav) that are located in the specified folder can now be accessed from other applications, and can retrieve full data set through ‘existing ODBC connections’ or partially through ‘Microsoft Query’.
When clicking ‘Existing Connections’ under the ‘Data’ menu in Office Excel 2007, ‘SPSS-Training’ will be displayed as an external data source for Excel (see ‘A’ in the following exhibit). By selecting this connection, one can retrieve any data set (whole data set) from the list.
Similarly, when clicking ‘From Other Sources’ and selecting ‘From Microsoft Query’, one can see the ‘SPSS-Training’ as a data source (see ‘B’), and by following the Wizard, users can retrieve part of a data set; only cases which satisfied set conditions and only the selected variables will be retrieved.
In summary, follow the steps below to import a complete SPSS data set into Excel 2007:
- Click ‘Data’ tab.
- Click the ‘Existing Connections’ button to open the ‘Existing Connections’ dialog box.
- Select ‘SPSS-Training’ from the list of available connections.
- Click ‘Open’ and a complete list of SPSS data sets in the specified folder will be displayed as ‘Tables’.
- Select the data set (by clicking on the name) and click the ‘OK’ button.
- In the import data window, select where the imported data, in the ‘Existing worksheet’ (active worksheet) or in a ‘New worksheet’. If the ‘Existing worksheet’ is selected, one can define the place to import data (default is $A$1).
- Click ‘OK’ to start the import process. This can take a few minutes.
At the end of import process, the SPSS data set will be placed on the specified Excel worksheet with the name like ‘Table_SPSS_Training’. Excel will treat it as a native Excel ‘Database Table’.
In this example, the file is saved with the name ‘Excel2.xlsx’. When opening the Excel file that contains an imported database, Excel will issue a ‘Security Warning’ with the message ‘Data connections have been disabled’ and an ‘Option’ tab. If the imported data needs to be updated from the source SPSS data set, or needs to be imported from another data set, the user must enable the data connection. Otherwise, the user can leave the data connection disabled.
In the Excel worksheet, by enabling the ‘Autofilter’ we can display all the variable names in the top row. We can use the ‘Autofilter’ feature to select cases that meet the criteria we specify. This can help us check for invalid entries. The ‘Autofilter’ can be turned on and off by clicking on the filter button.
To select the cases for all the children who are six years old, we can click the down arrow sign next to ‘HV105’, clear the tick next to ‘select all’ (to unselect all), then tick the box next to ‘6’ and click ‘OK’. In the following exhibit, the bottom left-hand corner of the status bar shows there are a total of 53,413 records (or cases) in the database. After applying the filter, however, only 1,302 records with children who are six years old are displayed.
If the filter is also applied to the variable HV104 (sex) and the cases are filtered to only display those records that have the value1 (Male), 656 records are now displayed (for boys who are six years old).
If the entire worksheet is selected and copied, and then pasted into a new worksheet while the filter is active, only the filtered records (or unhidden rows) will be pasted in new worksheet, then, unwanted variables can be selected and deleted column by column to clean up the Excel database. The final result is exactly the same as if we had imported the records through ‘Microsoft Queries’, but the filter is easier for most users. Microsoft Queries can be complicated for those who are not used to manipulating databases (see the steps in the following example).
2.2 Creating Frequency and Crosstab Tables
The Excel function ‘FREQUENCY’ can be used to create entire frequency tables from a range of cells or from a variable in a database table., The function, ‘COUNTIFS’ can be used to get the appropriate value for a cell of a frequency table or crosstab table.
Using the FREQUENCY Function
‘FREQUENCY’ is an MS Excel Worksheet function. The function counts how often individual values occur within a range of values, and returns the counts as a vertical array of numbers. FREQUENCY can be used, for example, to count the number of males and females among the household members. Because FREQUENCY returns an array, it must be entered as an array formula as explained below.
The following steps are used to construct a table to present the sex distribution of household members, both in absolute number and percentage distribution using the FREQUENCY function. For this example, theHV104 (Sex of household member) variable will be used with the codes ‘1=Male’ and ‘2=Female’(these data were obtained from the ‘SPSS_Training’ database).
- Prepare the table structure, formulas and ‘bin’ array as follow.
- Select cell ‘B3’ and type in ‘=FREQUENCY(SPSS_Training[HV104],$G$3:$G$4)’.
- Select the range ‘B3:B4’.
- Press ‘F2’ to get into formula editing mode, and press ‘<Ctrl><Shift>ENTER’ to reenter formula as an array formula.
- If necessary, set the display formats of the number cells and the table.
Using COUNTIF or COUNTIFS Function
The frequency table presented above can also be constructed using the EXCEL Count Functions COUNTIF and COUNTIFS.
- Prepare the table structure, formulas and ‘codes’ as in the following exhibit.
- Select cell ‘B3’, and type in ‘=COUNTIF(SPSS_Training[HV104],G3)’.
- Copy ‘B3’ and paste at ‘B4’.
- If necessary, adjust the display formats of the number cells and the table.
Note: The formula,‘=COUNTIFS(SPSS_Training[HV104],G3)’ can also be used in this example. COUNTIF allows only one condition while COUNTIFS can be used with multiple conditions.
Using COUNTIFS Function to construct a crosstab table
Although the ‘FREQUENCY’ function cannot be used to construct a crosstab table, the ‘COUNTIFS’ function can be used to get the number value of each and every cell of the table. The following example shows how to construct a crosstab table of educational attainment (HV109) by sex of household members (HV104) for the population aged 15-24 (Age: HV105):
- Prepare the table structure, formulas and ‘codes’ for both variables.
- Select cell ‘B5’, and type:
Here, the first COUNTIFS statement counts the population ‘aged 14 and above’ by education level and sex. The second COUNTIFS counts the population ‘aged 24 and above’ with the same characteristics. The value of the difference, therefore, represents the number of cases with respondents who are ‘aged 15-24’.
- Copy ‘B5’ and paste to the range‘B5:C11’.
- Complete the formulas, adjust the cell formats as necessary to obtain the following output table.
Although frequency and crosstab tables such as these can be constructed in Microsoft Office Excel, it is complicated to construct such tables if the sampling procedure requires ‘weighting’. In this case, it is better to construct the tables in SPSS with ‘weight on’ and then export the output to Microsoft Excel for additional processing and presentation.
2.3 PivotTables (OLAP Cubes)
Unweighted frequency and crosstab tables with multi-layers, which are useful in preliminary analyses of household survey data, can be constructed in Microsoft Excel using PivotTable techniques. A PivotTable is an interactive way to quickly summarize large amounts of data, to conduct in-depth analysis and to answer unanticipated questions about the data. Pivot Tables especially designed to:
- Provide user-friendly ways of querying large amounts of data.
- Subtotal and aggregate numeric data, summarize data by categories and subcategories, and create custom calculations and formulas.
- Expand and collapse levels of data to focus on the results, and drill down to details from the summary data to investigate areas of interest.
- Move rows to column or columns to rows to see different summaries of the source data;
- Filter, sort, group, and conditionally format useful and interesting subsets of data to allow users and analysts to focus on the most important information.
- Present concise, attractive, and annotated online or printed reports.
In a PivotTable, each column in the source data (or database) becomes a PivotTable field (a ‘field’ in Excel is equivalent to a ‘variable’ in SPSS) that summarizes multiple rows of information. A value field provides the values to be summarized. By default, data from the variables in the ‘Values’ area summarize the underlying source data in the PivotTable using the SUM function for the numeric variables, and the COUNT function for the text (string) variables.
To create a PivotTable, first, define its source data, specify a location in the workbook or the database table, and then lay out the fields as follows:
- Select the sheet with imported database and click ‘Insert’ tab in the main menu.
- Click ‘PivotTable’ button to open the ‘Create PivotTable’ dialog box.
- Since the active worksheet contains the imported ‘SPSS_Training’ database table, it will appear automatically in the ‘Table/Range’ selection box. Users can, however, change the data source to another table or to a specific range (e.g., A1:C2000).
- Select where the PivotTable will be placed, either in a ‘New Worksheet’ or an ‘Existing Worksheet’. If an ‘Existing Worksheet’ is selected, the user should provide the first cell address. In this example, just leave this option as default ‘New Worksheet’.
- Click ‘OK’ to create a new worksheet with ‘PivotTable creation tools’.
A new worksheet like the one, which has the necessary tools to create a PivotTable, will be created
A set of tools, which are used to create, edit and enhance the PivotTable, will be available in the new worksheet.
- From ‘PivotTable Field List’ select variables (or fields) and drag and drop to:
(a) Values à the variables to make actual summarization (count or sum or etc.)
(b) Row Labels à the variables that will be displayed on the rows (these can be nested)
(c) Column Labels à the variables that will be displayed on the columns (these can be nested)
(d) Report Filter à the variables to be used for filtering/subsetting the database;
When a variable is dragged and dropped into a box, the opening sign of a PivotTable on the worksheet will be replaced with an actual PivotTable with default settings.
In this section, we demonstrate the construction of a ‘PivotTable’ by creating a crosstab table of ‘Educational attainment by Sex for Population Aged 15-24’.
To do this, first, define which variables (or fields) should be placed in the ‘value, row, column or filter’ boxes, to create the required table. In this example, we will use variable HV109 (educational attainment) as the key variable and to display the education levels in the rows.
For this, drag HV109 from the PivotTable filed list and drop it into both:
(a) the ‘Values’ box (to count how many persons in each category), and
(b) the ‘Row Labels’ box (to display education levels in rows).
Excel will create the following PivotTable that shows the ‘frequency of HV109’.
The items displayed in rows can also be selected. For example, eight items (0, 1, 2, 3, 4, 5, 8, and blank) are displayed in cells A4 through A11. Since the code ‘8’ represents ‘unknown’ and ‘(blank)’ is simply ‘missing value’, these two items should not be displayed in the frequency table. To do this, click on the dropdown next to ‘Row Labels’ and uncheck the box next to ‘(blank)’ and click ‘OK’.
Here, ‘value labels’ can be directly typed into the PivotTable; new labels will replace the defaults. For example, the column label‘1’ can be replaced with ‘Male’ and ‘2’with ‘Female’. We will, however, only make these changes while finalizing the PivotTable.
The current PivotTable represents entire household population irrespective of age, but the requirement is just for the ‘population aged 15-24’.To fulfil this requirement, the cases must be filtered by ‘age’. To do this, variable HV105 (age) must be placed in the ‘filter’ box. This will, however, have no affect yet because there is no filter options set. To set filter options, click on the ‘dropdown’ icon next to ‘(All)’in cell B2, then, tick ‘Select Multiple Items’ checkbox. Only values for ages that are between 15 and 24 (inclusive) should remain ticked; clear the ticks from all other values.
Above example shows the PivotTable after improving captions (value labels) and adjusting column widths to suit the data. Once complete, the whole PivotTable, or any of its parts, can be copied and used in other worksheets, documents and presentations.
PivotTables are useful if multiple tables with the same structure are required for different groups (e.g. for different ages), or presenting the same table with only selected rows and/or columns. For example, we can create the same table for adults (aged 15+) by clicking the dropdown icon next to ‘(Multiple Items)’ in Cell B2. Tick ‘(All)’, and then clear ticks next to ‘0’, ‘1’, ‘2’, …, ‘14’ (see A). Similarly, to create a table for all adults who have ‘up to complete primary’ education, click the dropdown icon next to ‘Row Labels’ and select only the first three categories (see B).
As we can see in these examples, the PivotTable is a user-friendly, powerful and efficient tool for analysing household survey data, especially for the surveys that apply ‘self-weighting’ sample design.
2.4 Drawing Pivot Charts
PivotChart provides a graphical representation of the data in a Pivot Table. The layout and data that are displayed in a PivotChart can be easily changed, just as in a PivotTable. A PivotChart always has an associated PivotTable that uses a corresponding layout. The PivotTable and PivotChart’s fields are linked to each other. When changing the position of a field in the PivotTable, the corresponding field in the PivotChart also moves.
Filter field: A field that is used to filter data by specific items. In the example, the ‘age’ field displays data for all ages. To display data for a single age or selected ages, click the drop-down arrow next to (All) and then select an age(number) or multiple ages (numbers).
Values field: A field from the underlying source data that provides values to compare or measure. Depending on the source data, the summary function can be changed to Average, Count, Product, or other types of calculation.
Series field: A field that is assigned to a series orientation in a PivotChart. The items in the field provide the individual data series. In a chart, series are represented in the legend.
Item: Items represent the unique entries in a column or row field, and appear in the drop-down lists for report filter fields, category fields, and series fields. Items in a category field appear as the labels on the category axis of the chart. Items in a series field are listed in the legend and provide the names of the individual data series.
Category field: A field from the source data that is assigned to a category orientation in a PivotChart report. It provides the individual categories for which data points are charted. In a chart, categories usually appear on the x-axis, or horizontal axis, of the chart.
A PivotChart can be created automatically when creating a PivotTable or can be created from an existing PivotTable. To create a PivotChart from an existing PivotTable, follow these steps:
- Select any place (cell) on the existing PivotTable, two new menu items ‘Options’ and ‘Design’ will appear under ‘PivotTable Tools’ group in the main menu.
- Click ‘PivotChart’ under the ‘Options’ tab to open the ‘Insert Chart’ dialog box.
- Choose ‘Chart Type’ from the ‘Insert Chart’ dialog box.
- Click ‘OK’ to create a basic PivotChart together with a ‘PivotChart Filter Pane’.
The PivotChart, which is created automatically, is a ‘draft’. Notice that there is no chart title. A title must be specified using the tools which are available when clicking on an active PivotChart.
For example, to set the title of the chart as ‘Education Level by Sex (Aged 15+)’, click on the ‘Chart Title’ under the ‘Layout’ command and select the third option: ‘Above Chart’. After performing a few other make-ups, such as moving the legend, changing the design and displaying border lines for the plot area, the following PivotChart will be generated and is ready to use.
Another useful adjustment, which applies to both PivotTables and PivotCharts, is to display the ‘values’ in percentages, rather than absolute numbers. To review the percentage distribution of education level by sex for adults:
- Click on the ‘dropdown’ of the variable in the ‘value’ area.
- Select ‘Value Field Settings…’.
- Select ‘Show values as’ tab in the ‘Value Field Settings’ dialog box.
- Select ‘% of column’ in ‘Show values as’ dropdown list.
- Click ‘OK’.
The following table and chart will be generated after adjusting display formats, especially for the number of decimal places in percentages.
2.5 Elaborating Outputs for Better Presentation
Although the PivotTable and PivotChart in MS Excel are user-friendly and efficient tools that can be used to present household survey data in both tabular and graphical forms, the SPSS provides more methods for analysing data using more advanced statistical techniques, including the ability to work with weights. Many people are, however, more familiar with MS Excel and it is a useful tool for formatting and presenting data. SPSS and Excel can, therefore, be used to complement each other. Data can be analysed in SPSS; the output can be formatted and presented in MS Excel.
In this section, we use SPSS to estimate the school-age population and the number of children who are currently attending school by applying weights to the calculations. We will then use MS Excel to calculate age-specific enrolment rates and improve presentation of the data.
Basic of Weighting
In a town with two wards, there are 100 children aged 6-10 in Ward-1 and 50 such children in Ward-2. A survey on ‘schooling status’ was conducted by selecting 25 children from Ward-1 and 20 children from Ward-2. It was found out that 5 children (out of 25) from Ward-1 and 6 children (out of 20) from Ward-2 were not currently in school. Therefore, the percentage of out-of-school children (POS) can be estimated as:
POS (Ward-1) = 5 / 25 x 100 = 20.0%
POS (Ward-2) = 6 / 20 x 100 = 30.0%, and
Percentage of out-of-school children in the town can be estimated as:
POS (Ward 1+2) = 11 / 45 x 100 = 24.4%. …………. (1)
POS (Ward 1+2) = (20.0%+30.0%) / 2 = 25.0%. …………. (2)
Although the percentage of out-of-school children by Ward is representative of the number of out-of-school children in each individual Ward, the above estimates for the entire town is incorrect. This is because the samples are not ‘self-weighting’; the allocation of samples (sample sizes) are unbalanced between the two Wards. The sampling fraction for Ward-1 is 25 / 100 or 25.0% while the sampling fraction of Ward-2 is 20 / 50 or 40%. In other ward, a child in the sample from Ward-1 represents 4 children while a sample child from Ward-2 represents just 2.5 children.
To correctly estimate the number of out-of-school children for the town, we should apply ‘weighting’ to the calculations.
Since the POS (Ward-1) is 20.0%, we can estimate there will be 20 out-of-school children (20.0% x 100) in Ward-1, and 15 out-of-school children (30.0% x 50) in Ward-2.
We can estimate, therefore that there are 35 out-of-school children out of a population of 150 children aged 6-10 in the town. The POS for the Town is (35 / 150 x 100 = 23.3%).
Another way of calculating this estimate of the number of out-of-school children in Ward-1 and Ward-2 is to multiply the number of out-of-school children that were counted in the sample by the sample weight. In Ward-1, 5 children were counted, and since each child in the sample represents 4 children in the population, we estimate there are 20 out-of-school children in Ward-1 (5 x 4.0 = 20). In Ward-2, 6 children out-of-school children were counted, but since each child in the sample represents 2.5 children in the population we estimate there are 15 out-of-school children in Ward-2 (6 x 2.5 = 15). These numbers, 4.0 and 2.5 are known as the ‘sample weight’. Sample weights are normally provided with the data sets.
In SPSS, it is easy to apply weights if it is provided in the data set:
- Click ‘Data’ on the main menu.
- Click ‘Weight Cases…’. The ‘Weight Cases’ dialog box will open.
- In ‘Weight Cases’ dialog box, set ‘Weight cases by’.
- Select the variable representing the ‘weight’ (in this example, the weighting variable is HV005 – Sample weight in the DHS data set). Send this variable to the ‘Frequency variable’ input box.
- Click ‘OK’ to complete weighting process.
To stop using weighting, select ‘Do not weight cases’ in Step 3 and click OK’.
The following two tables show estimates for the population of children who are aged 6-10 by sex with and without weighting.
We can observe the differences due to the weighting in the percentage distribution of population by age and sex.
Similarly, the following tables present weighted and unweighted estimates of the number of children currently attending school (HV110 –Member still in school) by age and sex.
From the two sets of tables above, we can use MS Excel to calculate the ‘proportion of children currently attending school by age and sex’ or ‘percentage of out-of-school children by age and sex’.
Since it is easier to export outputs from the SPSS Viewer, first clear unnecessary outputs such as logs, notes and the case processing summary, and then export to Excel:
- Click ‘File’ on the main menu.
- Click ‘Export…’ and the ‘Export Output’ dialog box will open.
- In the ‘Export Output’ dialog box, set:
a) ‘All’ in ‘Objects to Export’.b) ‘Excel (*.xls)’ in ‘Document Type’.c) Provide ‘File Name’ with folder path.d) Click ‘OK’ to begin exporting outputs to Excel.
At the end of this process, an Excel file called ‘POS.xls’ with four cross-tabulation tables will be created in the specified folder as followings.
The ‘percentage of children currently attending school by age and sex’ can be calculated in the third and fourth tables by editing the captions of columns O-Q, and entering a formula to perform the calculation of ‘dividing children in school by total number of children of respective age and sex’. The result of these edits, along with some other changes, such as relocating the fourth output table beside the third one for easier comparison between unweighted and weighted calculations, can be observed in the following exhibit.
To demonstrate the visualization of data using charts, the percentage of children in school by age (or Age-Specific Enrolment Rate) will be presented using charts that are appropriate for the data. In this case, a ‘3-D Clustered Column’ and ‘Line’ charts will be used.
To create a ‘3-D Clustered Column Chart’:
- On the table, (a) select Cell ‘J24’, unmerge and type ‘Age’ into Cell ‘K26’. Similarly, (b) select Cell ‘J32’, unmerge and type ‘6-10’ to Cell ‘J32’.
- Select the ‘Data Source’ to create chart: age (K27:K31 – X-axis), percentage of children currently attending school for male (O26:O32 – Series 1) and for female (P26:P32 – Series 2).
- Click ‘Insert’ on the main menu.
- Click ‘Column’ to see the list of available Column Charts.
When a user places mouse on the ‘Column’, MS Excel displays useful information about the chart type, such as ‘Column charts are used to compare values across categories’. MS Excel displays similar information about other chart types too.
- Click the ‘3-D Clustered Column’ icon, which is the first one in the‘3-D Column’ group. The following chart will be displayed.
- The next step is to finalize the chart in Excel:
- Click on the chart, and click again on ‘Layout’ under ‘Chart Tools’.
- Click ‘Chart Title’, select ‘Above Chart’ to insert a space for chart title, and type ‘Percentage of Children Currently Attending School by Age and Sex (Weighted)’ into that space.
- Click ‘Axis Titles’, set ‘Primary Horizontal Axis Title’ to appear ‘Title Below Axis’, and type ‘Age’ into the space.
- Changing the location of legend (just select, drag and drop at new location).
- Setting the gap width between items (select one series, right-click to get pop-up menu, click ‘Format data series’, and set ‘Gap width/depth’).
- To change the series colour (select one series, right-click to display the pop-up menu, click ‘Format data series’, and set colour in ‘Fill’);
- To format any … (select that item, right-click to get pop-up menu and set).
- Move or resize the chart, chart title, legend or other elements.
The same procedure can be used to create a line graph. Since line charts are normally used to display the trends, over time or age, we will use different data to construct our line chart. In this case, select the data range to cover ages 6 to 10, but not total (aged 6-10).
We want to show the difference between male school attendance patterns and female school attendance patterns, but the difference is not clear with the current scale because it starts at 0%. We can adjust the scale to be from 60% to 100%, which allows us to clearly see the difference between male and female attendance patterns as following: