Module B2: Introduction to Data Analysis Software

1. Examples of Software for Analysing Household Survey Data to Assist EFA Monitoring

If we search for ‘statistical software packages’ on the web, we can easily find more than 100 examples of software that can be used to analyse statistical data. Some packages can only be run on-line. Some are free or public domain, while others are proprietary. Some packages analyse data for a particular field or purpose, while the others are general purpose.

1.1      CSPro (Census and Survey Processing System)

CSPro is a public domain statistical package that can be used for entering, editing, tabulating, and mapping census and survey data. It is frequently used by statistical agencies in developing countries, especially for data entry (fixed-width text file format).

It was designed and implemented through a joint effort by the developers of the Integrated Microcomputer Processing System (IMPS) and the Integrated System for Survey Analysis (ISSA) at the United States Census Bureau, Serpro S.A. and Macro International. CSPro was designed to replace both IMPS and ISSA, and can be downloaded from http://www.census.gov/population/international/software/cspro/csprodownload.html.

The current version of CSPro (4.1.002) was released on 12 December 2011. There are four key applications (together with several useful utilities) in the CSPro4.0 application package:

1)    A Data Entry Application contains a set of forms (screens) and logic that a data entry operator can use to key data into a file that can be used to later add new data or to modify existing data. As a part of the data entry application, users can create an unlimited number of forms (screens) for data entry.

2)    A Batch Edit Application can be used to gather information about a data file. It has several run-time features including the ability to:

  • write editing rules that check the validity of values in a variable and consistency between variables/cases and (if necessary) modify data values.
  • make imputations and generate imputation statistics.
  • generate edit reports automatically or create a customised report.
  • create additional variables.

3)    A Tabulation Application contains a set of table specifications (structure) and a data dictionary, which may already exist or be newly defined, to describe the data file that is to be tabulated.

This application can cross-tabulate variables, and if applicable, produce map results by geographical area using both existing variables and new variables created “on the fly”. Output tables can contains selected summary statistics including simple data counts, percentages, means, medians, modes, standard deviations, variances, n-tiles, proportions, minimums, and maximums. Tabulations can be made based on values from the data file (as it is) or by applying weights.

4)    A Data Dictionary describes the overall organization of a data file (or) provides a description of how data are stored in a data file. The Data Dictionary is central to all CSPro applications and must be created for each file that is being used.

One of the benefits of CSPro is that it can run on a computer with very basic specifications. The minimum configuration includes (i) 33MHz 486 processor; (ii) 16MB of RAM, (iii) a VGA

monitor, and Microsoft Windows 98SE. It is a public domain software, which means it can be downloaded free of charge.

CSPro is the most frequently used software for data entry and initial analyses of data from general surveys and population censuses. Current DHS surveys also use CSPro. However, every data file must have a Data Dictionary, even if it is only being used for simple data analysis such as constructing frequency tables for selected variables. Therefore, it is difficult to analyse a data set created using another software (or a data set without predefined Data Dictionary in CSPro).

1.2      Epi Info

Epi Info is public domain statistical software for epidemiology that has been developed by the Centers for Disease Control and Prevention (CDC) in Atlanta, Georgia (USA) since 1985. It is designed for the global community of public health practitioners and researchers.

The first version, EpiInfo1, was released in 1985 as a MS-DOS batch file on 5.25″ floppy disks. It operated on the MS-DOS platform until Epi Info 2000 was released for Windows. Starting from Epi Info 2000, data was stored in the Microsoft Access database format, rather than the text file format that was used in the MS-DOS versions. Version (3.5.1) was released on August 13, 2008 and was designed to run on Windows Vista. Since Epi Info 7, the software is being developed as an open-source project. It can be downloaded from http://wwwn.cdc.gov/epiinfo/.

The current versions of Epi Info provide easy form and database construction, data entry, and analysis with epidemiologic statistics, maps, and graphs. The primary applications within EpiInfo are:

MakeView   to create forms and questionnaires and automatically create a database to store the data that is collected.

Enter            to enter data into the database using the forms and questionnaires that were created in MakeView.

Analysis       to produce statistical analyses of data, report output and graphs.

EpiMap       to develop GIS maps with overlaying survey data.

Epi Report   to combine analysis output, enter data and any data contained in Access or SQL server, and present it in a professional format. The generated reports can be saved as HTML files for easy distribution or web publishing.

Although “Epi Info” is a CDC trademark, the programs, documentation, and teaching materials are in the public domain and may be freely copied, distributed, and translated. User analysis in 2003 documented 1,000,000 downloads from 180+ countries and its manual and programs have been translated from English into 13 additional languages.

The main benefit of Epi Info is that it integrates support for every step of the survey process, from developing the questionnaire to data analysis and creating custom reports. To begin with, users must develop a questionnaire with Epi Info’s “MakeView”. Based on that questionnaire, users can customise the data entry process, enter data into the database (into screens which were created while developing the questionnaire) and analyse the data. For epidemiological uses such as outbreak investigations, being able to rapidly create an electronic data entry screen and then do immediate analysis on the collected data can save considerable amounts of time compared to using paper surveys.

As such, Epi Info is one of the best software packages for survey developers and researchers, especially those who do epidemiological research/surveys. It is not easy, however, to analyse a data set created using other software.

1.3      Microsoft Excel

Microsoft Excel (full name: Microsoft Office Excel) is a component of the Microsoft Office Suite. It is a spreadsheet application that runs on Windows and Mac OSX operating systems. Excel was first released in 1985 for the Apple Macintosh. The first Windows version was released in November 1987. Since the release of Version 5 in 1993, Microsoft Excel has been the most frequently used spreadsheet application programme.

Key features of Microsoft Excel include calculation, graphing tools, pivot tables (or OLAP Cubes) and a macro programming language in Visual Basic for Applications (VBA).It also has the ability to carry out several database management functions. Microsoft Excel includes support for SQL (Structured Query Language) and Network DDE (Dynamic Data Exchange), which allows data stored in different systems to be exchanged.

Since its 1993 version, Microsoft Excel supports programming using Microsoft’s Visual Basic for Applications (VBA). VBA, which is based on Visual Basic, adds the ability to automate tasks in Excel and to provide user-defined functions (UDF) for use in worksheets. Programming with VBA allows for spreadsheet manipulation that would be impossible with standard spreadsheet techniques. Programmers may write VBA codes directly using the Visual Basic Editor (VBE). Users can also record VBA codes that replicate their actions on the spreadsheets, which allow users to automate simple tasks.

Using VBA, a programmer can access a database (or data set) that is placed on a spreadsheet or from the different files (created in non-Excel formats). Then, Visual Basic modules can be written to construct frequency and crosstab tables, calculate various statistics, and conduct transformation, sorting, selection and formatting. The results, whether intermediate or final, can be concurrently written back to a spreadsheet or saved in a separate file.

The benefit of Microsoft Excel is that, being a component of the Microsoft Office Suite, it is one of the most frequently used software applications. Many users are familiar with Microsoft Excel, but very few users are familiar with VBA, Pivot Tables and database functions that are essential for analysing household survey data for EFA monitoring.

Microsoft Excel is suitable for finishing statistical outputs that are produced by other software because it allows for table formatting and creation of graphs and charts.

There are various reference materials, training courses and free on-line tutorials available for Excel, for learning from basic skills to advanced features. Microsoft provides on-line help and user forums as part of the Microsoft Office.

Some of the websites that provide help and tutorials for Excel are http://excel.tips.net/, which covers a comprehensive range of Excel’s functions and provides help on writing Excel formulas. Another tutorial for the beginners is available online at http://people.usd.edu/~bwjames/tut/excel/.


1.4      PSPP

PSPP is a free, open-source alternative to the proprietary statistics package SPSS. PSPP is used to analyse sample data. It has both a graphical user interface and a conventional command line interface. It is written in C, uses the GNU Scientific Library for its mathematical routines, and Plotutils to generate graphs. PSPP has been distributed since 1998.

PSPP provides basic, but very useful, statistical analysis functions. It can be used to construct frequency and crosstab tables; calculate non-parametric tests, significant tests and reliability tests; supports various linear regression models; and can perform factor analysis and compute basic statistics. It also provides some database management features, such sorting and selecting cases, computing new variables, and recoding into existing and new variables.

Users can select outputs (tables and graphics) in ASCII, PDF, Postscript or HTML formats. Some graphs, such as histograms, pie-charts and np-charts can also be generated. PSPP can open SPSS data files and is able to import data from Gnumeric, OpenDocument, Microsoft Excel spreadsheets, databases, comma-separated text files and ASCII text files. It can save data files in the SPSS ‘portable’ file format (*.por), SPSS ‘system’ file format (*.sav) and ASCII text file format. Some of the libraries used by PSPP can be accessed programmatically; and PSPP-Perl provides an interface to the PSPP libraries.

As mentioned above, PSPP is freeware. The installation program file and manuals can be downloaded from the GNU web-site, http://www.gnu.org/software/pspp/. Users can install and use the software without limitations. PSPP is, however, in a developmental stage and its documentation and help system are not very useful for new users.


1.5      SAS (Statistical Analysis System)

SAS is an integrated system of software products from SAS Institute Inc. SAS enables programmers (users) to perform many different kinds of analysis, data management and functions to generate output, such as:

In addition, SAS has a range of business software solutions for IT management, human resource management, financial management, business intelligence, customer relationship management and more.

SAS is driven by SAS programs that define a sequence of operations to be performed on data stored in tables. SAS Library Engines and Remote Library Services allow access to data stored in a range of other data formats and on remote computer platforms.

SAS functions via application programming interfaces, in the form of statements and procedures. A SAS program is composed of three major parts namely: (a) the DATA step, (b) procedure steps, and (c) a macro language.

The DATA step identifies file structure, reading and writing of records, and closing of the file. All other tasks are accomplished by procedures in the procedure steps. Procedures are not restricted to only built-in ones but also allow extensive customisation, controlled by mini-languages defined within the procedures. SAS also has an extensive SQL support, allowing SQL programmers to use the system with little additional knowledge.

The macro programming extensions allow the use of “open code” macros or the interactive matrix language SAS/IML component. Macro code in a SAS program undergoes pre-processing. At runtime, DATA steps are compiled and procedures are interpreted and run in the sequence they appear in the SAS program. A SAS program requires the SAS software to run. SAS consists of a number of components, which require separately licenses and installations.

SAS runs on IBM mainframes, Unix machines, OpenVMS Alpha, and Microsoft Windows; and its code is easily portable between these environments. SAS requires extensive programming knowledge and it is the most comprehensive and expensive statistical analysis software. Users can contact www.sas.com for more information and assistance. It is also recommended to review http://www.ats.ucla.edu/stat/sas/default.htm[1], where users can learn more about SAS.


1.6      Stata

The name “Stata” is taken letters from the words “statistics” and “data”. It is a general-purpose statistical software package with full range of capabilities including data management, statistical analysis, graphics, simulations and custom programming. It is used by many businesses and academic institutions around the world. Most of its users work in research, especially in the fields of economics, sociology, political science, and epidemiology.

Stata was first commercialised in 1985 by StataCorp. A major updated version has been released approximately every two years. There are four major builds on each version of Stata:

  • Stata/MP       for multiprocessor computers (including dual-core and multi-core processors)
  • Stata/SE         for large databases
  • Stata/IC          the standard version
  • Small Stata     a smaller, student version that is only available for purchase through a special education license.

Stata primarily uses a command-line interface, but since Stata 8, it also offers a graphical user interface, with menus and dialogue boxes, to make it easier for users to access the built-in commands.

Stata allows one data set to be open at one time for review and editing in spreadsheet format, but the data set must be closed before other commands are executed. When working with Stata, it holds the entire data set in memory, which prevents it from being used for extremely large data sets. The data set is always rectangular in format; all variables hold the same number of observations (though some data entries may be missing values).

Stata’s proprietary file formats are platform independent, so users of different operating systems can easily exchange data sets and programs. Stata’s data format has changed over time, although not every Stata release includes a new data set format. Every version of Stata can read all older data set formats, and can write both the current and most recent previous data set formats. Thus, the current Stata release can always open data sets that were created with older versions, but older versions cannot read data sets that were created in a newer version.

Stata can read and write SAS XPORT format data sets and it can import data from ASCII formats (CSV or fixed-width) and spreadsheet formats (including various Microsoft Excel files). A few other econometrics software packages can directly import data in Stata file formats.

An advantage for using Stata is independency of OS for both data sets and programs. Another advantage is that it can be operated with both built-in commands and user-written commands. Several useful commands are available for download from the internet (these command files are called ado-files). Stata’s version control system is designed to give a very high degree of backward compatibility, ensuring that codes written for previous releases continue to work in newer versions.

The main disadvantage of Stata is that it is not easy to use. Users require a thorough understanding of Stata’s command line interface and basic commands to manipulate the data. Only those with extensive programming experience can learn to use Stata themselves. Training may be required for the beginners to teach them how to use Stata effectively.

Evaluation version of Stata software can be downloaded from www.stata.com. More information about Stata is available athttp://www.ats.ucla.edu/stat/stata/default.htm.

1.7      SPSS (Statistical Package for Social Sciences)

SPSS is one of the most popular data analysis software packages. It supports various statistical methods and procedures. SPSS was first developed in 1968 at the Stanford University for internal use only (see brief history of SPSS/PASW Statistics in Section 2.1 of this module). Starting from March 2009, the name SPSS had been changed to PASW Statistics (Predictive Analytics SoftWare). In July 2009, the company which owned PASW announced that it was being acquired by IBM. As of January 2010, it became “SPSS: An IBM Company”. By October 2010, IBM SPSS was fully integrated into the IBM Corporation

Recent versions of SPSS Statistics can handle multiple data sets with an almost unlimited number of variables and cases. It allows data and outputs to be imported and exported using a variety of formats including Microsoft Excel and various text formats. Users can operate SPSS through a menu (and dialog box) driven graphical interface as well as command line (syntax) interface.

SPSS is user-friendly. Even beginners can do basic statistical analysis with the software. It offers excellent on-line help, complete users’ manuals and self-learning tutorials. The package supports almost all statistical methods, which allows users to perform basic to advanced analysis on data sets. SPSS has good support for data management and data documentation.

The majority of household surveys are analysed with SPSS, and/or many final survey data sets are available in SPSS (*.sav) format.

For these reasons, in this module we shall demonstrate how to use SPSS to analyse household survey data for the purpose of EFA monitoring. Later modules will showhow Microsoft Excel can be used to format and arrange the outputs from SPSS analysis for presentations and reports.

Users can get up-to-date basic SPSS software from www.spss.com for evaluation. There are tutorials available at http://www.ats.ucla.edu/stat/spss/default.htm, that anyone, from beginners to experts, can use to learn SPSS.

DisclaimerUNESCO does not recommend any software or vendor over another. SPSS Statistics and Microsoft Excel are only used as ‘example’ software in this module to assist in producing and analysing EFA monitoring indicators using household survey data sets. Users may choose any software package that suits their purpose, expertise and budget.Review and selection of the statistical software are solely based on the limited experience of the author of this module. It does not reflect UNESCO’s view or preference.

Several facts are obtained from the user manuals of the underlying software, and from Wikipedia, the web-based free encyclopaedia.

Comments are closed.