Software

Introduction

Computer software was essential to process the data from approximately 20 million images. This section provides an overview of the various software applications developed by the CCRI.

Software was the responsibility of the CCRI Information Technology (IT) team. The core IT team was located at the University of Ottawa, where the central server was housed. In addition, there was an IT representative at each university center to assist with IT issues that might arise locally.

There were three eras of CCRI software development:

1. The first era coincides with the early stages of the project: the IT team was relatively small and, although all members were computer-savvy, only a few were IT specialists by profession. The IT goal was to develop software similar to the programs used by previous census projects. The IT expertise was rooted in Microsoft tools such as Visual Basic and Access. The 1911 census was the focus of the CCRI during this time.

2. The second era, which began in late 2004, was defined by a centralization of IT responsibility to Ottawa. The IT team grew as software professionals were recruited. Innovation became the objective. Microsoft tools took a back seat and a software development environment was built around the Java programming language. The focus of the CCRI during this time was on the 1921-1941 censuses.

3. Mid 2006 ushered in the third era, which was defined by the focus on the challenges of the 1951 census.

The schedules changed radically for the 1951 census. For all earlier censuses, the data was collected in a tabular format: the answer to each question was handwritten by the enumerator in a different column; each row represented one individual. A single schedule typically contained all the responses given by several individuals. In 1951, only a few of the responses were handwritten: the majority of responses were recorded using mark-sense technology. Further, each individual’s responses were recorded on a single schedule. For these reasons, the 1951 digital images were largely incompatible with the software written to process the 1911-1941 censuses. A new set of programs were required to process the images of the 1951 forms.

The IT team developed software to overcome challenges at every key phase of the project including:

1. Sample Point Selection
2. Data Capture
3. Data Cleaning
4. Data Coding
5. Geocoding
6. Data Delivery

The following sections explain the various software solutions developed for each phase. Each phase represents a category of problem that is common to each of the censuses (1911-1951) examined by the CCRI. Though the nature of the problem was essentially the same across censuses, the software solutions put forth came in a form that reflected the experience of the IT group.

The CCRI collected its data from digital images of the census schedules rather than the schedule microfiche traditionally used for census projects. These images, organized by reels, were the primary input to the CCRI software.
1. Sample Point Selection

The process of determining which images to include in the sample is called sample point selection (SPS). Since the CCRI is based on a partial-count sample, as opposed to a complete-count, the project defined year-specific sampling strategies that defined which of the available census schedule images were keyed for a given census.

1911

Two programs were used together to generate the 1911 samples. The first recorded how many households existed for each census sub-district: an operator, using a third-party image viewer, examined each schedule (image) and recorded the observed number of dwellings per sub-district. The second program used these counts and a random number generator to generate a list of dwelling numbers for each sub-district. Each number in the list represented a dwelling that was to be included in the sample for a particular sub-district.

1921-1941

A new program, which improved the quality of the sample by giving more control to the operator, was developed for sample point selection for 1921-1941. The operator used a specially designed image viewer to browse all the dwellings on a reel. When a target dwelling was encountered (the characteristics that determined a target dwelling were defined by the sampling protocol developed for the particular census year), the operator confirmed the legibility of the sample then used the program's point-and-click interface to highlight the data of the individuals who occupied the dwelling. Time was lost during the 1911 exercise when the computer selected a dwelling that had to be replaced at data entry-time because it was illegible. The revised SPS approach for 1921-1941 allowed the operator to quickly substitute an illegible target dwelling with a sample that was readable. When the user marked the image in this way, the program used the image coordinates to generate a sample point and saved it to the database.

1951

The SPS approach had to be revisited for the 1951 census since the 1951 schedules were drastically different from the schedules used in prior years. The previous versions of the SPS software were designed for a tabular census schedule where each row of the form represented the data recorded for one individual. In other words, a single form contained the complete micro data for several individuals. The 1951 schedule was completely different: each form only contained the data for a single individual; further, data was recorded on the front as well as on the back of the schedule.

Indexing Program

A new image viewer was developed to index the 1951 images. As the operator browsed each reel, s/he would examine each image and would use the software to tag it in any of the following scenarios:

A) When the household number recorded on the image differs from the household number that appeared on the previous image.

B) When the individual represented by the image was designated as being the head of household.

C) When the image was damaged or illegible.

D) When the image represented a leader document, which was used to demarcate the beginning of a new census sub-district.

Sampling Program

The 1951 sampling program used the results of the indexing process as input to generate sample points. The household information collected during the indexing phase was used to identify the target households for which sample points were created.

Substitution Program

Additionally, a substitution program was written to select a suitable replacement for sample points that, during the data entry process, were determined to be unusable, typically because the operator determined that individuals were missing from the household (in most cases as a result of an indexing error).

2. Data Capture

The process of transcribing the microdata from the images to a database is called data capture. Data entry operators keyed the micro-data from the images into special computer forms developed by the IT group. These programs performed basic validations on the keyed data before storing it in a DB2 database.

1911

The 1911 data capture program presented a screen that resembled the census form. At the top of the screen was a locator number that identified a particular dwelling on a particular image. The leading characters of the number identified which image must be opened by the operator (using a third-party image viewer). The final digits of the locator number were used to identify the target dwelling among several dwellings that might appear on the image. The micro-data for the individuals in the target dwelling was keyed into the data capture software.

1921-1941

For the 1921-1941 censuses, the data entry program also used a grid layout for capturing the individual responses. However, instead of presenting a locator number to the operator, the software used the data collected during SPS to present the actual image in a customs viewer that highlighted the target individuals.

1951

The 1951 data capture program was based on a CCRI -built image viewer that featured optical mark recognition capabilities. The viewer not only allowed the operator to view the images of each form belonging to a particular sample point, it also interpreted the markings on the form and automatically completed a data grid with the corresponding answers. The operator could manually specify the correct answer when computer failed to recognize the correct response.

SPIDER

The Sample Point Identification, Data Entry and Reporting system (S.P.I.D.E.R) was developed by the CCRI as the launch pad for all its software tools related to creating micro-data from digital images. The program uses the concept of tasks to manage all aspects of the data capture process: indexing, sample point selection, data capture, verification, read-and-edit, cleaning and communication.

• Indexing Task: allows the user to index the images on a reel.
• Selection Task: allows the user to define sample points from the schedules on a particular reel.
• Data Entry Task: allows the user to capture the micro-date for the sample points defined on a particular reel.
• Verification Task: allows the user to review the keying done by another user.
• Read-Edit Task: allows the user to review the micro-data data for a sample point. Any corrections made are recorded in a special log that is used to determine the accuracy rate of the data capture operation.
• Cleaning Task: allows the user to view data that was flagged by the cleaning program and, if necessary, to take corrective action.
• Communication Task: allows the operator and his/her supervisor to exchange messages related to specific sample point.

The main SPIDER window, the Task List, gives the user access to all the tasks that are assigned to him/her. The software provides supervisors with many options to manage tasks. Operators simply OPEN a task to launch the custom designed tool associated with the selected task type.

3. Data Cleaning

Once data capture was complete, the collected data was submitted to a cleaning program that generated an improved, or cleaned, copy of the original verbatim data. The cleaning cycle software verified the data in three steps:

1) Promotion: during data entry an operator could suggest an alternate value when the enumerator’s recorded response was missing, illegible, or suspect. The cleaning software replaced the missing or suspect value with the operator’s suggestion in the ‘cleaned’ copy of the data.

2) Standardization: misspellings, abbreviations, and synonyms of standard answers were replaced with the standard form of the response in the ‘cleaned’ copy of the data. For example, ‘c.o.e’ in the RELIGION column was replaced with “Church of England” in the ‘cleaned’ copy of the data.

3) Validation: the data was subjected to various validations. When an exception was detected, the suspect record was sent back to the originating center along with an error message. For example, if a married individual was less than 12 years old, the data record was flagged and sent back the originating center for review.

Rules

The standardization step of the cleaning process is controlled by a set of rules defined by the CCRI-built Rules system. This software allowed the project to define the standard form of the most common responses and specify how non-standard responses should be standardized.

The Rules system presented the cleaning team with all the distinct promoted responses to a given question along with their frequency. The team used the software to declare which of the responses were ‘standard’ (properly spelled and capitalized) and which were variations. All variations were linked to a standard response.
4. Data Coding

Coding is the process of associating each response to its corresponding code in a chosen coding scheme. A coding scheme is a set of codes that attempts to map each value in a domain to a unique code. A scheme can be very specific and address a single domain (for example OCCUPATION) or can be broad and offer codes for several domains. The coding scheme created by the CCRI is an example of the latter, as it offers codes for all the domains (e.g. LANGUAGE, RELIGION, SEX, etc.) covered by the 1911-1951 censuses

Code Management

The CCRI Code Management software allowed the coding team to manage (add/modify/delete) the coding schemes that were used to code the captured data.
This software served to manipulate a coding scheme and was not concerned with any captured data that might have been mapped using the scheme.

Code Mapping System (CMS)

The CCRI Code Mapping System allowed members of the cleaning team to associate one or more standardized responses to a specific code from a chosen coding scheme. The association of a response to a particular scheme code is called a mapping.

Mappings were an efficient way to store coding information. For example, imagine that there are 4000 nurses in the database. If the occupation “NURSE” must be coded to 12345 then the CCRI created a single mapping to reflect the association rather than modifying the data of the 4000 individuals that have that occupation.

This software also provided a Review feature, which allowed other team members to improve the work of the coding team, as well as a Sign-Off function that approvers in the coding team used to accept/sign-off the reviewed mappings.
5. Geocoding

The CCRI geocoding team defined a series of polygons that represent geographical regions within Canada. Each polygon has its own unique identifier called a ccriuid. The IT group developed a program that used summary files produced by the geocoding team to assign a ccriuid to every individual in the CCRI database.
6. Data Delivery

The CCRI data will be made available to researchers via an extract file since the Research Data Centers managed by Statistics Canada do not currently support DB2 database files.

Extraction

The Extract program produces a flat file of coded census micro data for a specific census year. The data in the file is stored in a common format that is readable by most statistical analysis tools such as SPSS and SAS.

The two main functions of the extract program are selection and coding. 1921-1951 census data is protected by privacy legislation therefore the extract program suppresses all responses that might be used to identify an individual. Secondly, the codeable responses that are extracted are coded before they are written to the flat file: the extract program uses the mappings created by the CMS sub-system to convert the textual responses to their equivalent codes. Certain extracted variables are not codeable (for example the ccriuid) and are written as is to the extract file.

Derivation

In addition to the coded micro data, for every individual the extract file contains a number of variables that were derived (generated) from the keyed data. These derived variables are keys and sequence numbers that are used to structure the contents of the extract file. The IT group wrote a program to generate these variables and store them in the database for retrieval by the extract program.

Conclusion

Computer software was essential to the success of the CCRI. The project developed an integrated set of tools to process the digital images of census schedules. The micro data creation process was made more efficient by using intelligent image viewers to link related phases of the data capture process. The CCRI’s SPIDER software is an innovative tool that greatly streamlines the collection of census data from digital images. The team also developed software to sample images from the available reels and to clean and distribute the captured data.

1911, 1921 Census

Software