CCRI - IRCS

1911, 1921 Census  

 

Standardization and Code Mapping Practices


Standardization

Standardization followed the data capture process. The purpose of standardization was to correct spelling errors, abbreviations and numerical inconsistencies in the verbatim responses, and to translate French responses into English. For example, in the Marital Status column, if a value of “Nevr ben married” occurred, it would be cleaned “Never been married.” While the purpose of standardization was to reduce the amount of distinct values so mapping would be an easier process to complete, it was also important to maintain the integrity of the verbatim responses. As a result, “Nevr ben married” was not cleaned to “Unmarried.” Four coders were included in the standardization process. Various software tools developed by the CCRI were used to standardize and code map the data.

In each column, there were many examples of the same response spelled different ways. For example, in the Language column, the response “English” was spelled “Eglish,” “Englis,” and “Enlish.” Abbreviations were also used. For example, “Eng,” “Engl” and “English.” There were also cases where a capital letter was used in one instance, and not in the next. For example, responses in the Country column included “Canada,” “canada,” and “CanadA.” The software interpreted these as distinct values, so our purpose was to clean them to the same value. In some cases, standardising reduced the number of distinct values from several thousand to several hundred, facilitating the code mapping process.

In addition to errors in spelling and abbreviations, standardization removed formatting inconsistencies in numerical responses. At times, a comma was used in place of a decimal point. In other cases a space, alpha value, or symbol was either in the wrong place or unnecessarily recorded. For example, in the “Annual Earnings” column, the value “$500,69” was cleaned to “500.69.” The software program was designed to recognise this standard numeric format and to code these values automatically, saving time.

Common examples of standardization:

Correction of Spelling/Formatting Errors
Farmyr, Afrmer, Famrer, farmer, farmr, etc., corrected to the standardized value “Farmer.”
$509,45 was formatted to the standardized value “509.45.”

Standardization of Reponses
Conversion/replacement of COE, C of Eng, English Church, Prot COE, Eglise D’Angleterre, etc., with the standardized value “Church of England”.

Translation of Responses
French responses were translated to English in all cases except when no single English meaning for a verbatim French response could be established. For example, “Belle Mère” (as a response in the Relationship to Household Head column) refers to both mother-in-law or step-mother, and so was left as “Belle Mère.”

Code Mapping

Code mapping was a three step process. In the initial step, four University of Ottawa staff worked to map values to codes. In addition, one other staff member worked to translate the code descriptions into French. In the second step, various experts reviewed the values, and, if incorrect, suggested alternate codes. After the review process was completed all suggested codes for a value were examined by senior CCRI members. Based on this examination the most appropriate code for a response was selected and approved. The software allowed this process to be tracked. Responses moved through the process, tagged as either having no proposed mappings [1], having a single proposal [2], having multiple proposals [3], having been auto-mapped [4], or having been approved [5].

The objective of code mapping was to link responses to their corresponding numerical code. Because many responses were mapped automatically, the remainder were those the computer was unable to link to a code, most often because the value was spelled in a different way than the code description, or was an illogical or unexpected response for the census question.

During the mapping process, if a value could not be mapped to any of the existing codes in the code set, the coders had two options. If it was sensible for the column, a code might be created to account for these values [6]. For example, in the Country column, many new codes were created to account for the numerous towns and cities in Canada given as places of birth. However, if the value was not a rational response to the census question being asked, it would be mapped to “Uncodeable.” For example, the value “Father” in the variable Employment Status Indicator would be mapped to “Uncodeable.”

All values were mapped except for numeric values, which were automatically mapped by the software program. If a numeric value arose in an incorrect column, such as Relationship, it would be mapped to “Uncodeable.”

Finally, for those values which showed a “?” or a “!” and could not be deciphered, they were mapped to “Illegible,” as question and exclamation marks were used at data entry to denote unreadable letters and numbers.

_____________________________

[1] Meaning no codes had been proposed for this variable. It would appear as a blue cell in the data field.
[2] Meaning only one code had been proposed for this value. It would appear as a yellow cell in the data field.
[3] Meaning more than code had been proposed for this value. It would appear as an orange cell in the data field.
[4] Meaning the computer was able to make sense of the data, and assigned the value a code on its own.
[5] Meaning the code was approved by one of the project coordinators.
[6] All new codes created were documented in a Code Maintenance Log.