Coding Philosophy
In this document the term code refers to the numeric value or symbol used to represent a piece of data. The term coding refers to the act of converting a piece of data into its code.
Coding, A General Overview [1]:
Why do we code? The purpose of coding responses is to reduce the number of responses for each variable found in the data to make analysis tractable.
Consider the most extreme case, where every individual gives a different response. With as many cases as responses, statistically we would not be able to explain the differences between individuals for that variable. Nor could we use that variable as an independent variable as it would saturate the model with dummy variables for each different response.
At a minimum transcribed census data will have spelling mistakes, spelling variations and abbreviations of common responses, which are highly unlikely to represent any real differences. Thus, we can be relatively confident that "merchant," "marchant," and "MCHT." can all be given the same code.
Another reason for coding is that the data will be used by statistical analysis packages that typically handle numeric responses more easily than alphabetic responses. Moreover, a small number of codes can be stored more compactly than a larger number of distinct alphabetic responses. These reasons for coding are independent of the main rationale for coding qualitative responses – classifying similar responses together for tractable analysis.
Who are we coding for? Because the main reason for coding qualitative responses is to allow statistical analysis of the data, producers of public use datasets must give consideration to who will be using their data, and what analyses they might be undertaking. Researchers use the codes for two inter-related purposes
1) To generate independent variables for use in analysis
2) To select sub-populations, based on the value of a variable.
We cannot predict what analyses researchers will be performing, or what groups they will be interested in selecting. This suggests creating codes for as many responses as possible. Researchers can always combine codes if they feel the existing classifications are too narrow. However, this must be traded off against unnecessary extra work created for researchers by creating groups that always need to be combined, or creating a coding scheme that is too complex to understand. Researchers who are interested in reclassifying the data can always be provided with the original responses if their interest is sufficiently deep. In summary, the coding scheme must balance the need for specificity with compactness.
The CCRI Coding Philosophy and Approach
Compatibility and flexibility were the two principles guiding the creation of the CCRI’s code sets.
As the CCRI covers multiple census years an effort has been made to create for those variables that are repeated from census to census code sets that cover all census years examined by the CCRI project. These cross-temporal coding sets have been created to ensure, as far as possible, inter-temporal consistency of the codes used to record the responses from different census years by providing a standardized coding set.
By creating cross-temporal coding sets we have aimed to make our data as research friendly as possible by making cross-temporal comparisons less cumbersome by removing the necessity for researchers to use multiple coding sets for variables which are repeated from census to census.
________________________
[1] This section comes from a report written by Evan Roberts of the Minnesota Population Center. We are thankful to Evan for allowing CCRI to use this report.