CCRI - IRCS

1911, 1921 Census  

 

CCRI Sampling - Sample Designs


Technically, the CCRI samples are cluster samples of individual records, with the dwelling as the cluster. For every decennial census, the population is divided into relatively small geographic units across the country, each of which is sampled separately, ensuring that the population of each is proportionately represented in the national sample. This procedure is effectively a form of sample stratification by area, which serves the georeferencing features of the project (see CCRI Geography) and slightly enhances the geographic representativeness of the samples, as compared with simple random samples of the same size.

We also oversample the populations of large dwellings (LDs), which are more or less arbitrarily defined as all dwellings with 31 or more enumerated residents. The oversample is also a form of stratification that improves the precision of the sample and allows for detailed analysis of the unusual populations they housed, such as those in institutions and work camps. The large dwelling sample is discussed in more detail below.

The censuses for 1911 through 1951 varied somewhat in the definitions of census units. In 1911, 1921, and 1931, enumerators were instructed to count “dwelling houses” in a first column, and in a second, “family, household or institution.” In 1941, the definition of places of habitation included buildings, as well as dwellings and family/households. In 1951, only buildings and households were employed as enumeration units. Our sample designs adapt to these changes but aim to maintain analytic comparability. The 1911 through 1941 sample units were dwellings— places with unique addresses and understood to be separate, occupied structures—mainly housing households and families (families and households may differ because the households may include non-kin members, boarders, lodgers, servants, employees, and the like).

In each census year, the main sample comprises “regular-sized” dwellings of 30 or fewer residents. Obviously, most private or single-family dwellings and the largest percentage of the national populations are included in this stratum. As previously indicated, dwellings of 31 or more residents are oversampled. This includes the residents of most institutions and a changing variety of other types of collective dwellings. In the absence of much detailed or aggregate census information about the character of large or collective dwellings, especially in the earlier years, we employ a simple, single cutoff of 31 or more members, based on our preliminary examination of the 1911 census and paralleling the cutoff adopted for a number of the public use samples of the IPUMS project. By using the cutoff, we also foster international comparability between the Canadian and U.S. database projects.

Our design provides for a random start among dwellings, within the first n dwellings of each of the relatively small geographic areas represented by a microfilm reel. We then systematically select every subsequent nth dwelling within each geographic area (reel), where n is 20 in 1911, 25 in 1921, and 33 in 1931, 1941, and 1951. The random start is repeated for each area. This yields the desired sample densities of 5 percent of dwellings, families and individuals in 1911, 4 percent in 1921, and 3 (3.33) percent in 1931, 1941, and 1951 (with the numbers of individual records nationally varying between about 360,000 to 420,000 depending on census year. The exact sample size by year and the size of the target, national population are given in the descriptions of each census extract). Our design ensures that the samples are epsem (equal probability of selection method) samples, geographically stratified by the areas represented by microfilm reels. The definition of geographic areas by microfilm reel is a matter of convenience, because the census enumerations were preserved on microfilm in the 1950s. The number of reels increases from 140 in 1911 to more than 240 in 1941. Though using microfilm reels as geographic reference units is convenient, our sample point selection and data entry are actually taken from images of the microfilmed original records.

The records of every individual within each selected dwelling are transcribed. This procedure yields cluster samples of individuals within dwellings and also, in principle, cluster samples of families and of households within dwellings, although the majority of dwellings include only one family and one household. For the analysis of the records of individuals, families and households, estimates of confidence intervals and standard errors should take into account cluster effects. They result from the greater homogeneity of some characteristics among individuals or families in the same dwelling than would have been found in a simple random sample of the same size (e.g., shared ethnicity or religious affiliations). It can be shown that these design effects will only be large for some variables. Others, such as age, for example, are barely affected, since they vary among individuals living together. Moreover, in many analyses, the samples will be sufficiently large that the question of statistical significance need not arise, and thus correct values of the error estimates will not be of primary concern (Ornstein 2000). In analysis it is the size of samples that is most important, rather than the sampling fraction (5 percent, 4 percent or 3 percent).

The oversample of LDs also addresses the problem of dealing with a relatively small number of large and varied places of residence, boarding houses, and institutions (e.g., hospitals, orphanages, residential schools, and prisons). The dwellings also include workplace residences, such as logging and mining camps or shanties, for which the densities in some regions are quite high, especially in the early part of the twentieth century. The majority, but not all, of such dwellings are included in our generic category of LDs. Some are captured in the main sample, such as small hospitals, boarding schools or work camps. A complete national or regional analysis of institutions of a given type or of work camps will combine selections from the main sample with the oversamples, appropriately weighting the selections (see Weighting below).

Epsem samples of the national populations would result if the sampling fraction within the LDs was the same as for the rest of the sample, which would involve selecting 5, 4, or 3 percent of their inmate populations, depending on the year. This is not an optimal design, however. For example, in a 3 percent sample only one record would be selected for an institution with, say, just 35 or so residents, severely limiting substantive analysis. A preferred design is to take an oversample.

Two key purposes are served by our oversamples of LDs. First, as we just noted, LD residents are of considerable historical and analytic interest; sampling them with greater probability makes it possible to study them in greater detail than if they were sampled at the same rate as the rest of the population. Second, to the extent that the residents in LDs are unusual, oversampling them increases the precision of estimates of characteristics of the total population, even if that LD stratum is not large in comparison with the entire sample (Kish 1967, chap. 3).

Our design accomplished the oversampling by simultaneously creating an inventory of all LDs and sampling within them, at the same time that we were identifying the main sample of dwellings with 30 or fewer residents. We employ a unique, dedicated software system (SPIDER) designed for the purpose and described in some detail elsewhere in this Guide (see the "CCRI Software" document in the Resources section of this guide). This procedure also alerts us to gaps in the records, illegibility, the inclusion of unexpected populations and other problems to be faced in the transcription of the sample records. Though still a daunting task, the unique software and the relatively small size of the Canadian population in these years makes it conceivable to undertake sample point selection in a series of integrated passes of the national census data, even as the population of the country grew substantially (the national population, which did not include Newfoundland until 1951, varied from about 7,200,000 in 1911 to 11,500,000 in 1941 and to slightly more than 14,000,000 in 1951).