# Unrepresentative big surveys significantly overestimated US vaccine uptake

### Calculation and interpretation of ddc

The mathematical expression for equation (1) is given here for completeness:

$${overline{Y}}_{n}-{overline{Y}}_{N}={hat{rho }}_{Y,R}times sqrt{frac{N-n}{n}}times {sigma }_{Y}$$

(2)

The first factor ({hat{rho }}_{Y,R}) is called the data defect correlation (ddc)1. It is a measure of data quality represented by the correlation between the recording indicator R (R = 1 if an answer is recorded and R = 0 otherwise) and its value, Y. Given a benchmark, the ddc ({hat{rho }}_{Y,R}) can be calculated by substituting known quantities into equation (2). In the case of a single survey wave of a COVID-19 survey, n is the sample size of the survey wave, N is the population size of US adults from US Census estimates55, ({overline{Y}}_{n}) is the survey estimate of vaccine uptake and ({overline{Y}}_{N}) is the estimate of vaccine uptake for the corresponding period taken from the CDC’s report of the cumulative count of first doses administered to US adults8,13. We calculate ({sigma }_{Y}=sqrt{{overline{Y}}_{N}(1-{overline{Y}}_{N})}) because Y is binary (but equation (2) is not restricted to binary Y).

We calculate ({hat{rho }}_{Y,R}) by using total error ({overline{Y}}_{n}-{overline{Y}}_{N}), which captures not only selection bias but also any measurement bias (for example, from question wording). However, with this calculation method, ({hat{rho }}_{Y,R}) lacks the direct interpretation as a correlation between Y and R, and instead becomes a more general index of data quality directly related to classical design effects (see ‘Bias-adjusted effective sample size’).

It is important to point out that the increase in ddc does not necessarily imply that the response mechanisms for Delphi–Facebook and Census Household Pulse have changed over time. The correlation between a changing outcome and a steady response mechanism could change over time, hence changing the value of ddc. For example, as more individuals become vaccinated, and vaccination status is driven by individual behaviour rather than eligibility, the correlation between vaccination status and propensity to respond could increase even if the propensity to respond for a given individual is constant. This would lead to large values of ddc over time, reflecting the increased impact of the same response mechanism.

### Error decomposition with survey weights

The data quality framework given by equations (1) and (2) is a special case of a more general framework for assessing the actual error of a weighted estimator ({overline{Y}}_{w}=frac{{sum }_{i}{w}_{i}{R}_{i}{Y}_{i}}{{sum }_{i}{w}_{i}{R}_{i}}), where ({w}_{i}) is the survey weight assigned to individual (i). It is shown in Meng1 that

$${overline{Y}}_{{rm{w}}}-{overline{Y}}_{N}={hat{rho }}_{Y,{R}_{{rm{w}}}}times sqrt{frac{N-{n}_{{rm{w}}}}{{n}_{{rm{w}}}}}times {sigma }_{Y},$$

(3)

where ({hat{rho }}_{Y,{R}_{{rm{w}}}}={rm{Corr}}(Y,{R}_{{rm{w}}})) is the finite population correlation between ({Y}_{i}) and ({R}_{{rm{w}},i}={w}_{i}{R}_{i}) (over i = 1, …, N). The ‘hat’ on ρ reminds us that this correlation depends on the specific realization of {Ri, i = 1, …, N}. The term nw is the classical ‘effective sample size’ due to weighting23; that is, ({n}_{{rm{w}}}=frac{n}{(1+{{rm{CV}}}_{{rm{w}}}^{2})}), where CVw is the coefficient of variation of the weights for all individuals in the observed sample, that is, the standard deviation of weights normalized by their mean. It is common for surveys to rescale their weights to have mean 1, in which case ({{rm{CV}}}_{w}^{2}) is simply the sample variance of W.

When all weights are the same, equation (3) reduces to equation (2). In other words, the ddc term ({hat{rho }}_{Y,{R}_{{rm{w}}}}) now also takes into account the effect of the weights as a means to combat the selection bias represented by the recording indicator R. Intuitively, if ({hat{rho }}_{Y,R}={rm{Corr}}(Y,R)) is high (in magnitude), then some Yi’s have a higher chance of entering our dataset than others, thus leading to a sample average that is a biased estimator for the population average. Incorporating appropriate weights can reduce ({hat{rho }}_{Y,R}) to ({hat{rho }}_{Y,{R}_{{rm{w}}}}), with the aim of reducing the effect of the selection bias. However, this reduction alone may not be sufficient to improve the accuracy of ({overline{Y}}_{w}) because the use of weight necessarily reduces the sampling fraction (f=frac{n}{N}) to ({f}_{{rm{w}}}=frac{{n}_{{rm{w}}}}{N}) as well, as nw < n. Equation (3) precisely describes this trade-off, providing a formula to assess when the reduction of ddc is significant to outweigh the reduction of the effective sample size.

Measuring the correlation between Y and R is not a new idea in survey statistics (though note that ddc is the population correlation between Y and R, not the sample correlation), nor is the observation that as sample size increases, error is dominated by bias instead of variance56,57. The new insight is that ddc is a general metric to index the lack of representativeness of the data we observe, regardless of whether or not the sample is obtained through a probabilistic scheme, or weighted to mimic a probabilistic sample. As discussed in ‘Addressing common misperceptions’ in the main text, any single ddc deviating from what is expected under representative sampling (for example, probabilistic sampling) is sufficient to establish that the sample is not representative (but the converse is not true). Furthermore, the ddc framework refutes the common belief that increasing sample size necessarily improves statistical estimation1,58.

By matching the mean-squared error of ({overline{Y}}_{w}) with the variance of the sample average from simple random sampling, Meng1 derives the following formula for calculating a bias-adjusted effective sample size, or neff:

$$begin{array}{r}{n}_{{rm{eff}}}=frac{{n}_{{rm{w}}}}{N-{n}_{{rm{w}}}}times frac{1}{E[{hat{rho }}_{Y,{R}_{{rm{w}}}}^{2}]}end{array}$$

Given an estimator ({overline{Y}}_{w}) with expected total MSE T due to data defect, sampling variability and weighting, this quantity neff represents the size of a simple random sample such that its mean ({bar{Y}}_{N}), as an estimator for the same population mean ({overline{Y}}_{N}), would have the identical MSE T. The term (E[{hat{rho }}_{Y,{R}_{{rm{w}}}}^{2}]) represents the amount of selection bias (squared) expected on average from a particular recording mechanism R and a chosen weighting scheme.

For each survey wave, we use ({hat{rho }}_{Y,{R}_{{rm{w}}}}^{2}) to approximate (E[{hat{rho }}_{Y,{R}_{{rm{w}}}}^{2}]). This estimation is unbiased by design, as we use an estimator to estimate its expectation. Therefore, the only source of error is the sampling variation, which is typically negligible for large surveys such as Delphi–Facebook and the Census Household Pulse. This estimation error may have more impact for smaller surveys such as the Axios–Ipsos survey, an issue that we will investigate in subsequent work.

We compute ({hat{rho }}_{Y,{R}_{{rm{w}}}}) by using the benchmark ({overline{Y}}_{N}), namely, by solving equation (3) for ({hat{rho }}_{Y,{R}_{{rm{w}}}}),

$${hat{rho }}_{Y,{R}_{{rm{w}}}}=frac{{Z}_{{rm{w}}}}{sqrt{N}},{rm{where}},{Z}_{{rm{w}}}=frac{{overline{Y}}_{{rm{w}}}-{overline{Y}}_{N}}{sqrt{frac{1-{f}_{{rm{w}}}}{{n}_{{rm{w}}}}}{sigma }_{Y}}$$

We introduce this notation Zw because it is the quantity that determines the well-known survey efficiency measure, the so-called ‘design effect’, which is the variance of Zw for a probabilistic sampling design23 (when we assume the weights are fixed). For the more general setting in which ({overline{Y}}_{w}) may be biased, we replace the variance by MSE, and hence the bias-adjusted design effect ({D}_{e}=E[{Z}_{{rm{w}}}^{2}]), which is the MSE relative to the benchmark measured in the unit of the variance of an average from a simple random sample of size nw. Hence ({D}_{I}equiv E[{hat{rho }}_{Y,{R}_{{rm{w}}}}^{2}]), which was termed as ‘data defect index’1, is simply the bias-adjusted design effect per unit, because ({D}_{I}=frac{{D}_{e}}{N}).

Furthermore, because ({Z}_{{rm{w}}}) is the standardized actual error, it captures any kind of error inherited in ({overline{Y}}_{w}). This observation is important because when Y is subject to measurement errors, (frac{{Z}_{{rm{w}}}}{sqrt{N}}) no longer has the simple interpretation as a correlation. But because we estimate ({D}_{I}) by (frac{{Z}_{w}^{2}}{N}) directly, our effective sample size calculation is still valid even when equation (3) does not hold.

### Asymptotic behaviour of ddc

As shown in Meng1, for any probabilistic sample without selection biases, the ddc is on the order of (frac{1}{sqrt{N}}). Hence the magnitude of ({hat{rho }}_{Y,R}) (or ({hat{rho }}_{Y,{R}_{{rm{w}}}})) is small enough to cancel out the effect of (sqrt{N-n}) (or (sqrt{N-{n}_{{rm{w}}}})) in the data scarcity term on the actual error, as seen in equation (2) (or equation (3)). However, when a sample is unrepresentative; for example, when those with Y = 1 are more likely to enter the dataset than those with Y = 0, then ({hat{rho }}_{Y,R}) can far exceed (frac{1}{sqrt{N}}) in magnitude. In this case, error will increase with (sqrt{N}) for a fixed ddc and growing population size N (equation (2)). This result may be counterintuitive in the traditional survey statistics framework, which often considers how error changes as sample size n grows. The ddc framework considers a more general set-up, taking into account individual response behaviour, including its effect on sample size itself.

As an example of how response behaviour can shape both total error and the number of respondents n, suppose individual response behaviour is captured by a logistic regression model

$${rm{logit}}[{rm{Pr }}(R=1|Y)]=alpha +beta Y.$$

(4)

This is a model for a response propensity score. Its value is determined by α, which drives the overall sampling fraction (f=frac{n}{N}), and by β, which controls how strongly Y influences whether a participant will respond or not.

In this logit response model, when (beta ne 0), ({hat{rho }}_{Y,R}) is determined by individual behaviour, not by population size N. In Supplementary Information B.1, we prove that ddc cannot vanish as N grows, nor can the observed sample size n ever approach 0 or N for a given set of (finite and plausible) values of {α, β}, because there will always be a non-trivial percentage of non-respondents. For example, an f of 0.01 can be obtained under this model for either α = −0.46, β = 0 (no influence of individual behaviour on response propensity), or for α = −3.9, β = −4.84. However, despite the same f, the implied ddc and consequently the MSE will differ. For example, the MSE for the former (no correlation with Y) is 0.0004, whereas the MSE for the latter (a −4.84 coefficient on Y) is 0.242, over 600 times larger.

See Supplementary Information B.2 for the connection between ddc and a well-studied non-response model from econometrics, the Heckman selection model59.

### Population size in multi-stage sampling

We have shown that the asymptotic behaviour of error depends on whether the data collection process is driven by individual response behaviour or by survey design. The reality is often a mix of both. Consequently, the relevant ‘population size’ N depends on when and where the representativeness of the sample is destroyed; that is, when the individual response behaviours come into play. Real-world surveys that are as complex as the three surveys we analyse here have multiple stages of sample selection.

Extended Data Table 3 takes as an example the sampling stages of the Census Household Pulse, which has the most extensive set of documentation among the three surveys we analyse. As we have summarized (Table 1, Extended Data Table 1), the Census Household Pulse (1) first defines the sampling frame as the reachable subset of the MAF, (2) takes a random sample of that population to prompt (send a survey questionnaire) and (3) waits for individuals to respond to that survey. Each of these stages reduces the desired data size, and the corresponding population size is the intended sample size from the prior stage (in notation, Ns = ns −1, for s = 2, 3). For example, in stage 3, the population size N3 is the size of the intended sample size n2 from the second stage (random sample of the outreach list), because only the sampled individuals have a chance to respond.

Although all stages contribute to the overall ddc, the stage that dominates is the first stage at which the representativeness of our sample is destroyed—the size of which will be labelled as the dominating population size (dps)—when the relevant population size decreases markedly at each step. However, we must bear in mind that dps refers to the worst-case scenario, when biases accumulate, instead of (accidentally) cancelling each other out.

For example, if the 20% of the MAFs excluded from the Census Household Pulse sampling frame (because they had no cell phone or email contact information) is not representative of the US adult population, then the dps is N1, or 255 million adults contained in 144 million households. Then the increase in bias for given ddc is driven by the rate of (sqrt{{N}_{1}}) where N1 = 2.55 × 108 and is large indeed (with (sqrt{2.5times {10}^{8}}approx mathrm{15,000})). By contrast, if the the sampling frame is representative of the target population and the outreach list is representative of the frame (and hence representative of the US adult population) but there is non-response bias, then dps is N3 = 106 and the impact of ddc is amplified by the square root of that number ((sqrt{{10}^{6}}=mathrm{1,000})). By contrast, Axios–Ipsos reports a response rate of about (50 % ), and obtains a sample of n = 1,000, so the dps could be as small as N3 = 2,000 (with (sqrt{mathrm{2,000}}approx 45)).

This decomposition is why our comparison of the surveys is consistent with the ‘Law of Large Populations’1 (estimation error increases with (sqrt{N})), even though all three surveys ultimately target the same US adult population. Given our existing knowledge about online–offline populations40 and our analysis of Axios–Ipsos’ small ‘offline’ population, Census Household Pulse may suffer from unrepresentativeness at Stage 1 of Extended Data Table 3, where N = 255 million, and Delphi–Facebook may suffer from unrepresentativeness at the initial stage of starting from the Facebook user base. By contrast, the main source of unrepresentativeness for Axios–Ipsos may be at a later stage at which the relevant population size is orders of magnitude smaller.

### CDC estimates of vaccination rates

Our analysis of the nationwide vaccination rate covers the period between 9 January 2021 and 19 May 2021. We used CDC’s vaccination statistics published on their data tracker as of 26 May 2021. This dataset is a time series of counts of 1st dose vaccinations for every day in our time period, reported for all ages and disaggregated by age group.

This CDC time series obtained on 26 May 2021 included retroactive updates to dates covering our entire study period, as does each daily update provided by the CDC daily update. For example, the CDC benchmark we use for March 2021 is not only the vaccination counts originally reported in March but also includes the delayed reporting for March that the CDC became aware of by 26 May 2021. Analyzing several snapshots before 26 May 2021, we find that these retroactive updates 40 days out could change the initial estimate by about 5% (Extended Data Fig. 3), hence informing our sensitivity analysis of +/− 5% and 10% benchmark imprecision.

To match the sampling frame of the surveys we analyze, US adults 18 years and older, we must restrict the CDC vaccination counts to those administered to those adults. However, because of the different way states and jurisdiction report their vaccination statistics, the CDC did not possess age-coded counts for some jurisdictions, such as Texas, at the time of our study. The number of vaccinations with missing age data reached about 10 percent of the total US vaccinations at its peak at the time of our study. We therefore assume that the day by day fraction of adults among individuals for whom age is reported as missing is equal to the fraction of adults among individuals with age reported. Because minors became eligible for vaccinations only towards the end of our study period, the fraction of adults in data reporting age never falls below 97%.

The Census Household Pulse and Delphi–Facebook surveys are the first of their kind for each organization, whereas Ipsos has maintained their online panel for 12 years.

### Question wording

All three surveys ask whether respondents have received a COVID-19 vaccine (Extended Data Table 1). Delphi–Facebook and Census Household Pulse ask similar questions (“Have you had/received a COVID-19 vaccination/vaccine?”). Axios–Ipsos asks “Do you personally know anyone who has already received the COVID-19 vaccine?”, and respondents are given response options including “Yes, I have received the vaccine.” The Axios–Ipsos question wording might pressure respondents to conform to their communities’ modal behaviour and thus misreport their true vaccination status, or may induce acquiescence bias from the multiple ‘yes’ options presented60. This pressure may exist both in high- and low-vaccination communities, so its net effect on Axios–Ipsos’ results is unclear. Nonetheless, Axios–Ipsos’ question wording does differ from that of the other two surveys, and may contribute to the observed differences in estimates of vaccine uptake across surveys.

### Population of interest

All three surveys target the US adult population, but with different sampling and weighting schemes. Census Household Pulse sets the denominator of their percentages as the household civilian, non-institutionalized population in the United States of 18 years of age or older, excluding Puerto Rico or the island areas. Axios–Ipsos designs samples to be representative of the US general adult population of 18 or older. For Delphi–Facebook, the US target population reported in weekly contingency tables is the US adult population, excluding Puerto Rico and other US territories. For the CDC Benchmark, we define the denominator as the US 18+ population, excluding Puerto Rico and other US territories. To estimate the size of the total US population, we use the US Census Bureau Annual Estimates of the Resident Population for the United States and Puerto Rico, 201955. This is also what the CDC uses as the denominator in calculating rates and percentages of the US population60.

Axios–Ipsos and Delphi–Facebook generate target distributions of the US adult population using the Current Population Survey (CPS), March Supplement, from 2019 and 2018, respectively. Census Household Pulse uses a combination of 2018 1-year American Community Survey (ACS) estimates and the Census Bureau’s Population Estimates Program (PEP) from July 2020. Both the CPS and ACS are well-established large surveys by the Census and the choice between them is largely inconsequential.

### Axios–Ipsos data

The Axios–Ipsos Coronavirus tracker is an ongoing, bi-weekly tracker intended to measure attitudes towards COVID-19 of adults in the US. The tracker has been running since 13 March 2020 and has released results from 45 waves as of 28 May 2021. Each wave generally runs over a period of 4 days. The Axios–Ipsos data used in this analysis were scraped from the topline PDF reports released on the Ipsos website5. The PDF reports also contain Ipsos’ design effects, which we have confirmed are calculated as 1 plus the variance of the (scaled) weights.

### Census Household Pulse data

The Census Household Pulse is an experimental product of the US Census Bureau in collaboration with eleven other federal statistical agencies. We use the point estimates presented in Data Tables, as well as the standard errors calculated by the Census Bureau using replicate weights. The design effects are not reported, however we can calculate it as (1+{{rm{CV}}}_{{rm{w}}}^{2}), where CVw is the coefficient of variation of the individual-level weights included in the microdata23.

The Delphi–Facebook COVID symptom survey is an ongoing survey collaboration between Facebook, the Delphi Group at Carnegie Mellon University (CMU), and the University of Maryland2. The survey is intended to track COVID-like symptoms over time in the US and in over 200 countries. We use only the US data in this analysis. The study recruits respondents using daily stratified random samples recruiting a cross-section of Facebook active users. New respondents are obtained each day, and aggregates are reported publicly on weekly and monthly frequencies. The Delphi–Facebook data used here were downloaded directly from CMU’s repository for weekly contingency tables with point estimates and standard errors.

### Ethical compliance

According to HRA decision tools (http://www.hra-decisiontools.org.uk/research/), our study is considered Research, and according to the NHS REC review tool (http://www.hra-decisiontools.org.uk/ethics/), we do not need NHS Research Ethics Committee (REC) review, as we used only (1) publicly available, (2) anonymized and (3) aggregated data outside of clinical settings.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

Tags: