Abstract

In Phase 1 of the American Heart Association COVID-19 Data Challenge our team reported results from our report entitled, “Effect of Public Health Policies on New Confirmed Cases for Coronavirus Disease 2019 in South Korea: Lessons for the World.”

In Phase 2, we introduce 2 separate analyses that build on this work by evaluating the impact of the COVID-19 pandemic on key population groups in the United States. In Part I of Phase 2, we examine how policies related to university and college reopening affected new confirmed COVID-19 cases in surrounding communities leveraging several publicly available data sources. In Part II of Phase 2, we leverage IBM Marketscan Data that were made available to teams as part of the AHA COVID-19 Data Challenge to estimate the number of individuals at high-risk for infection within the general population. Addressing these questions together will allow policymakers and public health officials to consider both the immediate and long-term impact of the COVID-19 pandemic as the U.S. and world brace for additional waves of infection during the upcoming winter.

Main findings:

  • The number of university and college students returning to campus had a positive association with new confirmed COVID-19 cases within the local county region where the institution resided. In comparison to holding class in-person, reopening colleges online or in a hybrid mode was associated with a lower number of new confirmed COVID-19 cases.

  • The number of individuals considered at high-risk for COVID-19 infection due to age, medical conditions and an immunocompromised state (due to solid organ transplant or drug use) varies considerably across the United States. These findings may be valuable for future work that evaluates how policies for reopening businesses and other social activities may need to be tailored at the state-level if COVID-19 cases continue to grow over time.

Part I. University reopening policies and new confirmed COVID-19 cases.

Objective

The new academic year for universities and colleges in the U.S. began in the midst of the COVID-19 pandemic in late August and early September. During March and April, most schools moved their class fully online and locked down the campus. For the 2020 fall semester, many schools opted to reopen fully or partially given a desire to optimize learning environments. However, other schools chose to institute policies supporting an online or hybrid mode of learning given concerns for student safety. Though all schools have tried to take many public health actions to ensure reopening occurs safely, numerous reports in the lay media suggest universities and colleges experienced outbreaks of COVID-19 cases after students returned. As indirect evidence, the overall total number of daily new confirmed cases across the U.S. shows an increasing trend starting from early September in Figure 1 that coincides with the time of school reopenings (both universities, colleges and elementary/secondary schools).

Left: Nationwide total number of daily new confirmed COVID-19 cases since June 1st and their 7-day average. The number of daily new confirmed COVID-19 cases shows an increasing trend starting from early September that coincides with the time of school reopening. Right: The log-transformed number of total new confirmed cases from September 1st to October 22nd across 3046 U.S. counties. Counties are colored gray if the information is unavailable.Left: Nationwide total number of daily new confirmed COVID-19 cases since June 1st and their 7-day average. The number of daily new confirmed COVID-19 cases shows an increasing trend starting from early September that coincides with the time of school reopening. Right: The log-transformed number of total new confirmed cases from September 1st to October 22nd across 3046 U.S. counties. Counties are colored gray if the information is unavailable.

Figure 1: Left: Nationwide total number of daily new confirmed COVID-19 cases since June 1st and their 7-day average. The number of daily new confirmed COVID-19 cases shows an increasing trend starting from early September that coincides with the time of school reopening. Right: The log-transformed number of total new confirmed cases from September 1st to October 22nd across 3046 U.S. counties. Counties are colored gray if the information is unavailable.

Although anecdotal observations about COVID-19 outbreaks after university and reopening within a campus have been communicated extensively by newspapers and other outlets, the precise effects of students returning to campus and university and college reopening policies on the spread of coronavirus in their surrounding communities are still unclear. In this study, we investigate these concerns more specifically focusing on two Aims:

Aim 1

First, we look at all counties in the U.S. and study the association between the spread of coronavirus since September 1st with the number of students returning to campus in each county. In this aim of the study, we do not distinguish between different reopening policies of universities and colleges (except for excluding those with an online mode of reopening only) and primarily focus on the proportion of population enrollment in universities and colleges in the local county region to understand how the number of students returning to campus may be related to new confirmed COVID-19 cases more generally in their community.

The mean number of daily new confirmed COVID-19 cases per 10,000 county population over U.S. counties without universities taken in-person or hybrid reopening policies and U.S. counties with the top 50% in-person or hybrid enrollments per 10,000 county population from June 1 to October 22. The mean number of daily new confirmed COVID-19 cases in U.S. counties with the top 50% in-person or hybrid college enrollment rates appears to increase as compared with U.S. counties without in-person or hybrid colleges. Interestingly, this separation appears to be greatest during an early phase in September with infections rising in U.S. counties without in-person or hybrid colleges in October.

Figure 2: The mean number of daily new confirmed COVID-19 cases per 10,000 county population over U.S. counties without universities taken in-person or hybrid reopening policies and U.S. counties with the top 50% in-person or hybrid enrollments per 10,000 county population from June 1 to October 22. The mean number of daily new confirmed COVID-19 cases in U.S. counties with the top 50% in-person or hybrid college enrollment rates appears to increase as compared with U.S. counties without in-person or hybrid colleges. Interestingly, this separation appears to be greatest during an early phase in September with infections rising in U.S. counties without in-person or hybrid colleges in October.

Aim 2

In Aim 2, we narrow down our scope to the U.S. counties with universities and colleges to specifically study the association of different reopening policies on new confirmed COVID-19 cases. Using publicly available information, we divide different policies into three categories: in-person, hybrid, and online. We explore the extent to which virtual classes through hybrid or online reopening – rather than in-person classes – was associated with new confirmed COVID-19 cases.

Left: The mean number of  daily confirmed COVID-19 cases per 10,000 population over 332 U.S. counties with universities taken in-person policy, 181 U.S. counties with universities taken hybrid policy, and 685 U.S. counties with universities taken online policy. From Figure 3, we can see that the daily confirmed cases per 10,000 population of three groups are almost the same before the end of August, but after that, there are considerable gaps between 3 groups (as reopening of schools occurred). Right: The comparison of the mean number of daily new confirmed cases (per 10,000 county population) among counties with various reopening policies among their universities and colleges. This figure only contains counties whose schools all had the same policy. The mean value of counties with hybrid policy is significantly different from the mean value of counties with in-person policy (p < 0.05) under the two-sample t-test. The mean value of counties with online policy is also significantly different from the mean value of counties with in-person policy (p < 0.001).Left: The mean number of  daily confirmed COVID-19 cases per 10,000 population over 332 U.S. counties with universities taken in-person policy, 181 U.S. counties with universities taken hybrid policy, and 685 U.S. counties with universities taken online policy. From Figure 3, we can see that the daily confirmed cases per 10,000 population of three groups are almost the same before the end of August, but after that, there are considerable gaps between 3 groups (as reopening of schools occurred). Right: The comparison of the mean number of daily new confirmed cases (per 10,000 county population) among counties with various reopening policies among their universities and colleges. This figure only contains counties whose schools all had the same policy. The mean value of counties with hybrid policy is significantly different from the mean value of counties with in-person policy (p < 0.05) under the two-sample t-test. The mean value of counties with online policy is also significantly different from the mean value of counties with in-person policy (p < 0.001).

Figure 3: Left: The mean number of daily confirmed COVID-19 cases per 10,000 population over 332 U.S. counties with universities taken in-person policy, 181 U.S. counties with universities taken hybrid policy, and 685 U.S. counties with universities taken online policy. From Figure 3, we can see that the daily confirmed cases per 10,000 population of three groups are almost the same before the end of August, but after that, there are considerable gaps between 3 groups (as reopening of schools occurred). Right: The comparison of the mean number of daily new confirmed cases (per 10,000 county population) among counties with various reopening policies among their universities and colleges. This figure only contains counties whose schools all had the same policy. The mean value of counties with hybrid policy is significantly different from the mean value of counties with in-person policy (p < 0.05) under the two-sample t-test. The mean value of counties with online policy is also significantly different from the mean value of counties with in-person policy (p < 0.001).

Methods

Data Source

We aggregated multiple publicly available datasets for these analyses. First, we used county-level daily new confirmed cases during the period of August 1st to October 22nd, 2020 (the specific dates of our study time period) from the COVID-19 tracking project by 1Point3Acres which was last updated on October 23rd, 2020. Second, we also obtained state-level cumulative testing rates from the CDC COVID Data Tracker. Third, we collected various university and college reopening policies and data on their enrollment from 2958 institutions across 1253 counties utilizing resources from the Chronicle of Higher Education which was last updated on October 1st, 2020. (Note: when we use the term university in isolation below we refer to both universities and colleges of higher education after secondary school.) These data were originally provided by the College Crisis Initiative at Davidson College. Fourth, we obtained county-level demographics data and land area from the United States Census Bureau website and socioeconomic data from the Economic Research Service in United States Department of Agriculture. In particular, we collected the county-level resident population estimates by age and race in 2019, the county-level median household income in 2018, and the county land area in square miles in 2010.

Outcome Variables

For both Aim 1 and Aim 2, our main outcome variable was the mean number of daily new confirmed cases per 10,000 county population from September 1st to October 22nd at the county level. Note that we chose to use the data starting from September 1st to capture the trend of confirmed cases after reopening, because the majority of these reopened near the late August or the early September. Moreover, we considered the mean number of daily new confirmed cases over a relatively long period (until the end date of analysis on October 21, 2020), rather than shorter intervals, to smooth out the weekly periodic pattern and the lag in impact of various reopening policies. In addition, we standardized the county-level average number of daily new confirmed cases by the county population as the total population varies significantly over different counties and it may affect the absolute number of confirmed cases for each county.

Exposure Variables

We use different exposure variables for Aim 1 and Aim 2 as their goals are different. For Aim 1, we focused on the number of students enrolled through in-person or hybrid classes trying to examine the effect of reopening between counties with and without universities. For Aim 2, we focused on specific reopening policies in only counties with universities. The original dataset for reopening policies mainly categorized five policies including “fully in person”, “primarily in person”, “hybrid”, “primarily online”, and “fully online”. For ease of interpretation, we further merge “fully in person” and “primarily in person” as “in person” policy, and “fully online” and “primarily online” as “online” policy, which resulted in three policy categories: “in person”, “online”, and “hybrid” in our analysis. Below we describe the exposure variables used for two aims in detail.

The number of universities and colleges stratified by various reopening policies.

Figure 4: The number of universities and colleges stratified by various reopening policies.

Aim 1: To study the impact of universities’ reopening, we used an exposure variable that captured the number of university students returning to their campus within each county. Given that the enrollments and policies of universities in each county may differ when more than one university is present in a county, we defined the exposure variable for this county as follows. We first aggregated all universities in a county and summed up the enrollment of students under the in-person or hybrid policies (where students would likely return to campus). Then the exposure variable was obtained as the ratio of this sum and the total county population (ENROLLP). Note that we used the total enrollment of universities under the above two policies as a reasonable proxy of the number of university students returning to the campus. And, as there is a large variance among universities’ enrollment, the proposed exposure variable better characterizes the population flow, in comparison to alternatives such as the number of universities with different policies. A higher value of the exposure variable means that more students stayed in physical buildings and were involved in onsite campus activities per county population. The exposure variable equals zero if there is no university in a region (or all universities in a region took online reopening policy.

Aim 2: To compare different reopening policies across universities, we focused on a subset of 1069 counties with universities. The main exposure variables were the proportion of enrollment for three policies within each county (ONLINE%, HYBRID%, INPERSON%). For example, for each county, the “in person” variable was the ratio of the total enrollment of all universities taking in-person policy in this county to the total county enrollment. The sum of three policy exposure variables equals to one. We were interested in whether the proportion of hybrid and online enrollment would be negatively associated with the COVID-19 confirmed cases when taking in-person enrollment as the baseline. We show an example of this in Figure 5 for Washtenaw County where the University of Michigan is one of several schools in the region.

We illustrate how to calculate our exposure variables using our state of Michigan. Left: The percentile rank of the mean daily confirmed cases from September 1st to October 22nd relative to all counties in the U.S. Middle: Schematic illustration of how to construct the main exposure variables in Aim 2 -- the proportion of enrollment for three policies within each county. Right: The geographic distribution of universities’ reopening policies over counties in Michigan.

Figure 5: We illustrate how to calculate our exposure variables using our state of Michigan. Left: The percentile rank of the mean daily confirmed cases from September 1st to October 22nd relative to all counties in the U.S. Middle: Schematic illustration of how to construct the main exposure variables in Aim 2 – the proportion of enrollment for three policies within each county. Right: The geographic distribution of universities’ reopening policies over counties in Michigan.

Other Control Variables

We considered the following control variables in models for both Aim 1 and Aim 2. We first included the mean number of daily new confirmed COVID-19 cases per 10,000 county population in August (AUGP). Examining new confirmed COVID-19 cases in August before reopening occurred would allow us to account for baseline differences in the severity of coronavirus infections across counties. We also controlled for five county-level demographic and economic characteristics that could potentially be associated with new confirmed COVID-19 cases including: 1) the proportion of population aged over 65 years (AGE65), 2) the proportion of population aged below 20 years (AGE20), 3) the proportion of population of racial minorities (non-White) (RACEMINO), 4) the median household income as of 2018 (MEDHHINC), and 5) COVID-19 testing rates at the state level (TR). Note that for the variable testing rates, we only had it available at the state level and used this as a proxy for county-level resources for testing.

Left: The median household income as of 2018 across U.S. counties. Right: The proportion of racial minorities (non-White) across U.S. counties. Counties are colored gray if the information is unavailable.Left: The median household income as of 2018 across U.S. counties. Right: The proportion of racial minorities (non-White) across U.S. counties. Counties are colored gray if the information is unavailable.

Figure 6: Left: The median household income as of 2018 across U.S. counties. Right: The proportion of racial minorities (non-White) across U.S. counties. Counties are colored gray if the information is unavailable.

Statistical Analysis

The analysis was done at the county level. The dataset initially provided county-level daily confirmed cases for 3086 counties (out of 3141 counties in the U.S.). To focus on in-person, online, and hybrid policies of interest, we excluded 163 counties that contained any university with “other” or “undetermined” policy. We observed a handful of “negative” daily confirmed cases, which were used to correct historical mistakes in the county-level data (i.e., a case was re-categorized the following day from being positive to negative). Given that such data from counties with large numbers of corrections are likely to be unstable and may even indicate limited surveillance systems, we excluded 30 (~1%) counties that contained at least one day with more than 5 corrected cases, which resulted in 2893 counties in our analysis.

We use linear models to estimate the effects of university reopening and different reopening policies. For Aim 1, we use the model:

\[ \begin{aligned} \text{Outcome } = & \text{ Intercept} + \text{ENROLLP} * \beta_1 + \text{AUGP} * \beta_2 + \text{AGE20} * \beta_3 + \text{AGE65} * \beta_4 + \text{RACEMINO} * \beta_5 \\ & + \text{ MEDHHINC} * \beta_6 + \text{TR} * \beta_7 + \epsilon. \end{aligned} \] For Aim 2, we use the model: \[ \begin{aligned} \text{Outcome } = & \text{ Intercept} +\text{ONLINE}\% * \beta_1 +\text{HYBRID}\% * \beta_2 + \text{AUGP} * \beta_3 + \text{AGE20} * \beta_4 + \text{AGE65} * \beta_5 \\ &+ \text{RACEMINO} * \beta_6 + \text{ MEDHHINC} * \beta_7 + \text{TR} * \beta_8 + \epsilon. \end{aligned} \]

Note that for Aim 2, the enrollment proportion of three policies sum up to 1, so we drop the covariate in-person enrollment proportion to avoid the issue of collinearity among covariates. We used a p-value of less than 0.05 to indicate statistically significant evidence of an association between the covariate and the outcome variable.

Further, we also conducted five sensitivity analyses.

First, for Aim 1, we conducted another analysis to compare the early-stage (within a month) and late-stage (after a month) impact of university reopening. Specifically, for each county, we considered two periods of time, one in September and the other in October. Thus, each county has two outcome measurements, corresponding to the average number of daily new confirmed cases per 10,000 county population in September and in October respectively. We also constructed an indicator variables indicating whether the period under consideration is October (I(Oct)), and added itself and its interactions with the mean daily new confirmed proportion in August and the enrollment proportion into the linear regression model to account for the variation over time. The model is given by \[ \begin{aligned} \text{Outcome } = & \text{ Intercept} + I(\text{Oct}) * \beta_1 + I(\text{Sep}) * \text{ENROLLP} * \beta_2 \\ & + I(\text{Oct}) * \text{ENROLLP} * \beta_3 + \text{AUGP} * \beta_4 + I(\text{Oct}) * \text{AUGP} * \beta_5 \\ &+\text{AGE20} * \beta_6 + \text{AGE65} * \beta_7 + \text{RACEMINO} * \beta_8 + \text{ MEDHHINC} * \beta_9 + \text{TR} * \beta_{10} + \epsilon, \end{aligned} \] where \(I(\cdot)\) is the indicator variable for a given month.

Second, given students may still be on campus even if instruction was online only, we replace the covariate of enrollment proportion as the enrollment proportion of all types of enrollment and re-estimated the model for Aim 1.

Third, for Aim 1, we changed the exposure variable - the in-person and hybrid enrollment proportion - into a binary variable. This variable is 1 if the in-person and hybrid enrollment proportion is greater than zero, and is 0 otherwise. The goal of this sensitivity analysis was to determine if a “dose-response” effect was noted based on the proportion of in-person and hybrid enrollment (rather than its mere presence in a county).

Fourth, we added county-level population density as another control variable and re-estimated the model. We added population density as it could affect the frequency of face-to-face interactions among people within the county, and therefore, affect the spread of COVID-19.

Lastly, because the impact of universities’ reopening to counties with large populations can be relatively minor, we further removed 87 counties with the top 5% total county population and re-estimated the model based on this cohort.

Our analysis was conducted using R statistical software version 3.6.0 and 3.6.1.

Results

Aim 1

The final cohort includes 2893 U.S. counties. The result of the fitted model is shown in Table 1. Counties with higher proportions of in-person or hybrid university enrollment are associated with higher confirmed case rates since September (\(\beta\) = 2.01, p-value < 0.001). That means, for every 10% rise in the ratio of in-person or hybrid university enrollment to the county population, the mean daily new confirmed cases would increase by about 0.2 persons per 10,000 county population (e.g., 7.4 cases per day in a county the size of Washtenaw County where the University of Michigan resides).

Regarding control variables, the average confirmed case rate in August is positively correlated with the confirmed case rate after September (\(\beta\) = 0.20, p-value < 0.001). This was not unexpected as the confirmed case rate is not likely to change drastically within a short period of time. The proportion of young people under the age of 20 and the proportion of older adults 65 years or older are both positively associated with the outcome (p-values < 0.001). This could be due to the fact that older adults are at higher risk of getting COVID-19, and young people would likely be going back to school after September, bringing higher chances of COVID-19 spread. The median household income, which can in general represent the local economic status, is negatively correlated with the outcome suggesting lower numbers of new confirmed COVID-19 cases in wealthier regions. The regional testing rate is positively associated with the outcome (p-value < 0.0001). This is expected, since the more cases get tested, the more confirmed cases are likely to be reported. Finally, we have found that the proportion of population of racial minorities in a county is negatively correlated with the outcome (\(\beta\) = -0.96, p-value <0.001). This result was robust using other control variables that represent racial minorities (e.g., African American proportion in a population). We suspect this is due to our evaluation of new confirmed cases late in the pandemic (i.e., time after September), instead of the whole pandemic period where rates of infection have been higher in U.S. counties with higher proportions of minority populations.

Table 1: The estimated coefficients for Aim 1 on 2893 counties. The main exposure variable is in-person + hybrid enrollment proportion (ENROLLP). The control variables are: AUGP (average number of daily new confirmed COVID-19 cases per 10,000 county population in August), AGE20 (proportion of young people under the age of 20), AGE65 (proportion of older adults 65 years or older), RACEMINO ( proportion of population of racial minorities, i.e., non-White), MEDHHINC (the median household income as of 2018), TR (COVID-19 testing rates at the state level).
Estimate Std Error P-value
Intercept -2.507 0.495 <0.001
AUGP 0.204 0.017 <0.001
AGE20 14.148 1.134 <0.001
AGE65 4.373 0.954 <0.001
ENROLLP 2.006 0.508 <0.001
RACEMINO -0.955 0.199 <0.001
MEDHHINC 0.000 0.000 <0.001
TR 1.810 0.201 <0.001

Aim 2

We take a subset from the cohort analyzed in Aim 1 but only keep U.S. counties with universities and colleges. This sample included 1069 counties. The result is summarized in Table 2. The online enrollment proportion is negatively associated with the outcome (\(\beta\) = -0.33, p-value < 0.001). The result could be interpreted in this way: fixing the hybrid enrollment proportion, if we move 10% of the enrolled university students in a county from in-person to online mode, the mean number of daily new confirmed cases per 10,000 county population would have decreased by about 0.033 people during this study period (i.e., reduce the number of new confirmed COVID-19 cases by approximately 1.2 per day in a county the size of Washtenaw County). We also found the hybrid policy enrollment proportion is negatively correlated with the outcome (\(\beta\) = -0.27, p-value = 0.012). This suggests that moving 10% of the enrollment students from in-person mode to online mode would reduce mean number of daily new confirmed cases per 10,000 county population by 0.027 people (i.e., reduce the number of new confirmed COVID-19 cases by approximately 1.0 per day in a county the size of Washtenaw County). The interpretation of other control variables are similar as in our analysis in Aim 1, and we omit the discussion here.

Table 2: The estimated coefficients for Aim 2 on 1069 counties. The main exposure variables are policies enrollment percentage (ONLINE% and HYBRID%). The control variables are the same as Table 1.
Estimate Std Error P-value
Intercept -0.729 0.743 0.327
AUGP 0.425 0.036 <0.001
AGE20 12.109 1.824 <0.001
AGE65 -1.511 1.499 0.314
ONLINE% -0.329 0.095 <0.001
HYBRID% -0.272 0.108 0.012
RACEMINO -1.454 0.264 <0.001
MEDHHINC 0.000 0.000 <0.001
TR 1.581 0.267 <0.001

Sensitivity Analysis

For Aim 1, we found evidence of an early-stage effect with enrollment proportion noted to be a significant variable in affecting the number of new confirmed COVID-19 cases in September, but with its association becoming insignificant in October. This suggests that the effect of school reopening may depend on the time, and the effect is more important in the first few weeks after reopening. We also observed such a time-varying effect in Figure 2, where there was an initial case surge in September with a later “catch-up” phenomenon in October. Secondly, we found that including online enrollment to the exposure variable in Aim 1 makes the enrollment proportion a more significant variable. It is likely that students who enroll online might still be around campus or local areas and bring risks of coronavirus spread. Thirdly, converting the continuous value of enrollment proportion into a binary one leads to weaker association between the outcome and university enrollment for Aim 1. This implies the necessity of measuring the precise number of in-person and hybrid enrollment as a continuous variable, rather than just a simple indicator, to reflect the extent of university students returning to the campus. Fourth, the additional control variable - the population density - was not found to be significantly associated with the outcomes. Finally, our sensitivity analysis results suggest that our analysis is not sensitive to including U.S. counties with large populations. The details of these model fitting results are summarized in Appendix A1.

Discussion

In this study, we examined the effect of reopening in universities and colleges on the spread of COVID-19 in their surrounding communities during the fall. Our findings could be summarized in two folds. First, we found that students returning to campus was associated with an increase in new confirmed COVID-19 cases in the counties in which the universities and colleges reside. Second, we assessed the effects of three different reopening policies and found that reopening online or partially online was associated with slower spread of the virus, in comparison to in-person reopening.

Our findings could provide principled guidance on the upcoming winter/spring 2021 semesters. For example, due to the impact of enrollment to its neighboring area, regions with universities and colleges should expect a spike of the confirmed cases in the first few weeks after students return to campus. Medical supplies such as masks and medication should be prepared in advance as well as enforcement of strict social distancing. However, our findings also suggest that these effects wane over time when compared with other U.S. counties without colleges or universities. In addition, for those areas that do choose to reopen, we found evidence to support that an online or hybrid mode with virtual classes may be the preferable option at mitigating the number of new confirmed COVID-19 cases.

Part II. Describing distribution of high-risk populations for COVID-19 in IBM Marketscan Data.

Objective

The COVID-19 pandemic has been found to affect key populations more than others. These “high-risk” populations include older adults, those with key medical co-morbidities, and those who take immunocompromised drugs. Despite descriptions of individuals based on demographics (like age), there are less data available to explore how distributions of high-risk populations for COVID-19 may exist across the United States. We sought to assess this using recent historical data from the IBM MarketScan Data where valuable information is available on medical co-morbidities and even immunocompromised drug use. This analysis focused on determining at-risk populations that may be used in future analyses to understand how policies for reopening businesses and other social activities may be tailored based on baseline risks within different states.

Data Source

We focused on advanced age as well as 10 medical conditions that are at increased risk of severe illness from the virus that causes COVID-19 listed on the CDC website. Advanced age was defined as the proportion of the population 65 years of older. Medical conditions include cancer, chronic kidney disease, chronic obstructive pulmonary disease (COPD), heart conditions (including heart failure, coronary artery disease, cardiomyopathies, and pulmonary hypertension), immunocompromised state from solid organ transplant, obesity, severe obesity, sickle cell disease, smoking, and type 2 diabetes mellitus. In addition, we included immunocompromised state from use of immunomodulatory medicines like corticosteroids and biologics that are growing in use but that might place patients at an increased risk. To explore the distribution of high-risk populations, we used the historical patient data including their diagnosis and medicine information from 2018 IBM MarketScan Data. We identified the ICD-10 codes related to the above 10 medical conditions by using the Clinical Classification Software Refined (CCSR) database made available from the Agency for Healthcare Research and Quality (AHRQ).

Method

We calculated the proportion of individuals aged 65 years or older in each state. For each medical condition, we counted the number of patients that have at least one corresponding ICD-10 code for each state using the MarketScan database. We calculated the proportion of patients with the medical condition in the MarketScan database to estimate the proportion of the population with the medical condition in each state. We then assessed the presence of immunomodulatory medicines among individuals enrolled in the MarketScan database to estimate the proportion of those taking immunomodulatory medicines. We used categories defined by prior work in our group that included classes of medications (e.g., corticosteroids, biologics, etc.). We also estimated the proportion of the population that is at increased risk in each state by combining the result of all medical conditions, and immunomodulatory medicines.

Results

We calculated the proportion of the population with conditions which are at increased risks or might be at increased risks in each state, and the results are summarized in Table 3.
Table 3: The population proportion with advanced age as well as medical conditions which are at increased risks or might be at increased risks of getting COVID-19 in U.S. states.
State All High-risk Heart Conditions Immunocompromised From Transplant Immunocompromised From Medicines Age65+ Cancer Chronic Kidney Disease COPD Obesity Severe Obesity Sickle Cell Disease Smoking Type2 Diabetes Mellitus
Alabama 0.3789 0.0424 0.0014 0.0171 0.1733 0.0359 0.0236 0.0200 0.2438 0.0549 0.0016 3e-04 0.1171
Alaska 0.2445 0.0327 0.0014 0.0181 0.1252 0.0330 0.0145 0.0155 0.1112 0.0193 0.0001 9e-04 0.0982
Arizona 0.2745 0.0369 0.0018 0.0171 0.1798 0.0436 0.0229 0.0165 0.1405 0.0211 0.0005 3e-04 0.0978
Arkansas 0.2987 0.0544 0.0016 0.0165 0.1736 0.0337 0.0169 0.0248 0.1444 0.0315 0.0009 5e-04 0.1113
California 0.2509 0.0365 0.0017 0.0178 0.1478 0.0463 0.0238 0.0142 0.1057 0.0113 0.0005 1e-04 0.1013
Colorado 0.2184 0.0343 0.0016 0.0186 0.1463 0.0431 0.0182 0.0162 0.0964 0.0137 0.0003 3e-04 0.0713
Connecticut 0.2369 0.0360 0.0016 0.0184 0.1768 0.0448 0.0130 0.0145 0.1116 0.0135 0.0006 1e-04 0.0788
Delaware 0.3332 0.0388 0.0017 0.0184 0.1940 0.0435 0.0178 0.0180 0.1934 0.0315 0.0015 2e-04 0.1157
Florida 0.3547 0.0570 0.0017 0.0189 0.2094 0.0600 0.0347 0.0258 0.1914 0.0265 0.0017 2e-04 0.1207
Georgia 0.3121 0.0388 0.0018 0.0188 0.1429 0.0388 0.0211 0.0158 0.1743 0.0323 0.0017 3e-04 0.1107
Hawaii 0.3683 0.0836 0.0052 0.0279 0.1896 0.0911 0.0661 0.0250 0.1058 0.0118 0.0005 5e-04 0.1487
Idaho 0.2774 0.0407 0.0032 0.0258 0.1627 0.0412 0.0268 0.0236 0.1378 0.0264 0.0001 6e-04 0.1095
Illinois 0.2629 0.0453 0.0018 0.0202 0.1612 0.0428 0.0211 0.0207 0.1152 0.0185 0.0007 3e-04 0.1025
Indiana 0.2961 0.0422 0.0016 0.0189 0.1613 0.0375 0.0172 0.0260 0.1559 0.0307 0.0006 8e-04 0.1048
Iowa 0.2683 0.0382 0.0017 0.0183 0.1753 0.0372 0.0168 0.0186 0.1498 0.0243 0.0005 6e-04 0.0913
Kansas 0.2679 0.0419 0.0019 0.0179 0.1632 0.0354 0.0161 0.0190 0.1344 0.0240 0.0005 5e-04 0.1001
Kentucky 0.3425 0.0403 0.0014 0.0198 0.1680 0.0411 0.0179 0.0242 0.2078 0.0439 0.0004 6e-04 0.1126
Louisiana 0.2945 0.0440 0.0016 0.0141 0.1594 0.0332 0.0192 0.0158 0.1588 0.0282 0.0014 4e-04 0.1097
Maine 0.2689 0.0431 0.0012 0.0193 0.2122 0.0413 0.0146 0.0272 0.1176 0.0183 0.0000 5e-04 0.0997
Maryland 0.2972 0.0411 0.0019 0.0188 0.1587 0.0409 0.0195 0.0172 0.1531 0.0205 0.0024 2e-04 0.1146
Massachusetts 0.2256 0.0306 0.0013 0.0175 0.1697 0.0468 0.0142 0.0134 0.1118 0.0126 0.0009 1e-04 0.0662
Michigan 0.3767 0.0541 0.0017 0.0214 0.1768 0.0527 0.0269 0.0292 0.2327 0.0418 0.0010 4e-04 0.1127
Minnesota 0.2030 0.0292 0.0021 0.0200 0.1632 0.0334 0.0134 0.0096 0.0980 0.0172 0.0004 3e-04 0.0695
Mississippi 0.2806 0.0388 0.0010 0.0158 0.1635 0.0335 0.0152 0.0158 0.1368 0.0296 0.0013 3e-04 0.1163
Missouri 0.3104 0.0440 0.0018 0.0192 0.1730 0.0387 0.0198 0.0241 0.1757 0.0341 0.0012 9e-04 0.1081
Montana 0.2107 0.0342 0.0019 0.0191 0.1932 0.0380 0.0142 0.0184 0.0846 0.0171 0.0000 9e-04 0.0785
Nebraska 0.2732 0.0391 0.0021 0.0188 0.1615 0.0344 0.0190 0.0228 0.1373 0.0250 0.0005 5e-04 0.1033
Nevada 0.2745 0.0376 0.0015 0.0146 0.1610 0.0370 0.0237 0.0190 0.1313 0.0176 0.0007 3e-04 0.1090
New Hampshire 0.2490 0.0374 0.0015 0.0214 0.1867 0.0466 0.0147 0.0207 0.1033 0.0166 0.0001 1e-04 0.0894
New Jersey 0.2836 0.0439 0.0015 0.0155 0.1661 0.0429 0.0162 0.0161 0.1457 0.0178 0.0011 1e-04 0.1068
New Mexico 0.2669 0.0347 0.0017 0.0167 0.1801 0.0358 0.0189 0.0199 0.1206 0.0171 0.0002 3e-04 0.1144
New York 0.2868 0.0411 0.0015 0.0163 0.1694 0.0415 0.0153 0.0193 0.1462 0.0182 0.0016 2e-04 0.1105
North Carolina 0.3036 0.0348 0.0016 0.0183 0.1670 0.0393 0.0172 0.0164 0.1713 0.0310 0.0016 3e-04 0.1043
North Dakota 0.2570 0.0364 0.0016 0.0184 0.1573 0.0336 0.0209 0.0143 0.1381 0.0230 0.0000 1e-03 0.0878
Ohio 0.3127 0.0539 0.0017 0.0197 0.1751 0.0442 0.0247 0.0305 0.1594 0.0299 0.0007 6e-04 0.1129
Oklahoma 0.2823 0.0452 0.0016 0.0213 0.1605 0.0363 0.0190 0.0198 0.1335 0.0254 0.0004 4e-04 0.1172
Oregon 0.2167 0.0234 0.0013 0.0167 0.1816 0.0359 0.0134 0.0097 0.1058 0.0216 0.0002 2e-04 0.0779
Pennsylvania 0.3147 0.0437 0.0016 0.0192 0.1870 0.0464 0.0185 0.0200 0.1815 0.0327 0.0009 4e-04 0.0948
Rhode Island 0.3063 0.0310 0.0011 0.0117 0.1766 0.0449 0.0108 0.0191 0.1898 0.0301 0.0008 2e-04 0.0853
South Carolina 0.3161 0.0403 0.0015 0.0186 0.1820 0.0410 0.0170 0.0186 0.1723 0.0353 0.0012 3e-04 0.1095
South Dakota 0.2298 0.0285 0.0020 0.0167 0.1717 0.0347 0.0159 0.0140 0.1181 0.0266 0.0003 9e-04 0.0776
Tennessee 0.2993 0.0459 0.0015 0.0185 0.1674 0.0376 0.0213 0.0216 0.1494 0.0307 0.0007 5e-04 0.1104
Texas 0.2736 0.0390 0.0017 0.0174 0.1288 0.0346 0.0206 0.0148 0.1385 0.0226 0.0009 2e-04 0.1105
Utah 0.2571 0.0274 0.0020 0.0165 0.1141 0.0340 0.0159 0.0106 0.1402 0.0285 0.0002 4e-04 0.0922
Vermont 0.2562 0.0420 0.0019 0.0206 0.2004 0.0499 0.0132 0.0262 0.1046 0.0157 0.0001 1e-04 0.0984
Virginia 0.2745 0.0330 0.0017 0.0179 0.1592 0.0373 0.0156 0.0148 0.1449 0.0270 0.0014 5e-04 0.1021
Washington 0.2493 0.0316 0.0017 0.0199 0.1589 0.0368 0.0167 0.0130 0.1193 0.0243 0.0004 3e-04 0.0944
West Virginia 0.3856 0.0590 0.0014 0.0181 0.2048 0.0426 0.0259 0.0362 0.2230 0.0492 0.0002 5e-04 0.1290
Wisconsin 0.2503 0.0386 0.0022 0.0220 0.1747 0.0393 0.0199 0.0150 0.1227 0.0228 0.0004 4e-04 0.0891
Wyoming 0.2069 0.0319 0.0022 0.0188 0.1714 0.0329 0.0112 0.0230 0.0792 0.0158 0.0000 9e-04 0.0816

The results for advanced age, total risks, and a subset of the medical conditions are plotted in Figures 7-9, and certain geographic patterns emerge across the states. For example, in Figure 7, we observe that several states where populations have a greater proportion of individuals at an advanced age, such as FL, ME, VT, WV, and MT. In Figure 8, we can also see that people with overall high risks due to various medical conditions concentrated in states on the east coast and southern area, especially in KY, WV, MI, AL, and FL. The left panel of Figure 9 indicates that the distribution of people with heart conditions shows similar patterns, where the proportions are in general higher on the northeast and southeast parts of the country, and FL, MI, AR, OH, and WV are the most severe states. Lastly, we can see from the right panel of Figure 9 that states such as ID, OK, WI, MI, and NH have relatively higher proportions of people who take immunomodulatory medicines.

The population proportion aged 65 years or above.

Figure 7: The population proportion aged 65 years or above.

The population proportion of people that are at increased risks for COVID-19.

Figure 8: The population proportion of people that are at increased risks for COVID-19.

Left: The population proportion with heart conditions (including heart failure, coronary artery disease, cardiomyopathies, and pulmonary hypertension)  across U.S. states. Right: The population proportion of people that take immunomodulatory medicines across U.S. states.Left: The population proportion with heart conditions (including heart failure, coronary artery disease, cardiomyopathies, and pulmonary hypertension)  across U.S. states. Right: The population proportion of people that take immunomodulatory medicines across U.S. states.

Figure 9: Left: The population proportion with heart conditions (including heart failure, coronary artery disease, cardiomyopathies, and pulmonary hypertension) across U.S. states. Right: The population proportion of people that take immunomodulatory medicines across U.S. states.

Discussion

Results from these analyses suggest variability in at-risk populations for COVID-19 across the United States. A next step would be to correlate some of these baseline values with new and emerging data on the rising number of COVID-19 cases in recent months. A key feature that has been seen clinically is the susceptibility of certain key groups to COVID-19 infection – especially severe infections that lead to hospitalization and complications including death. We used the IBM MarketScan data which has a number of strengths including a broadly representative insured population of the U.S. In addition, this allowed us to look at some features like the use of immunomodulatory medicines that would otherwise not be easily understood. Of course, there are limitations to these data and their results should be examined carefully when making broader population-based estimates given a lack of data on uninsured individuals and non-random selection of participants within the database.

Summary

In Phase 2 of the AHA COVID-19 Data Challenge, we introduce 2 separate analyses that build on our earlier work by evaluating the impact of the COVID-19 pandemic on key population groups in the United States. In Part I of Phase 2, we report how policies related to university and college reopening affected new confirmed COVID-19 cases in surrounding communities leveraging several publicly available data sources. In Part II of Phase 2, we used IBM Marketscan Data to describe how the number of individuals at high-risk for COVID-19 infection varies across U.S. states within the general population. Addressing these questions together will hopefully better inform policymakers and public health officials on strategies to address the COVID-19 pandemic as winter arrives and new cases of the infection grow over time.

Appendix

A1 Sensitivity Analysis in Part I

Comparison of the early-stage and late-stage impact of universities’ reopening (for Aim 1 only)

We studied the effect of enrollment proportion on the outcome in September and October separately. We found that the effect of university enrollment proportion has a more significant impact on increasing the risk of COVID-19 spread in September, which is right after reopening. Such effects become insignificant in October. Our main study covers both September and October and captures the significant effect of such variable.

Table 4: Sensitivity analysis: comparison of the early-stage and late-stage impact of universities’ reopening for Aim 1. I(Sep) is the September indicator, and I(Oct) is the October indicator, and other variables are the same as Table 1.
Estimate Std Error P-value
Intercept 0.182 0.227 0.423
I(Oct) -1.017 0.035 <0.001
AUGP 0.404 0.011 <0.001
AGE20 3.544 0.517 <0.001
AGE65 0.651 0.436 0.135
RACEMINO -0.350 0.090 <0.001
MEDHHINC 0.000 0.000 <0.001
TR 0.378 0.092 <0.001
AUGP * I(Oct) -0.406 0.015 <0.001
ENROLLP * I(Sep) 2.431 0.315 <0.001
ENROLLP * I(Oct) -0.076 0.315 0.81

Replacing the exposure variable as the proportion of all types of enrollment (for Aim 1 only)

The estimated coefficients are quite close to the estimates in our main model. Adding additional online enrollment to the exposure variable even makes the new exposure variable more significant (\(\beta\): 2.006 \(\rightarrow\) 1.671; p-value: 8.18e-5 \(\rightarrow\) 1.45e-5). A potential explanation is that students who enroll online might still be around campus or local area and bring risks of coronavirus spread.

Table 5: Sensitivity analysis: including all types of enrollment for Aim 1. ENROLLP_ALL means enrollment proportion of all types, and other variables are the same as Table 1.
Estimate Std Error P-value
Intercept -2.609 0.498 <0.001
AUGP 0.206 0.017 <0.001
AGE20 14.301 1.136 <0.001
AGE65 4.619 0.962 <0.001
ENROLLP_ALL 1.671 0.385 <0.001
RACEMINO -0.969 0.199 <0.001
MEDHHINC 0.000 0.000 <0.001
TR 1.801 0.201 <0.001

Binarizing the covariate in-person and hybrid enrollment proportion (for Aim 1 only)

The covariate binarized enrollment indicator (i.e., whether a county has any student enrolled in in-person or hybrid mode) was not significant (p-value: 0.93). This demonstrates the necessity of measuring the exact number of in-person and hybrid enrollment as a continuous variable, rather than just a simple indicator, to reflect the extent of university students returning to the campus within each county.

Table 6: Sensitivity analysis: binarizing enrollment proportion (in-person + hybrid) for Aim 1. ENROLL_BINARY means enrollment indicator, and other variables are the same as Table 1.
Estimate Std Error P-value
Intercept -2.146 0.495 <0.001
AUGP 0.202 0.017 <0.001
AGE20 13.793 1.139 <0.001
AGE65 3.553 0.957 <0.001
ENROLL_BINARY -0.006 0.067 0.93
RACEMINO -0.993 0.200 <0.001
MEDHHINC 0.000 0.000 <0.001
TR 1.811 0.202 <0.001

Including population density as a control variable

We included population density as an additional control variable for both Aim1 and Aim2. For both Aim 1 and Aim 2, the population density is not estimated as a significant covariate (p-value > 0.05). In other words, we did not see a very significant association between the population density and the confirmed rates.

Table 7: Sensitivity analysis: including population density as control variable for Aim 1. POP_DEN means population density, and other variables are the same as Table 1.
Estimate Std Error P-value
Intercept -2.524 0.496 <0.001
AUGP 0.204 0.017 <0.001
AGE20 14.218 1.141 <0.001
AGE65 4.417 0.957 <0.001
ENROLLP 2.002 0.509 <0.001
RACEMINO -0.966 0.200 <0.001
MEDHHINC 0.000 0.000 <0.001
POP_DEN 0.000 0.000 0.573
TR 1.810 0.201 <0.001
Table 8: sensitivity analysis, including population density as control variable, Aim 2. POP_DEN: population density, other variables are same as Table 1.
Estimate Std Error P-value
Intercept -0.846 0.749 0.259
AUGP 0.424 0.036 <0.001
AGE20 12.513 1.856 <0.001
AGE65 -1.279 1.512 0.398
ONLINE% -0.330 0.095 <0.001
HYBRID% -0.272 0.108 0.012
RACEMINO -1.491 0.265 <0.001
MEDHHINC 0.000 0.000 <0.001
POP_DEN 0.000 0.000 0.244
TR 1.571 0.267 <0.001

Excluding large population counties

We performed our analysis for Aim1 and Aim 2 after filtering top 5% large counties based on county population. The re-estimated coefficients are very similar to the original model, which suggests that our analysis is not sensitive to including counties with large populations.

Table 9: Sensitivity analysis: excluding large population counties for Aim 1. All variables are the same as Table 1.
Estimate Std Error P-value
Intercept -2.542 0.505 <0.001
AUGP 0.203 0.017 <0.001
AGE20 14.259 1.156 <0.001
AGE65 4.271 0.974 <0.001
ENROLLP 1.971 0.515 <0.001
RACEMINO -0.926 0.206 <0.001
MEDHHINC 0.000 0.000 <0.001
TR 1.852 0.206 <0.001
Table 10: Sensitivity analysis: excluding large population counties for Aim 2. All variables are the same as Table 1.
Estimate Std Error P-value
Intercept -0.944 0.783 0.228
AUGP 0.429 0.038 <0.001
AGE20 13.176 1.929 <0.001
AGE65 -1.787 1.585 0.26
ONLINE% -0.285 0.099 0.004
HYBRID% -0.253 0.112 0.024
RACEMINO -1.529 0.284 <0.001
MEDHHINC 0.000 0.000 <0.001
TR 1.658 0.284 <0.001

  1. Department of Statistics, University of Michigan, yangly@umich.edu

  2. Department of Statistics, University of Michigan, chengmc@umich.edu

  3. Department of Statistics, University of Michigan, weijtang@umich.edu

  4. Department of Statistics, University of Michigan, xfzhang@umich.edu

  5. Department of Statistics, University of Michigan, jizhu@umich.edu

  6. School of Medicine, University of Michigan, bnallamo@med.umich.edu, 734-647-1624