Testing a COVID-19 vaccine on a large sample space of population from different nationalities

Testing a COVID-19 vaccine on a large sample space of population from different nationalities

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I came across an article few days ago, while checking the number of canditates for vaccines about COVID-19 where Sinopharm's vaccine development's trial phase caught my eye. The article says that:

In mid-July, Sinopharm launched its phase three trial among 15,000 volunteers-aged 18 to 60, with no serious underlying conditions-in the United Arab Emirates. The company selected the UAE, as it has a diverse population with approximately 200 different nationalities, making it an ideal testing ground.

Now, my question is, why is a diverse population present in UAE or any country with 150+ nationalities ideal for the testing phase of COVID-19? Is it because of the presence of diverse gene pool among the canditates from so many different nationalities? I don't understand why does testing on a population with approx. 200 different nationalities matter. Since, everyone's DNA is different, wouldn't they get the same results if they test it on a large sample space of population whose nationality is the same (for 2-3 generations or more)?

Association of poor housing conditions with COVID-19 incidence and mortality across US counties

Poor housing conditions have been linked with worse health outcomes and infectious disease spread. Since the relationship of poor housing conditions with incidence and mortality of COVID-19 is unknown, we investigated the association between poor housing condition and COVID-19 incidence and mortality in US counties.


We conducted cross-sectional analysis of county-level data from the US Centers for Disease Control, US Census Bureau and John Hopkins Coronavirus Resource Center for 3135 US counties. The exposure of interest was percentage of households with poor housing conditions (one or greater of: overcrowding, high housing cost, incomplete kitchen facilities, or incomplete plumbing facilities). Outcomes were incidence rate ratios (IRR) and mortality rate ratios (MRR) of COVID-19 across US counties through 4/21/2020. Multilevel generalized linear modeling (with total population of each county as a denominator) was utilized to estimate relative risk of incidence and mortality related to poor housing conditions with adjustment for population density and county characteristics including demographics, income, education, prevalence of medical comorbidities, access to healthcare insurance and emergency rooms, and state-level COVID-19 test density. We report incidence rate ratios (IRRs) and mortality ratios (MRRs) for a 5% increase in prevalence in households with poor housing conditions.


Across 3135 US counties, the mean percentage of households with poor housing conditions was 14.2% (range 2.7% to 60.2%). On April 21 st , the mean (SD) number of cases and deaths of COVID-19 were 255.68 (2877.03) cases and 13.90 (272.22) deaths per county, respectively. In the adjusted models standardized by county population, with each 5% increase in percent households with poor housing conditions, there was a 50% higher risk of COVID-19 incidence (IRR 1.50, 95% CI: 1.38–1.62) and a 42% higher risk of COVID-19 mortality (MRR 1.42, 95% CI: 1.25–1.61). Results remained similar using earlier timepoints (3/31/2020 and 4/10/2020).

Conclusions and relevance

Counties with a higher percentage of households with poor housing had higher incidence of, and mortality associated with, COVID-19. These findings suggest targeted health policies to support individuals living in poor housing conditions should be considered in further efforts to mitigate adverse outcomes associated with COVID-19.

Citation: Ahmad K, Erqou S, Shah N, Nazir U, Morrison AR, Choudhary G, et al. (2020) Association of poor housing conditions with COVID-19 incidence and mortality across US counties. PLoS ONE 15(11): e0241327.

Editor: Jeffrey Shaman, Columbia University, UNITED STATES

Received: May 22, 2020 Accepted: October 13, 2020 Published: November 2, 2020

This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

Data Availability: All relevant data are within the manuscript and its Supporting information files.

Funding: Research reported in this publication was supported by VA HSRD grant I01 HX002422-01A2 (WWC), Research Project Grant NIH NHLBI R01HL139795 (A.R.M.) and an Institutional Development Award (IDeA) from NIH NIGMS P20GM103652 (A.R.M.). This work was also supported by Career Development Award Number 7IK2BX002527 from the United States Department of Veterans Affairs Biomedical Laboratory Research and Development Program (A.R.M.). The views expressed in this article are those of the authors and do not necessarily reflect the position or policy of the Department of Veterans Affairs or the United States government. This work was also supported by a Lifespan CVI Pilot Grant for Faculty (A.R.M.). Dr Erqou is supported by funding from the Department of Veterans Affairs, Veterans Health Administration, VISN 1 Career Development Award. Dr Erqou also received funding from Center for AIDS Research, The Rhode Island Foundation, and Lifespan Cardiovascular Institute.

Competing interests: The authors have declared that no competing interests exist.


The novel betacoronavirus, commonly referred to as coronavirus disease 2019 (COVID-19) and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), emerged in the city of Wuhan within the Hubei Province of China in late 2019 [1,2,3]. As of February 22, 2021, an estimated 111 million individuals have been infected with SARS-CoV-2 globally, with over 2.47 million deaths attributed to infection worldwide. At least 110 countries have reported > 10,000 confirmed cases, with the current number of worldwide cases increasing at a rate of over 400,000–500,000 cases per day [4]. SARS-CoV-2 shares significant homology with multiple betacoronaviruses that have produced outbreaks of viral pneumonias, the most notable being severe acute respiratory syndrome (SARS) in 2003 and Middle Eastern Respiratory Syndrome (MERS) beginning in 2012 [5,6,7]. Viruses in the SARS-CoV and MERS-CoV families share significant homology with betacoronaviruses that commonly circulate among bat populations, and each appears to have garnered infectivity for humans following transmission through an intermediate host—civets (Paguma larvata) in SARS-CoV, pangolins (Pholidota) in SARS-CoV-2 [7,8,9] and camels in MERS-CoV [8,9,10,11]. Symptoms of these betacoronaviruses include fever, cough, dyspnea, fatigue, muscle weakness, headache, nausea, and diarrhea [12, 13]. Loss of smell has been reported in patients with SARS-CoV-2 [12], and, like other betacoronaviruses, severe cases progress to a pneumonia, myocarditis, cytokine storm, hypercoagulability, acute respiratory distress syndrome (ARDS), septic shock, complete respiratory failure, multiple organ failure, and a high rate of fatality upon onset of these symptoms [12,13,14,15]. Increasing evidence suggests an elevated risk of abnormal blood clotting and thrombosis upon severe infection, including a Kawasaki disease-like syndrome in children, who have been thought to be a low risk age group for disease progression [14, 16].

SARS-CoV-2 is a lipid membrane enveloped, plus-sense RNA virus that fuses with the membrane to enter host cells and replicate (Fig. 1) [17]. Infectivity metrics have varied for SARS-CoV-2, depending on region and collection methodology [7, 16]. Estimates suggest the reproduction number (R0), or the expected number of cases directly generated by one individual, was 1.40–3.9 during the initial infection surges in Italy and mainland China, with aggregate measurements calculating the average value to be 2.5–3.5 [16]. The corresponding doubling time has been estimated at 3.1 days for the Italian outbreak—slightly longer than the estimated 1.4–3.0 day doubling time reported in mainland China [16]. By comparison, SARS and MERS boasted estimated R0 and doubling times of 2.0–4.0/2.0–5.0 and 16.2/7–12 days, respectively these are largely based around isolated datasets and may not represent true values throughout entire populations [18, 19]. The case fatality rates/infection fatality rates (CFR/IFR) estimates for SARS-CoV-2 vary significantly based on age, gender, regional infection prevalence, but current estimates put the absolute rate at approximately 0.68% (0.53–0.82%) [19]. In agreement with several analyses of population-wide outcomes, the largest analysis of SARS-CoV-2 outcomes to date (17.425 million adults), reinforced that age is the predominant risk factor, with the highest hazard ratios (HR) for severe morbidity and mortality following SARS-CoV-2 infection [(Age > 80 + , HR 12.64) vs. (Age > 70, HR 4.77) vs. (Age > 60, HR 2.09)], followed by recent organ transplant (HR 4.27), diagnosis of blood borne malignancy [(< 1 year since diagnosis, HR 3.52) vs. (< 5 years since diagnosis, HR 3.12)], metabolic disease [(uncontrolled diabetes, HR 2.36) vs. (controlled diabetes, HR 1.50) and (obese Class III, HR 2.27) vs. (obese class II, HR 1.56)], male sex (HR 1.99), stroke or dementia (HR 1.79), uncontrolled chronic respiratory conditions (HR 1.78), chronic renal disease (HR 1.72), ethnicity [(Black, HR 1.71) vs. (Mixed, HR 1.64) vs. (Asian, HR 1.62)], as well as other chronic conditions [20]. Multiple epidemiological studies have implicated various micronutrients as potential risk factors for poor disease progression. It remains unclear if such serum values, such as Vitamin C, Vitamin D, Selenium, or Zinc, are directly contributing to poorer outcomes or if these values are a reflection of an acute phase response [21].

SARS-CoV-2 Viral Entry Mechanisms and Machinery. (a) SARS-CoV-2 is a lipid membrane, enveloped, plus-sense ( +) single strand (ss) RNA betacoronavirus that must undergo host lipid membrane fusion in order to gain entry into the host cell. Potential inhibitors for subsequent steps of this process are depicted. Enveloped viruses are capable of entering the host cell via (1) direct, neutral pH, plasma membrane fusion or via (2) endocytosis, where membrane fusion would rely on pH-dependent proteases and optimal intra-endosomal conditions [144, 241]. (b) Structural diagrams of key enzymes involved in viral-cellular entry. (c) Structural diagram of a spike protein (S) depicting the location of S1 and S2 subunits, following S protein cleavage, and the altered conformational states (closed and open). To initiate the entry process, S protein must undergo a conformational change from a closed to open state, which exposes the receptor binding domain (RBD) on S, allowing it to bind to angiotensin converting enzyme 2 (ACE2) on the host cell [40]. Altered S structure bound to ACE2 and S cleaved products are also shown. PDB codes for structures are referenced in Additional file 1: Table 5. Figure was created with

SARS-CoV-2 is primarily transmitted by respiratory droplets and aerosols, with relatively less secondary transmission potentially stemming from stable viral particles on surfaces and fomites [22,23,24]. SARS-CoV-2 has rapidly spread through the community because of its high infectivity rate and asymptomatic viral carriers who unknowingly infect close contacts [25, 26]. Efforts to curb viral spread have differed based on region and municipality, though different methods have been effective at controlling the rate and burden of infection, such as robust testing, contact tracing, self-isolation after confirmed or potential infection, adoption of physical distancing in shared settings, avoidance of large public gatherings, frequent hand washing, use of viricidal disinfectants, and mask wearing [27]. Routine testing protocols followed by aggressive tracing of recent contacts have demonstrated to be effective methods that control viral spread, notably in South Korea and New Zealand [24, 28]. Data continues to emerge regarding the efficacy of maintaining physical distance and mask wearing, especially when indoors, where insufficient ventilation increases the likelihood of viral aerosol transmission and group spread of viral infections [29].

In addition to these methods, advances in SARS-CoV-2 testing have allowed rapid identification of infected individuals. The first generations of tests developed were PCR-based tests pertaining to SARS-CoV-2 specific nucleotide sequences, largely differing only in the primer and probe sequences used by various developers and manufacturers [30, 31]. All first-generation tests were conducted via nasopharyngeal swab, although bronchoalveolar lavage and sputum-based diagnostic tools have since been introduced with comparable efficacy [32]. Each of these methods have been subjected to similar limitations, including the need for a quality primary sample from patients, proper and efficient sample handling, and avoidance of mutations in the viral genome that decrease efficacy of selected primers and probes [33, 34]. More recent generations of PCR-based tests, including less invasive nasopharyngeal and saliva-based tests, are more heat-stable and have less stringent preservation conditions. Even given these limitations, PCR-based in vitro diagnostic (IVD) tools have demonstrated a sensitivity/specificity of 70/90 + % [34,35,36,37]. Several of these newer generation tests also demonstrate improved sensitivity and specificity metrics, with values now routinely ranging in the 90/95% + range, respectively [38]. Combining these IVD tools with chest-CT increases the combined sensitivity of diagnosis to up to 94% [35]. Serology-based tests have been implemented however, these tests have low specificity and positive predictive value. Even at a true population prevalence of 10%, most serology tests do not achieve a positive predictive value above 75%, and many demonstrate a false positive rate of up to 50%. Fortunately, more recent iterations have improved positive predictive value, especially as the population prevalence has increased [39].

Considering the viral genomic, structural, and functional aspects of SARS-CoV-2 and its strains, this review will focus on three precise targets for antiviral activity: spike (S) protein, the main viral protease (M Pro ), and RNA-dependent RNA polymerase (RdRp). Aspects of these targets will be comprehensively covered, including existing treatment options, challenges to robust and sustained antiviral activity, and potential for modulation and optimization. There is a compelling need for highly effective and rapidly implementable antiviral compounds and therapies, especially because early therapeutic options have shown only minimal capacity to limit morbidity and mortality in the most vulnerable populations. A list of current and emerging therapies is summarized in Table 1 and Additional file 1: Table 1–4.

Spike protein pathophysiology

The S protein is the main virulent and antigenic determinant of SARS-CoV-2 and assembles to form a homotrimeric complex expressed at the external surface of the virus (Fig. 1). This S protein complex protrudes from the virus, peppering the outer lipid membrane like a crown, from which the coronavirus name is derived. It acts to bind its cellular target and to mediate membrane fusion. For SARS-CoV and SARS-CoV-2, angiotensin converting enzyme 2 (ACE2) is the major human receptor for the S protein and facilitates viral entry [3, 40] (Fig. 1). ACE2 is highly expressed in the small and large intestines, kidney epithelium, male gonads, gallbladder, cardiomyocytes, and thyroid follicular cells [41]. More modest expression occurs in respiratory and bronchial epithelium, alveolar macrophages, and type II pneumocytes which may explain why SARS-CoV-2 cases present most commonly as respiratory infections and transmit by aerosols [24, 42]. Collectively, the diversity of expression may contribute to interorgan transmission and systemic manifestations [42]. As previously mentioned, the S protein is cleaved into the S1 subunit, which is primarily responsible for receptor binding, and the S2 subunit, which is involved in the fusion between viral and host membranes (Figs. 1 and 2). Certain conformations are required for each subunit to perform its function, which is why multiple cleavage events are associated with cellular entry [43]. This orchestrated cleavage process is also thought to be important for antigen masking prior to target receptor binding, as immunogenic receptor binding domain epitopes largely remain buried until viral attachment and fusion are initiated [44]. These steps offer several opportunities for therapeutic targeting. The receptor binding domain (RBD) of S protein lies within the S1 subunit and is expressed at the apical surface of each S monomer. Following RBD-ACE2 binding, S1 dissociates from S2 at which point S2 catalyzes membrane fusion (Fig. 2). Potential therapies targeting SARS-CoV-2 S protein will be discussed with emphasis on vaccination, RBD-ACE2 blockade, and fusion inhibitors.

SARS-CoV-2 Membrane Fusion Pathway. (a) Structural diagrams of some key elements of S2 involved in membrane fusion. (b) Schematic summary of the essential steps in viral-host membrane fusion. Following the binding to ACE2, S protein must be cleaved by a protease, such as Transmembrane Serine Protease 2 (TMPRSS2), furin or cathepsin L to generate the S1 and S2 subunits, in order to release the S1 subunit thus exposing the fusogenic core of S2 [109, 121, 242]. With its hydrophobic core exposed, S2 protein is now in a high-energy, pre-fusion, metastable state, fostered by the energetic imbalance induced by its uncovered core [150]. The S2 subunit can undergo a conformational change, extending heptad repeat 1 (HR1) and heptad repeat 2 (HR2) domains, and injecting its fusion peptide (FP) into the membrane of the host cell, forming the pre-hairpin intermediate. This pre-hairpin structure then folds back into a six helix bundle (6-HB), pulling apart the host membrane. Finally, the viral and host membranes fuse with one another, as HR1 and HR2 fold into a trimer of hairpins resulting in pore formation [152, 243]. The viral genome is then able to access the intracellular space of the host cell for transcription and replication. PDB codes for structures are referenced in Additional file 1: Table 5. Figure was created with


Vaccination is an attractive therapeutic option as it offers the potential for long-term immunity. The S protein is the logical target for vaccine development because it is expressed at the viral surface and is susceptible to recognition by circulating antibodies. Vaccines designed against S proteins have been most efficacious in vaccine candidates for past betacoronavirus pandemics. Existing strategies for designing an efficacious vaccine include preparations of full-length S Protein, RBD-only peptide, RBD DNA-containing nanoparticles, RBD mRNA-containing nanoparticles, inactivated virus, and recombinant viral vectors. A number of these approaches have proceeded through Phase III clinical trials and will be discussed below.

Moderna’s (Cambridge, MA, USA) lipid nanoparticle mRNA-based vaccine for full length SARS-CoV-2 S protein (mRNA-1273) began Phase III placebo-controlled COVID-19 prevention clinical trials on July 14th, 2020 [45]. Critically, this vaccine candidate demonstrated a vaccine efficacy of 94.1% (94.4% in individuals under 65 with known risk factors, 86.4% in individuals over the age of 65) in its recently completed phase III trial, including robust protection in elderly individuals. The trial enrolled 30,420 individuals (96% of those randomized into the treatment arm received both vaccine injections) that spanned a diverse array of age, socioeconomic, and health demographics, demonstrating virtually complete protection of severe clinical disease and mortality and reporting no sustained adverse events [46].

Notably, this vaccine was awarded $483 million in US federal funding and has partnered with Lonza to produce one billion projected doses annually. Production and orders for this vaccine have escalated significantly in recent months. Clinical trials evaluating the vaccine safety and efficacy in specific populations, including pregnant and youth populations, are now underway. Phase I clinical trials of this vaccine demonstrated robust, dose-dependent neutralizing antibody production and CD4 predominant T cell engagement after administration of two doses of the vaccine, separated by 28 days, both in the age 18–55 cohort and in the age 56 + cohort [47, 48]. Importantly, both B cell and T cell immunity was generated in patients over the age of 71, which represents the most at-risk population for severe COVID-19 outcomes. Vaccination reactions, which were, reportedly, limited to fatigue, chills, headache, myalgia, and pain at the injection site, were reported by over half of recipients when prompted. Synergistic with these findings, recent published data in primates also suggests that mRNA-1273 is able to generate robust T cell immunity, as well as elusive inhibition of mucosal replication in these animals—a limiting factor in multiple vaccine candidates to date [47, 49]. This vaccine requires 2 doses that must be stored at − 20 °C.

BioNTech (Mainz, Germany) partnered with Pfizer to develop four mRNA vaccine preparations encoding either secreted or membrane-anchored full-length or RBD-only S protein constructs (BNT162 b1, b2, b3, and b4) [50]. Of these variations, the b1 (secreted trimerized S glycoprotein) and b2 (lipid/membrane anchored full-length S protein locked in its pre-fusion conformation) variants emerged as the candidates that entered Phase II/III trials [51, 52]. BNT162b2 was demonstrated to induce relatively fewer and less severe side effects, with equivalent induction of immune response to the b1 variant, and it was therefore chosen to be the construct of choice to be administered for both doses of the recently completed Phase III trial. Similar to the Moderna vaccine candidate, the BNT162b2 demonstrated a vaccine efficacy of 95.0% (94.7% in individuals over the age of 65) among a diverse enrollment of 43,448 individuals. There were no sustained adverse events reported in the experimental group. Trials have begun in additional populations for this vaccine candidate, as well [53]. BioNTech and Pfizer have reported robust immunity induction in a dose-dependent fashion, exceeding SARS-CoV-2 specific antibody titers 1.9–4.6 times greater than those found in convalescent human sera following COVID-19 infection (54). Like mRNA-1273, a significant number of participants reported side effects, the majority of which were also limited to mild-to-moderate flu-like symptoms and pain at the injection site. This vaccine requires 2 doses and must be maintained at − 80 °C.

The University of Oxford, in partnership with AstraZeneca (Cambridge, United Kingdom) also adopted the vectored virus route and have completed several phase III clinical trials involving their candidate vaccine, ChAdOx1 nCoV-19 (Fig. 1a) [54], between April and November of 2020 [55, 56]. The trials demonstrated a collective vaccine efficacy of 62.1% in 23,848 enrolled individuals. Efficacy in older individuals could not be determined from this trial [57]. The trial enrolled individuals across a similarly diverse population distribution, though the trial was marred by incongruencies throughout the trial administration. No lasting long-term side effects could be definitively attributed to the vaccine, though there were at least two cases of transverse myelitis reported in the Phase III trials. Recent evidence suggests similar levels of protection after a single dose, with a booster demonstrating increased serological markers of immunity when given out to 90 days [57]. There are concerns about the possibility of DNA integration from the modified adenoviral vaccine, but this has not been reported to date [58]. ChAdOx1 nCoV-19 is a chimpanzee-derived adenovirus that expresses full-length SARS-CoV-2 S protein. Their published efficacy suggests induction of S-protein-specific neutralizing antibodies in subjects as part of phase I/II trials, as well as in vaccinated rhesus macaques [59, 60]. In pre-clinical development, viral RNA was detected by bronchoalveolar lavage fluid in 33% of vaccinated animals, although this number may be misleading, as viral load was lower in these animals compared to controls, and viral RNA was approaching undetectable levels in almost all vaccinated animals a week after infection. Still, the failure to prevent infection and viral shedding in a third of vaccinated animals raises concerns. Importantly, there was no pulmonary pathology in vaccinated monkeys seven days post inoculation with SARS-CoV-2, whereas inflammatory infiltrates, hyperplasia, and edema were pronounced in controls [59]. In mice, a booster dose appears to significantly improve vaccine efficacy and protective effects, including in aged mice [55]. Taken together, ChAdOx1 nCoV-19 may prove beneficial by reducing disease severity however, there is a concern that it may not limit viral spread in the population. Concerns remain over the viability of an adenoviral vaccine delivery mechanism, as a large portion of the population harbors anti-adenoviral antibodies [56].

A recombinant human adenovirus type 5 vaccine developed by CanSino Biologics that expresses full-length S protein has progressed into Stage III clinical Trials [61]. In the phase I trial, 100% of participants in the high dose group (1.5 × 10 11 viral particles) achieved seroconversion (> fourfold increase in antibody titer) to the RBD at 28 days post-vaccination [62].

Johnson & Johnson’s Ad26.COV2.S is an adenoviral vaccine that has completed Phase III clinical trials and expresses a stabilized pre-fusion S protein complex. Recent releases claim an overall vaccine efficacy of 72% among 43,783 enrolled participants of varied demographics. The vaccine candidate demonstrates an 85% protection from moderate and severe infection, including from the B.1.351 variant. This shot can be easily distributed as a single-shot vaccine and can be stored at normal refrigeration temperatures.

China leads the field in inactivated SARS-CoV-2 vaccine preparations. The most extensively developed of these is sponsored by Sinovac Research and Development Co. Ltd. [63, 64]. Sinovac’s (Beijing, China) purified inactive SARS-CoV-2 virus vaccine, CoronaVac (Fig. 1c), has entered Phase III clinical trials in China, Turkey, and Brazil. The Brazil trial has recently concluded, with preliminary reports claiming an overall vaccine efficacy of 50.4% and a 78% efficacy in prevention. In phase I/II trials, vaccine administration produced robust immune responses in both young and old participants, though immunity was relatively lower in older adults [65]. No severe adverse events were reported, and neutralizing antibodies developed 14 days after the vaccination in its preliminary clinical trial. Similar findings were found after vaccine administration in rhesus macaques [66]. Interestingly, sera of mice vaccinated with inactivated vaccines (as opposed to clonal S protein antigens) display neutralizing efficacy against 11 different SARS-CoV-2 strains with broad phylogenetic variation.

Rapid advances have been made by Novavax (Gaithersburg, MD, USA) in the area of recombinant protein vaccines as the protein subunit vaccine, NVX-CoV2373, has now completed Phase III trials in the United Kingdom. The company’s recombinant S protein vaccine candidate demonstrated an 89.3% vaccine efficacy in over 20,000 enrolled participants and nearly complete elimination of severe disease progression [67]. This included robust protection against the emerging variant strain B1.1.7 (deemed the UK variant) however, vaccine efficacy dropped to 60% in preventing the B.1.351 variant (South African variant). In pre-clinical and early phase trials, NVX-CoV2373 induced robust anti-S protein antibody titers as well as CD4 + T helper cell reactivity. Novavax has received a $1.6 billion investment from the United States Warp Speed project with intent to produce 100 million doses of the NVX-CoV2373 candidate vaccine [68].

The first DNA vaccine, INO-4800 (Fig. 1a), is currently in a phase I/II trials [69]. Inovio (Plymouth Meeting, PA, USA) developed a vaccine encoding full-length S protein induced T cell responses and S1, S2, and RBD-specific IgG production in mice. Markedly, sera of INO-4800-vaccinated mice inhibited ACE2 binding for S protein, which would offer the added benefit of limiting viral spread [70]. Inovio announced in a press release that 94% of vaccine recipients produced threshold immune responses, noting a robust induction of both neutralizing-antibody and T cell response in these recipients. Inovio reports that it plans to continue Phase III trials in early 2021 after being paused by the US FDA pending further investigation. Separately, a comparable strategy developed by Symvivo (British Columbia, Canada) employs Bifidobacterium longum engineered to deliver a DNA plasmid encoding the full-length SARS-CoV-2 S protein (SARS-2-SP). The vaccine, named bacTRL-S (Fig. 1a), has initiated phase I clinical trials in British Columbia and Nova Scotia, Canada [71]. While no preclinical data is available, theoretically, the pathogen-associated molecular patterns (PAMPs) of the bacterial vector could help boost an adaptive immune response.

A significant number of additional vaccine and antibody-based therapies are in various stages of pre-clinical and clinical developments (Table 1 and Additional file 1: Table 1–4) [72,73,74].

RBD-ACE2 blockers

While vaccination is an ideal modality for SARS-CoV-2 prophylaxis, achieving neutralizing antibody titers high enough to prevent infection can take weeks [75]. It is important to have therapies available to treat patients infected with SARS-CoV-2 before a vaccine is readily available to everyone and can be potentially useful in different strains when vaccines are less effective. Treatments that precisely inhibit RBD-ACE2 interactions may play an important role in reducing morbidity and mortality as this receptor-ligand interaction is essential for host cell entry [3, 76]. Vaccine-derived antibodies overlap with this strategy since RBD has been identified as the predominant antigen targeted by vaccine-induced antibodies against the S protein [66]. For this reason, viral neutralizing antibodies are commonly presumed to be RBD-specific. However, this is not always the case, as antibodies binding S outside the RBD have neutralizing efficacy without inhibiting ACE2 binding [77, 78]. Similarly, RBD-binding antibodies can neutralize viral particles without competing for ACE2 binding [79, 80]. It has been reported that destabilization of the prefusion metastable complex by antibody binding can disrupt virulence in the absence of competition for the ACE2 binding site [81]. To our knowledge, no study has compared the neutralizing efficacy of antibodies that do or do not competitively antagonize RBD-ACE2 interactions. This information could potentially narrow the search for the optimal monoclonal anti-SARS-CoV2-S antibody. Over 160 clinical trials examining convalescent plasma for SARS-CoV-2 treatment are accessible on It is possible that a polyclonal repertoire of IgG/IgM clones obtained in plasma may synergize mechanistically and provide greater efficacy than monoclonal strategies. Indeed, convalescent sera therapies will likely be less susceptible to treatment resistance as new SARS-CoV-2 strains evolve. Detailed discussions of the clinical efficacy of convalescent sera can be found in a recent review [82]. Among the monoclonal antibodies that have progressed through Phase III clinical trials, only Regeneron’s REGN-COV2 has demonstrated apparent efficacy throughout Phase I/II clinical trials. The REGN-COV2 cocktail (since renamed REGEN-COV, asirivimab and imdevimab), consisting of two fully humanized monoclonal antibodies against the SARS-CoV-2 S protein, reduces viral load in proportion to initial viral load at the onset of treatment, and Regeneron has announced a 100% reduction in severe disease in individuals receiving the drug cocktail. The antibody cocktail binds and sequesters SARS-CoV-2 viral particles, preventing their interaction with cellular receptor proteins [83, 84].

Unfortunately, any discussion of S protein targeting therapies is incomplete without a discussion of emerging strain variations and genetic variability. Since the time the SARS-CoV-2 genome was first sequenced in January 2020, many mutations have been identified in samples isolated from patients in similar locations, indicating the virus may diverge into several sub-strains [6, 7, 85, 86]. This is relevant to drug and vaccine design, since testing efficacy, antiviral resistance, and vaccine efficacy may depend significantly on the genetic stability of target epitopes. At least 93 distinct mutations have been isolated from different regions, with the largest percentage clustered in the open reading frame 1b (ORF1b) (48 mutations) and the S protein (14 mutations) encoding regions. Of these mutations, a number of definitive lineages have emerged [87]. Perhaps most prominent among these are B.1.1.7 (UK variant), B.1.351 (South African variant), and the P.1 (Brazil, B.1.1.28 branch) lineage [87]. The branch lineages predominantly represent alterations in the immunoreactive regions of the S protein. These variants have demonstrated varying degrees of immune escape, including from convalescent sera [88].Vaccine efficacy against these variants has been variable, with almost all approved candidates retaining efficacy against the B.1.1.7 (UK) variant. Efficacy has been less consistently retained in the B.1.351 variant, including significant reductions in efficacy seen in the Johnson and Johnson, NovaVax, and Oxford/AstraZeneca vaccine [89, 90].The exact efficacy of each of these vaccines will be clarified by additional data. Fortunately, none of the variants have demonstrated apparent differences in virulence and mortality. Multiple variants initially appeared to be more infectious however, recent variant cases have reduced that challenge this claim—namely whether they are truly more infectious or were solely novel pathogens in affected regions. The variant regions are also common targets for diagnostic tools and therapies, making both the frequency and the location of mutations directly relevant to the efficacy of viral treatment and containment [6, 7, 85, 86]. Specifically, PCR-based IVD technologies use primers that are commonly complementary to regions in the ORF1 or S protein sequence. Several antivirals target charge-specific locations in either the RdRp/nonstructural protein (nsp) 12, receptor binding domain, viral proteases, or viral-activating/processing enzymes, either at the nucleic acid or protein level. Of the virus-targeting therapeutics that have been developed or in pre-clinical development (Table 1), nucleoside analogs and phagolysosome modulators are potentially more resistant to genetic mutations. Many other treatments could be influenced by changes in viral structure and are more susceptible to viral mutations.

Reducing the expression of cellular ACE2 offers a separate strategy for limiting viral infection. There was initial concern that patients taking ACE inhibitors and angiotensin II receptor blockers (ARBs), which are known to upregulate ACE2 expression, would increase infection susceptibility [91]. However, no evidence has emerged suggesting that renin–angiotensin–aldosterone system (RAAS) inhibitors negatively impact patient outcomes [92,93,94]. In fact, these agents might actually improve clinical course for hospitalized SARS-CoV-2 patients [95]. Although these data need to be confirmed, the present evidence would suggest that RAAS inhibitors should not be stopped in the setting of SARS-CoV-2. Conversely, it has been hypothesized that downregulating ACE2 would reduce viral infection and improve outcomes. Isotretinoin (Fig. 1a), an FDA-approved acne medication, was predicted to be the strongest down-regulator of ACE2 expression [96]. Several trials have incorporated isotretinoin into trials to treat SARS-CoV-2 alone or as a combination therapy to enhance other RBD-ACE2 targeting agents [97,98,99,100,101]. It should be noted that the immunomodulatory effects of isotretinoin may improve outcomes independently of its ACE2-regulating effects, however these mechanisms are outside the scope of this review.

Non-antibody therapies targeting the RBD-ACE2 axis are more simplistic mechanistically in that they are exclusively competitive antagonists and steric inhibitors of target engagement. Soluble SARS-CoV-2 RBD inhibited pseudoviral entry in human ACE2-expressing cells [102]. In the setting of acute therapy, existing antibodies against SARS-CoV2 RBD may inhibit its efficacy. There are currently no clinical trials investigating recombinant RBD peptides for treating SARS-CoV-2, however the preclinical groundwork has been laid for therapeutic development [103]. In contrast, two clinical trials forms utilizing recombinant ACE2 protein to treat SARS-CoV-2 have been approved for enrollment [97, 104]. Before, the COVID-19 pandemic, Khan and colleagues found human recombinant soluble ACE2 (hrsACE2) was well tolerated by patients receiving treatment for acute respiratory distress syndrome (ARDS) [105]. Preclinical work investigating hrsACE2 in the setting of SARS-CoV-2 found that hrsACE2 inhibited viral attachment and replication within ACE2-expressing Vero E6 cells (Fig. 2b) [106]. It has also been reported that L-SIGN and DC-SIGN are low affinity receptors that can mediate SARS infection [107, 108]. Limited work has been done to develop therapeutics specifically targeting these alternative entry mechanisms. As this is the third coronavirus outbreak since 2002, the probability of another outbreak is almost certain. Of these outbreaks, SARS-CoV and SARS-CoV-2 both facilitate endosome-mediated infection by binding ACE2 [3, 40, 76]. Developing novel therapies that target RBD-ACE2 interactions will likely benefit patients affected by SARS-CoV-2. Importantly, these therapies could be rapidly adapted to treat future coronavirus serotypes that target ACE2.

Viral membrane fusion inhibitors

The ACE2 binding of S protein induces a conformational change that opens cleavage sites accessible to cellular proteases (Figs. 1 and 2). Cleavage at the S1/S2 and then S2′ sites induce conformational changes that permit the catalytic fusion of viral and cellular membranes by the fusion protein [109] (Figs. 1 and 2). Inhibiting select proteases, such as cathepsins, furin, and transmembrane protease serine 2 (TMPRSS2), offers alternative methods to prevent viral-host membrane fusion, thus halting its invasion. Preclinical data investigating TMPRSS2 has shown promise as a therapeutic target. Camostat mesylate, a serine protease inhibitor, can perturb TMPRSS2 activity which is vital for viral entry and replication within Calu-3 (lung epithelial) cells (Fig. 2b) [110]. Although the greatest effect was achieved by targeting TMPRSS2, the authors noted that the degree of viral inhibition was increased with the concomitant use of a cathepsin inhibitor. Oral camostat mesylate and the more potent intravenous serine inhibitor, nafamosat mesilate (Fig. 2b), are approved treatments for pancreatitis in Japan [111]. Nafamosat mesylate is currently under investigation in Phase II/III trials (NCT04418128). The satisfactory safety profiles associated with these drugs has permitted the rapid enrollment of patients for phase II and III clinical trials [112,113,114,115].

Precise targeting of cathepsins, a class of cysteine proteases, may improve the efficacy of TMPRSS2-directed interventions as suggested in preclinical models [110]. Indeed, it has been suggested that cathepsin L (CatL) is vital for facilitating S protein-directed entry into HEK293T cells [76]. Teicoplanin and dalbavancin (Fig. 1a), two glycopeptide antibiotics, could prevent S-directed pseudoviral entry in vitro by inhibiting CatL [116]. The calpain and cathepsin inhibitor, BLD-2660 (Fig. 1a), is a small molecule that was initially designed for fibrotic diseases but is currently being adapted for SARS-CoV-2 patients. The anti-IL-6 and anti-fibrotic actions paired with the hypothetical benefits of cathepsin inhibition make BLD-2660 an attractive candidate for treatment [117]. However, accumulating evidence suggest that these effects may be mitigated by anti-inflammatory effects. This is supported by the relative efficacy of corticosteroids (namely dexamethasone) and anti-IL-6 modalities (namely Tocalizumab), as implementation of dexamethasone has decreased mortality by as much as 35% in severe patients in some studies. Dexamethasone is now considered as a component of the standard of care in treating moderate to severe disease, while recent evidence also suggests that Tocalizumab may convey clinically relevant efficacy in preventing disease progression and sequelae [118,119,120].

Furin and furin-like proprotein proteases are ubiquitously expressed and have dynamic functions. Coutard and others predicted furin cleavage sites unique to SARS-CoV-2 within the S1/S2 and S2′ domains [121]. Hoffman and others extended this work to find that furin cleaves SARS-CoV2-S protein at the S1/S2 motif, and that this cleavage is essential for viral entry into human lung epithelium and cell–cell spread [109, 122]. While no registered clinical trial targeting furin for the treatment of SARS-CoV-2 exists, one study is designed to investigate the role of tranexamic acid, a plasmin inhibitor, in the cleavage of the S protein complex at the furin site [123]. This may change as the understanding of S protein cleavage advances, but human therapies will likely need to be short in duration and aerosolized to prevent undesirable systemic toxicities. Interestingly, preprint data from Poschet and others report azithromycin and chloroquine (CQ) reduce furin activity, which might derive from their putative lysosomotropic actions (Fig. 2b) [124]. TMPRSS2 is the most rational precision target because it boasts the greatest reduction in SARS-CoV-2 virulence while maintaining low toxicity.

S protein proteolytic cleavage can also be inhibited by endosomal pH alterations [110, 125]. CQ was shown to inhibit the acidification of endosomes which prevented SARS-CoV-SP-mediated pseudovirus infection in Vero E6 cells [125]. Vincent and others extended this work to find that CQ impaired glycosylation of ACE2, which may also impair viral infection [126]. Given that SARS-CoV-2 utilizes ACE2 to undergo receptor-mediated endocytosis, it is plausible that the anti-viral mechanisms would translate in to SARS-CoV-2 [76]. In a letter to the editor, Wang and colleagues report that CQ inhibited SARS-CoV-2 infection and replication in Vero E6 cells which would support the proposed mechanism of action [127]. CQ and its metabolite, hydroxychloroquine (HCQ), are FDA-approved drugs for autoimmune and parasitic conditions which has spurred its rapid incorporation into clinical trials and off label use. Initially, CQ came under criticism after a 96,032 patient cohort study reported increased mortality associated with SARS-CoV-2 patients receiving CQ/HCQ in the presence or absence of a macrolide. This paper has since been retracted after the validity and rigor of the multinational registry used to acquire the data could not be verified [128]. Since, several large datasets have emerged that illustrate that HCQ is ineffective in reducing either morbidity or morality secondary to COVID-19 infection [129,130,131,132,133,134] Presently, clinical outcomes do not appear to improve with CQ/HCQ treatment, with or without adjuvant treatment [135,136,137,138].

Thus far, the pharmaceuticals discussed have interfered with S protein cleavage which is required for virus-cell fusion. Another putative therapeutic approach could be to directly interfere with the fusion motifs located on S2. Following RBD-ACE2 engagement and receptor-mediated endocytosis, S1 is cleaved and released form S2 which permits the insertion of the fusion peptide (FP) into the endosomal membrane [109, 139, 140]. FP insertion induces the helical heptad repeat 1 (HR1) motif to self-assemble into a trimeric coiled-coil structure. Next, S2 folds in upon itself whereby the distal HR2 helices insert into the grooves of the apical HR1 coil forming a stable 6 helix bundle (6-HB) [141,142,143,144,145]. S2 folding and formation of the 6-HB shortens the distance between viral and host membranes (Fig. 2). This juxtaposition causes the viral and endosomal lipid bilayers to fuse with one another so that the virion can escape the endosome and access the host cytoplasm [146,147,148,149,150]. SARS-CoV-2 is unique from SARS-CoV in that it can induce syncytial formation by catalyzing cell–cell fusion [141]. Therefore, therapies targeting the S2 fusion machinery have the potential to act at the level of endosome-cytoplasmic entry as well as spread between adjacent host cells. Recombinant HR1/HR2 peptides can disrupt 6-HB formation and prevent membrane fusion. There are two known compounds, EK1 and IPB01, which are HR2 sequence-derived peptide that prevent 6-HB formation by binding to HR1 [151, 152]. Attaching carboxy-terminal cholesterol groups to each peptide enhanced the antiviral efficacy and potency of these compounds. The updated lipoprotein names for the aforementioned peptides are EK1C4 and IPB02, respectively (Fig. 2b) [141, 152]. There are no studies registered for the use of HR-targeting compounds, but development should be encouraged as such compounds already display reactivity against a broad range of coronavirus strains.

Pro-teasing out the main protease (M Pro ) and its inhibitors

In coronaviruses, the main protease (M Pro ), also known as 3C-like proteinase (3CLpro) or nsp5, performs the first major step to activate viral replication [153,154,155]. M Pro is encoded in two large polyproteins, pp1a and pp1ab (Fig. 3), which are cleaved by autoproteolysis to release a series of nonstructural proteins (nsps) involved in viral replication [156, 157]. This initial cleavage step performed by M Pro is a necessary first step before it can activate other proteins involved in coronavirus replication. M pPro consists of three major domains containing a Cys-His catalytic dyad and a group of four major substrate-binding sites located between domains I and III (often referred to as S1, S2, S3, and S4 sites) [158,159,160,161]. Overall, the SARS-CoV-2 M Pro shares 96% of its DNA sequence identity with the SARS-CoV M Pro and multiple domains, such as the substrate-binding sites, are well conserved among many coronaviruses [162, 163]. M Pro acts on 11 highly specific cleavage recognition sites of Leu-Gln↓(Ser, Ala, Gly) between different interdomain junctions [153, 155, 158]. The M Pro structure has low homology with endogenous human proteases, which makes it an ideal target for highly specific protease inhibitors with low toxicity [158]. By inhibiting the action of M Pro , it is possible to prevent the activation of other proteins required for viral replication.

Key elements of SARS-CoV-2 viral replication and some therapeutic targets. (a) ( +) RNA viruses are ‘ribosome-ready’, meaning that upon cytoplasmic entry, their genome is recognized by the host ribosome as mRNA and can immediately be translated. During translation, the viral genome employs a technique called 'ribosomal frameshifting' to produce two types of polyproteins, pp1a and pp1ab, which encode the non-structural proteins (nsps) including the viral protease M Pro /nsp5 [232]. M Pro first autocleaves the polyproteins at a Gln/Ala and Gln/Ser junction, then cleaves most of the remaining proteins from the first two reading frames of the viral genome, including RdRp [232]. RdRp integrates with nsp7 and nsp8 to assemble into the polymerase holoenzyme. (b) The 3′ region of SARS-CoV-2 RNA genome encodes its structural proteins, S, Envelope (E), Membrane (M) and Nucleocapsid (N) proteins. Discontinuous transcription of the 3′ region generates a nested set of subgenomic ( −) RNAs that are copied into ( +) mRNA, resulting in the host ribosomal translation of the structural proteins [196]. RdRp is also responsible for replicating the viral genome for packaging. Replication and transcription processes are localized into interconnected, double-membraned, ER-derived vesicles called replicase-transcriptase complex (RTC) [189, 244], which centralize the machinery required for these processes and serve as a buffer to any host immune response [245]. Viral structural proteins are translated by host ribosomes from the subgenomic RNA synthesized by RdRp. After processing in the ER-Golgi Intermediate Compartment (ERGIC), the structural proteins and viral RNA are transported to budding vesicles. Finally, virus particles are assembled and released by exocytosis. PDB codes for structures are referenced in Additional file 1: Table 5. Figure was created with

M Pro substrate binding relies on a conserved residue pair of Glu-His to sterically recognize Gln residue on the target substrate in some coronaviruses [164]. This recognition depends primarily on Gln’s carbonyl group, thus potential M Pro inhibitory compounds should mimic Gln side chain volume, rather than focus on its electrostatic components [153, 162, 164]. M Pro catalysis also relies on a conserved GSCGS motif that maintains the structure of its triple turn catalytic site, located directly opposite of a stabilizing region, partial negative charge cluster (PNCC). In various coronaviruses, PNCC interacts with a water residue to stabilize Turn II of the active site increasing the efficiency of M Pro catalytic activity [164]. Since PNCC is located on the enzyme surface, it is a potential target for allosteric inhibition [165, 166].

In SARS-CoV-2, M Pro can have other secondary functions and was found to interact with histone deacetylase 2 (HDAC2), which has a potential cleavage site near the nuclear localization sequence [167]. This suggests M Pro could interrupt the nuclear transport of HDAC2 and inhibit its effects on inflammation, resulting in an overall anti-inflammatory effect. Direct inhibition of M Pro would interfere with the replication cycle and its other functions at limiting inflammation. Given the important roles of M Pro , protease inhibitors that directly target its unique structure could prove to be effective and form a major component of a drug cocktail to limit SARS-CoV-2 infections (Table 1 and Additional file 1: Table 2).

Many protease inhibitors are in preclinical development and could become an invaluable tool to directly inhibit the main protease and disrupt the primary viral lifecycle. Many of these compounds are peptidomimetic drugs screened by computational modeling, in vitro assays, or cell-based assays. Some drugs were previously effective in other coronaviruses such as SARS-CoV or MERS. Based on computational modeling and a high throughput screening, N3 compound was found to be an irreversible protease inhibitor that can covalently bind to the active site of M Pro and block the entry and docking of other substrate molecules (Fig. 3a) [168]. This tight covalent interaction was confirmed by crystal structure models showing a dimer complex of N3 and M Pro , further stabilized by multiple hydrogen bonds to fully anchor it. N3 was found to be effective at inhibiting M Pro in vitro and in limiting SARS-CoV-2 infection in Vero cells [168].

To specifically target M Pro , Compounds 11a and 11b were another group of peptidomimetic drugs designed to interact with the substrate binding sites of M Pro , which directly inhibits its catalytic activity (Fig. 3a) [165]. 11a and 11b have aldehyde groups that can covalently bind to a cysteine in M Pro , and this interaction is further stabilized by additional hydrogen bonds and other interactions. 11a and 11b have shown high efficacy in inhibiting the main protease in vitro. Promisingly, both compounds can be administered intravenously with low toxicity in mammals such as rats and dogs and will potentially be safe in humans. α-ketoamides are another class of structural protease inhibitors that can target the substrate binding sites of M Pro and block its proteolytic activity (Fig. 3a) [158, 169]. In particular, α-ketoamide 13b inhibited the proteolytic activity of recombinant SARS-CoV-2 and MERS M Pro and significantly limited SARS-CoV-2 replication in Calu-3 lung cells. These compounds can be well tolerated and delivered by subcutaneous administration or lung inhalation due to lung tropism.

Previously shown to be effective against SARS-CoV-1, compounds such as compound 4, GC376, MAC-5576 could also potentially target the SARS-CoV-2 M Pro [170]. In cell-based assays, compound 4 and GC376 significantly inhibited viral replication and was safe enough to not induce cytotoxicity. These compounds can covalently bind to the active Cys145 residue of the substrate-binding site of M Pro with different modifications such as by nucleophilic addition in the case of GC376. Out of the four major substrate-binding sites, these inhibitors seem to primarily target the first or second sites (S1 or S2) more than the third and fourth sites (S3 or S4) [170]. Further development of structural protease inhibitors could improve inhibition of all four substrate-binding sites or be used in combination to simultaneously target multiple substrate-binding sites. In addition to the active site, other allosteric sites such as the dimer surface or distal site regions can also be targeted to modulate and inhibit the catalytic activity of M Pro [171]. A combination of different protease inhibitors against both the substrate-binding site and distant allosteric sites could be used synergistically to enhance M Pro inhibition and completely stop viral replication.

While protease inhibitors are primarily in preclinical development, early results show that they are very promising and could potentially be effective at preventing and limiting SARS-CoV-2 infections. Protease inhibitors would be useful in patients with early or moderate symptoms, or possibly for prophylaxis to completely limit early spread and replication. This will also be a good option for patients who are immunocompromised or have other contraindications and cannot directly receive a vaccine. In addition to its effectiveness, protease inhibitors may also have low toxicity and fewer contraindications, because SARS-CoV-2 M Pro does not share a lot of homology with existing human proteins [158]. This will also decrease the chance for severe side effects and allow more frequent use in different patient populations. Additional modifications can help allow targeted delivery of protease inhibitors into infected cells in the lungs. Some of these compounds, such as α-ketoamides, can be administered by inhalation directly to the lungs, as opposed to intravenous injection, which can further improve its effectiveness [158, 172]. Future studies should test protease inhibitors in preclinical animal models and evaluate them further in clinical trials.

Repurposing HIV and HCV protease inhibitors

Existing protease inhibitors against other viruses such as HIV (human immunodeficiency virus) and HCV (Hepatitis C virus) were considered for use in COVID-19 because they could nonspecifically target viral proteases in general. They are existing FDA-approved drugs with good safety profiles and can be easily mass produced and distributed to many patients, if shown to be effective against SARS-CoV-2. Lopinavir (Fig. 3a) is an aspartate protease inhibitor used in HIV treatment and is often administered with ritonavir to increase its half-life by inhibiting its degradation by cytochrome P450 3A4 [173]. The combination of lopinavir and ritonavir was effective in in vitro studies and trials of patients with SARS-CoV and suggested that this combination could potentially inhibit the SARS-CoV-2 main protease and viral replication [174, 175]. Lopinavir/ritonavir does not appear to cause any serious side effects or complications. In early July of 2020, the multi-national World Health Organization (WHO) Solidarity Trial paused the recruitment and study of its lopinavir/ritonavir treatment arm for hospitalized patients with severe symptoms due to futility [134, 175, 176]. Data to date suggests that lopinavir/ritonavir does not significantly alter the mortality rate when compared to control groups that received supportive care [134, 177]. There may have also been adverse side effects in some patients that supported the decision to stop the use of lopinavir/ritonavir. Similarly, the lopinavir/ritonavir arm of the Randomised Evaluation of COVid-19 thERapY (RECOVERY) Trial headed by the University of Oxford in the United Kingdom did not have a significant improvement in mortality rate or a reduction in hospital stay for almost 1600 hospitalized patients, many of whom were on supplemental oxygen [176, 178]. However, it was noted that lopinavir/ritonavir was not completely evaluated in patients with mechanical ventilators due to the smaller sample size for this subgroup. Results from both the WHO Solidarity Trial and the RECOVERY Trial suggested that lopinavir/ritonavir may be ineffective at treating a large sample size of hospitalized patients with SARS-CoV-2 infection and did not have a significant reduction in the mortality rate or other measured clinical outcomes.

Darunavir is another HIV protease inhibitor that was considered for use in COVID-19, because its antiviral activity was similar to lopinavir. Like the interaction between lopinavir and ritonavir, cobicistat is often given with darunavir to inhibit cytochrome P450 3A4 activity and increase the bioavailability of darunavir. In vitro studies, however, showed that darunavir did not restrict SARS-CoV-2 infection of Caco-2 cells or improve cell viability [179]. An initial pilot study assessed the effects of darunavir/cobicistat on viral clearance in COVID-19 patients with milder symptoms that recently tested positive [180]. Compared to standard care, five days of darunavir/cobicistat did not significantly reduce viral load or affect the time to a negative test over a week, but none of the study participants had serious side effects. Other clinical trials involving darunavir/cobicistat are ongoing and a larger sample size could better determine its effectiveness. Similarly, ASC-09, a modified structural version of darunavir, is also being evaluated in combination with ritonavir in different clinical studies [181, 182].

In addition to HIV protease inhibitors, HCV protease inhibitors may also have inhibitory action against M Pro , since HCV NS3 proteases have structural similarities with M Pro [183]. Recent in vitro screening of HCV protease inhibitors found that simeprevir has been shown to effectively inhibit SARS-CoV-2 replication in cell-based assays using Vero E6 cells and human HEK293T cells [183,184,185]. In addition to inhibition of M pro , simeprevir may also have inhibitory action on RdRp[185] and can be used synergistically with other drugs such as remdesivir [183,184,185]. Since simeprevir is a FDA-approved drug, it could be quickly tested in clinical trials.

Danoprevir is another repurposed HCV protease inhibitor and is often given with ritonavir to inhibit cytochrome P450 3A4. The first reported clinical study gave treatments of danoprevir/ritonavir to 11 patients with moderate symptoms from COVID-19 [186]. 4 to 14 days of treatment with danoprevir/ritonavir helped all the study participants recover and be discharged from the hospital without any major side effects. However, it is difficult to fully assess if danoprevir/ritonavir is effective at reducing viral load without an adequate comparison to proper control groups. Another HCV serine protease inhibitor is boceprevir, which was screened to interact with M Pro . Boceprevir inhibited recombinant M Pro enzymatic activity and inhibited cytopathic effects of SARS-CoV-2 in Vero cells [187]. While these initial results are promising, future studies should fully evaluate the antiviral effects of HCV protease inhibitors such as danoprevir/ritonavir and boceprevir in various animal models and randomized clinical trials.

Overall, the clinical trial results with repurposed HIV protease inhibitors such as lopinavir/ritonavir and darunavir/cobicistat for COVID-19 have been underwhelming and showed no significant effects on mortality rate, length of hospital stay, or other outcomes. Several factors could explain the recent failure of HIV protease inhibitors in different clinical trials. Some of these studies focused on using lopinavir/ritonavir in hospitalized patients with severe symptoms who required supplemental oxygen or relied on mechanical ventilator support. Protease inhibitors are likely most effective to limit replication in early stages of disease and may not have a significant impact in severe infections when viral load is extremely high and cannot be fully mitigated. HIV protease inhibitors may also be insufficient by itself and require the synergistic addition of other therapies to enhance its efficacy. In a preliminary clinical study, lopinavir/ritonavir combined with interferon beta, an immune modulator, and ribavirin, a nucleoside analog and inhibitor, improved symptoms and decreased infection time compared to lopinavir/ritonavir by itself and to standard care [188]. These findings were not replicated in larger patient cohorts within the SOLIDARITY trial, however. Another possibility is that HIV protease inhibitors do not sufficiently bind to the active site or regulatory domains to fully inhibit the activity of M Pro in SARS-CoV-2 for a significant clinical effect. It appears that repurposing protease inhibitors that target other viruses outside of coronaviruses may be insufficient for SARS-CoV-2. Thus, specific protease inhibitors designed to target the SARS-CoV-2 M Pro with higher affinity may be more effective and are more likely to show clinical improvement in patients.

RNA dependent RNA polymerase

RdRp or nsp12 is the viral enzyme responsible for both replicating the RNA genome and transcibing the RNA used for translating the structural and accessory proteins at the 3′ end of the genome. Both of these events occur in interconnected double membrane vesicles that bud off of the host cell’s ER, called replication and transcription complexes [189].

For genomic replication, polymerases employ a conserved method of nucleic acid polymerization called the two-metal mechanism of polymerase catalysis. RdRp catalyzes the formation of a phosphodiester bond using metal ions that are held in place by two conserved aspartic acids in its active site. The conserved sequence for a ( +) strand RNA polymerase like the one found in SARS-CoV-2 is a Gly-Asp-Asp motif. This motif is similar to all other polymerases suggesting a common evolutionary ancestor.

Resulting in the translation of the 3′ end of the viral genome, RdRp employs an unusual strategy of discontinuous transcription producing a nested set of 3′ co-terminal sub-genomic RNAs. As RdRp copies the viral RNA, it reaches junctions called Transcription Regulatory Sequences (TRS) which contain highly conserved Core Sequences (CS). Once these sequences are detected by RdRp, it is able to either copy the sequence or jump from that sequence, possibly through long-range RNA-RNA interaction, and base pair with same CS part of the TRS at 5′ end of genomic RNA resulting in the production of (−) RNAs. RdRp then copies these (-) subgenomic RNA sequences into ribosome ready mRNA. The complicated nature of discontinuous transcription may help explain the higher rate of recombination seen in coronaviruses [190]. RdRp also complexes with nsp7 and nsp8 which help to increase RdRp processivity [165, 191,192,193], and interacts with nsp14—a bifunctional protein that has capping and endonuclease activities [194].

Due to its importance in viral replication, RdRp has been the target of many anti-viral therapies and inhibition of the polymerase may be an effective method of reducing SARS-CoV-2 transmission and disease severity (Table 1 and Additional file 1: Table 3). RdRp inhibitors have been studied and successfully used in the past to manage a myriad of diseases with viral etiologies including HIV, Hepatitis C, and Ebola [191, 195,196,197]. There is also data on the use of RdRp inhibitors for treatment of SARS-CoV and MERS infections, which are genetically and structurally similar to the SARS-CoV-2 virus [198]. Currently, there has been only one missense mutation in the viral RdRp found in the top 50 most common mutations in SARS-CoV-2 across the globe [199]. This indicates that the SARS-CoV-2 RdRp is conserved, which decreases the risk of viral resistance to an RdRp inhibitor. Recent cryo-electron microscopy research has elucidated the structure of the SARS-CoV-2 RdRp revealing that it retains the typical ‘hand’ formation common to polymerases its structure comprises of the fingers, thumb, and palm subdomains. This commonality allows researchers to use information from previous RdRp inhibitor studies as a foundation to jumpstart their experiments with new data. Characterization of the SARS-CoV-2 RdRp provides a framework for repurposing previously used drugs and developing new medications to inactivate the SARS-CoV-2 virus.

Nucleoside analog RdRp inhibitors

Antiviral nucleoside analogs are prodrugs that are converted into the active 5-triphosphate form within a cell. This nucleoside analog is then incorporated by viral RNA polymerase into viral RNA strands leading to termination of RNA polymerase function or becoming incorporated into a complete viral RNA strand but leading to non-functional mutations. These mechanisms of action are not mutually exclusive and often both contribute to decreased viral load. Coronaviruses are known to have an exonuclease (nsp14) with proofreading activity, which can remove incorrectly paired nucleotide bases and lead to resistance against nucleoside analogs [200, 201]. Yet, some drugs are still effective including remdesivir which mainly works by terminating the viral RdRp (Fig. 3a) [202]. Recent data, largely stemming from the SOLIDARITY trial, suggests that remdesivir provides very minimal benefit in terms of morbidity and mortality in the context of a well-controlled clinical trial [134].

In silico assays, which use computer models to predict a molecule’s affinity to an enzyme, and molecular docking studies have illustrated that many drugs that have been used to treat various diseases and a myriad of biologically derived compounds can bind to SARS-CoV-2′s RdRp. These molecules provide a potential starting point for SARS-CoV-2 treatment, but none have been proven effective and most are far from becoming a therapeutic option. [195, 203,204,205,206,207]. Furthermore, molecular analysis studies have been completed to show the binding site and molecular mechanism of action of remdesivir on the SARS-CoV-2 RdRp [191, 208]. In vitro cell assays of SARS-CoV-2 infection models tested the effectiveness of known HIV nucleoside analogs including tenofovir, 40-ethynyl-2-fluoro-20-deoxyadenosine, alovudine, lamivudine, and emtricitabine as well as remdesivir to inhibit viral loads and discovered that only remdesivir significantly decreased viral load at a concentration not toxic to the human cells (therapeutic index of 28.6) [209]. An in-vivo study constructed a chimeric mouse-adapted SARS-CoV variant to infect mice with the SARS-CoV-2 RdRp and found that subcutaneous injections of remdesivir resulted in improved lung function and decreased viral load [210].

Human data on the efficacy of these RdRp inhibitors in treating COVID-19 is limited, but there have been some clinical trials as well as studies on previously known diseases that can help judge the potential of some of these drugs. Ribavirin clinical trials against MERS revealed high levels of toxicity indicating that drug may not be the best candidate for COVID-19 treatment [211]. Sofosbuvir with velpatasvir is currently used as an effective hepatitis-C treatment and is well tolerated in patients indicating it may be able to reach effective dosage concentrations to treat COVID-19. Clinical trials of this drug combination are currently underway in Iran [212]. Remdesivir has been used to effectively treat the Ebola and has been used as a COVID-19 treatment for compassionate use in the U.S. and other countries. An observational study analyzing data from 53 patients using remdesivir for compassionate use, found that 68% of patients showed clinical improvement after the first dose and 23% had serious adverse effects [213, 214]. A phase 3 double blinded clinical trial comparing intravenous remdesivir to placebo was completed in Hubei, China. The study consisted of 158 patients in the remdesivir arm and 79 receiving a placebo and concluded that remdesivir was not associated with clinical improvement. Yet, there was a non-statistically significant trend for quicker recovery times in the intervention group, which may warrant a need for a larger clinical trial [215]. As stated, preliminary analysis of SOLIDARITY trial findings demonstrates no meaningful clinical difference from remdesivir administration compared to the standard of care. Finally, favipiravir is clinically approved for treatment of influenza in Japan and has shown some effectiveness in treating Ebola. This drug has also been used in a randomized control COVID-19 trial in China. The trial was a head to head comparison of favipiravir and arbidol with roughly 120 patients in each arm. The study showed no significant difference between therapies for 7-day clinical recovery rate. Yet, favipiravir did significantly decrease fever and cough symptoms faster and revealed a trend of greater effectiveness on moderately compared to severely ill patients [216, 217]. There are a number of clinical trials currently registered that test these various RdRp inhibitors. One study conducting a phase 4 trial with favipiravir plus HCQ, and multiple trials for remdesivir and sofosbuvir have reached phase 3 but none have published any statistically significant results so far.

Zinc as a potential RdRp inhibitor

In vitro cell studies have illustrated that zinc directly inhibits the RdRp in SARS-CoV, but a zinc ionophore is needed to move zinc into the cell to be effective [218]. Zinc is known to play an important role in immunomodulation and zinc deficiency is also prevalent amongst high-risk SARS-CoV-2 infectious groups including people of old age, on diuretics, and anti-hypertensive medications [219]. Furthermore, HCQ is a potential drug treatment for COVID-19 as well as a zinc ionophore. Therefore, giving zinc alone and with HCQ has been hypothesized to reduce viral load and attenuate the immune response in SARS-CoV-2 infected patients [220]. Yet, recent studies have illustrated that HCQ is ineffective in reducing infection risk prophylactically or improving outcomes in mild to moderate infections [131, 221]. There are several ongoing clinical trials registered in aimed to determine whether zinc, along with other agents, is effective in preventing SARS-CoV-2 infection and/or reducing viral load (Additional file 1: Table 1). Trials featuring zinc as a treatment modality in isolation have been sparse and appear to have been discontinued. Unfortunately, as of February 1st, 2021, none of these trials have shown major efficacy with regards to hard endpoints, though many trials are ongoing.


Antibody epitope curation

Linear B cell epitopes on the SARS-CoV-2 surface glycoprotein were curated from five published studies [55,56,57,58,59]. Four of these studies screened polyclonal sera of convalescent COVID-19 patients using either peptide arrays [55, 56, 59] or phage immunoprecipitation sequencing (PhIP-Seq) [57]. One study characterized the epitopes of monoclonal neutralizing antibodies [59]. Results from Schwarz et al. included sera from six SARS-CoV-2-naive patient sera and nine SARS-CoV-2-infected patient sera using PEPperCHIP® SARS-CoV-2 Proteome Microarrays [59]. The peptides included in these proteome-wide epitope mapping analyses were limited to those which demonstrated either IgG or IgA fluorescence intensity > 1000 U in at least two infected patient samples and in none of the naive patient samples. In addition, two peptides were also included (QGQTVTKKSAAEASK, QTVTKKSAAEASKKP) which demonstrated IgG fluorescence intensity > 1000 U in only one naive patient sample each, but in four and five infected patient samples, respectively.

HLA ligand prediction

The SARS-CoV-2 protein sequence FASTA was retrieved from the NCBI reference database ( [60]. Haplotypes included in this analysis were derived from those with > 5% expression within the United States populations based on the National Marrow Donor Program’s HaploStats tool [61]:

HLA-A: A*11:01, A*02:01, A*01:01, A*03:01, A*24:02

HLA-B: B*44:03, B*07:02, B*08:01, B*44:02, B*44:03, B*35:01

HLA-C: C*03:04, C*04:01, C*05:01, C*06:02,C*07:01, C*07:02

HLA-DR: DRB1*01:01, DRB1*03:01, DRB1*04:01, DRB1*07:01, DRB1*11:01, DRB1*13:01, DRB1*15:01

Additionally, HLA-DQ alpha/beta pairs were chosen based on prevalence in previous studies [62]:

HLA-DQ: DQA1*01:02/DQB1*06:02, DQA1*05:01/DQB1*02:01, DQA1*02:01/DQB1*02:02, DQA1*05:05/DQB1*03:01, DQA1*01:01/DQB1*05:01, DQA1*03:01/DQB1*03:02, DQA1*03:03/DQB1*03:01, DQA1*01:03/DQB1*06:03

For HLA-I, 8-11mer epitopes were predicted using netMHCpan 4.0 [63] and MHCflurry 1.6.0 [64]. For HLA-II calling, 15mers were predicted using NetMHCIIpan 3.2 [65] and NetMHCIIpan 4.0 [66]. For optimization of epitope predictions, individual features from each HLA-I and HLA-II prediction tool was compared against IEDB binding affinities using Spearman correlation (Additional file 1: Fig. S1). Cutpoints for the best performing HLA-I and HLA-II feature were set using 90% specificity of predicting for peptides with < 500 nM binding affinity in the IEDB set, using predicted binding affinity values from netMHCpan 4.0 (HLA-I) and netMHCIIpan 3.2 (HLA-II). The proportion of the total U.S. population containing at least one haplotype capable of binding each peptide was calculated assuming no genetic linkage:

Immunogenicity modeling

IEDB HLA-I and HLA-II viral tetramer data were used to generate a generalized linear model (GLM family = binary) with tetramer-positivity as a binary outcome [51]. Independent variables for HLA-I included NetMHCpan 4.0 binding affinity and elution score, MHCflurry binding affinity, presentation score, processing score, and percentage of aromatic (F, Y, W), acidic (D, E), basic (K, R H), small (A, G, S, T, P), cyclic (P), and thiol (C, M) amino acid residues. Independent variables for HLA-II included NetMHCIIpan 4.0 binding affinity and elution scores, and percentage of aromatic, acidic, basic, small, cyclic, and thiol amino acid residues. All independent variables were normalized to 0–1 to keep coefficients comparable (binding affinities divided by 50,000). GLM model performance was derived using 5-fold cross-validation, balancing for HLA alleles. The final HLA-I and HLA-II models were generated using each full IEDB set, then applied to SARS-CoV-2 predicted HLA ligands to derive a GLM score. For immunogenicity filtering, predicted epitopes above the median GLM score were kept.

B cell epitope selection

Accessibility of contiguous regions of the spike protein was approximated with the following heuristic: mean accessibility of 35%, minimum accessibility of 15%, requiring at least one residue to have accessibility greater than 50%, and the ends of a region to have at least 25% accessibility. Adjacency to a functional region was defined as within 15aa of either side of FP, HR1, and HR2, and within 50aa of the RBD. A broader window was used for the receptor binding domain due to the known presence of neutralizing antibody epitopes in S1 of SARS-CoV-1 outside of the RBD [67].

Published T cell epitope data curation

T cell epitopes from eight studies of immune responses from convalescent COVID-19 patients [68,69,70,71,72,73,74,75] were manually curated into a spreadsheet with 973 entries (Table S9). Other studies were excluded which focused on murine immune responses and/or immunity from vaccination. To aggregate epitope regions of varying granularities, the viral proteome was split into 40aa bins, overlapping by 20aa. A bin was considered to contain an epitope region if they overlapped by at least 8aa. Similarly, each vaccine peptide counted as overlapping a bin if their overlap was at least 8aa. Overlapping bins were mutually exclusive, and only the bin with the highest number responding patients was retained. Bin boundaries were then clipped to the minimum and maximum boundaries of any epitope region contained within it.

Vaccine peptide manufacturability

Based on previous experiences with peptide synthesis failures and consultation with the UNC High-Throughput Peptide Synthesis and Array Facility, we devised a scoring rubric for solid-phase peptide synthesis difficulty (Additional file 1: Fig. S8A). This rubric includes features related to the stability of the synthesized peptide product as well as sequence features which increase the difficulty of peptide elongation and/or purification. For example, hydrophobic peptides are challenging to solubilize, whereas hydrophobic regions within peptides are challenging to elongate during synthesis due to strong conformational properties. In our scoring rubric, hydrophobicity of peptide sequences is calculated using the mean GRAVY score [76], which is computed both for the entire peptide as well as the max for all local windows of lengths between 5mer and 8mer. Local hydrophobicity scores are penalized proportional to how much they exceed 2.5 whereas whole peptide hydrophobicity is penalized to the degree that it exceeds 2. These values were determined based on unpublished data relating to which peptides had failed for reasons related to hydrophobicity during the PGV001 neoantigen vaccine trial [77]. Another category of difficulties relates to the instability of certain pairs of adjacent amino acids. The extremely unstable dipeptides are DG and NG, whereas the less penalized but still problematic dipeptides are DS, DN, DD, NN, ND, NS, and NP. Furthermore, certain terminal residues inhibit the initiation of synthesis or formation of undesired residues such as pyroglutamate. Difficult N terminal residues are Q, E, C, and N, whereas difficult C terminal residues are P, C, and H. Lastly, the inclusion of multiple thiol residues can be challenging due to formation of long-range disulfide bonds. Our heuristic penalizes both the total number of thiols (C and M residues), as well as a penalty for excessive cysteines which is only applied when the number of C residues exceeds 1. Many of similar features are enumerated in commercial peptide design guides, such as ones published by Biomatik [78] and SB peptide [79] or in standard texts on solid-phase synthesis [80]. The particular weights given to different peptide features are determined purely from experience and intuition and are presented without claims of accuracy or optimality.

SARS-CoV-2 entropy calculations

In total, 7881 SARS-CoV-2 genome sequences were downloaded from GISAID ( [81]. A preprocessing step removed 127 sequences that were shorter than 25,000 bases. The sequences were split into 79 smaller files and aligned using Augur [82] (which relies on the MAFFT [83] aligner) with NCBI entry MT072688.1 [84] as the reference genome. The reference genome was downloaded from NCBI GenBank [85]. The 79 resulting alignment files were concatenated into a single alignment file with the duplicate reference genome alignments removed. The multiple sequence alignment was translated to protein space using the R packages seqinr [86] and msa [87]. Entropy for each position was calculated using the following formula, where n is the number of possible outcomes (i.e., total unique identifiable amino acid residues at each location) and pi is the probability of each outcome (i.e., probability of each possible amino acid residues at each location):

Mouse vaccination

All mouse work was performed according to IACUC guidelines under UNC IACUC protocol ID 20-121.0. Vaccine studies were performed using BALB/c mice with free access to food and water. Mice were ordered from Jackson Laboratories and vaccinated at 8 weeks of age. Equal numbers of male and female mice were used per group, vaccinated with poly(I:C) (Sigma-Aldrich cat. #P1530) either alone or in combination with 16 synthesized vaccine peptides. In total, 26 μg total peptide was utilized per vaccination (divided equimass per peptide). Then, 75 μg of polyI:C was utilized per vaccination, with n = 6 mice per experimental group and n = 3 mice per polyI:C-only control group. Mice were vaccinated on days 1 and 7, cheek bleeds obtained on days 7 and 14, and sacrificed with cardiac bleeds performed on day 21.

S Protein ELISA

Serum obtained from cardiac bleeds on day 21 was utilized for ELISA testing for antibody response to SARS-CoV-2 spike (S) protein. Nunc Maxisorp plates (Thermo Fisher Scientific) were coated with S protein (generously provided by Ting Lab at UNC), or BSA as a negative control and incubated overnight. Plates were blocked with 10% FBS in PBS, washed, and serum plated in duplicate wells with serial dilutions. 6x His Tagged monoclonal antibody (Thermo Fisher Scientific) was also plated as an experimental control. Goat anti-mouse IgG HRP (Thermo Fisher Scientific) was added to washed plates as a secondary antibody. TMB substrate (Thermo Fisher Scientific) was added, development was stopped with TMB Stop solution (BioLegend), and plates were read at 450 nm.

Peptide ELISA

Serum obtained from cardiac bleeds on day 21 and cheek bleeds on experimental days 7 and 14 were tested for antibody response to the predicted B cell peptide epitopes used for vaccinations via peptide ELISAs. Plates were coated with 5μg/mL of target peptide using coating reagent from the Takara Peptide Coating Kit (Takara cat. #MK100). Measles peptide was utilized as a negative control, and Flag peptide was also plated as an experimental control. Plates were blocked with a blocking buffer according to the manufacturer’s protocol. Serum was plated in duplicate wells with serial dilutions, and anti-FLAG antibody was plated in the experimental control wells. Rabbit anti-mouse IgG HRP (Abcam ab97046) was utilized as a secondary antibody. TMB substrate (Thermo Fisher Scientific cat. #34028) was added, development was stopped with TMB Stop solution (BioLegend cat. #423001), and plates were read at 450 nm.


After the sacrifice of mice on experimental day 21, spleens were dissected out for ELISpot assessment of T cell activation in response to peptide and adjuvant vaccination. Spleens were mechanically dissociated using a GentleMACS Octo Dissociator (Miltenyi Biotec) and passed through a 70-μm filter. RBC lysis buffer (Gibco cat. #A1049201) was used to remove red blood cells, and cells were washed then passed through 40-μm filters. Splenocytes were counted and 250,000 splenocytes were plated per well into plates (BD Biosciences cat. #551083) that had been coated with each of the individual 16 predicted target peptides, or PBS as negative control or PHA as experimental control. Plates were incubated for 72 h. Anti-interferon gamma detection antibody was added according to the manufacturer’s protocol, followed by enzyme conjugate Streptavidin-HRP and final substrate solution (BD Biosciences cat. #557630). Plates were allowed to develop, washed to stop development, and allowed to dry before reading on ELISpot reader (AID Classic ERL07).

Graphical and statistical analysis

Plots and analyses were generated using the following R packages: caret 6.0-84 [88], cowplot 0.9.4 [89], data.table 1.12.8 [90], DESeq2 1.22.2 [91], doMC 1.3.6 [92], dplyr 0.8.4 [93], forcats 0.4.0 [94], GenomicRanges 1.34.0 [95], ggallin 0.1.1 [96], ggbeeswarm 0.6.0 [97], ggnewscale 0.4.1 [98], ggplot2 3.3.0 [89], ggpubr 0.2 [99], ggrepel 0.8.1 [100], gplots 3.0.3 [101], gridExtra 2.3 [102], huxtable 4.7.1 [103], magrittr 1.5 [104], officer 0.3.10 [105], pROC 1.16.2 [106], RColorBrewer 1.1-2 [107], readxl 1.3.1 [108], scales 1.1.0 [109], seqinr 3.6-1 [86], stringr 1.4.0 [110], venneuler 1.1-0 [111], viridis 0.5.1 [112]. Figures 4C, D and 5 were generated using the following Python packages: NumPy [113], pandas [114], Matplotlib [115], and Jupyter [116].

Statistical Modeling, Causal Inference, and Social Science

I’ve had various thoughts regarding clinical trials for coronavirus treatments and vaccines, and then I came across thoughtful posts by Thomas Lumley and Joseph Delaney on vaccines.

So let’s talk, first about treatments, then about vaccines.

Clinical trials for treatments

The first thing I want to say is that designing clinical trials is not just about power calculations and all that. It’s also about what you’re gonna do with the results once they come in. The usual ideas of design (including in our own books, unfortunately) focus on what can be learned from a single study. But that’s not what we have here.

Hospitals have lots of coronavirus patients right now, and they can try out whatever treatments are on the agenda, starting with the patients that are at the highest risk of dying. This should be done in a coordinated fashion, by which I don’t mean a bunch of randomized trials, each aiming for that statistical-significance jackpot, followed by a series of headlines and maybe an eventual meta-analysis. When I say “coordinated,” I mean that all the studies should put their patient-level information into an open repository using some shared format, everything gets registered, all the treatments, all the background variables, all the outcomes. This shouldn’t be a burden on experimenters. Indeed, a shared, open-source spreadsheet should be easier to use, compared to the default approach of each group doing their own thing.

Ok, now that I wrote that paragraph, I wish I’d written it a couple months ago. Not that it would’ve made any difference. It would take a lot to change the medical-industrial complex. Sander Greenland et al. have been screaming for years, and the changes have been incremental at best.

Let me tell you a story. A doctor was designing a trial for an existing drug that he thought could be effective for high-risk coronavirus patients. He contacted me to check his sample size calculation: under the assumption that the drug increased survival rate by 25 percentage points, a sample size of N = 126 would assure 80% power. (With 126 people divided evenly in two groups, the standard error of the difference in proportions is bounded above by √(0.5*0.5/63 + 0.5*0.5/63) = 0.089, so an effect of 0.25 is at least 2.8 standard errors from zero, which is the condition for 80% power for the z-test.) When I asked the doctor how confident he was in his guessed effect size, he replied that he thought the effect on these patients would be higher and that 25 percentage points was a conservative estimate. At the same time, he recognized that the drug might not work. I asked the doctor if he would be interested in increasing his sample size so he could detect a 10 percentage point increase in survival, for example, but he said that this would not be necessary.

It might seem reasonable to suppose that a drug might not be effective but would have a large effect if it did happen to work. But this vision of uncertainty has problems. Suppose, for example, that the survival rate was 30% among the patients who do not receive this new drug and 55% among the treatment group. Then in a population of 1000 people, it could be that the drug has no effect on the 300 of people who would live either way, no effect on the 450 who would die either way, and it would save the lives of the remaining 250 patients. There are other possibilities consistent with a 25 percentage point benefit—for example the drug could save 350 people while killing 100—but I’ll stick with the simple scenario for now. In any case, the point is that the posited benefit of the drug is not “a 25 percentage point benefit” for each patient rather, it’s a benefit on 25% of the patients. And, from that perspective, of course the drug could work but only on 10% of the patients. Once we’ve accepted the idea that the drug works on some people and not others—or in some comorbidity scenarios and not others—we realize that “the treatment effect” in any given study will depend entirely on the patient mix. There is no underlying number representing “the effect of the drug.” Ideally one would like to know what sorts of patients the treatment would help, but in a clinical trial it is enough to show that there is some clear average effect. My point is that if we consider the treatment effect in the context of variation between patients, this can be the first step in a more grounded understanding of effect size.

I also shared some thoughts last month on costs and benefits, in particular:

When considering design for a clinical trial I’d recommend assigning cost and benefits and balancing the following:

– Benefit (or cost) of possible reduced (or increased) mortality and morbidity from COVID in the trial itself.
– Cost of toxicity or side effects in the trial itself.
– Public health benefits of learning that the therapy works, as soon as possible.
– Economic / public confidence benefits of learning that the therapy works, as soon as possible.
– Benefits of learning that the therapy doesn’t work, as soon as possible, if it really doesn’t work.
– Scientific insights gained from intermediate measurements or secondary data analysis.
– $ cost of the study itself, as well as opportunity cost if it reduces your effort to test something else.

This may look like a mess—but if you’re not addressing these issues explicitly, you’re addressing them implicitly. . . .

Whatever therapies are being tried, should be monitored. Doctors should have some freedom to experiment, and they should be recording what happens. To put it another way, they’re trying different therapies anyway, so let’s try to get something useful out of all that.

It’s also not just about “what works” or “does a particular drug work,” but how to do it. . . . You want to get something like optimal dosing, which could depend on individuals. But you’re not gonna get good discrimination on this from a standard clinical trial or set of clinical trials. So we have to go beyond the learning-from-clinical-trial paradigm, designing large studies that mix experiment and observation to get insight into dosing etc.

Also, lots of the relevant decisions will be made at the system level, not the individual level. . . . These sorts of issues are super important and go beyond the standard clinical-trial paradigm.

Clinical trials for vaccines

I haven’t thought about this at all so I’ll outsource the discussion to others.

There are over 100 potential vaccines being developed, and several are already in preliminary testing in humans. There are three steps to testing a vaccine: showing that it doesn’t have any common, nasty side effects showing that it raises antibodies showing that vaccinated people don’t get COVID-19.

The last step is the big one, especially if you want it fast. . . . We don’t expect perfection, and if a vaccine truly reduces the infection rate by 50% it would be a serious mistake to discard it as useless. But if the control-group infection rate over a couple of months is a high-but-maybe-plausible 0.2% that means 600,000 people in the trial — one of the largest clinical trials in history.

How can that be reduced? If the trial was done somewhere with out-of-control disease transmission, the rate of infection in controls might be 5% and a moderately large trial would be sufficient. But doing a randomised trial in setting like that is hard — and ethically dubious if it’s a developing-world population that won’t be getting a successful vaccine any time soon. If the trial took a couple of years, rather than a couple of months, the infection rate could be 3-4 times lower — but we can’t afford to wait a couple of years.

The other possibility is deliberate infection. If you deliberately exposed trial participants to the coronavirus, you could run a trial with only hundreds of participants, and no more COVID deaths, in total, than a larger trial. But signing people up for deliberate exposure to a potentially deadly infection when half of them are getting placebo is something you don’t want to do without very careful consideration and widespread consultation. . . .

One major barrier is manufacturing the doses, especially since we decided to off-shore a lot of our biomedical capacity in the name of efficiency (at the cost of robustness). . . . We want an effective vaccine and it may be the case that candidates vary in their effectiveness. There are successful vaccines that do not grant 100% immunity. The original polio vaccines were only 60-70% effective versus one of the strains, but that still led to a vast decrease in the number of infections in the United States once vaccination became standard.

So, clearly we want trials. . . . Now we get to the point about medical ethics. A phase III trial takes a long time to conduct and there is some political pressure for a fast solution. . . . if the virus is mostly under control, you need a lot of people and a long time to evaluate the effectiveness of a vaccine. People are rarely exposed so it takes a long time for differences in cases between the arms to show up. . . .

Another option is the challenge trial. Likely only taking a few hundred participants, it would have no more deaths than a regular trial. But it would involve infecting people, treated with a placebo(!!), with a potentially fatal infectious disease. There are greater good arguments here, but the longer I think about them the more dubious they get to me. Informed consent for things that are so dangerous really does suggest coercion. . . .

Combining these ideas

Organizing clinical trials for treatments . . . I just don’t think this is gonna happen.

But organizing clinical trials for vaccines? Maybe this is possible. Based on the above discussion, it seems like it’s likely we’ll soon be seeing vaccine trials based on infecting healthy people with the virus and then seeing if they fight it off. If so, I have a few thoughts:

1. I don’t see why you need to give anyone placebos. If we have several legitimate vaccine ideas, let’s give everyone some vaccine or another. If they all work, and nobody gets sick, that’s great. If we’re testing 100 vaccine ideas, then we can guess that most of them won’t be so effective, so we’ll get placebos automatically.

2. As discussed above, coordinate all of these. Certainly no need for 100 different placebo groups.

3. Multilevel modeling all the way. Bayesian inference. Decision making based on costs and benefits, not statistical significance.


Antibody epitope curation

Linear B cell epitopes on the SARS-CoV-2 surface glycoprotein were curated from four published studies 56–59 . Three of these studies screened polyclonal sera of convalescent COVID-19 patients using either peptide arrays 56,58 or phage immuno-precipation sequencing (PhIP-Seq) 59 . One study characterized the epitopes of monoclonal neutralizing antibodies 57 . Additionally, we were provided as-of-yet unpublished results from a study of sera from six SARS-CoV-2-naive patient sera and nine SARS-CoV-2-infected patient sera using PEPperCHIP® SARS-CoV-2 Proteome Microarrays. The peptides included in these proteome-wide epitope mapping analyses were limited to those which demonstrated either IgG or IgA fluorescence intensity > 1000U in at least two infected patient samples and in none of the naive patient samples. In addition, two peptides were also included (QGQTVTKKSAAEASK, QTVTKKSAAEASKKP) which demonstrated IgG fluorescence intensity > 1000U in only one naive patient sample each, but in four and five infected patient samples, respectively.

HLA ligand prediction

The SARS-CoV-2 protein sequence FASTA was retrieved from the NCBI reference database ( Haplotypes included in this analysis were derived from those with > 5% expression within the United States populations based on the National Marrow Donor Program’s HaploStats tool 22 :

HLA-A: A*11:01, A*02:01, A*01:01, A*03:01, A*24:02

HLA-B: B*44:03, B*07:02, B*08:01, B*44:02, B*44:03, B*35:01

HLA-C: C*03:04, C*04:01, C*05:01, C*06:02,C*07:01, C*07:02

HLA-DR: DRB1*01:01, DRB1*03:01, DRB1*04:01, DRB1*07:01, DRB1*11:01, DRB1*13:01, DRB1*15:01

Additionally, HLA-DQ alpha/beta pairs were chosen based on prevalence in previous studies 23 :

HLA-DQ: DQA1*01:02/DQB1*06:02, DQA1*05:01/DQB1*02:01, DQA1*02:01/DQB1*02:02, DQA1*05:05/DQB1*03:01, DQA1*01:01/DQB1*05:01, DQA1*03:01/DQB1*03:02, DQA1*03:03/DQB1*03:01, DQA1*01:03/DQB1*06:03

For HLA-I, 8-11mer epitopes were predicted using netMHCpan 4.0 18 and MHCflurry 1.6.0 19 . For HLA-II caling, 15mers were predicted using NetMHCIIpan 3.2 20 and NetMHCIIpan 4.0 21 . For optimization of epitope predictions, individual features from each HLA-I and HLA-II prediction tool was compared against IEDB binding affinities using Spearman correlation (Figure S1). Cutpoints for the best performing HLA-I and HLA-II feature were set using 90% specificity of predicting for peptides with < 500nM binding affinity in the IEDB set. The proportion of the total U.S. population containing at least one haplotype capable of binding each peptide was calculated assuming no genetic linkage:

Immunogenicity modeling

IEDB HLA-I and HLA-II viral tetramer data were used to generate a generalized linear model (GLM, family = binary) with tetramer-positivity as a binary outcome. Independent variables for HLA-I included NetMHCpan 4.0 binding affinity and elution score, MHCflurry binding affinity, presentation score, processing score, and percentage of aromatic (F, Y, W), acidic (D, E), basic (K, R H), small (A, G, S, T, P), cyclic (P), and thiol (C, M) amino acid residues. Independent variables for HLA-II included NetMHCIIpan 4.0 binding affinity and elution scores, and percentage of aromatic, acidic, basic, small, cyclic, and thiol amino acid residues. All independent variables were normalized to 0-1 to keep coefficients comparable (binding affinities divided by 50,000). GLM model performance was derived using 5-fold cross validation, balancing for HLA alleles. The final HLA-I and HLA-II models were generated using each full IEDB set, then applied to SARS-CoV-2 predicted HLA ligands to derive a GLM score. For immunogenicity filtering, predicted epitopes above the median GLM score were kept.

SARS-CoV-2 entropy calculations

8,008 SARS-CoV-2 genome sequences were downloaded from GISAID ( 51 . A preprocessing step removed 127 sequences that were shorter than 25,000 bases. The sequences were split into 79 smaller files and aligned using augur 52 with MT072688.1 91 as the reference genome. The reference genome was downloaded from NCBI GenBank 92 . The 79 resulting alignment files were concatenated into a single alignment file with the duplicate reference genome alignments removed. The multiple sequence alignment was translated to protein space using the R packages seqinr 93 and msa 94 . Entropy for each position was calculated using the following formula, where n is the number of possible outcomes (i.e. total unique identifiable amino acid residues at each location) and pi is the probability of each outcome (i.e. probability of each possible amino acid residues at each location):

Immunomodulatory molecule co-expression analysis

Single cell RNA sequencing data was collected from six respiratory datasets 80–84 and three gastrointestinal datasets 82,85,86 . ACE2 + cells were subsetted as cells with an expression of ACE2 greater than zero. The proportion of ACE2 + cells expressing the immunomodulatory genes were plotted with the circlize package 95 . Coexpression of the immunomodulatory genes that were expressed in greater than five percent of the ACE2 + cells were plotted as links.

Graphical and statistical analysis

Plots and analyses were generated using the following R packages: scales 96 , data.table 97 , ggrepel 98 , ggplot2 99 , viridis 100 , ggnewscale 101 , seqinr 93 , DESeq2 102 , GenomicRanges 103 , gplots 104 , ggbeeswarm 105 , ggallin 106 , stringr 107 , gridExtra 108 , pROC 109 , caret 110 , RColorBrewer 111 , dplyr 112 , cowplot 113 , ggpubr 114 , doMC 115 , venneuler 116 , ComplexHeatmap 117 , and circlize 95 packages. Figures 4C, 4D, and 5 were generated using the following Python packages: NumPy 118 , pandas 119 , Matplotlib 120 , and Jupyter 121 .

Definition of 'Random Sampling'

Definition: Random sampling is a part of the sampling technique in which each sample has an equal probability of being chosen. A sample chosen randomly is meant to be an unbiased representation of the total population. If for some reasons, the sample does not represent the population, the variation is called a sampling error.

Description: Random sampling is one of the simplest forms of collecting data from the total population. Under random sampling, each member of the subset carries an equal opportunity of being chosen as a part of the sampling process. For example, the total workforce in organisations is 300 and to conduct a survey, a sample group of 30 employees is selected to do the survey. In this case, the population is the total number of employees in the company and the sample group of 30 employees is the sample. Each member of the workforce has an equal opportunity of being chosen because all the employees which were chosen to be part of the survey were selected randomly. But, there is always a possibility that the group or the sample does not represent the population as a whole, in that case, any random variation is termed as a sampling error.

An unbiased random sample is important for drawing conclusions. For example when we took out the sample of 30 employees from the total population of 300 employees, there is always a possibility that a researcher might end up picking over 25 men even if the population consists of 200 men and 100 women. Hence, some variations when drawing results can come up, which is known as a sampling error. One of the disadvantages of random sampling is the fact that it requires a complete list of population. For example, if a company wants to carry out a survey and intends to deploy random sampling, in that case, there should be total number of employees and there is a possibility that all the employees are spread across different regions which make the process of survey little difficult.

Related Defintions

The five forces model of analysis was developed by Michael Porter to analyze the competitive environ

Endorsements are a form of advertising that uses famous personalities or celebrities who command a h

Reference price is also known as competitive pricing, because here the product is sold just below th

Loss leaders are high volume, high profile brands or products that are sold by retailers with the in

Ambient Advertising is about placing ads on unusual objects or in unusual places where you wouldn&rsquot

Conspicuous consumption is the practice of purchasing goods or services to publicly display wealth r

Market concentration is used when smaller firms account for large percentage of the total market. It

Cash Cow is one of the four categories under the Boston Consulting Group's growth matrix that repres

A strategic business unit, popularly known as SBU, is a fully-functional unit of a business that has

Rebranding is the process of changing the corporate image of an organisation. It is a market strateg

Example 9.1 Section

The EPA considers indoor radon levels above 4 picocuries per liter (pCi/L) of air to be high enough to warrant amelioration efforts. Tests in a sample of 200 Centre County Pennsylvania homes found 127 (63.5%) of these sampled households to have indoor radon levels above 4 pCi/L. What is the population value being estimated by this sample percentage? What is the standard error of the corresponding sample proportion?

Recap: the estimated percent of Centre Country households that don't meet the EPA guidelines is 63.5% with a standard error of 3.4%. The Normal approximation tells us that

  • for 68% of all possible samples, the sample proportion will be within one standard error of the true population proportion and
  • for 95% of all possible samples, the sample proportion will be within two standard errors of the true population proportion.

Thus, a 68% confidence interval for the percent of all Centre Country households that don't meet the EPA guidelines is given by

A 95% confidence interval for the percent of all Centre Country households that don't meet the EPA guidelines is given by

Confidence Intervals for a proportion:

For large random samples a confidence interval for a population proportion is given by

where z* is a multiplier number that comes form the normal curve and determines the level of confidence (see Table 9.1 for some common multiplier numbers).

Table 9.1. Commonly Used Multipliers

Interpreting Confidence Intervals

To interpret a confidence interval remember that the sample information is random - but there is a pattern to its behavior if we look at all possible samples. Each possible sample gives us a different sample proportion and a different interval. But, even though the results vary from sample-to-sample, we are "confident" because the margin-of-error would be satisfied for 95% of all samples (with z*=2).

The margin-of-error being satisfied means that the interval includes the true population value.

Properties of Confidence Intervals

  • There is a trade-off between the level of confidence and the precision of the interval. If you want more confidence, you will have to settle for a wider interval (bigger z*).
  • Our formula for the confidence interval depends on the normal approximation, so you must check that you have independent trials and a large enough sample to be sure that the normal approximation is appropriate.
  • The standard error calculation involves estimating the true standard deviation by substituting the sample proportion for the population proportion in the formula. Luckily, this works well in situations where the normal curve is appropriate [i.e. when np and n(1-p) are both bigger than 5].
  • A confidence Interval is only related to sampling variability. The probability that your interval captures the true population value could be much lower if your survey is biased (e.g. bad question wording, low response rate, etc. ).

What Is Principal Component Analysis (PCA) and How It Is Used?

Principal component analysis, or PCA, is a statistical procedure that allows you to summarize the information content in large data tables by means of a smaller set of “summary indices” that can be more easily visualized and analyzed. The underlying data can be measurements describing properties of production samples, chemical compounds or reactions, process time points of a continuous process, batches from a batch process, biological individuals or trials of a DOE-protocol, for example.

This article is posted on our Sartorius Blog.

Using PCA can help identify correlations between data points, such as whether there is a correlation between consumption of foods like frozen fish and crisp bread in Nordic countries.

Principal component analysis today is one of the most popular multivariate statistical techniques. It has been widely used in the areas of pattern recognition and signal processing and is a statistical method under the broad title of factor analysis.

PCA is the mother method for MVDA

PCA forms the basis of multivariate data analysis based on projection methods. The most important use of PCA is to represent a multivariate data table as smaller set of variables (summary indices) in order to observe trends, jumps, clusters and outliers. This overview may uncover the relationships between observations and variables, and among the variables.

PCA goes back to Cauchy but was first formulated in statistics by Pearson, who described the analysis as finding “lines and planes of closest fit to systems of points in space” [Jackson, 1991].

PCA is a very flexible tool and allows analysis of datasets that may contain, for example, multicollinearity, missing values, categorical data, and imprecise measurements. The goal is to extract the important information from the data and to express this information as a set of summary indices called principal components.

Statistically, PCA finds lines, planes and hyper-planes in the K-dimensional space that approximate the data as well as possible in the least squares sense. A line or plane that is the least squares approximation of a set of data points makes the variance of the coordinates on the line or plane as large as possible.

PCA creates a visualization of data that minimizes residual variance in the least squares sense and maximizes the variance of the projection coordinates.

How PCA works

In a previous article, we explained why pre-treating data for PCA is necessary. Now, let’s take a look at how PCA works, using a geometrical approach.

Consider a matrix X with N rows (aka "observations") and K columns (aka "variables"). For this matrix, we construct a variable space with as many dimensions as there are variables (see figure below). Each variable represents one coordinate axis. For each variable, the length has been standardized according to a scaling criterion, normally by scaling to unit variance. You can find more details on scaling to unit variance in the previous blog post.

A K-dimensional variable space. For simplicity, only three variables axes are displayed. The “length” of each coordinate axis has been standardized according to a specific criterion, usually unit variance scaling.

In the next step, each observation (row) of the X-matrix is placed in the K-dimensional variable space. Consequently, the rows in the data table form a swarm of points in this space.

The observations (rows) in the data matrix X can be understood as a swarm of points in the variable space (K-space).

Mean centering

Next, mean-centering involves the subtraction of the variable averages from the data. The vector of averages corresponds to a point in the K-space.

In the mean-centering procedure, you first compute the variable averages. This vector of averages is interpretable as a point (here in red) in space. The point is situated in the middle of the point swarm (at the center of gravity).

The subtraction of the averages from the data corresponds to a re-positioning of the coordinate system, such that the average point now is the origin.

The mean-centering procedure corresponds to moving the origin of the coordinate system to coincide with the average point (here in red).

The first principal component

After mean-centering and scaling to unit variance, the data set is ready for computation of the first summary index, the first principal component (PC1). This component is the line in the K-dimensional variable space that best approximates the data in the least squares sense. This line goes through the average point. Each observation (yellow dot) may now be projected onto this line in order to get a coordinate value along the PC-line. This new coordinate value is also known as the score.

The first principal component (PC1) is the line that best accounts for the shape of the point swarm. It represents the maximum variance direction in the data. Each observation (yellow dot) may be projected onto this line in order to get a coordinate value along the PC-line. This value is known as a score.

The second principal component

Usually, one summary index or principal component is insufficient to model the systematic variation of a data set. Thus, a second summary index – a second principal component (PC2) – is calculated. The second PC is also represented by a line in the K-dimensional variable space, which is orthogonal to the first PC. This line also passes through the average point, and improves the approximation of the X-data as much as possible.

The second principal component (PC2) is oriented such that it reflects the second largest source of variation in the data while being orthogonal to the first PC. PC2 also passes through the average point.

Two principal components define a model plane

When two principal components have been derived, they together define a place, a window into the K-dimensional variable space. By projecting all the observations onto the low-dimensional sub-space and plotting the results, it is possible to visualize the structure of the investigated data set. The coordinate values of the observations on this plane are called scores, and hence the plotting of such a projected configuration is known as a score plot.

Two PCs form a plane. This plane is a window into the multidimensional space, which can be visualized graphically. Each observation may be projected onto this plane, giving a score for each.

Modeling a Data Set

Now, let’s consider what this looks like using a data set of foods commonly consumed in different European countries. The figure below displays the score plot of the first two principal components. These scores are called t1 and t2. The score plot is a map of 16 countries. Countries close to each other have similar food consumption profiles, whereas those far from each other are dissimilar. The Nordic countries (Finland, Norway, Denmark and Sweden) are located together in the upper right-hand corner, thus representing a group of nations with some similarity in food consumption. Belgium and Germany are close to the center (origin) of the plot, which indicates they have average properties.

The PCA score plot of the first two PCs of a data set about food consumption profiles. This provides a map of how the countries relate to each other. The first component explains 32% of the variation, and the second component 19%. Colored by geographic location (latitude) of the respective capital city.

How to Interpret the Score Plot

In a PCA model with two components, that is, a plane in K-space, which variables (food provisions) are responsible for the patterns seen among the observations (countries)? We would like to know which variables are influential, and also how the variables are correlated. Such knowledge is given by the principal component loadings (graph below). These loading vectors are called p1 and p2.

The figure below displays the relationships between all 20 variables at the same time. Variables contributing similar information are grouped together, that is, they are correlated. Crisp bread (crips_br) and frozen fish (Fro_Fish) are examples of two variables that are positively correlated. When the numerical value of one variable increases or decreases, the numerical value of the other variable has a tendency to change in the same way.

When variables are negatively (“inversely”) correlated, they are positioned on opposite sides of the plot origin, in diagonally 0pposed quadrants. For instance, the variables garlic and sweetener are inversely correlated, meaning that when garlic increases, sweetener decreases, and vice versa.

PCA loading plot of the first two principal components (p2 vs p1) comparing foods consumed.

If two variables are positively correlated, when the numerical value of one variable increases or decreases, the numerical value of the other variable has a tendency to change in the same way.

Furthermore, the distance to the origin also conveys information. The further away from the plot origin a variable lies, the stronger the impact that variable has on the model. This means, for instance, that the variables crisp bread (Crisp_br), frozen fish (Fro_Fish), frozen vegetables (Fro_Veg) and garlic (Garlic) separate the four Nordic countries from the others. The four Nordic countries are characterized as having high values (high consumption) of the former three provisions, and low consumption of garlic. Moreover, the model interpretation suggests that countries like Italy, Portugal, Spain and to some extent, Austria have high consumption of garlic, and low consumption of sweetener, tinned soup (Ti_soup) and tinned fruit (Ti_Fruit).

Geometrically, the principal component loadings express the orientation of the model plane in the K-dimensional variable space. The direction of PC1 in relation to the original variables is given by the cosine of the angles a1, a2, and a3. These values indicate how the original variables x1, x2,and x3 “load” into (meaning contribute to) PC1. Hence, they are called loadings.

The second set of loading coefficients expresses the direction of PC2 in relation to the original variables. Hence, given the two PCs and three original variables, six loading values (cosine of angles) are needed to specify how the model plane is positioned in the K-space.

The principal component loadings uncover how the PCA model plane is inserted in the variable space. The loadings are used for interpreting the meaning of the scores.

Confidence Intervals and Levels

The confidence interval is the plus-or-minus figure usually reported in newspaper or television opinion poll results. For example, if you use a confidence interval of 4 and 47% percent of your sample picks an answer you can be “sure” that if you had asked the question of the entire relevant population between 43% (47-4) and 51% (47+4) would have picked that answer.

The confidence level tells you how sure you can be. It is expressed as a percentage and represents how often the true percentage of the population who would pick an answer that lies within the confidence interval. The 95% confidence level means you can be 95% certain the 99% confidence level means you can be 99% certain. Most researchers work for a 95% confidence level.

When you put the confidence level and the confidence interval together, you can say that you are 95% sure that the true percentage of the population is between 43% and 51%.

Factors that Affect Confidence Intervals
The confidence interval is based on the margin of error. There are three factors that determine the size of the confidence interval for a given confidence level. These are: sample size, percentage and population size.

Sample Size
The larger your sample, the more sure you can be that their answers truly reflect the population. This indicates that for a given confidence level, the larger your sample size, the smaller your confidence interval. However, the relationship is not linear (i.e., doubling the sample size does not halve the confidence interval).

Your accuracy also depends on the percentage of your sample that picks a particular answer. If 99% of your sample said “Yes” and 1% said “No” the chances of error are remote, irrespective of sample size. However, if the percentages are 51% and 49% the chances of error are much greater. It is easier to be sure of extreme answers than of middle-of-the-road ones.

When determining the sample size needed for a given level of accuracy you must use the worst case percentage (50%). You should also use this percentage if you want to determine a general level of accuracy for a sample you already have. To determine the confidence interval for a specific answer your sample has given, you can use the percentage picking that answer and get a smaller interval.

Population Size
How many people are there in the group your sample represents? This may be the number of people in a city you are studying, the number of people who buy new cars, etc. Often you may not know the exact population size. This is not a problem. The mathematics of probability proves the size of the population is irrelevant, unless the size of the sample exceeds a few percent of the total population you are examining. This means that a sample of 500 people is equally useful in examining the opinions of a state of 15,000,000 as it would a city of 100,000. For this reason, the sample calculator ignores the population size when it is “large” or unknown. Population size is only likely to be a factor when you work with a relatively small and known group of people .

The confidence interval calculations assume you have a genuine random sample of the relevant population. If your sample is not truly random, you cannot rely on the intervals. Non-random samples usually result from some flaw in the sampling procedure. An example of such a flaw is to only call people during the day, and miss almost everyone who works. For most purposes, the non-working population cannot be assumed to accurately represent the entire (working and non-working) population.

Most information on this page was obtained from The Survey System


  1. Burhardt

    As the specialist, I can render the help. Together we can arrive at the correct answer.

Write a message