Have Daszak & Shi Used Arbitrary Data Exclusions to Bolster Proximal Wet Lab Origin of SARS-CoV-2?
Zheng-Li Shi and Peter Daszak revealed their first analysis was biased to favor "away from WIV" Their new analysis is no more trustworthy.
Introduction
Surprise! The authors of this study retract their own study:
But not to worry, they replaced it with this study:
If you’re confused, it’s not on you. The authors appear to have noticed that the geographic weighted analysis would make it more likely to appear that a bat was transported from afar to WIV for analysis - which we already know was happening. So, in public, they appear to have undone their biasing of their study. Grab a coffee, tea, or other… and Read on.
The revised analysis of the origins and cross-species transmission of bat coronaviruses in China, published by researchers associated with EcoHealth Alliance and the Wuhan Institute of Virology (WIV), raises serious concerns about the arbitrary exclusion of data. Specifically, the updated study excluded sequences from northern Laos and 27 duplicated sequences from the original dataset, citing these as “experimental errors.” Duplicating one, may two sequences might occur, sometimes. But such massive “errors” are unlikely to occur without intent. Further, a closer examination reveals that these exclusions may have been motivated by a desire to simplify the phylogenetic narrative in favor of the natural origin hypothesis. Such practices further erode confidence in the integrity of research from these organizations, already mired in controversy.
Original Data Inclusion: Valid or "Errors"?
The original dataset included 41 sequences from bats sampled in northern Laos (Luang Namtha province) and 27 duplicated sequences. These Laotian sequences represent an essential geographical and evolutionary link between bat populations in Southeast Asia and southern China, regions that share ecological continuity. Labeling these sequences as "errors" without thorough justification raises questions about the motives behind their exclusion.
Duplicated sequences are, admittedly, a data management oversight, but their impact on phylogenetic inferences depends on their distribution and frequency. Simply excluding these sequences rather than addressing their influence analytically suggests a lack of rigor or a will to mislead.
How the Results Differ Significantly
The updated analysis, after excluding the Laotian sequences and duplicates, makes substantial changes to key conclusions:
Revised Host-Switching Rates: The revised study reports different rates and patterns of host-switching events. With Laotian sequences removed, the frequency of inter-family and inter-genus host switches within China increased, concentrating evolutionary activity in Chinese bat populations. This adjustment aligns conveniently with the “official” natural ([proximal) origin narrative.
Geographical Dispersal Patterns: The revised data repositions southern China, particularly Rhinolophus spp. bats, as the epicenter of coronavirus evolution and spillover. Excluding Laotian sequences erases critical dispersal routes that suggest a broader Southeast Asian context, diminishing the plausibility of alternative scenarios such as wider interregional transmission. This is unfortunate, since we know SARS-CoV-2 virus shares functional elements with viruses from that region samples in the mid 2000’s (HKU sequences, as described in this IPAK report IPAK RESEARCH REPORT COVID-19-2020_1).
Phylogenetic Tree Simplification: Excluding long-isolated Laotian lineages reduces phylogenetic diversity, making the tree appear more cohesive and localized to China. This simplification can give the false impression of a straightforward zoonotic origin and undercuts more complex evolutionary possibilities.
Downplaying Regional Diversity: The exclusion of data from Laos inherently limits the scope of the analysis. Without these sequences, the study implies that the evolution of SARS-CoV-2 is confined to China, undermining the exploration of transnational evolutionary pathways.
The Impact of Including Duplicate Sequences in Phylogenetic Analysis in General
Including duplicate sequences in phylogenetic analysis can introduce significant distortions and inefficiencies, ultimately compromising the accuracy and reliability of evolutionary inferences. Duplicate sequences do not add new evolutionary information but instead, artificially inflate the representation of certain taxa or lineages. This can distort the inferred tree topology, as clustering algorithms may interpret the duplicates as independent evidence for a grouping, skewing the analysis.
One of the most significant consequences of including duplicates is the potential to produce misleading evolutionary scenarios. For example, duplicates can mask true evolutionary events, such as gene duplication, recombination, or horizontal gene transfer, by creating false signals in the dataset. This distortion could result in incorrect branch placements or overconfidence in certain relationships, ultimately misrepresenting the evolutionary history. In high-stakes research, such as tracing the origins of zoonotic pathogens like SARS-CoV-2, this could lead to erroneous conclusions about how key traits evolved or were transmitted across species.
Another critical issue is the inflation of support values for branches in the phylogenetic tree. Bootstrap analyses, which assess the statistical confidence of inferred relationships, may yield artificially high values when duplicates reinforce specific branch placements. This can create a false confidence in the tree’s topology, misleading researchers about the robustness of the inferred relationships. For example, inflated support values might obscure genuine uncertainty about key evolutionary pathways in analyses involving SARS-related coronaviruses.
The presence of duplicates also reduces the signal-to-noise ratio in the dataset. By introducing redundancy, duplicates dilute the meaningful evolutionary signals critical for detecting subtle patterns. This is particularly problematic in datasets with limited diversity, where genuine evolutionary signals are already difficult to discern. For instance, duplicates could obscure important insights into cross-species transmission dynamics in understanding zoonotic spillovers, such as those involving bat coronaviruses.
From a computational perspective, duplicates increase the dataset size without contributing additional information. While this introduces inefficiencies, such as longer processing times, the greater concern lies in how these inefficiencies can hinder robust analyses. Larger datasets with redundant information can make it more challenging to apply advanced phylogenetic methods, reducing the clarity of the results.
Researchers routinely identify and remove duplicate sequences to mitigate these issues before conducting phylogenetic analysis. In cases where duplicates are retained—for example, to evaluate sequencing consistency—it is essential to assess their impact and report their inclusion transparently and critically. Phylogenetic tools, such as maximum likelihood or Bayesian inference models, can sometimes account for redundancy, but these methods are not foolproof and require careful parameterization.
Ultimately, including duplicate sequences poses a significant risk to the integrity of phylogenetic analysis, particularly in research areas with critical implications for public health or evolutionary biology. By prioritizing unique, high-quality sequences and applying rigorous quality control, researchers can enhance the reliability of their analyses and ensure that their conclusions accurately reflect evolutionary relationships. In the context of tracing the origins of SARS-CoV-2, for instance, avoiding duplicates is crucial for maintaining confidence in the scientific process and its findings. Their inclusion in the first place may or may not have been accidental.
Evidence of Methodological Bias: Duplicated Sequences and the Proximal Origins Narrative
The inclusion of duplicated sequences in phylogenetic analyses conducted by Daszak et al. introduces clear methodological biases that align with the narrative of proximal origins of SARS-CoV-2 within China (in their view, supporting the wet market origin of locally hunted animals). By over-representing sequences from certain regions, particularly those outside China, the original dataset artificially skewed the inferred phylogeographic and evolutionary patterns. This biased approach not only distorts the analytical framework but also strategically reinforces a narrative that supports a natural spillover event within Chinese bat populations.
Inflated Representation of Northern Laos
The original analysis included 27 duplicated sequences and erroneously incorporated 41 sequences from bats sampled in northern Laos. These duplications artificially increased the representation of viral sequences from outside China, disproportionately influencing the analysis to highlight evolutionary activity in regions adjacent to southern China. This over-representation would have amplified the role of Laos in phylogenetic tree topology, dispersal dynamics, and inferred transmission pathways, which could serve to dilute or obscure the evolutionary connections.
Bayesian Bias Toward Over-Represented Regions
The Bayesian phylogeographic framework employed by Daszak et al. is particularly sensitive to the distribution of sequences across regions. Duplicated sequences from northern Laos (first analysis) would disproportionately elevate the posterior probabilities of that region being ancestral or significant in the evolutionary history of bat coronaviruses, support lab-transport origin. Such inflated probabilities inherently bias the inferred tree topology and evolutionary dynamics, potentially leading to conclusions that overstate the role of Laos in viral emergence.
Strategic Revisions to Align with Proximal Origins
The subsequent removal of duplicated sequences and sequences from northern Laos in the revised analysis led to significant shifts in phylogenetic inferences. With the duplicates excluded, the narrative shifted back to focusing on China as the primary hotspot of evolutionary activity and viral origin. This revision conveniently supports the proximal origins hypothesis, which posits that SARS-CoV-2 emerged naturally within Chinese bat populations before transmitting to humans. The deliberate exclusion of data that initially highlighted regions outside China suggests a concerted effort to align the analysis with the preferred narrative.
Implications for Scientific Integrity
The selective inclusion and subsequent exclusion of sequences highlight a lack of methodological consistency and transparency. The first analysis demonstrates clear signs of bias by initially inflating the role of regions outside China and then revising the dataset to emphasize China. This approach undermines confidence in the objectivity of the research and raises questions about the influence of external pressures or predetermined conclusions on the scientific process.
This analysis's handling of duplicated sequences illustrates how methodological choices can be leveraged to fit a desired narrative. The lack of transparency in justifying these inclusions and exclusions further emphasizes the need for rigorous, unbiased approaches to phylogenetic analysis, particularly in research with profound implications for public health and global policy.
Phylogeographic Method Employed by Daszak et al.
The phylogeographic analysis conducted by Daszak et al. employed a Bayesian statistical framework to reconstruct the evolutionary history and spatial dynamics of bat coronaviruses. The methodology integrated genetic, temporal, and geographic data to infer transmission patterns and ancestral host relationships. By using this approach, the researchers aimed to identify the origins and evolutionary pathways of coronaviruses across regions and host species, focusing on understanding cross-species transmission events and geographic dispersal.
Data Partitioning and Host State Assignment
Sequences were grouped into datasets based on their geographic origin and the bat species from which they were sampled. Geographic regions were clustered into ecological zones, employing hierarchical clustering methods designed to reflect ecological diversity and geographical contiguity. These zones were defined based on environmental factors and known bat habitats, though the specifics of this clustering were not extensively detailed in the study. Host family information was treated as a discrete character state, allowing for reconstructing ancestral host species through probabilistic modeling. However, uncertainties in host species identification and their impact on ancestral state reconstruction were not explicitly addressed, which may influence the robustness of these inferences.
Bayesian Inference with BEAST
The analysis relied on BEAST (Bayesian Evolutionary Analysis Sampling Trees), a widely used software for phylogenetic inference. BEAST models evolutionary relationships by sampling from posterior distributions, accounting for uncertainties in tree topology, branch lengths, and substitution rates. Temporal data, incorporated through molecular clock models, allowed for estimating the timing of transmission events and the spread of viruses across regions. The study used relaxed molecular clock assumptions to accommodate rate variation across lineages, enhancing the resolution of evolutionary timelines. However, the accuracy of these timing estimates depends heavily on the quality of the temporal data and calibration points used.
Ancestral State Reconstruction
Bayesian inference was applied to reconstruct ancestral states for both host species and geographic locations. Posterior probabilities were calculated at each node of the phylogenetic tree, indicating the likelihood of different hosts or regions being the source of specific viral lineages. This enabled the researchers to infer evolutionary transitions between host species and geographic origins of key lineages. While this approach is robust, it is sensitive to biases introduced by the inclusion or exclusion of sequences, as discussed in the context of data handling choices.
Spatiotemporal Dispersal Dynamics
The phylogeographic method also reconstructed the spatial dynamics of viral dispersal over time. By integrating geographic data with phylogenetic trees, the researchers traced the movement of viruses across regions, identifying potential hotspots for cross-species transmission and evolutionary activity. This included modeling the spread of coronaviruses from one ecological zone to another. However, the geographic resolution of the analysis appears to be coarse, focusing on large regions rather than fine-scale local dynamics, which limits the precision of the dispersal inferences.
Limitations and Sensitivity to Data Bias
While the Bayesian framework provides a powerful tool for phylogeographic inference, it is susceptible to data representation. Including duplicated sequences and sequences from northern Laos disproportionately affected posterior probabilities and phylogenetic topology, artificially inflating the role of certain regions in the evolutionary history. These methodological choices introduced biases that significantly influenced the outcomes of the analysis. For example, over-representing northern Laos sequences initially highlighted regions outside of China (more specifically, away from the WIV) as contributors to viral evolution. Subsequent revisions, which excluded these sequences, shifted the narrative back toward China as the primary hotspot of evolutionary activity, raising questions about the consistency and objectivity of the data-handling process.
Implications for Framing SARS-CoV-2 Origins
By employing this Bayesian phylogeographic framework, Daszak et al. produced a detailed reconstruction of bat coronavirus evolution. However, the methodological sensitivity to data representation highlights the importance of rigorous data curation and transparency. The results were influential in framing the narrative of SARS-CoV-2's origins within Chinese bat populations, but the selective handling of sequences from outside China limits the exploration of alternative hypotheses. Competing explanations, such as broader regional contributions or laboratory-associated origins, were not robustly tested in the analysis. These omissions underscore the need for balanced and transparent phylogeographic approaches to ensure that such studies contribute reliably to understanding zoonotic spillovers and viral evolution.
The Impact of Excluding HKU Sequences on Connections to Ralph Baric's Research
The exclusion of HKU sequences from the revised phylogenetic analysis has significant implications for understanding the evolutionary history of SARS-CoV-2. These sequences, studied and published by Ralph Baric's lab and collaborators in the mid-2000s, represent crucial components of the Sarbecovirus lineage. Their removal effectively obscures potential links between SARS-CoV-2 and earlier research on bat coronaviruses with similar genomic features.
1. HKU Sequences as Historical Context
The HKU sequences (e.g., HKU3-1, HKU3-2, and HKU3-3) were pivotal in establishing the phylogenetic relationships among Sarbecoviruses. These sequences provided evidence of a shared evolutionary history and demonstrated the presence of key genetic motifs that later appeared in SARS-CoV-2. For example:
The receptor-binding domain (RBD) of the spike protein in HKU sequences shows structural and functional similarities to that of SARS-CoV and SARS-CoV-2.
Pathogenic protein motifs present in HKU sequences have been highlighted in studies as precursors to elements seen in SARS-CoV-2.
The HKU sequences, such as HKU3-1, HKU3-2, and HKU3-3, were sampled from bats in Hong Kong during extensive field studies conducted in the mid-2000s. Researchers collected these sequences from Rhinolophus spp., commonly known as horseshoe bats, which are widely recognized as natural reservoirs for Sarbecoviruses. The sampling sites included caves and rural habitats across Hong Kong, areas known for harboring diverse bat populations.
These sequences were isolated as part of a targeted effort to investigate the diversity of bat coronaviruses in the wake of the 2003 SARS epidemic. Hong Kong was a critical location for this research due to its proximity to Guangdong Province, where the first SARS cases emerged, and its ecological similarity to other regions in southern China. The field studies aimed to identify potential viral reservoirs and assess the risk of future zoonotic spillovers, and the HKU sequences became foundational in shaping the early understanding of Sarbecovirus evolution.
Collected between 2004 and 2006, the HKU sequences represent some of the earliest identified relatives of SARS-CoV. Their genomic characteristics provided valuable insights into the evolutionary pathways of Sarbecoviruses, including features later seen in SARS-CoV-2. The location of their sampling underscores the regional distribution of Sarbecoviruses across southern China and Southeast Asia, bridging critical gaps in the timeline of coronavirus evolution.
Excluding these sequences from the revised phylogenetic analyses effectively narrows the scope of the research. It removes an essential link to the historical and geographical context of Sarbecovirus diversity, obscuring potential connections between SARS-CoV-2 and earlier discoveries. This omission not only limits the depth of the evolutionary narrative but also sidesteps questions about how key genetic features, including those studied in experimental research, may have originated or evolved. By sidelining the HKU sequences, the updated analyses reduce the opportunity to fully explore the broader regional and historical factors contributing to the emergence of SARS-CoV-2.
By excluding these sequences, the revised analysis removes critical evidence that could illuminate how specific genomic features of SARS-CoV-2, including its unique furin cleavage site and binding affinities, may have predated the pandemic and been present in earlier isolates studied by Baric’s lab.
Ralph Baric’s laboratory conducted pioneering work on SARS-like coronaviruses, including the reconstruction of a synthetic virus using consensus sequences from the HKU3-1, HKU3-2, and HKU3-3 isolates, along with an additional sequence, RP3. These HKU sequences, originally obtained from bats in Hong Kong, were integral to creating a functional virus capable of infecting human ACE2 (hACE2) cells. In this process, the lab made specific modifications to the virus’s spike protein, particularly in its receptor-binding domain, to enhance its ability to bind to hACE2 receptors. This modification demonstrated the potential for laboratory-engineered viruses to cross species barriers and infect human cells more effectively, a crucial feature for understanding the zoonotic potential of coronaviruses.
The synthetic virus constructed in Baric’s lab included key elements of the HKU sequences that predate SARS-CoV-2. Notably, these sequences exhibited a pathogenic motif signature that shares significant overlap with features found in SARS-CoV-2, such as a truncated N-terminal spike domain and a retroviral-like envelope motif (Gp41). These features, first identified in HKU3-3, were also retained in the synthetic construct, highlighting the relevance of these early experiments to the current understanding of coronavirus evolution. The ability to synthesize a biologically active virus with enhanced human infectivity underscores the importance of the HKU sequences in tracing the origins of SARS-CoV-2 and assessing the potential contribution of laboratory research to its emergence.
The exclusion of HKU sequences from revised phylogenetic analyses has profound implications for understanding the origins of SARS-CoV-2. Without these sequences, the analyses obscure the connection between SARS-CoV-2 and earlier laboratory work that demonstrated the enhancement of human infectivity through targeted modifications. This exclusion narrows the phylogenetic scope, focusing exclusively on natural spillover events within China (ostensibly the wet market), while omitting critical data that links SARS-CoV-2 to experimental constructs derived from the HKU sequences. Furthermore, this selective exclusion undermines transparency, as it prevents a thorough investigation into how laboratory modifications may parallel natural evolutionary processes, thereby obfuscating potential overlaps between research activities and the features of SARS-CoV-2.
Baric’s work with the HKU sequences illustrates the power of synthetic biology to create infectious coronaviruses with enhanced human infectivity. By excluding these sequences from updated analyses, researchers limit the opportunity to explore the evolutionary and experimental history of SARS-CoV-2 fully. The HKU sequences are vital for understanding the natural evolution of coronaviruses and evaluating how laboratory research may have intersected with these processes. Their omission raises significant questions about selective data handling and the broader implications for investigating the pandemic's origins. Including these sequences in future analyses is essential for transparency, accountability, and a comprehensive understanding of how SARS-CoV-2 emerged.
2. Obscuring Research Lineage
Ralph Baric's lab conducted extensive work on reverse genetics and chimeric virus studies using sequences from the HKU lineage. These efforts included recombination experiments that combined spike proteins from different coronaviruses to assess zoonotic potential. By excluding HKU sequences from the analysis:
The study prevents phylogenetic inferences that could trace elements of SARS-CoV-2 to constructs previously studied in Baric’s lab.
The absence of these sequences eliminates a direct link to experimental platforms and genomic backbones used in earlier research, leaving gaps in the evolutionary timeline.
3. Implications for Understanding SARS-CoV-2 Origins
The omission of HKU sequences hinders the ability to address critical questions about the origins of SARS-CoV-2, particularly:
Whether specific genomic features of SARS-CoV-2, such as its enhanced human ACE2 binding, have roots in experimental work involving HKU-like viruses.
How natural or engineered recombination events involving HKU-like sequences may have contributed to the emergence of SARS-CoV-2.
4. Undermining Transparency
The exclusion of these sequences raises concerns about selective reporting and the narrowing of the dataset to favor a natural origin narrative. By removing HKU sequences, the study avoids revisiting questions about the potential overlap between natural evolution and gain-of-function research, thereby limiting a comprehensive examination of all plausible origins.
Undermining Confidence in Scientific Integrity
The decision to exclude valid data, particularly sequences that challenge a narrow geographical narrative, damages the credibility of EcoHealth Alliance and WIV for several reasons:
Selective Framing: By removing data that might complicate the natural origin hypothesis, the revised analysis appears tailored to support a preordained conclusion. This raises concerns about confirmation bias and the influence of political or institutional pressures. The authors are not overtly aware of the sensitivity of phylogenetic analyses to sparse vs. dense taxon sampling, which has been well known for over twenty years.
Lack of Transparency: The justification for labeling Laotian sequences as errors is insufficiently detailed. Were these sequences independently validated or cross-referenced with other datasets? Without such transparency, the rationale for exclusion seems arbitrary.
Reputation of Involved Institutions: Both EcoHealth Alliance and WIV have faced criticism for their roles in high-stakes virological research. This development exacerbates existing doubts about their commitment to unbiased scientific inquiry.
Erosion of Public Trust: Given the global stakes of understanding SARS-CoV-2’s origins, any appearance of data manipulation or selective reporting undermines public confidence in scientific research and fuels skepticism about the transparency of pandemic-related investigations.
Broader Implications for SARS-CoV-2 Origins Research
The exclusion of Laotian sequences from this analysis diminishes the robustness of the study in several key ways:
Missed Opportunities: Including Laotian sequences could illuminate evolutionary links between bat populations across Southeast Asia and southern China, broadening understanding of coronavirus transmission dynamics.
Alternative Hypotheses: Excluding data that might suggest interregional transmission or non-natural origins stifles the exploration of alternative hypotheses, such as accidental laboratory involvement.
Scientific Accountability: The apparent selectivity in data handling underscores the need for independent oversight in high-profile studies, especially those with significant policy and public health implications.
Conclusion
The revised analysis of SARS-CoV-2 origins exemplifies the dangers of arbitrary data exclusion. While presented as an effort to correct errors, removing Laotian sequences and duplicated data significantly alters the study’s conclusions in ways that favor the natural and proximal origin hypothesis. This development undermines confidence in the scientific integrity of EcoHealth Alliance and WIV, highlighting the need for transparent and unbiased investigations into the origins of SARS-CoV-2. Without comprehensive and inclusive analyses, public trust in virological research will continue to erode, leaving critical questions about the pandemic unanswered.
the disgraced authors are disingenuous charlatans
Scum