BioRxiv is a preprint server for biology, operated by Cold Spring Harbor Laboratory, a research and educational institution.

Help us improve this page by adding information.
Visit our Writing Guide or this topic page for additional help.




Further reading


Documentaries, videos and podcasts





Ricard Argelaguet
May 3, 2021
Nature Biotechnology
The development of single-cell multimodal assays provides a powerful tool for investigating multiple dimensions of cellular heterogeneity, enabling new insights into development, tissue homeostasis and disease. A key challenge in the analysis of single-cell multimodal data is to devise appropriate strategies for tying together data across different modalities. The term 'data integration' has been used to describe this task, encompassing a broad collection of approaches ranging from batch correction of individual omics datasets to association of chromatin accessibility and genetic variation with transcription. Although existing integration strategies exploit similar mathematical ideas, they typically have distinct goals and rely on different principles and assumptions. Consequently, new definitions and concepts are needed to contextualize existing methods and to enable development of new methods. As the number of single-cell experiments with multiple data modalities increases, Argelaguet and colleagues review the concepts and challenges of data integration.
Chao Gao
April 19, 2021
Nature Biotechnology
Integrating large single-cell gene expression, chromatin accessibility and DNA methylation datasets requires general and scalable computational approaches. Here we describe online integrative non-negative matrix factorization (iNMF), an algorithm for integrating large, diverse and continually arriving single-cell datasets. Our approach scales to arbitrarily large numbers of cells using fixed memory, iteratively incorporates new datasets as they are generated and allows many users to simultaneously analyze a single copy of a large dataset by streaming it over the internet. Iterative data addition can also be used to map new data to a reference dataset. Comparisons with previous methods indicate that the improvements in efficiency do not sacrifice dataset alignment and cluster preservation performance. We demonstrate the effectiveness of online iNMF by integrating more than 1 million cells on a standard laptop, integrating large single-cell RNA sequencing and spatial transcriptomic datasets, and iteratively constructing a single-cell multi-omic atlas of the mouse motor cortex. A new algorithm enables scalable and iterative integration of single-cell datasets.
Lythgoe, K. A., Hall, M., Ferretti, L., de Cesare, M., MacIntyre-Cockett, G., Trebes, A., Andersson, M., Otecko, N., Wise, E. L., Moore, N., Lynch, J., Kidd, S., Cortes, N., Mori, M., Williams, R., Vernet, G., Justice, A., Green, A., Nicholls, S. M., Ansari, M. A., Abeler-Dörner, L., Moore, C. E., Peto, T. E. A., Eyre, D. W., Shaw, R., Simmonds, P., Buck, D., Todd, J. A., on behalf of the Oxford Virus Sequencing Analysis Group (OVSG), Connor, T. R., Ashraf, S., da Silva Filipe, A., Shepherd, J., Thomson, E. C., The COVID-19 Genomics UK (COG-UK) Consortium, Bonsall, D., Fraser, C., Golubchik, T.
April 16, 2021
A year into the severe acute respiratory syndrome coronavirus 2 pandemic, we are experiencing waves of new variants emerging. Some of these variants have worrying functional implications, such as increased transmissibility or antibody treatment escape. Lythgoe et al. have undertaken in-depth sequencing of more than 1000 hospital patients' isolates to find out how the virus is mutating within individuals. Overall, there seem to be consistent and reproducible patterns of within-host virus diversity. The authors observed only one or two variants in most samples, but a few carried many variants. Although the evidence indicates strong purifying selection, including in the spike protein responsible for viral entry, the authors also saw evidence for transmission clusters associated with households and other possible superspreader events. After transmission, most variants fizzled out, but occasionally some initiated ongoing transmission and wider dissemination. Science , this issue p. [eabg0821][1] ### INTRODUCTION Genome sequencing at an unprecedented scale during the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic is helping to track spread of the virus and to identify new variants. Most of this work considers a single consensus sequence for each infected person. Here, we looked beneath the consensus to analyze genetic variation within viral populations making up an infection and studied the fate of within-host mutations when an infection is transmitted to a new individual. Within - host diversity offers the means to help confirm direct transmission and identify new variants of concern. ### RATIONALE We sequenced 1313 SARS-CoV-2 samples from the first wave of infection in the United Kingdom. We characterized within-host diversity and dynamics in the context of transmission and ongoing viral evolution. ### RESULTS Within-host diversity can be described by the number of intrahost single nucleotide variants (iSNVs) occurring above a given minor allele frequency (MAF) threshold. We found that in lower-viral-load samples, stochastic sampling effects resulted in a higher variance in MAFs, leading to more iSNVs being detected at any threshold. Based on a subset of 27 pairs of high-viral-load replicate RNA samples (>50,000 uniquely mapped veSEQ reads, corresponding to a cycle threshold of ~22), iSNVs with a minimum 3% MAF were highly reproducible. Comparing samples from two time points from 41 individuals, taken on average 6 days apart (interquartile ratio 2 to 10), we observed a dynamic process of iSNV generation and loss. Comparing iSNVs among 14 household contact pairs, we estimated transmission bottleneck sizes of one to eight viruses. Consensus differences between individuals in the same household, where sample depth allowed iSNV detection, were explained by the presence of an iSNV at the same site in the paired individual, consistent with direct transmission leading to fixation. We next focused on a set of 563 high-confidence iSNV sites that were variant in at least one high-viral-load sample (>50,000 uniquely mapped); low-confidence iSNVs unlikely to represent genomic diversity were excluded. Within-host diversity was limited in high-viral-load samples (mean 1.4 iSNVs per sample). Two exceptions, each with >14 iSNVs, showed variant frequencies consistent with coinfection or contamination. Overall, we estimated that 1 to 2% of samples in our dataset were coinfected and/or contaminated. Additionally, one sample was coinfected with another coronavirus (OC43), with no detectable impact on diversity. The ratio of nonsynonymous to synonymous ( dN/dS ) iSNVs was consistent with within-host purifying selection when estimated across the whole genome [ dN/dS = 0.55, 95% confidence interval (95% CI) = 0.49 to 0.61] and for the Spike gene ( dN/dS = 0.60, 95% CI = 0.45 to 0.82). Nevertheless, we observed Spike variants in multiple samples that have been shown to increase viral infectivity (L5F) or resistance to antibodies (G446V and A879V). We observed a strong association between high-confidence iSNVs and a consensus change on the phylogeny (153 cases), consistent with fixation after transmission or de novo mutations reaching consensus. Shared variants that never reached consensus (261 cases) were not phylogenetically associated. ### CONCLUSION Using robust methods to call within-host variants, we uncovered a consistent pattern of low within-host diversity, purifying selection, and narrow transmission bottlenecks. Within-host emergence of vaccine and therapeutic escape mutations is likely to be relatively rare, at least during early infection, when viral loads are high, but the observation of immune-escape variants in high-viral-load samples underlines the need for continued vigilance. ![Figure][2] Diagram showing low SARS-CoV-2 within-host genetic diversity and narrow transmission bottleneck. Individuals with high viral load typically have few, if any, within-host variants. Narrow transmission bottlenecks mean that the major variant in the source individual was typically transmitted and the minor variants lost. Occasionally, the minor variant was transmitted, leading to a consensus change, or multiple variants were transmitted, resulting in a mixed infection. Credit: FontAwesome, licensed under CC BY 4.0. Extensive global sampling and sequencing of the pandemic virus severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) have enabled researchers to monitor its spread and to identify concerning new variants. Two important determinants of variant spread are how frequently they arise within individuals and how likely they are to be transmitted. To characterize within-host diversity and transmission, we deep-sequenced 1313 clinical samples from the United Kingdom. SARS-CoV-2 infections are characterized by low levels of within-host diversity when viral loads are high and by a narrow bottleneck at transmission. Most variants are either lost or occasionally fixed at the point of transmission, with minimal persistence of shared diversity, patterns that are readily observable on the phylogenetic tree. Our results suggest that transmission-enhancing and/or immune-escape SARS-CoV-2 variants are likely to arise infrequently but could spread rapidly if successfully transmitted. [1]: /lookup/doi/10.1126/science.abg0821 [2]: pending:yes
Daniel P. Cooke
March 29, 2021
Nature Biotechnology
Almost all haplotype-based variant callers were designed specifically for detecting common germline variation in diploid populations, and give suboptimal results in other scenarios. Here we present Octopus, a variant caller that uses a polymorphic Bayesian genotyping model capable of modeling sequencing data from a range of experimental designs within a unified haplotype-aware framework. Octopus combines sequencing reads and prior information to phase-called genotypes of arbitrary ploidy, including those with somatic mutations. We show that Octopus accurately calls germline variants in individuals, including single nucleotide variants, indels and small complex replacements such as microinversions. Using a synthetic tumor data set derived from clean sequencing data from a sample with known germline haplotypes and observed mutations in a large cohort of tumor samples, we show that Octopus is more sensitive to low-frequency somatic variation, yet calls considerably fewer false positives than other methods. Octopus also outputs realigned evidence BAM files to aid validation and interpretation. Octopus detects germline and somatic variants with high sensitivity and accuracy.
James Rogers
July 31, 2020
Fox News
A type of fungus found at the site of the Chernobyl nuclear disaster was sent into space in a research project that aims to keep astronauts safe from radiation on deep space mission
The Economist
May 6, 2020
The Economist
Will that change how science is published?
R. Prasad
February 28, 2020
The Hindu
: On February 26, The Lancet Global Health retracted a letter by two nurses from Guangzhou province just two days after it was published in the journal. In the letter, the two nurses describe their pl
Golden logo
Text is available under the Creative Commons Attribution-ShareAlike 4.0; additional terms apply. By using this site, you agree to our Terms & Conditions.