Statistics#

Introduction#

Quantitative data is often summarized and analysed with statistical methods and visualized with plots/graphs/diagrams. Statistical methods reveal quantitative trends, patterns, and outliers in data, while plots and graphs help to convey them to audiences. Carrying out a suitable statistical analysis and choosing a suitable chart type for your data, identifying their potential pitfalls, and faithfully realising the analysis or generating the chart with suitable software are essential to back up experimental conclusions with data and reach communication goals.

Dimensionality reduction#

What is it?#

Dimensionality reduction (also called dimension reduction) aims at mapping high-dimensional data onto a lower-dimensional space in order to better reveal trends and patterns. Algorithms performing this task attempt to retain as much information as possible when reducing the dimensionality of the data: this is achieved by assigning importance scores to individual features, removing redundancies, and identifying uninformative (for instance constant) features. Dimensionality reduction is an important step in quantitative analysis as it makes data more manageable and easier to visualize. It is also an important preprocessing step in many downstream analysis algorithms, such as machine learning classifiers.

📏 How do I do it?

The most traditional dimensionality reduction technique is principal component analysis (PCA)50. In a nutshell, PCA recovers a linear transformation of the input data into a new coordinate system (the principal components) that concentrates variation into its first axes. This is achieved by relying on classical linear algebra, by computing an eigendecomposition of the covariance matrix of the data. As a result, the first 2 or 3 principal components provide a low-dimensional version of the data distribution that is faithful to the variance that was originally present. More advanced dimensionality reduction methods that are popular in biology include t-distributed stochastic neighbor embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP). In contrast to PCA, these methods are non-linear and can therefore exploit more complex relationships between features when building the lower-dimensional representation. This however comes at a cost: both t-SNE and UMAP are stochastic, meaning that the results they produce are highly dependent on the choice of hyperparameters and can differ across different runs.

⚠️ Where can things go wrong?

Although reducing dimensionality can be very useful for data exploration and analysis, it may also wipe information or structure that is relevant to the problem being studied. This is famously well illustrated by the Datasaurus dataset, which demonstrates how very differently-looking sets of measurements can become indistinguishable when described by a small set of summary statistics. The best way to minimize this risk is to start by visually exploring the data whenever possible, and carefully checking any underlying assumptions of the dimensionality reduction method being used to ensure that they hold for the considered data. Dimensionality reduction may also enhance and reveal patterns that are not biologically relevant, due to noise or systematic artifacts in the original data (see Batch effect correction section below). In addition to applying normalization and batch correction to the data prior to reducing dimensionality, some dimensionality reduction methods also offer so-called regularization strategies to mitigate this. In the end, any pattern identified in dimension-reduced data should be considered while keeping in mind the biological context of the data in order to interpret the results appropriately.

📚🤷‍♀️ Where can I learn more?

Batch correction#

What is it?#

Batch effects are systematic variations across samples correlated with experimental conditions (such as different times of the day, different days of the week, or different experimental tools) that are not related to the biological process of interest. Batch effects must be mitigated prior to making comparisons across several datasets as they impact the reproducibility and reliability of computational analysis and can dramatically bias conclusions. Algorithms for batch effect correction address this by identifying and quantifying sources of technical variation, and adjusting the data so that these are minimized while the biological signal is preserved. Most batch effect correction methods were originally developed for microarray data and sequencing data, but can be adapted to feature vectors extracted from images.

📏 How do I do it?

Two of the most used methods for batch effect correction are ComBat and Surrogate Variable Analysis (SVA), depending on whether the sources of batch effects are known a priori or not. In a nutshell, ComBat involves three steps: 1) dividing the data into known batches, 2) estimating batch effect by fitting a linear model that includes the batch as a covariate and 3) adjusting the data by removing the estimated effect of the batch from each data point. In contrast, SVA aims at identifying “surrogate variables” that capture unknown sources of variability in the data. The surrogate variables can be estimated relying on linear algebra methods (such as singular value decomposition) or through a Bayesian factor analysis model. SVA has been demonstrated to reduce unobserved sources of variability and is therefore of particular help when identifying possible causes of batch effects is challenging, but comes at a higher computational cost than ComBat.

⚠️ Where can things go wrong?

As important as it is for analysis, batch effect correction can go wrong when too much or too little of it is done. Both over- and under-correction can happen when methods are not used properly or when their underlying assumptions are not met. As a result, either biological signals can be removed (in the case of over-correction) or irrelevant sources of variation can remain (in the case of under-correction) - both potentially leading to inaccurate conclusions. Batch effect correction can be particularly tricky when the biological variation of interest is suspected to confound with the batch. In this case in particular (although always a good approach), the first lines of fight against batch effects should be thought-through experimental design and careful quality control, as well as visual exploration of the data52. Plotting data batch-by-batch before applying any correction can help confirm (or infirm) that the observed trends are similar across batches.

📚🤷‍♀️ Where can I learn more?

Normality testing#

What is it?#

Normality testing is about assessing whether data follow a Gaussian (or nomal) distribution. Because the Gaussian distribution is frequently found in nature and has important mathematical properties, normality is a core assumption in many widely-used statistical tests. When this assumption is violated, their conclusions may not hold or be flawed. Normality testing is therefore an important step of the data analysis pipeline prior to any sort of statistical testing.

📏 How do I do it?

Normality of a data distribution can be qualitatively assessed through plotting, for instance relying on a histogram. For a more quantitative readout, statistical methods such as the Kolmogorov-Smirnov (KS) and Shapiro-Wilk tests (among many others) report how much the observed data distribution deviates from a Gaussian. These tests usually return and a p-value linked to the hypothesis that the data are sampled from a Gaussian distribution. A high p-value indicates that the data are not inconsistent with a normal distribution, but is not sufficient to prove that they indeed follow a Gaussian. A p-values smaller than a pre-defined significance threshold (usually 0.05) indicates that the data are not sampled from a normal distribution.

⚠️ Where can things go wrong?

Although lots of the “standard” statistical methods have been designed with a normnality assumption, alternative approaches exist for non-normally-ditributed data. Many biological processes result in multimodal “states” (for instance differentiation) that are inherently not Gaussian. Normality testing should therefore not be mistaken for a quality assessment of the data: it merely informs on the types of tools that are appropriate to use when analyzing them.

📚🤷‍♀️ Where can I learn more?