Module 4 – Introduction to multivariate analysis with R

Multivariate analyses are statistical tools that help you summarize and understand the structure of complex multidimensional datasets (i.e. datasets with many variables). Principal Component Analysis (PCA) and hierarchical clustering are probably the most widely used among these methods.

This training session will be heavily oriented toward the practical use of multivariate statistical tools and we will keep the theory to its strict necessary minimum. We will systematically explore the following questions in the context of ecological datasets :

  • which method is best suited for which scientific question or data analysis problem ?

  • how to interpret the outputs of these methods and avoid interpretation pitfalls ?

  • which question should you ask yourself when you use these methods for your own data analysis to avoid analyses pitfalls ?

  • how to implement practically these methods in R ?

Downloads

NB this is just a draft version (still a  lots of typos,…) but the available documents should already be useable.

This zip archive contains the pdfs, R script and datasets of this module.

Outline

The core of the training will be on clustering methods with 2 main tools : Hierarchical Clustering and K-means clustering and on ordination methods with 3 main tools : PCA, MDS and nMDS

  • Introduction : Place the multivariate statistical tools in context. What do we mean by multivariate or multidimensional ? Multivariate vs univariate + supervised vs unsupervised methods. For which kind of questions are they useful ?

  • Exploratory graphical tools : Without any complex mathematical tools, simple, well-designed graphs can already be very useful to gain insights from a complex dataset.
    Tools : SPLOMs (Scatterplot Matrices), Parallel Coordinates Plots, heatmap of a correlation matrix, Faceting,…

  • Distances and Similarities : most multivariate tools are based on similarity measures between the columns or lines of a dataset. Which measure is best adapted for each situation ?

  • Clustering : find discontinuities, groups of similar observations or variables.

    • Hierarchical clustering

    • Non Hierarchical clustering : K-means, K-medoïds

    • Quick scan of other miscellaneous clustering methods (eg : DBSCAN,…)

    • Clustering validation and interpretation : Are my groups « real » ? How many groups should I use ? Why are these observations grouped together ?
      Tools : heatmaps, silhouette plots & statistics, gap statistic, etc…

  • Ordination methods : visualize dissimilarities on a summarized version of the data

    • PCA : Principal Component Analysis and tbPCA : transformation based PCA

    • MDS : Metric Dimensional Scaling (synonym of PCoA : Principal Coordinates Analysis)

    • nMDS : non Metric Dimensional Scaling

    • Quick scan of other ordination methods (eg CA : Correspondence Analysis, MCA : Multiple Correspondence Analysis, FAMD : Factorial Analysis of Mixed Data,…)

    • Use and interpretation of these tools : scree plot, biplots, circle of correlation plots, cos² and heatmaps, …

  • Quick overview of canonical or supervised methods : « explain »/ »predict » a matrix with an other matrix

    • RDA : Redundancy analysis

    • Multivariate Regression Trees

    • Quick scan of other approaches (eg CCA : Canonical Correspondence Analysis, dbRDA : distance Based RDA,…)

Prerequisites

  • statistics : basic level required : type of variables (quantitative, qualitative, binary,…), basic descriptive statistics (average, variance, standard deviation,…), basic notions about sampling (sample, population, observation, variable,…)

  • R : basic level required :

    • Basic operators and functions (paste, c, [, arithmetic and logical operators, …)

    • How to read a dataset and import it in R

    • Basic manipulation of vectors and data.frames : select an element, a column, a line of a dataset

    • Basic notions about graphs (base plots mainly)

This is a training about statistics not data manipulation with R. The R code that we will use to do the analyses is often quite simple and can easily be adapted after a simple copy paste. But to understand these statistical tools it is really important to use them as soon as possible and to have your mind freed from the R/coding mundane tasks. The difficulties occur often for the initial steps of data loading and cleaning or formating for the analysis.

If you think you need a refresh about R see for example this free online course :

https://www.datacamp.com/courses/free-introduction-to-r

Or the first module of this series for a more in-depth introduction (in French)

/formation-rstat-module-1-langage-r-introduction-et-manipulation-de-donnees/

 

There are plenty of other resources…