6 Data Analysis

6.1 Building a Spatial Index

We will build a spatial index for our new BaseData object. A spatial index enables analyses using this BaseData to run faster (though beware that some user-defined spatial conditions will not work with a spatial index). Note that an index can be deleted/rebuilt at any time, and these changes affect only subsequent analyses, not those that preceded any change.

Select Build Spatial Index from the Analyses menu option:

For most purposes, accept the index sizes suggested by the system (see Using the Spatial Index for additional details on setting optimal index sizes).

Click OK to build the spatial index. A pop-up dialog notifies you that the index has been built (OK).

6.2 Running a Cluster Analysis

Once you have imported or opened a BaseData object, you can run cluster analyses to identify clusters in this data.

Biodiverse supports agglomerative clustering of the groups based on their labels, or some function of their labels such as the values of a linked matrix or tree.

We will run a cluster analysis on the BaseData imported earlier ‘Example_new_basedata’ (see Importing data). If My_new_basedata is not loaded, but you have saved it, open this file now by clicking on ‘Open’ from the Basedata menu and navigating to the location of your bds file.

To run a cluster analysis on the currently selected BaseData object, select Cluster from the Analyses menu:

The cluster analysis tab opens. It has three main sections where you can set options:

The upper section (Parameters) determines the parameters used in the clustering to generate a tree. This includes options to control the cluster tie-breaker algorithm.
There are check-boxes in the middle section that can be toggled to alter how the cluster analysis runs, e.g. by controlling aspects like process performance (e.g. memory usage), the automatic export of results, etc.
The lower (Spatial calculations to run for each cluster node) determines what subsequent calculations will be run for each node in this tree, using the groups it contains to define the spatial sample. This allows you to, for example, calculate the phylogenetic diversity of the regions defined by each cluster node.

6.2.1 Setting cluster analysis parameters

For the purposes of this guide, accept the default name for the cluster analyses (see diagram above) and also the default metric ‘SORENSON’ (the set of indices that can be used as the clustering metric is denoted by a column in the Indices list). We will also accept the default linkage of ‘link_average’, which calculates the dissimilarity of each newly created group with all other groups as the average value of the groups merged to form it, weighted by the number of terminal nodes they each include. In this way a merger between node A with 10 terminal nodes and node B with 1 terminal node will not be biased towards the labels in node B (see the Cluster parameters section of the Sample Session for details of the additional linkage functions that are available).

The spatial conditions can be used to control which Groups are considered candidates to be clustered together. The default condition in the ‘Spatial conditions set 1’ text box provided by the system is ‘sp_select_all()’. This condition selects every group and is an appropriate condition to use for unconstrained clustering. If you wish to use a different spatial condition, for example to first cluster within geographic regions, then refer to Spatial Conditions for further details. The definition query serves a similar purpose, except that it can be used to restrict the clustering to only a subset of groups. For example you might have a data set for Europe but only want to cluster cells in regions above a specified elevation value. The Spatial Analysis section in this document has some more details about spatial conditions and definition queries.

Check the option or calculations boxes as appropriate. If you’re curious as to what they do, hover over them with the mouse pointer to view a tooltip.

Click on the Go! button (keyboard shortcut Control-G). The system will first build the dissimilarity matrix (or matrices if you have several spatial conditions), then run the clustering using these matrices, and then any spatial calculations you have selected for each node. It will then prompt you to display results.

6.2.2 Viewing the cluster results

There are three sub-panes within the display pane. On the left is the group grid, on the upper right is the tree representing the clustering, and on the lower right is the scree plot for the tree.

As with the visualisation of basedata/matrix/tree objects described earlier, the system is linked, and interactions in the group grid or tree of the cluster display are generally reflected in the other. Hovering the mouse over a node in the tree highlights the groups (in the group grid) that are contained in that node. Likewise, hovering the mouse over a group in the group grid will highlight the path (set of nodes/clusters) in the tree to which that group belongs.

Left click on a tree node to colour a set of subclusters (descendant nodes in the tree). These clusters are split into a number of coloured groupings based on the “Clusters to colour” parameter at the bottom of the pane. Note that some leaf nodes in the tree may have a length of zero (indicated by vertical bars at the leftmost side of the tree, longer vertical bars indicating more zero-length leaf nodes). Thus, if any such nodes occur under (to the left of) the node you have selected, their colouring will not be apparent if you are using “Plot by Length” mode. These nodes, along with their colouring (if selected) can be made apparent by switching to “Plot by Depth” mode under the tree options button under the Display menu at the left.

The blue sliding bar in the tree pane can be dragged across the tree to colour the nodes and groups at that level. The bar also displays the number of nodes it is crossing when the mouse is focused on it. The maximum number of contrasting colours the system will display is 13. If the slider bar crosses more than 13 nodes, all nodes will be uniformly coloured red instead, and groups in the group grid will not be coloured.

It is also possible to assign your own colours to the clusters and to generate a geotiff (raster) to reproduce the coloured cells in other systems.

6.3 Running a Spatial (Moving Window) Analysis

6.3.1 Introduction

The Spatial Analyses are a type of moving window analyses (because nearly all analyses in Biodiverse are spatial in some way), and are another method of identifying spatial patterns in your data. Depending upon the exact nature of your data, it is likely that the Groups comprising your BaseData constitute a geographic surface. Moving windows are used to systematically iterate across this space, with one or two neighbourhoods of the moving window being used to assess the labels/values aggregated in the Groups. The window might also consist only of the processing group, with no neighbours (each group is analysed separately).

To run a moving window analysis, you will need to have imported or loaded a BaseData object and ideally built a spatial index for this BaseData. These instructions assume you have either imported the sample data above, or you can open the example BaseData object ’example_data.bds’in the Biodiverse data folder.

Select a Basedata object by clicking on it and open the menu option Analyses -> Spatial and the spatial tab appears:

The Spatial tab has two main sections where you can set options. The upper section (Parameters) determines the parameters used in defining the two neighbour sets used in the spatial analysis (the second is optional) as well as the definition query. The lower (Calculations to run for each neighbourhood) determines the subsequent calculations that will be run for the set of neighbours related to each group. You can select any number of calculations to perform.

6.3.2 Setting the Spatial Analysis Options

The Name option is the name used in the system. You can accept the default, or enter a new name (you cannot have duplicate names within the same BaseData object).

The Neighbour set 1 and Neighbour set 2 text boxes allow you to define the neighbour sets used for the calculations (see Spatial Conditions for further details). The size and shape of the neighbourhoods defines the set of groups to be considered in an analysis and it is up to you to decide what sort of neighbourhoods and how many to use for your analysis. The system provides considerable flexibility in this regard: both neighbour sets may be arbitrarily defined independently of each other and you are not obliged to specify a second neighbourhood.

The default condition in the Neighbour set 1 text box is ‘sp_self_only()’. This restricts the neighbourhood to one cell/group (the processing group, which is the group being processed at an iteration). The default condition in the Neighbour set 2 text box ‘sp_circle (radius => 100000)’ defines a circular neighbourhood of 100000 units centred on the processing group. The units are whatever the original data used, and for the example data this is a 100 km or one cell radius.

These default conditions are appropriate for our sample data, but we will modify the second condition. Change the circle radius from 100000 to 200000.

See Spatial Conditions for more detail on defining the neighbourhood sets (e.g. for the range of functions that you can use to specify neighbourhoods).

Note that the use of neighbourhoods varies according to the type of calculation selected for analysis (see Selecting Indices below). Some calculations aggregate the neighbourhoods into a single set (e.g. endemism_whole), others compare the labels in the first neighbour set with those in the second (e.g. Jaccard and other dissimilarity indices), while others use the first set to define the list of labels to use but then consider distributions across both neighbour sets (e.g. endemism_central). Some will not be run unless two neighbour sets are defined, primarily the turnover calculations.

The Definition query text box allows the analysis to be applied to only a subset of groups (those which satisfy the criteria in the definition query, but note that all groups are still considered as possible neighbours). Leave this blank for this session. For details of setting a definition query see Definition Queries.

The Verify buttons let you know if the spatial conditions entered are syntactically valid.

6.3.3 Selecting indices to calculate

Use the check boxes to select which calculations are to be calculated. Press the plus buttons next to the Endemism, Lists and Counts and Taxonomic Dissimilarity and Comparison sets to expand their trees. Check the boxes of Endemism central, Redundancy and Richness under Lists and Counts and Sorenson under the Taxonomic Dissimilarity and Comparison set.

Details about the full range of indices are available from the Indices page.

Click on the Go! button. The system will then run the selected spatial analyses.

6.3.4 Viewing the Spatial Analysis Results

Once the analysis is complete, the system asks you whether you want to display the results. Click Yes and you will be shown a map of the moving window analysis results. Pull the pane down to view the options you used, for example to change them and re-run the analysis.

Vector overlays from shapefiles can be plotted using the “overlays” interface at the bottom of the screen. Click the Overlays… button, click Add, navigate to the Biodiverse/data folder and select coastline_lamberts.shp, click OK, highlight this file in the Select Overlay dialog box and click OK.

Nothe that Biodiverse is not yet aware of coordinate systems and only “sees” the coordinates as numbers. This means that the vector data MUST be in the same coordinate system as your basedata. If you have imported data as lat/long coordinates (a geographic coordinate system) then the vector data must also use geographic coordinates. If your basedata are in an equal area coordinate system then your vector data must also be in that coordinate system.

6.3.5 Running analysis with local neighbours

In the above moving window spatial analysis, the output visualisations only provided a single cell output of iterations within the map (black dot within the group cell).

However, altering the second neighbour entry to ”sp_circle (radius = > 500000)” when setting the initial analysis parameters (noted above), can identify aggregated relationships of these cells with their “neighbours”.

The result is a display like this:

Hovering over a group will highlight the groups used in the calculations – a solid circle for those in the first neighbourhood and a dash for those in the second neighbourhood. Right click on a group to keep the highlighting at that group until the mouse is left clicked on any group. You can control whether these are displayed, or just display one set, by using the Neighbours selection options.

Holding the control key while clicking on a group, or clicking on a group with the middle mouse button, produces a pop-up list with all the results in it. It also includes lists of the elements (groups) in each neighbour set (Elements set1, set2 and all) and of the labels in these neighbour sets (Labels set1, set2 and all). The “All” list is the union of neighbour sets 1 and 2.

The tree branches are coloured according to whether they are found only in the first neighbour set (blue), the second neighbour set (red) , both (black) or neither (grey). See also the blog post. Some indices can also be plotted on the tree.

6.4 Running a Randomisation Analysis

6.4.1 Introduction

Randomisations in Biodiverse are used to assess the statistical significance of a set of analysis results given some randomisation scheme such as shuffling the species around the map (labels across groups in Biodiverse terms), subject to constraints such as each cell (group) must maintain the same number of species. Randomisations are key to interpreting where results differ from what would be expected, and are integral to protocols such as the Categorical Analysis of Paleo and Neo Endemism (CANAPE). The output can provide rank-relative scores from which significance can be derived as well as z-scores.

The default randomisation algorithm in Biodiverse is called rand_structured.

The rand_structured algorithm is so called because it maintains the structure of the original data, specifically the spatial distribution of the richness patterns, while randomly allocating taxa (labels) to groups (cells). By default, the richness patterns are matched exactly, but users also have the option to allow tolerances such as an additional number of labels per group (e.g. five extra), or some multiple of the observed (e.g. 1.5 times as many). The rand_structured randomisation uses a filling algorithm, followed by a swapping algorithm to reach its richness targets. There is a series of blog posts describing the randomisation algorithms and visualisations.

The basic process of the randomisation is this. For each iteration of the randomisation analysis, Biodiverse will:

Create a new basedata object with a random allocation of labels to groups, according to the selected algorithm
For each analysis in the basedata:
1. Regenerate a version using the randomised basedata
2. Compare the values of the analyses from the original and randomised basedata and track if they are higher or lower, on a group by group basis
3. Track basic statistics of the distribution to allow the calculation of the mean and standard deviation, and thus z-scores.

Randomisations can be run by selecting the Analyses>Randomisation menu option:

Cluster and Region Grower trees are not rebuilt by default, partly for speed, partly because it is not normally needed. If you do want to do so then select the “build randomised tree outputs” checkbox. Note that any calculations per node are compared against the randomly generated basedata, though.

Once the relevant parameters are specified, press “Randomise!”.

Note that the randomisation analysis tab does not show maps or any other displays once the specified number of iterations have completed. The results are added as lists to any existing analysis objects and can be displayed by selecting from the “List and Indices” option on the relevant plots. You might need to close any open plots and re-open them to see the new lists. See the main help system.

A description of what the list and index names mean is also in the main help system. (Consider for example the Rand1>>SPATIAL_RESULTS list and C_END_CWE index below).

6.4.2 Spatially Structured Randomisations

In the standard implementation of a randomisation, labels are randomly allocated to any group across the entire data set. This works well for many studies and is very effective, however, when one begins to scale analyses to larger extents, then the total pool of labels begins to span many different environments. The randomisations could allocate a polar taxon to the tropics, and a desert taxon to a rainforest. This does not make the randomisation invalid, but it is perhaps not as strict as it could be, and perhaps also not as realistic.

This can be readily fixed in Biodiverse by specifying a spatial condition to define subsets. Or, if you only want to randomise a subset of your data while holding the rest as-observed, you can use a definition query. You can specify spatial conditions using the same syntax as for the Spatial and Cluster analyses. The analytical decisions are yours but generally speaking it is better to use a condition that generates non-overlapping regions. Each group is allocated to only one subset, so in such overlapping cases it is the first one that contains a group “wins” that group.

To read more on spatially structured randomisations see this blog post.