Spatial Data Science
Tool used : GeoDa
Spatial correlation :
- measure of clustering in space using Moran's I
- Autocorrelation as a measure of spatial corr.
Ways to get data for species distribution : voluntary observation data like Pl@ntNet from participative collaboration.
Spatial Data Science :
Takes into account location, distance, spatial interactions as core concepts as opposed to regular data science that doesn't take it into account as thouroughly.
Data types :
Points, Lines, Polygons.
Visual representation on maps is better because we are very good at detecting visual patterns, whereas we suck at reading tabular data. Helps with knowledge discovery by reading graphs, maps or any data viz.
Categories on a map should reflect the underlying distribution so that it makes sense.
Types of maps :
-
Percentile maps : quantile map but closer to a normal distribution, emphasis on extremes and outliers. Use a diverging colormap.
-
Box map : similar to box plots but for maps. The split of categories is with percentiles aswell.
We can also split regarding the standard deviation with standard devational maps.
Multivariate data analysis
Curse of dimensionality : the larger the number of variables describing our points, the larger the space between them, leading to scarcity. This makes most brute force algorithms useless in high-dimensions.
Exploratory methods do not explain, they only suggest hypotheses and interesting patterns. It is also difficult to quantify uncertainty. It does not provide causation, only correlation
We can represent more than 2 or 3 dimensions in a scatter plot using color, size and other characteristics of our points. As a result we can visualize for example 4-dimensional data in a 2D scatterplot and analyze it visually instead of analytically. These are also called Bubble plots .
Parralel Coordinate Plot (PCP) :
Axes as parallel lines instead of orthogonal for multi-dimensional data. Observations are shown as lines between variables instead of points like such :
Lines that are close together and parallel represent clusters in multidimensional data space as we can see in this following plot :



