Section Pursuit
Based on arXiv:2004.13327
Ursula Laa
Institute of Statistics
Department of Landscape, Spatial and Infrastructure Sciences
University of Natural Resources and Life Sciences
CMStatistics 2021
Based on arXiv:2004.13327
Ursula Laa
Institute of Statistics
Department of Landscape, Spatial and Infrastructure Sciences
University of Natural Resources and Life Sciences
CMStatistics 2021
Sectioning in visualization means that we select a subset of points that are kept or highlighted in a view of the data. The subset is generally defined by a set of inequality conditions defining the section.
Combining projections and sections of the data can provide the viewer with new insights, as discussed in Furnas and Buja (1994). The combination is referred to as prosections and it was shown how they can be used to determine the intrinsic dimensionality of a distribution.
More generally sectioning can be useful to reveal features that are hidden in all projections of the data, but typically relies on user interaction (linked brushing). However, if we can define a systematic approach to sectioning we could also optimize an index function to identify interesting sections: section pursuit.
We can define sections based on projections of the data: a slice is using the orthogonal distance of each data point from a centered projection plane to define a section, see UL, Cook, Valencia (2019). Points close to the plane are highlighted in the projected view and can be compared to the overall (projected) distribution of points.
A first approach to section pursuit is therefore to optimize a section pursuit index function over all possible slices of the data. If we keep the projections centered through the mean, and the slice thickness fixed, this means we are optimizing over all possible projections and can use methods from projection pursuit in combination with information about the orthogonal distances.
One approach would be to directly plug in index functions from the projection pursuit literature, but compute them only for the points inside the slice.
However, in this case we have a direct comparison: how is the distribution inside the slice different from the projected distribution? An interesting slice would be one that reveals a difference between the two!
Using these relative counts sk and ck we will build an index that compares the two distributions, this is based on the decomposition proposed in Gous and Buja (2012).
We define two index functions that aim to detect either regions of low density (holes) or high density (up, grains) inside the slice distribution:
IlowA=∑k[(ck−sk)]>ε,
IupA=∑k[(sk−ck)]>ε
where we only sum up positive differences above the threshold value ε (based on the estimated size of sampling fluctuations).
We also consider generalizing the definition to
IlowA=∑kwk([c1/qk−s1/qk]>ε)q, IupA=∑kwk([s1/qk−c1/qk]>ε)q.
allowing for bin-wise weights wk and where q controls the sensitivity to small differences.
When computing this index in practice, we will see systematic differences between S and C that depend on the underlying distributions. To proceed we will make a few assumptions:
With these assumptions we can study the expected differences between the two distributions, and define a bin-wise reweighting scheme for S and C separately.
To illustrate the method we simulated a large sample of points inside a hypersphere, but with holes that are defined through hyperspherical harmonics, such that some slices are interesting and others are not.
We can compare the binned distributions with and without reweighting to see why this is important, for example looking at S1 from set B:
To understand the index performance we can show how the index value changes as we slowly rotate the angle of the projection away from the most or least informative slice:
tourr
package) to interpolate projections along those directions, up to a maximum angle α.We can use section pursuit to better understand non-linear boundaries in high-dimensional spaces:
We start by sampling points parameter space: points are drawn uniformly within a p-dimensional hypersphere. We then evaluate the model predictions or inequality conditions for all samples, and create hollow regions by dropping points from one class, or where the inequality condition is not met.
Example: Classification problem from physics (PDFSense data), with three groups in six dimensions. When fitting a SVM with radial kernel we find that across most of the parameter space we predict one of the three groups. Where are sections with predictions for the other two?
classifly
package to sample in a grid across the 6D parameter space and evaluate predictions for all points.This is joint work done in collaboration with Dianne Cook, German Valencia and Andreas Buja.
My slides are made using RMarkdown
, xaringan
and the ninjutsu
theme.
The main R
packages used are tourr
, classifly
and the tidyverse
.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |