Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Section Pursuit

Based on arXiv:2004.13327

Ursula Laa

Institute of Statistics
Department of Landscape, Spatial and Infrastructure Sciences
University of Natural Resources and Life Sciences

CMStatistics 2021

1

What is hidden in projections?

2

What is hidden in projections?

2

Combining sectioning and projections

Sectioning in visualization means that we select a subset of points that are kept or highlighted in a view of the data. The subset is generally defined by a set of inequality conditions defining the section.

Combining projections and sections of the data can provide the viewer with new insights, as discussed in Furnas and Buja (1994). The combination is referred to as prosections and it was shown how they can be used to determine the intrinsic dimensionality of a distribution.

More generally sectioning can be useful to reveal features that are hidden in all projections of the data, but typically relies on user interaction (linked brushing). However, if we can define a systematic approach to sectioning we could also optimize an index function to identify interesting sections: section pursuit.

3

Slicing

We can define sections based on projections of the data: a slice is using the orthogonal distance of each data point from a centered projection plane to define a section, see UL, Cook, Valencia (2019). Points close to the plane are highlighted in the projected view and can be compared to the overall (projected) distribution of points.

4

What makes a slice interesting?

A first approach to section pursuit is therefore to optimize a section pursuit index function over all possible slices of the data. If we keep the projections centered through the mean, and the slice thickness fixed, this means we are optimizing over all possible projections and can use methods from projection pursuit in combination with information about the orthogonal distances.

One approach would be to directly plug in index functions from the projection pursuit literature, but compute them only for the points inside the slice.

However, in this case we have a direct comparison: how is the distribution inside the slice different from the projected distribution? An interesting slice would be one that reveals a difference between the two!

5

Notation

  • The projected data points are computed as Y=XA, where X is an n×p data matrix, A is a p×d (orthonormal) basis for the d-dimensional space onto which the data is being projected
  • To generate a 2-dimensional slice we compute the orthogonal distance between every point and the plane (defined by A=(a1,a2)) as the Euclidean norm hi=||xi(xia1)a1(xia2)a2||.
  • Observations are considered inside the slice if hi<h.
  • We denote the set of points inside the slice S, and the set of points outside the slice C.
6

Notation

  • The two sets are separately binned into K bins: Sk=iI(Yibk)I(hi<h) and Ck=iI(Yibk)I(hih)
  • The relative counts are denoted sk=Sk/iSi and ck=Ck/iCi

Using these relative counts sk and ck we will build an index that compares the two distributions, this is based on the decomposition proposed in Gous and Buja (2012).

7

Index definition

We define two index functions that aim to detect either regions of low density (holes) or high density (up, grains) inside the slice distribution:

IlowA=k[(cksk)]>ε,

IupA=k[(skck)]>ε

where we only sum up positive differences above the threshold value ε (based on the estimated size of sampling fluctuations).

8

Index definition (generalization)

We also consider generalizing the definition to

IlowA=kwk([c1/qks1/qk]>ε)q, IupA=kwk([s1/qkc1/qk]>ε)q.

allowing for bin-wise weights wk and where q controls the sensitivity to small differences.

9

Practial considerations

When computing this index in practice, we will see systematic differences between S and C that depend on the underlying distributions. To proceed we will make a few assumptions:

  • The data shall be distributed inside a hypersphere (no systematic differences based on from which angle we are looking at it)
  • In the absence of structure we will assume points to be distributed uniformly within the sphere (capturing volume effects rather than assuming specific structure, one alternative could be a multivariate normal distribution instead)
  • We will use polar binning in the 2D projection plane (in r and θ)

With these assumptions we can study the expected differences between the two distributions, and define a bin-wise reweighting scheme for S and C separately.

10

Simulated data example

To illustrate the method we simulated a large sample of points inside a hypersphere, but with holes that are defined through hyperspherical harmonics, such that some slices are interesting and others are not.

11

Reweighting in action

We can compare the binned distributions with and without reweighting to see why this is important, for example looking at S1 from set B:

12

Tracing the index

To understand the index performance we can show how the index value changes as we slowly rotate the angle of the projection away from the most or least informative slice:

  • Start by randomly selecting a large number directions along which we move away from the starting plane.
  • Use geodesic interpolation (as available in the tourr package) to interpolate projections along those directions, up to a maximum angle α.
  • Evaluate the index value for each slice along the interpolated path.
  • The topotrace plot shows how the index value changes with α, we can see characteristics of the index function (smoothness, squint angle) and if there are local maxima or ridges for the considered distribution.
13

Tracing the index

14

Example application

We can use section pursuit to better understand non-linear boundaries in high-dimensional spaces:

  • Classification boundaries, for example from a SVM with polynomial or radial kernel
  • Boundaries from inequality conditions, for example physically meaningful regions in a parameter space

We start by sampling points parameter space: points are drawn uniformly within a p-dimensional hypersphere. We then evaluate the model predictions or inequality conditions for all samples, and create hollow regions by dropping points from one class, or where the inequality condition is not met.

15

Example application

Example: Classification problem from physics (PDFSense data), with three groups in six dimensions. When fitting a SVM with radial kernel we find that across most of the parameter space we predict one of the three groups. Where are sections with predictions for the other two?

  • Use the classifly package to sample in a grid across the 6D parameter space and evaluate predictions for all points.
  • Shave points to go from a hypercube to a hypersphere.
  • Input to section pursuit is the set of points for which the prediction is the most commonly predicted class (filling most of the hypersphere).
  • Use section pursuit to identify a slice where this set of points shows a large hole - we get a larger fraction of alternative predictions in this section of the parameter space.
  • Show all samples with their predicted class in this slice.
16

Example application

17

Summary and Outlook

  • We can define section pursuit using slices, and optimizing an index that compares the projected distribution to that inside the slice.
  • In doing so we make assumptions about the underlying distribution to adjust for expected differences (and estimate the expected fluctuations ε).
  • One application is to find interesting slices to better understand non-linear decision boundaries or inequality conditions.
  • The current implementation is limited to datasets with large sample size and intermediate number of dimensions (up to seven dimensions were considered so far).
  • There is a lot of room to develop additional index functions for section pursuit and further develop the proposed generalization, and for better optimization methods.
  • Alternative definitions of sectioning should also be considered to generalize the method.
18

Thanks!


This is joint work done in collaboration with Dianne Cook, German Valencia and Andreas Buja.

My slides are made using RMarkdown, xaringan and the ninjutsu theme. The main R packages used are tourr, classifly and the tidyverse.

19

What is hidden in projections?

2
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow