## The Johnson-Lindenstrauss Lemma, and the paradoxical power of random linear operators. Part 1.

###### Updated, 2018-12-04

I’ll be discussing the ramifications of:

for several posts here. Some introduction and links to proofs and explications will be provided, as well as more recent work. For instance,

There are two parts to the Lemma. Let ${}_{p}\mathbf{R}_{d}$ be a random projection matrix for a dataset matrix ${}_{n}\mathbf{D}_{p}$. Generally speaking $p$ is large, perhaps even $p > n$. (This is sometimes called a “small $n$, big $p$ problem” in terms of problem characterization.) A random projection matrix mapping onto two dimensions looks like:

$\left[ \begin{array}{cc} r_{1,1} & r_{1,2} \\ r_{2,1} & r_{2,2} \\ \vdots \\ r_{p-1,1} & r_{p-1,2} \\ r_{p,1} & r_{p,2} \end{array} \right]$

It is derived from an initial matrix, ${}_{p}\mathbf{S}_{d}$,

$\left[ \begin{array}{cc} s_{1,1} & s_{1,2} \\ s_{2,1} & s_{2,2} \\ \vdots \\ s_{p-1,1} & s_{p-1,2} \\ s_{p,1} & s_{p,2} \end{array} \right]$

where $s_{i,j}$ are each independent drawn from a $\mathcal{N}(0,1)$ or standard Normal distribution, and then

$\mathbf{r}_{i} = \frac{\mathbf{s}_{i}}{\sqrt{(\mathbf{s}_{i}^{\top} \mathbf{s}_{i})}}$

or, in other words, the $i$-th row of ${}_{p}\mathbf{R}_{d}$ is produced from the $i$-th row of ${}_{p}\mathbf{S}_{d}$ by dividing it by it’s $L_{2}$ length or norm, often called the Euclidean norm because of its coinciding with the Euclidean distance calculation.

Then, the first part is given by a Theorem 1, which bounds distortion of pairwise distances of points:

## Theorem 1.

### Let $\epsilon \in (0,\frac{1}{2})$. Let ${}_{n}\mathbf{D}_{p} \subset \mathbb{R}^{p}$ be a set of $n$ points. Then there exists a Lipschitz map$\Pi : \mathbb{R}^{p} \rightarrow \mathbb{R}^{d}$ which preserves metric continuity, so that for all $u, v \in {}_{n}\mathbf{D}_{p},$$(1 - \epsilon) \| u - v \|^{2} \le \| \Pi(u) - \Pi(v) \|^{2} \le (1 - \epsilon) \| u - v \|^{2}$

A full proof won’t be given here (see references), but there is a norm-preserving intermediate result which is useful in itself:

## Theorem 1b.

### Let $\mathbf{t} \subset \mathbb{R}^{p}$. Suppose all entries in a ${}_{p}\mathbf{Q}_{d}$ are $\sim \mathcal{N}(0,1)$ independently. Then $\llbracket (1 - \epsilon) \|\mathbf{t}\|^{2} \le \frac{\mathbf{t} {}_{p}\mathbf{Q}_{d} }{\sqrt{d}} \le (1 + \epsilon) \|\mathbf{t}\|^{2} \rrbracket \ge 1 - e^{-\frac{d(\epsilon^{2} - \epsilon^{3})}{4}}$where $\llbracket x \rrbracket$ denotes the probability of the event $x$.

The second part of the JL Lemma is given by a Theorem 2, which bounds distortion of pairwise inner products:

## Theorem 2.

### Let $\mathbf{t}, \mathbf{w} \in \mathbb{R}^{p}$ and that $\|\mathbf{t}\| \le 1$ and $\|\mathbf{w}\| \le 1$, where all norms are $L_{2}$.

This Lemma is in many ways remarkable. But, as my son, Professor Jeff Galkowski of Northeastern University, described in response to my question regarding what concentration of measure in high dimensions means, he replied:

The concentration of measure is (essentially) just the fact that in high dimensions most of the volume of a sphere is very close to the origin in the $L^\infty$ norm.

In other words,

## Theorem 3.

### Let $\mathbb{S}^{p-1}$ be the unit sphere in $\mathbb{R}^{p}$ and (let) $A \in \mathbb{S}^{p-1}$ be a measurable set so $\text{volume}(A) \ge \frac{1}{2}$. Let $A_{\epsilon}$ be the set of points with distance at most $\epsilon$ from $A$. Then $\text{volume}{A_{\epsilon}} \ge 1 - e^{-\frac{p \epsilon^{2}}{2}}$.

This Lemma is a subject reviewed in many courses. For example, Professor Zhu at the University of Wisconsin (Madison) covers the subject in his CS731 (2011), “Random Projection” (Advanced Artificial Intelligence). There are also many good Web expositions of it, like this one by Mathematics doctoral student Renan Gross. Mathematician Hein Hundal has a post from 2013 which introduces it and, more importantly, cites its connections to Machine Learning and Compressed Sensing.

Also, the JL Lemma is an example of what Schneider and Gupta describe as a data oblivious method In particular,

In its assessment of alternative approaches to dimensionality reduction, the Committee on the Analysis of Massive Data (National Research Council of the National Academies, 2013) labels random projections approaches “data oblivious“, in that the dimensionality reduction mapping can be computed without any knowledge of or use of the data. This is in contrast to “data aware” methods such as principal components analysis and its refinements, where the mapping is dependent on a given dataset. The report also identifies key benefits of random projections as follows (p. 77): “… the projection is guaranteed to work (in the sense that it preserves the distance structure or other properties) for arbitrary point-sets. In addition, generating such projections requires very little resources in terms of space and/or time, and it can be done before the data are even seen. Finally, this approach leads to results with provable guarantees”.

Here’s an illustration. Take a 7-variate Gaussian (7-dimensional), having a mean of

$\left[ \begin{array}{c} 10, -20, 10, 5, 0, -10, 30 \end{array} \right]^{\top}$

and a covariance matrix of:

$\left[ \begin{array}{ccccccc} 4& 0.12 & 0.0004 & 0.03 & 0.01 & 0.004 & 0 \\ 0.12& 9& 0.006& 0& 0.015& 0& 0.009 \\ 0.0004& 0.006& 4& 0.01& 0.04& 0.004& 0 \\ 0.03& 0& 0.01& 0.25& 0.0005& 0& 0 \\ 0.01& 0.015& 0.04& 0.0005& 1& 0.04& 0 \\ 0.004& 0& 0.004& 0& 0.04& 4& 0.06 \\ 0& 0.009& 0& 0& 0& 0.06& 9 \end{array} \right]$

10,000 points were drawn from this distribution, and then it was randomly projected to two dimensions in the manner described for Johnson-Lindenstrauss above. The resulting 10000-by-2 matrix of points was first subjected to a kernel density analysis on a 500-by-500 grid giving the result:

The k-NN (“k-Nearest Neighbors”) clustering algorithm was used on the projection ($k = 170$) to produce the following summary:

Note the projection is only from 7 dimensions to 2. The random projections technique really comes into its own when the initial number of dimensions is large. The figure below shows a k-NN result from an initial dataset having 5038 rows and 20230 columns:

I’ll be giving an illustration of an application in a second post. It will address other open questions, too, such as how to pick a $k$ value for the k-NN clustering.

However, want to close with the note that there’s a recent connection to or application in Climate Science, given in the reference below:

##### Update, 2018-12-04

Code related to this post is available at my Google Drive folder for such things.