I’ll be discussing the ramifications of:
- William B. Johnson and Joram Lindenstrauss, “Extensions of Lipschitz mappings into a Hilbert space, Contemporary Mathematics, 26:189–206, 1984.
for several posts here. Some introduction and links to proofs and explications will be provided, as well as more recent work. For instance,
- W. B. Johnson, A. Naor, “The Johnson-Lindenstrauss [L]emma almost characterizes Hilbert space, but not quite“, Discrete Computational Geometry (2010) 43: 542–553.
- E. J. Candes, T. Tao, “Near-optimal signal recovery From random projections: Universal encoding strategies?“, IEEE Transactions on Information Theory, 52(12), December 2006, 5406-5425.
- M. J. Schneider, S. Gupta, “Forecasting sales of new and existing products using consumer reviews: A random projections approach“, International Journal of Forecasting, 2016, 32, 243-256.
There are two parts to the Lemma. Let be a random projection matrix for a dataset matrix . Generally speaking is large, perhaps even . (This is sometimes called a “small , big problem” in terms of problem characterization.) A random projection matrix mapping onto two dimensions looks like:
It is derived from an initial matrix, ,
where are each independent drawn from a or standard Normal distribution, and then
or, in other words, the -th row of is produced from the -th row of by dividing it by it’s length or norm, often called the Euclidean norm because of its coinciding with the Euclidean distance calculation.
Then, the first part is given by a Theorem 1, which bounds distortion of pairwise distances of points:
Let . Let be a set of points. Then there exists a Lipschitz map
which preserves metric continuity, so that for all
A full proof won’t be given here (see references), but there is a norm-preserving intermediate result which is useful in itself:
Let . Suppose all entries in a are independently. Then
where denotes the probability of the event .
The second part of the JL Lemma is given by a Theorem 2, which bounds distortion of pairwise inner products:
Let and that and , where all norms are .
This Lemma is in many ways remarkable. But, as my son, Professor Jeff Galkowski of Northeastern University, described in response to my question regarding what concentration of measure in high dimensions means, he replied:
The concentration of measure is (essentially) just the fact that in high dimensions most of the volume of a sphere is very close to the origin in the norm.
In other words,
Let be the unit sphere in and (let) be a measurable set so . Let be the set of points with distance at most from . Then .
This Lemma is a subject reviewed in many courses. For example, Professor Zhu at the University of Wisconsin (Madison) covers the subject in his CS731 (2011), “Random Projection” (Advanced Artificial Intelligence). There are also many good Web expositions of it, like this one by Mathematics doctoral student Renan Gross. Mathematician Hein Hundal has a post from 2013 which introduces it and, more importantly, cites its connections to Machine Learning and Compressed Sensing.
Also, the JL Lemma is an example of what Schneider and Gupta describe as a data oblivious method In particular,
In its assessment of alternative approaches to dimensionality reduction, the Committee on the Analysis of Massive Data (National Research Council of the National Academies, 2013) labels random projections approaches “data oblivious“, in that the dimensionality reduction mapping can be computed without any knowledge of or use of the data. This is in contrast to “data aware” methods such as principal components analysis and its refinements, where the mapping is dependent on a given dataset. The report also identifies key benefits of random projections as follows (p. 77): “… the projection is guaranteed to work (in the sense that it preserves the distance structure or other properties) for arbitrary point-sets. In addition, generating such projections requires very little resources in terms of space and/or time, and it can be done before the data are even seen. Finally, this approach leads to results with provable guarantees”.
Here’s an illustration. Take a 7-variate Gaussian (7-dimensional), having a mean of
and a covariance matrix of:
10,000 points were drawn from this distribution, and then it was randomly projected to two dimensions in the manner described for Johnson-Lindenstrauss above. The resulting 10000-by-2 matrix of points was first subjected to a kernel density analysis on a 500-by-500 grid giving the result:
The k-NN (“k-Nearest Neighbors”) clustering algorithm was used on the projection () to produce the following summary:
Note the projection is only from 7 dimensions to 2. The random projections technique really comes into its own when the initial number of dimensions is large. The figure below shows a k-NN result from an initial dataset having 5038 rows and 20230 columns:
I’ll be giving an illustration of an application in a second post. It will address other open questions, too, such as how to pick a value for the k-NN clustering.
However, want to close with the note that there’s a recent connection to or application in Climate Science, given in the reference below:
- T. Seitola, V. Mikkola, J. Silen, H. Järvinen, “Random projections in reducing the dimensionality of climate simulation data“, Telus Series A: Dynamic Meteorology and Oceanography, 2014, 66, 25274.