I’ve written two posts here on using a Symmetrized Normalized Compression Divergence or SNCD for comparing time series. One introduced the SNCD and described its relationship to compression distance, and the other applied the SNCD to clustering days at a high school based upon patterns of electricity consumption.
Having good tools for making such comparisons is important, because such bases for clustering and exploration are useful when examining large datasets, like the hydrological datasets I’ve previously described. I am also finally getting around to doing something with these datasets, a project I put off because of my commitments to climate activism over the last few years.
Despite my earlier enthusiasm for SNCD as a tool for series comparisons, it turns out there is a better measure, something called Procrustes tangent distance (“PTD”). I discovered this in the second edition of a book by I. L. Dryden and K. V. Mardia, called Statistical Shape Analysis, with Applications in R (2016) and through related literature and scholarship. A key paper is
J. T. Kent, K. V. Mardia, “Shape, Procrustes tangent projections and bilateral symmetry“, Biometrika, 2001, 88(2), 469-485 (with correction).
PTD is superior because it and related efforts reduce shape comparisons like that of two time series to ordinary multivariate analysis. (See pertinent book by Mardia, J. Kent, and J. Bibby as well.) For purposes of statistical analysis, it’s difficult to get better than that.
This is an outcome of a problem area dubbed Generalized Procrustes Analysis (“GPA”), and arises in applications where biological shapes need to be matched, such as bivalve shells. It also arises in archaeological work where automated methods for matching shards of pottery are engaged. These techniques and problems have deep connections to differential geometry and have engaged other great minds besides Mardia, Dryden, and Kent. PTD may not be the last word. In particular,
C. P. Klingenberg, L. R. Monteiro, “Distances and directions in multidimensional shape spaces: Implications for morphometric applications“, Systematic Biology, 54(4), 1 August 2005, 678–688
reviewed some criticisms of PTD, along with discussion by Dryden and Mardia, with others.
My application is more modest than the general multidimensional shapes problem, being limited strictly to two dimensions where some of these complications to not arise.
Unfortunately, the details of defining the Procrustes tangent distance are involved. Procrustes analysis begins with the consideration of
-dimensional landmarks and proceeds to the recovery of a rotational invariant shape, obtained by maximizing the trace of a product,
, involving a symmetric landmarks distance matrix,
and a rotation matrix,
, over all
. The value of the trace and the maximizing rotation is found using the SVD, and that is also used in the practical construction of the PTD.
The next step is a linearization by constructing a tangent space, namely, the Procrustes tangent space, and an associated tangent matrix, , which is constructed as follows. Let
be two sets of
-by-
landmarks matrices. Recall these are landmark coordinates in
dimensions and there are
of them. Find the maximum over rotation matrices
of
Call that maximum point . Then
and this can be re-expressed, after some algebra, as
Because of an implicit constraint on ,
turns out to be a bounded, non-negative Riemannian distance between
and
and their shapes. While the equation above could be solved using non-linear minimization, there are more direct approaches sketched in Kent and Mardia. Moreover, my calculations of PTD are obtained by calls to the function
procGPA
from the shapes package offered by I. L. Dryden.
The article by Klingenberg and Monteiro cited above also gives a qualitative overview.
The insight for applicability to time series comes from this sketch:
Applying the PTD to unique pairs of edges results in:
Note however that the traces in the picture could just as well be three different time series. Accordingly, the PTD for shapes also yields distances between time series.
Does this generalize, however? Do the distances continue to make sense even when the series differ in other ways?
Consider
(Click image to see a larger figure, and use browser Back Button to return to blog.)
In the labeling atop of each, the “L” factor is inversely proportional to slope, except for the zero case, which is a zero slope. In the same, the “W” factor is inversely proportional to frequency.
What does the PTD produce as distances among these? Note the the larger the number in the following figure, the farther away the cases are:
(Click image to see a larger figure, and use browser Back Button to return to blog.)
The distances show that irrespective of slope, the PTD is picking up ripple trains with the same frequency. Some are annotated.
Note that these distances have been multiplied by 100 times to get the distances in a range where they register well in the plot. What this means is that PTD considers all the cases pretty close to one another in shape. Nevertheless, it is capable of good discriminations.
What does SNCD do with the same 16 cases?
(Click image to see a larger figure, and use browser Back Button to return to blog.)
In short, the divergences are very difficult to reconcile with any pattern of similarity. Even shorter, SNCD butchered it.
Code for calculating these figures and results is available in my Google repository.
Finally, I have repeated the analysis of high school electricity consumption clustering with PTD and found it gave nearly identical results to use of SNCD,