(Major update on 11th January 2019. Minor update on 16th January 2019.)
On comparing things
The idea of a calculating a distance between series for various purposes has received scholarly attention for quite some time. The most common application is for time series, but particularly with advent of means to rapidly ascertain nucleotide and ligand series, distances between these are increasingly of inferential interest.
Vitányi’s idea(**)
When considering the divergence between arbitrary sequences, P. M. B. Vitányi has been active since at least 1998, with his work with colleagues Bennett, Gács, Li, and Zurek, per “Information distance” which appeared in IEEE Transactions on Information Theory in that year. Since 1998, he has published papers with these and other colleagues. The paper which concerns the present post is R. Cilibrasi, P. M. B. Vitányi, “Clustering by compression“, which appeared in IEEE Transactions on Information Theory in 2005. That paper and A. R. Cohen, P. M. B. Vitányi, “Normalized Compression Distance of multisets with applications“, IEEE Transactions on Information Theory, 2015, 37(8), 1602-1614, explain why this way of comparing sequences is attractive. For example, from Cohen and Vitányi,
Pairwise normalized compression distance (NCD) is a parameter-free, feature-free, alignment-free, similarity metric based on compression … The way in which objects are alike is commonly called similarity. This similarity is expressed on a scale of 0 to 1 where 0 means identical and 1 means completely different … To define the information in a single finite object one uses the Kolmogorov complexity [15] of that object (finiteness is taken as understood in the sequel). Information distance [2] is the information required to transform one in the other, or vice versa, among a pair of objects … Here we are more concerned with normalizing it to obtain the so-called similarity metric and subsequently approximating the Kolmogorov complexity through real-world compressors [19]. This leads to the normalized compression distance (NCD) which is theoretically analyzed and applied to general hierarchical clustering in [4]. The NCD is parameter-free, feature-free, and alignment-free, and has found many applications in pattern recognition, phylogeny, clustering, and classification ….
This is exciting because it offers a way of, if you will, doing really non-parametric statistics: Not only don’t inferential procedures based upon these not care about statistical distributions which the units of study exhibit, they are opaque to many features which might sidetrack inference with outlying characteristics. These sometimes arise from simple mistakes in measurement or record. It’s to be expected, I think, that use of such techniques will result in loss of statistical power in comparison to inferences based upon good parametric models for a given dataset. On the other hand, it’s almost impossible to make specification errors, or form Likelihood functions improperly. Aspects of models which cause these just are not seen.
Definition and some properties
The basic idea of determining how far apart two sequences and
are begins by positing a practical compressor, an operator,
, which takes a sequence
into a compressed version of
, or
. Then define a
, where the
is a length measure of sorts, each of
, perhaps the length of the resulting compression in bits or nats. Then
where denotes the concatenation of the sequence
with the sequence
, is interpreted as the normalized compressed divergence between
and
. If
is calculated instead, a pseudo-distance is obtained. It is at least symmetric. In general,
In other words, is not a metric distance.
is not one either, but it is symmetric. This why, in the title I refer to it as a symmetrized divergence and to differentiate from the standard Cilibrasi and Vitányi I call it SNCD and I refer to it as a divergence(*).
In fact, the terminology can be confusing. Both fail because they don’t extend across the non-negative Reals. Nevertheless, it is possible to cluster objects using either. It’s difficult to do inferences with these, but definining
gets to an index which does extend across the Reals and can be readily used for statistical inference. The logit is a well-known transform for mapping probabilities onto the Real line.
However, the Triangle Inequality is not respected, so the geometry is non-Euclidean.
The point of the post
In addition to introducing Normalized Compressed Divergence to the readership, something which I’ll be using in future posts, I constructed several series which show similarities to one another. The pseudo-distances between related pairs of these were calculated, as were their logits.
Below I show the series, and then I present a table showing the results. Hopefully, this gives some idea of what series are considered familiar or not.
The cases
ypredict0
,ypredict1
These are two similar instances of a curve and dataset taken from Sivia and Skilling. Both divergences between these curves and intermediate ones are calculated in the results below.
y.trig0
,y.trig1
,y.trig2
,y.trig3
These are instances of a basic sine series and three modulations of it.
y.trig0
shows 4 waves from a sine function with frequency one.y.trig1
shows 4 waves from a related function, one with an amplitude modulation of.
y.trig2
shows 4 waves from a related function, one shifted in phase by an eighth of a wavelength, that is.
y.trig3
shows 4 waves from a related function, one chirped in frequency, as, where
steps in
in thousandths.
Some practicalities
When calculating a compressed version of a signal, generally speaking, practical compressors demand a string version of the signal. I have chosen to use the xzip compressor with the “-9e” header option, as specified in the R internal memCompress
function. This means a length divergences cannot be zero, but, nevertheless, the divergence could be.
Also, numeric signals need to be converted to characters. There are many ways that could be done. I originally used the technique in the TSclust package of R, since that was the only package which overtly claimed to offer an NCD dissimilarity measure. But it turns out that has problems, at least for numerical series. The is also a TSdist package which simply imports the corresponding dissimilarity measures from TSclust.
A problem in TSclust
The TSclust package of R has a dissimilarity calculation. Consulting the source, it’s clear this is not symmetrized, and is patterned literally after Cilibrasi and Vitányi:
###################################################################################
####### Clustering by Compression (2005), Cilibrasi, R., Vitanyi, P.M.B., ########
######## Normalized Compression Distance ##########################################
###################################################################################
diss.NCD <- function(x,y, type="min") {
.ts.sanity.check(x,y)
comp <- .compression.lengths(x,y, type)
(comp$cxy - min(comp$cx,comp$cy)) / max(comp$cx, comp$cy)
}
#common part of compression methods,
#calculate the sizes of the compressed series and of their concatenation
.compression.lengths <- function(x, y, type) {
methods <- type
type = match.arg(type, c("gzip", "bzip2", "xz", "min"))
if (type == "min") { #choose the best compression method of the three
methods <- c("gzip", "bzip2", "xz")
}
xy <- as.character(c(x,y))
x <- as.character(x)
y <- as.character(y)
cxym <- sapply( methods, function(m) { length( memCompress(xy, type=m) )})
cxy <- min(cxym)
cx <- min(sapply( methods, function(m) { length( memCompress(x, type=m) )}))
cy <- min(sapply( methods, function(m) { length( memCompress(y, type=m) )}))
list(cx=cx, cy=cy, cxy=cxy)
}
Apart from the fact that .ts.sanity.check
doesn’t work, not that the numeric series are simply converted to character strings in .compression.lengths
before being subjected to compression and, subsequently, calculation of the NCD. This cannot be correct.
Consider what would happen if there were two time series, and
, each of length 100. Suppose
consists of a sequence of 50 copies of the string “3.1” followed by a sequence of 50 copies of the string “3.2”. Suppose
consists of a sequence of 50 copies of the string “3.2” followed by a sequence of 50 copies of the string “3.1”. Using the
dissVSTR
function from below, which operates purely on strings, not on statistical series, the SNCD is 0.7.
Now consider two other series, like these but slightly modified, and
, also each of length 100. But instead, suppose
consists of a sequence of 50 copies of the string “3.01” followed by a sequence of 50 copies of the string “3.02” and
consists of a sequence of 50 copies of the string “3.02” followed by a sequence of 50 copies of the string “3.01”. If these were statistical series, the values are close to one another than they were. But since they are strings, and have an additional numeral zero,
dissVSTR
actually shows their SNCD is larger, about 0.73, implying they are father apart.
I first tried to fix this in an earlier version of this post by precalculating a combined set of bins for the pooled values in both series based upon standard binning logic in the sm package and then quantiles using the hdquantile
function of the Hmisc package. I then assigned a character to each of the resulting bins and used the R cut
function to determine to which bin each of the individual series belonged and coded that. This was a bit better, but I still think it was wrong: Bins corresponding to bigger values hadn’t any more mass than smaller bins and, so, a distance apart at a larger bin was ranked in information distance exactly the same way as a bin apart at the low end.
Remedy
The remedy I’ve chosen is to do code differently if SNCD of the the pair of objects is calculated on strings (or files) or numerical series. For strings, there’s no problem going right ahead and calculating the compressed versions of the strings. But for numerical statistical series, as in the last suggestion above, quantiles of the pooled values from the two series (without duplicates removed) are calculated using the hdquantile
function from Hmisc. The number of quantiles points is one more than the rounded version of , where
is the length of the longer of the two series, in the case they are of different length. So
hdquantile(x=c(x,y), probs=seq(0, 1,
length.out=round(log(max(n.x, n.y)/log(2) + 1)),
names=FALSE)
calculates the pooled quantiles.
Next, assign a unique string character to each of the resulting bins. But instead of just using that character by itself, replicate it into a string with the same number of repetitions as the bin number. Thus, if there are bins, the bin containing the smallest values has its character just of length unity, but the last bin, bin
, has its corresponding character replicated
times.
Finally run over the two series, and code them by emitting the repeated bin labels corresponding to the bin in which they fall. This is the result submitted for compression comparison.
There is an additional thing done to accommodate NA values in the series, but the reader can check the code below for that.
The results
There are 6 cases which serve as end members in various capacities, as shown above. The divergences between ypredict1
and ypredict2
are shown, as are divergences between y.trig0
and y.trig1
, y.trig0
and y.trig2
, y.trig0
and y.trig3
, y.trig1
and y.trig2
, y.trig1
and y.trig3
, and finally y.trig2
and y.trig3
.
Also shown are intermediate morphings between end members of these pairs. If is one end member and
is a second end member, then
is the intermediate morphing between the two.
Using the divergence between ypredict1
and ypredict2
as a baseline, the modulation of the sine case, y.trig0
, into its phase shifted counterpart, or y.trig2
, is see as the least change. The amplitude modulated version of y.trig0
called y.trig1
has a substantial divergence, but not as much as the chirped version called y.trig3
. The intermediate versions of these behavior predictably. It is a little surprising that once ypredict1
is transformed 0.7 of the way to ypredict2
, the divergence doesn’t worsen. Also, in the case of the sinusoids, divergences as the curves approach the other end point do not change monotonically. That isn’t surprising, really, because there’s a lot going on with those sinusoids.
The code producing the results
Intermediate datasets, R code for the above results is available from a directory in my Google drive replacement for Git.
New versions of the codes have been uploaded to the Google drive. The old versions are still there.
The key code for producing the SNCDs from numerical statistical series is:
library(Matrix) # Data structure for large divergence matrices
library(random) # Source of the random function
library(Hmisc) # Source of the hdquantile function
library(gtools) # Source of the logit function
numericToStringForCompression<- function(x, y)
{
n.x<- length(x)
n.y<- length(y)
# (This the default number of bins for binning from the sm package, but there are
# two vectors here, and they need a common number of bins.)
nb<- max(round(log(n.x)/log(2)+1), round(log(n.y)/log(2)+1))
Q<- hdquantile(c(x,y), probs=seq(0,1,length.out=1+nb), names=FALSE)
alphaBet<- c(letters, LETTERS, sapply(X=0:9, FUN=function(n) sprintf("-%.0f", n)))
m<- length(Q) - 1
stopifnot( (1 < m) && (m <= length(alphaBet)) )
codes<- c("!", mapply(A=rev(alphaBet[1:m]), K=(1:m),
FUN=function(A,K) Reduce(f=function(a,b) paste0(a,b,collapse=NULL), x=rep(A, (1+K)), init="", right=FALSE, accumulate=FALSE)))
cx<- 1+unclass(cut(x, Q, labels=FALSE))
cx[which(is.na(cx))]<- 1
cy<- 1+unclass(cut(y, Q, labels=FALSE))
cy[which(is.na(cy))]<- 1
chx<- codes[cx]
chy<- codes[cy]
return(list(x=chx, y=chy))
}
compression.lengths<- function(xGiven, yGiven, type="xz")
{
if (is.numeric(xGiven))
{
coding<- numericToStringForCompression(x=xGiven, y=yGiven)
x<- coding$x
y<- coding$y
} else
{
stopifnot( is.character(xGiven) )
stopifnot( is.character(yGiven) )
x<- xGiven
y<- yGiven
}
#
xx<- c(x,x)
yy<<-c(y,y)
xy<- c(x,y)
yx<- c(y,x)
stopifnot( is.character(xx) )
stopifnot( is.character(yy) )
stopifnot( is.character(xy) )
stopifnot( is.character(yx) )
zero<- length(memCompress("", type=type))
cx<- length(memCompress(x, type=type)) - zero
cy<- length(memCompress(y, type=type)) - zero
cxx<- length(memCompress(xx, type=type)) - zero
cyy<- length(memCompress(yy, type=type)) - zero
cxy<- length(memCompress(xy, type=type)) - zero
cyx<- length(memCompress(yx, type=type)) - zero
return(list(cx=cx, cy=cy, cxx=cxx, cyy=cyy, cxy=cxy, cyx=cyx, csymmetric=(cxy+cyx)/2))
}
divc.NCD <- function(xGiven, yGiven, trans=function(x) x)
{
typCompr<- "xz"
if (is.numeric(xGiven))
{
coding<- numericToStringForCompression(x=xGiven, y=yGiven)
x<- coding$x
y<- coding$y
} else
{
stopifnot( is.character(xGiven) )
stopifnot( is.character(yGiven) )
x<- xGiven
y<- yGiven
}
#
xy<- c(x,y)
yx<- c(y,x)
zero<- length(memCompress("", type=typCompr))
cx<- length(memCompress(x, type=typCompr)) - zero
cy<- length(memCompress(y, type=typCompr)) - zero
cxy<- length(memCompress(xy, type=typCompr)) - zero
cyx<- length(memCompress(yx, type=typCompr)) - zero
#
# Symmetrized NCD of the above.
mnxy<- min(cx, cy)
mxxy<- max(cx, cy)
ncd<- max(0, min(1, ( (cxy - mnxy) + (cyx - mnxy) ) / (2*mxxy) ) )
#
return(trans(ncd))
}
divs<- function(SERIES, period=25)
{
stopifnot( is.data.frame(SERIES) )
N<- ncol(SERIES)
divergences<- Matrix(0, N, N, dimnames=list(NULL, NULL))
# Since logits are so common in inference, calculate those, too.
logit.divergences<- Matrix(-Inf, N, N, dimnames=list(NULL, NULL))
N1<- N-1
for (i in (1:N1))
{
for (j in ((1+i):N))
{
d<- divc.NCD(xGiven=SERIES[,i], yGiven=SERIES[,j], trans=function(x) x)
divergences[i,j]<- d
divergences[j,i]<- d
ld<- logit(d)
logit.divergences[i,j]<- ld
logit.divergences[j,i]<- ld
}
if (0 == (i%%25))
{
cat(sprintf("... did %.0f\n", i))
}
}
stopifnot( !is.null(colnames(SERIES)) )
colnames(divergences)<- colnames(SERIES)
rownames(divergences)<- colnames(SERIES)
colnames(logit.divergences)<- colnames(SERIES)
rownames(logit.divergences)<- colnames(SERIES)
#
# Return Matrix objects, leaving conversion to a matrix, a distance matrix, or a data
# from to the consumer of the output. Can't anticipate that here.
return(list(divergences=divergences, logit.divergences=logit.divergences))
}
dissVSTR<- function(VSTR, period=25, logitp=FALSE)
{
stopifnot( is.vector(VSTR) )
N<- length(VSTR)
zero<- length(memCompress(""))
ncdf<- function(cx, cy, cxy, cyx) { mnxy<- min(cx,cy) ; mxxy<- max(cx,cy) ; return( max(0, min(1, (cxy + cyx - 2*mnxy)/(2*mxxy) ))) }
#
CV200) & (0 < period))
{
cat(sprintf("Preconditioning of %.0f items completed.\n", N))
}
#
if (logitp)
{
dInitial<- -Inf
trans<- logit
} else
{
dInitial<- 0
trans<- function(x) x
}
#
divergences<- Matrix(dInitial, N, N, dimnames=list(NULL, NULL))
#
N1<- N-1
for (i in (1:N1))
{
sx<- VSTR[i]
cx<- CV[i]
for (j in ((1+i):N))
{
sy<- VSTR[j]
cy<- CV[j]
sxy<- sprintf("%s%s", sx, sy)
syx<- sprintf("%s%s", sy, sx)
cxy<- length(memCompress(sxy)) - zero
cyx<- length(memCompress(syx)) - zero
d<- trans(ncdf(cx, cy, cxy, cyx))
if (is.nan(d))
{
cat("NANs within VSTR. Inspection:\n")
browser()
}
divergences[i,j]<- d
divergences[j,i]<- d
}
if ((0 < period) && (200 < N) && (0 == (i%%period)))
{
cat(sprintf("... did %.0f\n", i))
}
}
colnames(divergences)<- names(VSTR)
rownames(divergences)<- names(VSTR)
# Return a Matrix object, leaving conversion to a matrix, a distance matrix, or a data
# from to the consumer of the output. Can't anticipate that here.
return(divergences)
}
You are welcome to use this, but please acknowledge its source:
Jan Galkowski from 667-per-cm.net.
Thanks.
(*) L. Pardo, Statistical Inference Based on Divergence Measures, Chapman & Hall/CRC, 2006.
(**) Others have done important work in this area, including I. J. Taneja (2013) in “Generalized symmetric divergence measures and the probability of error“, Communications in Statistics – Theory and Methods, 42(9), 1654-1672, and J.-F. Coeurjolly, R. Drouilhet, and J.-F. Robineau (2007) in “Normalized information-based divergences“, Problems of Information Transmission, 43(3), 167-189.
Pingback: Procrustes tangent distance is better than SNCD | Hypergeometric
Triangle inequality is respected in non-Euclidean geometry… if it is not, we do not have a metric space at all.
Hi Ryniacz!
Thanks for your comment.
You are indeed correct that SNCD and many of the other divergences Pardo (for example: see first footnote) discusses are not metrics and do not describe a metric space. This is why I eschew the term “distance” for these as is used sloppily in some of the engineering literature and, like Pardo, prefer “divergence”. This is also true of the famous Kullback-Leibler divergence.
So why pursue these?
Basically there are applications where units of study have no natural metricization. One professional interest of mine, for example, is comparing similarities of USER-AGENT strings. These can be compared with edit distances, including the Levenshtein which does observe the Triangle and is a metric, application of it to the USER-AGENT case doesn’t produce distances which are meaningful for interpreting them. In particular, USER-AGENTs often have insertions of substrings which, when absent, associate the instance to others. Yet the substrings are meaningful in themselves. These are alignment problems and one of the reasons why NCDs were introduced was to have a computationally cheap way of comparing without resorting to calculation of edit distances or dynamic warps.
Other applications where divergences are useful are shape matching where Procrustes “distances” are common. Depending upon how these are defined, they may well be metrics, but some are not. Practically, they are often calculated in terms of Riemannian distances, which is how the R package shapes does it. Riemannian distances are true metrics, results of inner products, but because they vary from point to point, it is difficult to see that they are without knowing the manifold.
Finally, there is application in network coordinates, where Triangle is famously violated. It’s not clear there how to posit a Riemannian distance because if there is a manifold, it’s constantly changing. An earlier alternative, Vivaldi network coordinates, forcibly imposed Triangle.
I think there is a future in pseudo-metricization theoretical work, where such divergences are approximately metrics over some limited region of support. I have thought about that, but I don’t have definite ideas how to proceed there.
Stay tuned to this blog: You’ll see me advance an alternative to the SNCD based upon Procrustes distance and shapes which appears to work better than it for some problems. However, I cannot see how Procrustes can be usefully applied to problems like USER-AGENT clustering, so SNCD appears to have a role.
Pingback: A look at an electricity consumption series using SNCDs for clustering | Hypergeometric