I’m beginning a new style of column, called technical publications of the week. While I can’t promise these will be weekly, I will, from time to time, highlight technical publications I’ve recently read which I consider to be noteworthy. I am going to read them all again.
My professional emphasis, recently, for Akamai Technologies, has been on the plethora of adaptions of random projection methods (see also), generally based upon direct application of the Johnson-Lindenstrauss lemma or its several improvements. Many of these are collected under the rubric of locality sensitive hashing or LSH.
A first paper is called Earthquake detection through computationally efficient similarity search, and is by C. E. Yoon, O. O’Reilly, K. J. Bergen, and G. C. Beroza, and appeared in 2015 in Science Advances. It also has supporting online material. Using a technique for audio fingerprinting by Baluja and Covell, the authors develop fingerprints for earthquakes and convert these to signatures using LSH. These were used to assess classification accuracy of uncatalogued and catalogued earthquakes relative to a manually identified set for the Calaveras Fault in California, comparing performance to that obtained through the well-known but slower and more computationally expensive technique of autocorrelation, as well as the catalogue.
Yoon, O’Reilly, Bergen, and Beroza report very promising results, despite the great reduction in computation needed. Of greater interest to me is fitting the LSH into a larger signal processing task, including prefiltering and then interpreting results afterwards. They document the progress of a canonical data science project, offering the finished product, but strongly suggesting the pitfalls and backtracking they needed to undertake to bring it to success. That kind of experience is instructive for both students of data science, and the managers that expect results from these investigations.
Second, two papers applying LSH to health-related time series, with nice discussion of engineering tradeoffs for these applications:
- Y. B. Kim, E. Hemberg, U.-M. O’Reilly, “Stratified Locality-Sensitive Hashing for accelerated physiological time series retrieval,” 2016 IEEE 38th Annual International Conference of the Engineering in Medicine and Biology Society, October 2016
- D. C. Kale, D. Gong, Z. Che, R. Wetzel, P. Ross, Y. Liu, G. Medioni, “An examination of multivariate time series hashing with applications to health care,” 2014 IEEE Conference on Data Mining, December 2014
Third, a paper, C. Luo, A. Shrivastava, “SSH (Sketch, Shingle, & Hash) for indexing massive-scale time series,” NIPS Time Series Workshop 2016, which offers an LSH-derived technique for preconditioning problems of time series comparison and lookups using dynamic time warping resulting in a net improvement of speed.
Fourth, not a paper, but an interview, from Dr Stephen Chu:
I agree Yoon’s paper is very nice. But it approximately solves the problem. You can EXACTLY solve the problem faster, see….
Click to access STOMP_GPU_final_submission_camera_ready.pdf
—
The SSH paper is also approximately solving a problem that you can solve faster, but EXACTLY, see http://www.cs.ucr.edu/~eamonn/SIGKDD_trillion.pdf or watch video https://www.youtube.com/watch?v=d_qLzMMuVQg
The Yongwook Bryce Kim paper is also approximately solving a problem that you can solve faster, but EXACTLY, see http://www.cs.ucr.edu/~eamonn/SIGKDD_trillion.pdf or watch video https://www.youtube.com/watch?v=d_qLzMMuVQg