This code is a Scala implementation of a sketch/random projection-based method to efficiently perform the parallel indexing of time series and similarity search on them.Paper
Time-series index construction on the input dataset D within distributed data processing frameworks proceeds as follows:
Given a collection of queries Q, in the form of distributed time series dataset, and a previously constructed index on a dataset D, we consider the problem of finding time series that are similar to Q in D. We perform such a search in the following steps:
The main components of Spark-parSketch, i.e., Index Construction ( IC ) and Query Processing ( QP ), are developed within Spark. Spark is deployed on top of Hadoop Distributed File System ( HDFS ) in order to efficiently read input time series, query time series, as well as to store preliminary data and final results, and thus to overcome the bottleneck of centralized data storing.
Spark-parSketch stores grids (the resulting indexes) to a distributed relational storage ( RDB Store ), setup as a number of PostgreSQL instances. Each Spark worker connects to each of the databases through JDBC and persists the contents of each grid cell as soon as they are identified. This pipelined procedure avoids the in-memory storage of large intermediate datasets, hence relaxes the memory consumption during grid construction, to avoid memory overflow at Spark workers. Moreover, the embedded feature of RDBMS for indexation provides for more efficient query processing.
In this use-case, each object of dataset is identified with an ID and consist of 256 values. At each time point, a random walk generator draws a random number from a Gaussian distribution N(0,1), then adds the value of the last number to the new number. The number of time series varies from 50 million to 300 million depending on the experiment.
In this use-case, the real world data represents seismic time series collected from the IRIS Seismic Data Access repository at various earthquake zones. After preprocessing we've achieved two seismic datasets: the first one contains 600 thousand time series of 462 values and the second one - 40 million time series of 200 values each.
We show below set of charts to demonstrate the Spark-parSketch performance (vs. parallelized Linear Search algorithm), in terms of response time and precision. Bar chart compares time performance of two approaches. Two line charts depict the precision of similarity search as top candidates (up to 10) for selected query in a given dataset.
Parameters: The user can observe the tool performance on a range of input datasets ( Input Time Series ), alongside with various scale of query dataset ( Queries Number ). To proceed throught separate Query in the selected Dataset user may use the bottom slide bar ( Query # ).
The combo box Cell size demonstrate how some parameters of spark-parSketch tool, related to the sketch-based algorithm, could impact the precision accuracy.
Quality Ratio2: N/A
Parallel Linear Search
Query # : 0
1 the parameter not affects the PLS precision chart
2 ratio between the 10th highest correlation of the time series found by sketches to the 10th highest correlation found by parallel linear search
Experiments were conducted on a cluster http://www.grid5000.fr of 16 compute nodes with two 8 cores Intel Xeon E5-2630 v3 CPUs, 128 GB RAM, 2x558GB capacity storage per node. The cluster is running under Hadoop version 2.7, Spark v. 2.1 and PostgreSQL v. 9.4 as a relational database system.
* Experimental runs with PLS on a larger datasets couldn't be finished due big time consuming, thus lack of cluster reservation time to conclude similarity search.
The video of demonstration use-case is available here .