Reviews For Paper
 Paper ID 212 Title High-dimensional Time Series Clustering via Cross-Predictability

Question
Summary of Paper and Detailed Comments This paper describes a new similarity metric (namely cross-predictability) in the context of time series analysis and with the particular goal of series clustering in problems with much more series that points in each of them.

Time series clustering is a very interesting problem and the authors are right when they foresee it may become a key data analytics area due to exponentially increasing available of such type of data.

The reported research is appropriately within the remit of the AISTATS conference and the paper well written and structured.
The mathematical developments are very thorough and seem sound to me. The review of methods in section 2 is a bit skewed and limited, but this is perhaps unavoidable given space constraints.

The results from synthetic data are very encouraging and the comparison with baseline methods is fair.
The corresponding results with a real data set are limited in scope but interesting nonetheless.
My main concern about this paper has to do with this real data analysis from a general viewpoint: What I actually sorely miss here is a more clear justification of why cross-predictability is an interesting similarity measure for time series clustering, that is, why is it interesting and, importantly, for which type of problems?: what is the domain context of applicability?
A few sentences to nail this point would be most welcome.

Minor stuff: please revise bibliography list in order to eliminate a few typos and homogenize format.
Overall Rating Accept
Reviewer Confidence Understood the main points in the paper, but skipped the proofs and technical details, not very familiar with the literature

Question
Summary of Paper and Detailed Comments Paper Summary
=======================

The paper addresses the problem of clustering of time-series data, in the setting where the number of time-series observations, $d$, is larger than the duration of the time-series, $T$. To address the problem, the paper uses the notion of “cross-predictability”, where the future value in each time-series is predicted by past values of other time-series. Since the number of time-series, $d$, is large, the paper uses the notion of sparsity to relate each time-series to only a few other time-series, that ideally belong to the same cluster. One can relate the covariance matrix of data $\Sigma$ to the lag-1 auto-covariance $\Sigma_1$ using $\Sigma_1 = \Sigma A^T$, where $A$ denotes the transition matrix of the autoregressive model $X_{t+1} = A X_{t}$. Using empirical covariance and and lag-1 auto-covariance, the paper considers a Dantzig Selector algorithm to recover a sparse $A$. The paper evaluates the performance of the proposed method on synthetic data and a small real dataset of sensor measurements.

=======================

- The reviewer believes that the novelty of the paper is limited. More specifically, the proposed algorithm for time-series clustering is very similar to the Sparse Subspace Clustering (SSC) algorithm [9, 27]. More specifically, the Dantzig Selector algorithm in equation (3.3) and the spectral clustering method are almost the same as the Bias Corrected Dantzig Selector algorithm in [27] followed by spectral clustering as in [9, 27]. The only difference is that the paper uses the lag-1 auto-covarinace instead of the outer product of each point with respect to the rest of the data. Unfortunately, despite this close similarity, the paper has been written in a way that ignores this large overlap between the proposed algorithm and the existing literature: there is no discussion of the similarities and differences between (3.3) and the Bias Corrected Dantzig Selector algorithm in [27], the spectral clustering step is presented without prior motivation and explaining similarities to existing literature on subspace clustering. More importantly, the paper re-introduces definitions and theorems used/proved in the literature without proper references and discussions. For example, Definition 4.2 on inradius is identical to the notion of subspace inradius introduced in [26, 27]. Similarly, “cluster recovery property” has been defined and used with the same setting in [27] as well as “Sparse Subspace Clustering: Algorithm, Theory and Applications, PAMI 2013”. More importantly, the result in Theorem 4.3 is a straightforward and slight modification of the results in [27, 28], where the kernel is substituted with covariance, thanks to the same set-up and almost the same algorithms (Dantzing selector on clustered columns).

- The reviewer believes that the proposed optimization in (3.3) has the drawback of finding the trivial solution of having each column of the data representing itself with a coefficient of 1, while other entries being zeros. More specifically, while in SSC [9, 27], each point is regressed as a sparse combination of OTHER points, in (3.3) each column is regressed as a combination of ALL columns, including itself. This can be seen by the fact that in (3.3), $|| \Sigma A^T - \Sigma_1 || = || X_S^T (X_T A^T - X_T) || \leq \epsilon$. Hence, imposing the sparsity assumption on $A$ will lead to the trivial solution of $A$ being the identity. The reviewer believes that the algorithm should be modified to solve the problem per column, where for each column, the covariance is built using OTHER columns.
Overall Rating Weak Reject
Reviewer Confidence Understood the paper. Checked the proofs. Familiar with the literature

Question
Summary of Paper and Detailed Comments Authors propose a new algorithm for clustering time series based on the assumption that each time series can be modelled by an auto-regressive model. The algorithm estimates a sparse transition matrix, under the assumption that non-zeros values are obtained for times series from the same cluster. Spectral clustering algorithm is then used to perform the clustering.
- "the optimization problem can be decomposed into d independent sub-problems and solved individually": while this statement is computationnaly appealing, it raises certain issues. There is no longer one global problem to solve and it is not clear how the algorithm would behave as long as the AR assumption does not perfectly holds. The experimental section gives an exemple of such a setting but authors should discuss in more details the involvement of this bundle of small problems to solve rather than a global one.
In addition, it is well known that the explanatory variables in a L1 problem should not be too much correlated one to another in order to have a consistent solution. Authors should discuss this point.
- in the experimental section, regarding the real word dataset, Fig. 5 shows that the performance of the clustering is highly related to the \lambda value (ranging from 0 or 0.5 to 0.8). When there is not "true" clustering to tune this parameter, it is hard to see how to choose this parameter in order to obtain a good clustering.

- in the introduction : "distance metric learning [30, 3],": there is nothing related to TS in these papers (or the authors should point it as it is not obvious)
- in the introduction "where d > T: a conventional VAR model": VAR has not been defined yet
- the first paragraph of "Similarity-based Time Series Clustering" section has already been stated in the introduction
- state in the paper that the proof of theorem 4.3 is given in the supplementary material
- section 5.1: DTW is not a distance but a similarity, it then can not be imbedded in a gaussian kernel function (or at least, it does not give a kernel)
- dimension $k$ of the PCA: change the notation
- how can the PCA performs so well w.r.t. the other competing methods when using the synthetic data? Explain in more details, this may be counter-intuitive as the PCA does not take into account the temporal structure of the data.
- "We simply roughly scan through the entire range of [0,1] for 1/" and the results are shown in Figure 5": the range is between 0 and 0.15
Overall Rating Weak Accept
Reviewer Confidence Understood the main points in the paper, but skipped the proofs and technical details, not very familiar with the literature

Question
Summary of Paper and Detailed Comments
The paper introduces a novel time series clustering technique that relies on the autoregressive model. The idea is to learn a VAR model with sparsity constraints on the transition matrix such that only few sensors are taken into account to predict future observation of a given sensor. Once done, a standard spectral clustering procedure is used.

One concern about this idea is that authors introduce a penalization term in the optimization process to tend toward sparsity of the tarnsition matrix, but it is not sufficient to make it block diagonal. It would have been nice, for example, to provide visualization of the learned transition matrices in the experimental section (on both synthetic and real world data) to better understand whether this constraint tend or not to produce block-diagonal transition matrices.

The paper is well-written, the method makes sense and the experiments are sound and convincing. A minor remark is that it would have been nice to experimentally compare the proposed method with kemeans DBA [1], which is a well-established time series clustering method (NB: I am not involved in any way in this paper).

[1] Petitjean, F., Ketterlin, A., Gançarski, P. (2011). A global averaging method for dynamic time warping, with applications to clustering. Pattern Recognition, 44(3), 678-693.
Overall Rating Accept
Reviewer Confidence Understood the main points in the paper, but skipped the proofs and technical details, not very familiar with the literature