Seminar: November 3, 2008

Nathan Srebro, TTI-C

On the Informational Cost of Tractability

As with many other machine learning problems, most formulations of clustering correspond to non-convex, NP-hard, optimization problems. Yet, when the data really is clustered, and enough data is available, local search and other heuristics tend to be able to recover the clustering. At an extreme regime, we can even prove the clustering is tractably recoverable. On the other extreme, if the data is not well clustered, or not enough data is available, the "optimal" clustering, although hard to find, is meaningless. This leads to the common wisdom that "clustering isn't hard: its either easy, or not interesting".

Is it in-fact the case that when a meaningful clustering is statistically recoverable, it is also easy to find computationally? If so, we certainly do not have a rigorous understanding of why this is the case. Or, is there a regime in which the clustering is statistically recoverable, but not tractably recoverable? E.g., might we need more samples in order to ensure that our recovery methods work, beyond what is statistically necessary if we had infinite computational power? Can we quantify this increase in the required sample size due to our computational limitations?

In this talk I will discuss the above issues, mostly in the context of clustering data from a mixture of Gaussians, but also in some other machine learning problems. I will also mention other ways in which increasing the size of the data reduces, rather than increases, the amount of required computation, contrary to the standard approach of studying the complexity of computation as increasing with the size of the data.