Hits: 0
Research Area: Practical theory for clustering algorithms
Status: Finished  

StreamKM++ is a new k-means clustering algorithm for data streams. It computes a small weighted sample of the data stream and solves the k-means problem on this sample using the k-means-++ algorithm by Arthur and Vassilvitskii. We use two new techniques. First, we use non-uniform sampling similar to the k-means++ algorithm. This leads to a rather easy implementable algorithm and to a runtime which has only a low dependency on the dimensionality of the data. Second, we develop a new data structure called coreset tree in order to significantly speed up the time necessary for sampling non-uniformly during the coreset construction.

We compared our algorithm experimentally with two well-known streaming implementations (BIRCH and StreamLS). In terms of quality (sum of squared errors), our algorithm is comparable with StreamLS and signicantly better than BIRCH (up to a factor of 2). In terms of running time, our algorithm is slower than BIRCH. Comparing the running time with StreamLS, it turns out that our algorithm scales much better with increasing number of centers. We also gave theoretical justification that our sample set is small in low dimensional spaces.


StreamKM++ was published in: Ackermann, Lammersen, Märtens, Raupach, Sohler, Swierkot: "StreamKM++: A Clustering Algorithm for Data Streams". In Proceedings of the 12th Workshop on Algorithm Engineering and Experiments (ALENEX 2010)

Note that our algorithm was integrated into MOA, a framework for data stream mining, as a package.

For more information on StreamKM++, see also the StreamKM++ Website.

[ Back ]