Sampling Within k-Means Algotithm to Cluster Large Datasets
Due to current data collection technology, our ability to gather data has surpassed our ability to analyze it. In particular, k-means, one of the simplest and fastest clustering algorithms, is ill-equipped to handle extremely large datasets on even the most powerful machines. Our new algorithm implements sampling within k-means to reduce the amount of data analyzed, thus decreasing run-time. We perform a simulation study to compare our sampling based k-means to the standard k-means algorithm by analyzing both speed and accuracy. Results show that our algorithm is significantly more efficient than the existing algorithm with comparable accuracy.
This research was completed as part of the REU Site Interdisciplinary Program inHigh Performance Computing at the University of Maryland, Baltimore County.