Divide and Recombine for the Analysis of Very Large Datasets
William S. Cleveland
Department of Statistics
Divide and recombine (D&R) is a framework for the analysis of very large datasets, ubiquitous today in science, engineering, business, and government. The data are divided into subsets, an analysis method is applied to each subset or to each subset in a sample, and the subset outputs of the method are recombined.
The goal of data analysis, whether the dataset is very large or very small, should be comprehensive analysis that does not miss important information in the data. The 1000s of analysis methods of statistics and machine learning can be divided into two groups. Mathematical methods, which result in numerical output, enable automated learning by the computer. Visualization methods, which result in visual output, enable human guidance to the process of automated learning. Both mathematical methods and visualization methods are critical to comprehensive analysis.
The computing of D&R is embarrassingly parallel. Recent development of very effective distributed software environments that exploit this, have resulted in feasible computation. This provides a mechanism for comprehensive analysis of very large datasets because it enables both mathematical and visualization methods. In a D&R analysis, mathematical methods are typically applied to all subsets, and visualization methods are typically applied to a representative sample guided by variables from mathematical methods.
To achieve its potential, D&R requires much further research in all areas that are involved in the analysis of data: computational environments, mathematical methods, visualization methods, and theory. The goal of the research is to discover methods of division and recombination that provide optimal results from the analysis methods, given that the data must be divided.