Divide and Recombine for the Analysis of Big Data
Divide and Recombine for the Analysis of Big Data by William S. Cleveland - Machine Learning Summer School at Purdue, 2011. Divide and Recombine (D&R) consists of
the general approach of parallelizing big data, statistical methods for division and recombination, sampling and display methods for visualization of samples of subsets,
computational methods, and computational environments.
In D&R, the data are broken up into structured subsets, general analysis methods are applied to each subset, and the results of the analyses recombined. The necessary steps of
data division and recombination open up an exciting area of research in statistical theory and methods, and there are already a number of very useful results.
The steps also open up research in computational methods and hardware-software environments, and here, too, there are important results.
By introducing the exploitable parallelization of the data, D&R succeeds in making it possible to apply to big data almost any existing analysis method from statistics,
machine learning, and visualization. This enables detailed, comprehensive analysis of big data at all stages of the analysis process, starting with the raw data.
This includes detailed visualization at all stages, not just to reduced data such as summary statistics, results of dimension reduction methods, fitted models,
and the output of algorithms applied to the detailed data. Visualization at all stages substantially reduces the chances of losing critical information in the data.
Machine Learning Summer School at Purdue, 2011