Model Compression
Cristian Bucila, Rich Caruana, Alexandru Niculescu-Mizil
Often the best performing supervised learning models are ensembles of hundreds or thousands of base-level classifiers. Unfortunately, the space required to store this many classifiers, and the time required to execute them at run-time, prohibits their use in applications where test sets are large (e.g. Google), where storage space is at a premium (e.g. PDAs), and where computational power is limited (e.g. hearing aids). We present a method for "compressing" large, complex ensembles into smaller, faster models, usually without significant loss in performance.
- Found it remarkable to discover that the improved data munging algorithm was critical to having the compression technique work.
- Related this with the previously discussed Deep Compression paper. This paper acheived much higher compression ratios (size reduction by a factor of 1000 as oppossed to the maximum of 49 in the Deep Compression paper), but this project was compressing ensembles, whereas the other paper was compressing a single network. Potentially it seems both compression techniques could be applied to a given solution.
- Reflected on the distinction between training on the soft output vs. training on the classifications. Rich Caruana gave a talk at the University of Alberta and mentioned training the smaller network on the soft outputs of the ensemble. This paper describes expanding the training set by labeling generated inputs with the ensemble, but not "soft-labeling" them. Speculated that the use of soft outputs for training was a later development.