Definition #
The process of reducing an AI model’s computational footprint through methods like pruning, quantization, or architecture search, enabling edge/mobile deployment.
Key Characteristics #
- Pruning: Removing redundant neurons (up to 90% size reduction)
- Quantization: 32-bit → 8-bit weights (4x smaller)
- Knowledge distillation: Training small “student” models
Why It Matters #
Compressed models achieve 95% of original accuracy with 10x speed gains (TensorFlow Lite benchmarks).
Common Use Cases #
- Mobile app object detection
- IoT sensor anomaly detection
- Real-time video processing
Examples #
- TensorFlow Lite Converter
- PyTorch Mobile
- Apple Core ML model optimization
FAQs #
Q: Does compression hurt accuracy?
A: Advanced methods (QAT) minimize loss—often <2% drop in well-tuned models.
Q: Can I compress any model?
A: Vision/audio models compress better than language models (due to attention mechanisms).