A depthwise separable convolution is a computationally efficient alternative to a standard convolution. It factorizes a standard convolution into two separate operations: a depthwise convolution and a pointwise convolution. First, the depthwise convolution applies a single spatial filter to each input channel independently. Second, the pointwise convolution, which is a 1x1 convolution, projects the channels output by the depthwise convolution onto a new channel space. This factorization drastically reduces the number of parameters and the amount of computation required, making it ideal for mobile and embedded vision applications without a significant drop in accuracy.
The concept of factorizing convolutions has roots in earlier work on signal processing and network design. However, their modern application in deep learning was popularized by the MobileNetV1 architecture, developed by Google researchers in 2017. The goal was to create highly efficient neural networks for on-device computer vision. François Chollet's Xception architecture, also from 2017, further demonstrated the power of this approach, showing that it could even outperform standard convolutions on large-scale datasets, suggesting that the efficiency gains were not just for mobile but also for large-scale models.
Depthwise separable convolutions are a cornerstone of efficient deep learning, particularly in computer vision. They are the core building block for numerous mobile-optimized architectures, including the MobileNet and EfficientNet series. Their use has enabled the deployment of sophisticated computer vision features, such as real-time object detection and image segmentation, on resource-constrained devices like smartphones. The principle of separating spatial and channel-wise operations has also influenced the design of other efficient network architectures beyond computer vision.