VOLO: Vision Outlooker for Visual Recognition
Authors
Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, Shuicheng Yan
Published on
December 21, 2022
Publisher
TPAMI 2022
Recently, Vision Transformers (ViTs) have been broadly explored in visual recognition. With low efficiency in encoding fine-level features, the performance of ViTs is still inferior to the state-of-the-art CNNs when trained from scratch on a midsize dataset like ImageNet. Through experimental analysis, we find it is because of two reasons: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines, leading to low training sample efficiency; 2) the redundant attention backbone design of ViTs leads to limited feature richness for fixed computation budgets and limited training samples. To overcome such limitations, we present a new simple and generic architecture, termed Vision Outlooker (VOLO), which implements a novel outlook attention operation that dynamically conduct the local feature aggregation mechanism in a sliding window manner across the input image. Unlike self-attention that focuses on modeling global dependencies of local features at a coarse level, our outlook attention targets at encoding finer-level features, which is critical for recognition but ignored by self-attention. Outlook attention breaks the bottleneck of self-attention whose computation cost scales quadratically with the input spatial dimension, and thus is much more memory efficient. Compared to our Tokens-To-Token Vision Transformer (T2T-ViT), VOLO can more efficiently encode fine-level features that are essential for high-performance visual recognition. Experiments show that with only 26.6M learnable parameters, VOLO achieves 84.2% top-1 accuracy on ImageNet-1K without using extra training data, 2.7% better than T2T-ViT with a comparable number of parameters. When the model size is scaled up to 296M parameters, its performance can be further improved to 87.1%, setting a new record for ImageNet-1K classification. In addition, we also take the proposed VOLO as pretrained models and report superior performance on downstream tasks, such as semantic segmentation.