Efficient way to fine-tune large, pre-trained Vision Transformer (ViT) models for new tasks.

The Core Problem and Our Idea 💡
State-of-the-art ViT models are massive and powerful, but adapting them to a new, specific dataset (a “downstream task”) requires “fine-tuning” the entire model. This is problematic for two main reasons:

It’s expensive: Training entire parameters in a ViT requires significant computational power and time.
It’s inefficient: You have to save a complete, separate copy of the massive model for every single new task.

We looked to the field of Natural Language Processing (NLP), where researchers had developed a clever solution for their transformer models called adapter modules. Our central idea was to see if we could successfully apply this same technique to the computer vision domain with ViTs.

An adapter module is a very small set of new layers that we insert into the existing architecture of a pre-trained model. The key is that we freeze all the original weights of the giant ViT and only train our small, new adapter modules. If this works, we can adapt a ViT to a new task by only training and saving a tiny fraction of the parameters, making the whole process much cheaper and more efficient.

Initially, our project faced significant challenges, as the model consistently suffered from severe overfitting, leading to poor performance. However, this was overcome by implementing improved regularization techniques, including a learning rate warm-up and specific weight initialization methods. This refined approach dramatically unlocked the model’s potential, allowing it to achieve near state-of-the-art results, such as 90.5% test accuracy on CIFAR-100.

Ultimately, our work demonstrates that adapter modules hold significant promise and can be a highly effective and useful tool for various computer vision tasks when paired with a proper training regime.