Multi-resolution Fine-Tuning of Vision Transformers

Fitzgerald, Kerr, Law, Meng, Seah, Jarrel, Tang, Jennifer and Matuszewski, Bogdan orcid iconORCID: 0000-0001-7195-2509 (2022) Multi-resolution Fine-Tuning of Vision Transformers. In: Medical Image Understanding and Analysis, 27/7/2022-29/7/2022, Cambridge.

Full text not available from this repository.

Official URL:


For computer vision systems based on artificial neural networks, increasing the resolution of images typically improves the performance of the network. However, ImageNet pre-trained Vision Transformer (ViT) models are typically only openly available for 2242 and 3842 image resolutions. To determine the impact of using higher resolution images with ViT systems the performance differences between ViT-B/16 models (designed for 3842 and 5442 image resolutions) were evaluated. The multi-label classification RANZCR CLiP challenge dataset, which contains over 30,000 high resolution labelled chest X-ray images, was used throughout this investigation. The performance of the ViT 3842 and ViT 5442 models with no ImageNet pre-training (i.e. models were only trained using RANZCR data) was firstly compared to see if using higher resolution images increases performance. After this, a multi-resolution fine-tuning approach was investigated for transfer learning. This approach was achieved by transferring learned parameters from ImageNet pre-trained ViT 3842 models, which had undergone further training on the 3842 RANZCR data, to ViT 5442 models which were then trained on the 5442 RANZCR data. Learned parameters were transferred via a tensor slice copying technique. The results obtained provide evidence that using larger image resolutions positively impacts ViT network performance and that multi-resolution fine-tuning can lead to performance gains. The multi-resolution fine-tuning approach used in this investigation could potentially improve the performance of other computer vision systems which use ViT based networks. The results of this investigation may also warrant the development of new ViT variants optimized to work with high resolution image datasets.

Repository Staff Only: item control page