Computer Science: Faculty Publications and Other Works

Token Turing Machines are Efficient Vision Models

Purvish Jajal, Purdue University
Nick Eliopoulos, Purdue University
Benjamin Shiue-Hal Chou, Purdue University
George K. Thiruvathukal, Loyola University Chicago
James C. Davis, Purdue University
Yung-Hsiang Lu, Purdue University

Author Posting © The Author(s), 2024. This CVPR paper is the Open Access version, provided by the Computer Vision Foundation. Except for the watermark, it is identical to the accepted version of the article, which will be published in IEEE Xplore.

Abstract

We propose Vision Token Turing Machines (ViTTM) an efficient low-latency memory-augmented Vision Transformer (ViT). Our approach builds on Neural Turing Machines (NTM) and Token Turing Machines (TTM) which were applied to NLP and sequential visual understanding tasks. ViTTMs are designed for non-sequential computer vision tasks such as image classification and segmentation. Our model creates two sets of tokens: process tokens and memory tokens; process tokens pass through encoder blocks and read-write from memory tokens at each encoder block in the network allowing them to store and retrieve information from memory. By ensuring that there are fewer process tokens we are able to reduce the inference time of the network while maintaining its accuracy. On ImageNet-1K the state-of-the-art ViT-B has median latency of 529.5ms and 81.0% accuracy while our ViTTM-B is 56% faster (234.1ms) with 2.4x fewer FLOPs with an accuracy of 82.9%. On ADE20K semantic segmentation ViT-B achieves 45.65mIoU at 13.8 frame-per-second (FPS) whereas our ViTTM-B model acheives a 45.17 mIoU with 26.8 FPS (+94%).

This paper has been withdrawn.

Author Manuscript

This is a pre-publication author manuscript of the final, published article.

Computer Science: Faculty Publications and Other Works

Token Turing Machines are Efficient Vision Models

Abstract

Author Manuscript

Submission Tools

Explore

For Contributors

About eCommons