Computer Science: Faculty Publications and Other Works

Token Turing Machines are Efficient Vision Models

Document Type

Conference Proceeding

Publication Date

3-2025

Publication Title

2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pages

7898-7907

Publisher Name

IEEE

Abstract

We propose Vision Token Turing Machines (ViTTM), an efficient, low-latency, memory-augmented Vision Transformer (ViT). Our approach builds on Neural Turing Machines (NTM) and Token Turing Machines (TTM), which were applied to NLP and sequential visual understanding tasks. ViTTMs are designed for non-sequential computer vision tasks such as image classification and segmentation. Our model creates two sets of tokens: process tokens and memory tokens; process tokens pass through encoder blocks and read-write from memory tokens at each encoder block in the network, allowing them to store and retrieve information from memory. By ensuring that there are fewer process tokens, we are able to reduce the inference time of the network while maintaining its accuracy. On ImageNet-1K, the state-of-the-art ViT-B has median latency of 529.5 ms and 81.0% accuracy, while our ViTTM-B is 56% faster (234.1 ms), with 2.4× fewer FLOPs, with an accuracy of 82.9%. On ADE20K semantic segmentation, ViT-B achieves 45.65 mIoU at 13.8 frames per second (FPS) whereas our ViTTM-B model achieves 45.17 mIoU with 26.8 FPS (+94%).

Comments

Author Posting © IEEE, 2025. This is the authors' version of the work. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The definitive version of this work was published in the proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (April 8, 2025), https://doi.org/10.1109/WACV61041.2025.00767.

Recommended Citation

Jajal, Purvish; Eliopoulos, Nick; Chou, Benjamin Shiue-Hal; Thiruvathukal, George K.; Davis, James C.; and Lu, Yung-Hsiang. Token Turing Machines are Efficient Vision Models. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), , : 7898-7907, 2025. Retrieved from Loyola eCommons, Computer Science: Faculty Publications and Other Works, http://dx.doi.org/10.1109/WACV61041.2025.00767

Creative Commons License

This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.

Copyright Statement

Download

Available for download on Tuesday, September 01, 2026

Find in your library

Included in

Artificial Intelligence and Robotics Commons, Systems Architecture Commons

COinS

Author Manuscript

This is a pre-publication author manuscript of the final, published article.

Computer Science: Faculty Publications and Other Works

Token Turing Machines are Efficient Vision Models

Document Type

Publication Date

Publication Title

Pages

Publisher Name

Abstract

Comments

Recommended Citation

Creative Commons License

Copyright Statement

Included in

Author Manuscript

Submission Tools

Explore

For Contributors

About eCommons

Computer Science: Faculty Publications and Other Works

Token Turing Machines are Efficient Vision Models

Authors

Document Type

Publication Date

Publication Title

Pages

Publisher Name

Abstract

Comments

Recommended Citation

Creative Commons License

Copyright Statement

Included in

Share

Author Manuscript

Submission Tools

Explore

For Contributors

About eCommons