Computer Science: Faculty Publications and Other Works

AdvChar: Attacking Interpretable NLP Systems

Document Type

Article

Publication Date

10-15-2025

Publication Title

IEEE Transactions on Information Forensics and Security

Volume

Pages

11342–11354

Publisher Name

IEEE

Abstract

Studies have shown that machine learning systems are vulnerable to adversarial examples in theory and practice. Where previous attacks have focused mainly on visual models that exploit the difference between human and machine perception, text-based models have also fallen victim to these attacks. However, these attacks often fail to maintain the semantic meaning of the text and similarity. This paper introduces AdvChar, a black-box attack on Interpretable Natural Language Processing Systems, designed to mislead the classifier while keeping the interpretation similar to benign inputs, thus exploiting trust in system transparency. AdvChar achieves this by making less noticeable modifications to text input, forcing the deep learning classifier to make incorrect predictions and preserve the original interpretation. We use an interpretation-focused scoring approach to determine the most critical tokens that, when changed, can cause the classifier to misclassify the input. We apply simple character-level modifications to measure the importance of tokens, minimizing the difference between the original and new text while generating adversarial interpretations similar to benign ones. We thoroughly evaluated AdvChar by testing it against seven NLP models and three interpretation models using benchmark datasets for the classification task. Our experiments show that AdvChar can significantly reduce the prediction accuracy of current deep learning models by altering just two characters on average in input samples.

Comments

Author Posting © 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The definitive version was published in IEEE Transactions on Information Forensics and Security, Vol. 20 (October 2025), https://doi.org/10.1109/TIFS.2025.3622073.

Recommended Citation

E. Abdukhamidov, T. Abuhmed, J. C. S. Santos and M. Abuhamad, "AdvChar: Attacking Interpretable NLP Systems," in IEEE Transactions on Information Forensics and Security, vol. 20, pp. 11342-11354, 2025, doi: 10.1109/TIFS.2025.3622073.

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Copyright Statement

Download

Available for download on Tuesday, September 01, 2026

Find in your library

Included in

Computer Sciences Commons

COinS

Computer Science: Faculty Publications and Other Works

AdvChar: Attacking Interpretable NLP Systems

Document Type

Publication Date

Publication Title

Volume

Pages

Publisher Name

Abstract

Comments

Recommended Citation

Creative Commons License

Copyright Statement

Included in

Submission Tools

Explore

For Contributors

About eCommons

Computer Science: Faculty Publications and Other Works

AdvChar: Attacking Interpretable NLP Systems

Authors

Document Type

Publication Date

Publication Title

Volume

Pages

Publisher Name

Abstract

Comments

Recommended Citation

Creative Commons License

Copyright Statement

Included in

Share

Submission Tools

Explore

For Contributors

About eCommons