Orthogonal Compression for Edge Vision Transformers: Combining Recursive Weight-Sharing with Token Merging

Junseo Kim, Uraz Odyurt, Amirreza Yousefzadeh

Abstract

Vision Transformers (ViTs) deliver strong accuracy on image classification but impose two distinct resource costs: large parameter storage (due to unique weights per layer) and high peak activation memory (due to quadratic self-attention over long token sequences). Recursive weight-sharing and token merging each address one of these costs independently, yet their combination remains unexplored. This work investigates whether these two compression mechanisms can be composed within the Sliced Recursive Transformer (SReT) without destructive interference. We present a post-training integration framework, propose a depth-aware merging schedule motivated by the observation that spatial redundancy decreases across recursive iterations, and report preliminary measurements on a flat ViT baseline. On DeiT-Tiny-Distill evaluated over ImageNet-1K, token merging at a rate r=15 yields 1.8x throughput with 2.9 percentage points of accuracy loss. These results, combined with the 4x parameter reduction already achieved by SReT alone when compared to a flat model of the exact same effective depth, motivate the ongoing integration and characterize the design challenges involved.

Cite as »

Metadata

Type:: Conference Talk
Year:: 2026
Venue:: CompSys 2026 Conference

Links

Licence

Artefacts shared as PDF are licenced under CC BY 4.0.