Orthogonal Compression for Edge Vision Transformers: Combining Recursive Weight-Sharing with Token Merging
Abstract
Vision Transformers (ViTs) deliver strong accuracy on image classification but impose two distinct resource costs: large parameter storage (due to unique weights per layer) and high peak activation memory (due to quadratic self-attention over long token sequences). Recursive weight-sharing and token merging each address one of these costs independently, yet their combination remains unexplored. This work investigates whether these two compression mechanisms can be composed within the Sliced Recursive Transformer (SReT) without destructive interference. We present a post-training integration framework, propose a depth-aware merging schedule motivated by the observation that spatial redundancy decreases across recursive iterations, and report preliminary measurements on a flat ViT baseline. On DeiT-Tiny-Distill evaluated over ImageNet-1K, token merging at a rate r=15 yields 1.8x throughput with 2.9 percentage points of accuracy loss. These results, combined with the 4x parameter reduction already achieved by SReT alone when compared to a flat model of the exact same effective depth, motivate the ongoing integration and characterize the design challenges involved.