Manifold-Preserving Transformers are Effective for Short-Long Range Encoding

Figure: Diagram of TransJect’s manifold‑preserving encoder. Orthogonal attention and injective residuals keep token geometries intact, enabling deeper, more efficient models. Image generated with Gemini AI.

TL;DR

Deep transformers often warp the “shape” of your data: semantically similar tokens drift apart as layers stack up. TransJect reimagines the encoder with orthogonal attention and injective connections to keep representations geometrically faithful while still scaling deep. The result? Stronger performance on both short and long sequences, plus faster inference.

Why this research?

Standard transformer encoders are brilliant at capturing both local and global dependencies, yet they have a blind spot: as layers deepen, token representations get stretched and scrambled. Self‑attention projects embeddings onto sparse manifolds, inflating distances between related tokens and increasing representational entropy. This makes it hard to build very deep or reversible models and leads to unstable information flow.

TransJect tackles this by imposing mathematical structure on the encoder. It combines orthogonal attention with an injective residual connection to ensure that different tokens remain distinct and that pairwise distances change only within a provable bound. By controlling Lipschitz constants, TransJect supports much deeper architectures without the usual information degradation.

Main insights

Geometry matters: Orthogonal projections and shared eigenvalues keep the relative geometry of tokens stable across layers.
Injectivity & reversibility: The encoder guarantees that distinct inputs never collide in representation space, moving toward reversible NLP models.
Entropy–sparsity connection: Lower activation bounds correlate with lower representational entropy, analogous to thermodynamic reversibility.
Orderly sparsity: A mixture‑of‑experts design encourages balanced specialization and lower entropy compared to standard multi‑head attention.
Deep scaling without collapse: Controlling Lipschitz constants allows very deep encoders to maintain information flow.

Results at a glance

Task/Domain	Improvement
IMDb (short sequences)	Up to +6.8 pp accuracy over vanilla Transformer
AGNews	Random‑TransJect variant slightly surpasses TransJect
Long Range Arena (LRA)	Competitive or SOTA versus BigBird, Linformer, Performer, Skyformer
Penn Treebank (PTB)	~79% lower test perplexity than vanilla Transformer
Efficiency	13×–26× faster on long character sequences due to linear attention

Beyond numbers, TransJect maintains an activation factor around 1 across layers, whereas standard transformers exhibit growing activation factors and entropy. This indicates stable, low‑entropy representations through depth.

Keywords

Term	Description
TransJect	A manifold‑preserving transformer encoder with orthogonal attention and injective residuals
Injectivity	Ensuring distinct tokens remain distinct across all layers
Orthogonal Attention	Using orthogonal projections to preserve pairwise distances
Entropy–Sparsity	Empirical link between bounded activations and low representational entropy
Deep Scaling	Controlling Lipschitz constants to enable deeper, reversible models