Multi Head Latent Attention: The RoPE Compatibility Problem - A Detailed Mathematical Analysis

Apr 30, 2025

1. Introduction

1.1 Multi-Head Latent Attention: Advancing Inference Efficiency in Large Language Models

Large Language Models (LLMs) have transformed natural language processing capabilities, yet their deployment presents substantial challenges as model size increases to hundreds of billions of parameters with extended context windows of tens or hundreds of thousands of tokens. During the autoregressive generation process, the Key-Value (KV) cache emerges as a critical bottleneck, presenting organizations with a fundamental trade-off between computational efficiency and memory resource allocation.

Without KV caching, the computational complexity for generating each token scales quadratically with sequence length (O(n2) per token), while maintaining minimal memory requirements O(1). This approach becomes prohibitively expensive for long sequences, as each new token would require recomputing attention scores with all previous tokens. KV caching strategy reduce this to linear computational complexity (O(n) per token), but at the cost of linear memory growth O(n). For standard Multi-Head Attention (MHA), the total KV cache memory consumption can be expressed as:

MemoryMHA=B×L×NL×2×NH×DH×P(1)\text{Memory}_{\text{MHA}} = B \times L \times N_L \times 2 \times N_H \times D_H \times P \tag{1}

This formula encapsulates the memory requirements across batch size (B), sequence length (L), number of layers (NL), heads (NH), head dimension (DH), and precision (P). The factor of 2 accounts for storing both keys and values separately. As models grow larger and context windows expand, this memory requirement becomes increasingly untenable, even on high-end hardware accelerators.

Multi-Head Latent Attention (MLA) addresses this challenge through a novel approach that transforms the fundamental equation of memory consumption:

MemoryMLA=B×L×NL×(DC+DR)×P(2)\text{Memory}_{\text{MLA}} = B \times L \times N_L \times (D_C + D_R) \times P \tag{2}

Where DC is the compression KV dimension and DR is the dimension of the key rotary position component. This reformulation enables substantial memory savings without compromising model capabilities, creating new possibilities for deploying models in resource-constrained environments.

Architectural Differences: MHA vs. MLA

Standard Multi-Head Attention (MHA) and Multi-Head Latent Attention (MLA) share the same high-level goal of enabling tokens to attend to each other, but differ significantly in their internal architecture and memory efficiency characteristics. In standard MHA, each token's hidden representation undergoes three parallel linear projections to create query, key, and value vectors. This process can be represented mathematically as:

qt=WQht(3)q_t = W^Q h_t \tag{3}
kt=WKht(4)k_t = W^K h_t \tag{4}
vt=WVht(5)v_t = W^V h_t \tag{5}

These projections are then split into NH attention heads, each operating in a lower-dimensional space:

qti,kti,vtiRDH(6)q_t^i, k_t^i, v_t^i \in \mathbb{R}^{D_H} \tag{6}

The attention mechanism computes weighted interactions between tokens, where the weights are determined by the compatibility between queries and keys. For each head i, the attention output is computed as:

ot,i=j=1tSoftmaxj((qt,i)Tkj,idh)vj,i(7)\text{o}_{t,i} = \sum_{j=1}^{t} \text{Softmax}_j\left(\frac{(\text{q}_{t,i})^T\text{k}_{j,i}}{\sqrt{d_h}}\right)\text{v}_{j,i} \tag{7}
where:j: Position index (1jt) of previous tokens in the sequencet: Current position in the sequenceqt,i: Query vector at position t for head ikj,i: Key vector at position j for head ivj,i: Value vector at position j for head idh: Dimension of each attention headot,i: Output of attention at position t for head ii{1,2,,nh}, where nh is the total number of attention heads(8)\begin{align} \text{where:} \\ \text{j} &: \text{ Position index } (1 \leq j \leq t) \text{ of previous tokens in the sequence} \\ \text{t} &: \text{ Current position in the sequence} \\ \text{q}_{t,i} &: \text{ Query vector at position } t \text{ for head } i \\ \text{k}_{j,i} &: \text{ Key vector at position } j \text{ for head } i \\ \text{v}_{j,i} &: \text{ Value vector at position } j \text{ for head } i \\ \text{d}_h &: \text{ Dimension of each attention head} \\ \text{o}_{t,i} &: \text{ Output of attention at position } t \text{ for head } i \\ \text{i} &\in \{1, 2, \ldots, n_h\}, \text{ where } n_h \text{ is the total number of attention heads} \end{align} \tag{8}

The outputs from all heads are concatenated and projected through an output matrix:

ut=WO[ot,1;ot,2;...;ot,nh](9)\text{u}_t = W^O[\text{o}_{t,1}; \text{o}_{t,2}; ...; \text{o}_{t,n_h}] \tag{9}

During inference, MHA caches the full key and value vectors for each token across all layers and heads, creating substantial memory pressure as sequence length increases. MLA fundamentally reimagines this architecture by introducing an intermediate compression step and decoupling content information from positional information. The architecture consists of two parallel paths:

MLA Architecture

Figure 1: Architecture of Multi Head Latent Attention

  1. Content Path (with compression):
ctKV=WDKVht(10)c_t^{KV} = W^{DKV}h_t \tag{10}
ktC=WUKctKV(11)k_t^C = W^{UK}c_t^{KV} \tag{11}
vt=WUVctKV(12)v_t = W^{UV}c_t^{KV} \tag{12}
qtC=WUQctQ(13)q_t^C = W^{UQ}c_t^{Q} \tag{13}
ctQ=WDQht(14)c_t^{Q} = W^{DQ}h_t \tag{14}
  1. Position Path (with RoPE):
ktR=RΘ,tdWKRht(15)k_t^R = R_{\Theta,t}^d \cdot W^{KR}h_t \tag{15}
qtR=RΘ,tdWQRctQ(16)q_t^R = R_{\Theta,t}^d \cdot W^{QR}c_t^Q \tag{16}

Where RΘ,td represents the rotary position encoding matrix. The final key and query representations are formed by concatenating both components:

kt=[ktC;ktR](17)k_t = [k_t^C; k_t^R] \tag{17}
qt=[qtC;qtR](18)q_t = [q_t^C; q_t^R] \tag{18}

During inference, MLA caches both the compressed latent vectors ctKV and the rotary key components ktR as shown in Figure 2. This is reflected in the memory formula, where DC represents the dimension of ctKV and DR represents the dimension of ktR.

MLA Tensor Cache

Figure 2: Cache both the compressed latent vectors ctKV and the rotary key components ktR

The attention calculation in MLA becomes:

ot,i=j=1tSoftmaxj((qt,iC)Tkj,iC+(qt,iR)TkjRdh+dhR)vj,iC(19)\text{o}_{t,i} = \sum_{j=1}^{t} \text{Softmax}_j\left(\frac{(\text{q}_{t,i}^C)^T\text{k}_{j,i}^C + (\text{q}_{t,i}^R)^T\text{k}_j^R}{\sqrt{d_h + d_h^R}}\right)\text{v}_{j,i}^C \tag{19}
where:j: Position index (1jt) of previous tokens in the sequencet: Current position in the sequenceqt,iC: Content component of query vector at position t for head ikj,iC: Content component of key vector at position j for head iqt,iR: Rotary position component of query vector at position t for head ikjR: Rotary position component of key vector at position jvj,iC: Value vector at position j for head idh: Dimension of the content componentdhR: Dimension of the rotary position componentot,i: Output of attention at position t for head ii{1,2,,nh}, where nh is the total number of attention heads(20)\begin{align} \text{where:}\\ \text{j} &: \text{ Position index } (1 \leq j \leq t) \text{ of previous tokens in the sequence} \\ \text{t} &: \text{ Current position in the sequence} \\ \text{q}_{t,i}^C &: \text{ Content component of query vector at position } t \text{ for head } i \\ \text{k}_{j,i}^C &: \text{ Content component of key vector at position } j \text{ for head } i \\ \text{q}_{t,i}^R &: \text{ Rotary position component of query vector at position } t \text{ for head } i \\ \text{k}_j^R &: \text{ Rotary position component of key vector at position } j \\ \text{v}_{j,i}^C &: \text{ Value vector at position } j \text{ for head } i \\ \text{d}_h &: \text{ Dimension of the content component} \\ \text{d}_h^R &: \text{ Dimension of the rotary position component} \\ \text{o}_{t,i} &: \text{ Output of attention at position } t \text{ for head } i \\ \text{i} &\in \{1, 2, \ldots, n_h\}, \text{ where } n_h \text{ is the total number of attention heads} \end{align} \tag{20}

And the final multi-head output remains:

ut=WO[ot,1;ot,2;...;ot,nh](21)\text{u}_t = W^O[\text{o}_{t,1}; \text{o}_{t,2}; ...; \text{o}_{t,n_h}] \tag{21}

This formulation separates content-based attention (first term) from position-aware attention (second term), allowing each to be processed optimally. The content path can be efficiently compressed without worrying about position encoding, while the position path handles rotary encodings separately, maintaining relative position awareness. This decoupling strategy is particularly important because applying rotary position encodings directly to compressed representations would create mathematical incompatibilities during inference, requiring costly recomputations for each new token (which we will see in the rest of the article with derivations). By separating content from position, MLA achieves both memory efficiency and computational efficiency.

1.2 Rotary Position Embeddings (RoPE): Mathematical Foundations

Transformer architectures have demonstrated remarkable efficacy across diverse natural language processing tasks, yet they inherently lack sequential awareness due to their parallel token processing mechanism. To mitigate this limitation, position encoding methodologies have been developed to incorporate sequential information into the representation space. Among these approaches, Rotary Position Embedding (RoPE), introduced by Su et al. (2021), represents a mathematically sophisticated advancement in positional encoding.

RoPE encodes positional information by applying a position-dependent rotation to pairs of dimensions in the embedding space. For a token at position m with embedding vector 𝐱m ∈ ℝd, RoPE transforms query and key vectors as follows:

fq(xm,m)=(Wqxm)eimθ(22)f_q(\mathbf{x}_m, m) = (\mathbf{W}_q\mathbf{x}_m)e^{im\theta} \tag{22}
fk(xn,n)=(Wkxn)einθ(23)f_k(\mathbf{x}_n, n) = (\mathbf{W}_k\mathbf{x}_n)e^{in\theta} \tag{23}

Here, the complex exponential eimθ represents rotation in the complex plane. This operation rotates the query and key vectors by angles proportional to their positions in the sequence. The rotation angle increases with the position index, creating unique position-dependent transformations for each token. For practical implementation in neural networks, these complex number rotations are expressed using real-valued rotation matrices. For embedding vectors with dimension d (where d is even), we can view the embedding space as composed of d/2 two-dimensional subspaces. In each two-dimensional subspace corresponding to dimensions (2i-1, 2i), RoPE applies a 2×2 rotation matrix:

(cosmθisinmθisinmθicosmθi)(24)\begin{pmatrix} \cos m\theta_i & -\sin m\theta_i \\ \sin m\theta_i & \cos m\theta_i \end{pmatrix} \tag{24}

Generalizing to a d-dimensional space (where d is even), RoPE uses a block-diagonal rotation matrix RΘ,md:

RΘ,md=(cosmθ1sinmθ10000sinmθ1cosmθ1000000cosmθ2sinmθ20000sinmθ2cosmθ2000000cosmθd/2sinmθd/20000sinmθd/2cosmθd/2)(25)R_{\Theta,m}^d = \begin{pmatrix} \cos m\theta_1 & -\sin m\theta_1 & 0 & 0 & \cdots & 0 & 0 \\ \sin m\theta_1 & \cos m\theta_1 & 0 & 0 & \cdots & 0 & 0 \\ 0 & 0 & \cos m\theta_2 & -\sin m\theta_2 & \cdots & 0 & 0 \\ 0 & 0 & \sin m\theta_2 & \cos m\theta_2 & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \cdots & \cos m\theta_{d/2} & -\sin m\theta_{d/2} \\ 0 & 0 & 0 & 0 & \cdots & \sin m\theta_{d/2} & \cos m\theta_{d/2} \end{pmatrix} \tag{25}

Where θi = 10000-2(i-1)/d for i ∈ [1, 2, ..., d/2]

Relative position encoding: The principal advantage of Rotary Position Embedding (RoPE) is its intrinsic capacity to encode relative positional information rather than absolute positions. This property becomes mathematically evident when examining the attention mechanism. For a query vector at position m and a key vector at position n, the attention score is formulated as:

qmTkn=(RΘ,mdWqxm)T(RΘ,ndWkxn)=xmTWqTRΘ,mndWkxn(26)q_m^T k_n = (R_{\Theta,m}^d \mathbf{W}_q\mathbf{x}m)^T(R_{\Theta,n}^d \mathbf{W}_k\mathbf{x}_n) = \mathbf{x}_m^T \mathbf{W}q^T R_{\Theta,m-n}^d \mathbf{W}_k\mathbf{x}_n \tag{26}

Where RΘ,m-nd = (RΘ,md)T RΘ,nd. This means the attention score naturally incorporates relative position information (m-n) rather than absolute positions. Consequently, this mathematical property enables transformer architectures to develop position-invariant representations of token relationships, thereby enhancing the model's capability to capture linguistic dependencies across diverse contextual environments.

Mathematical Proof of Relative Position Property

This relative position property comes from fundamental properties of rotation matrices:

  1. The transpose of a rotation matrix inverts the rotation: (RΘ,md)T = RΘ,-md
  2. Multiplying rotation matrices compounds their rotations: RΘ,ad RΘ,bd = RΘ,a+bd

Therefore: (RΘ,md)T RΘ,nd = RΘ,-md RΘ,nd = RΘ,-m+nd = RΘ,n-md

2. The Decoupled RoPE Strategy in MLA

2.1 Separating Content and Position Information

The key innovation in MLA is the decoupled Rotary Position Embedding (RoPE) strategy, which elegantly separates content information from positional information:
1. Content Path (no position encoding):

ctKV=WDKVht(27)c_t^{KV} = W^{DKV}h_t \tag{27}
ktC=WUKctKV(28)k_t^C = W^{UK}c_t^{KV} \tag{28}
qtC=WUQctQ(29)q_t^C = W^{UQ}c_t^Q \tag{29}

2. Position Path (Rotated keys and queries):

ktR=RΘ,tdWKRht(30)k_t^R = R_{\Theta,t}^d \cdot W^{KR}h_t \tag{30}
qtR=RΘ,tdWQRctQ(31)q_t^R = R_{\Theta,t}^d \cdot W^{QR}c_t^Q \tag{31}

3. Final Representations (concatenation):

kt=[ktC;ktR](32)k_t = [k_t^C; k_t^R] \tag{32}
qt=[qtC;qtR](33)q_t = [q_t^C; q_t^R] \tag{33}

The MLA approach bifurcates the attention mechanism into dual parallel pathways, enabling distinct processing optimizations for different aspects of token representation. This architectural design represents a significant advancement over traditional attention mechanisms by addressing the fundamental tension between computational efficiency and positional awareness.

In the content path, semantic information is processed without positional encoding, allowing for substantial dimensionality reduction through compression. The down-projection matrix WDKV transforms the high-dimensional hidden state ht into a compact latent representation ctKV with dimension dc, where typically dc ≪ nheads × dhead. This compression captures the essential semantic content while eliminating redundant information, resulting in a more memory-efficient representation that can be cached during inference.

The separate position path maintains spatial awareness through Rotary Position Embeddings (RoPE), applied via the rotation matrix RΘ,td. By isolating positional information in a dedicated pathway with dimension dR (typically much smaller than the content dimension), MLA preserves the model's ability to understand token relationships without applying position encodings to the compressed representations. This separation is crucial for preventing the computational challenges that would arise from applying rotation matrices to compressed vectors (which we will explore in below sections).

2.2 Attention Calculation with Decoupled RoPE

The attention score between a query at position p and a key at position j becomes:

apj=qpTkjdh+dhR=(qpC)TkjC+(qpR)TkjRdh+dhR(34)a_{pj} = \frac{q_p^T k_j}{\sqrt{d_h + d_h^R}} = \frac{(q_p^C)^T k_j^C + (q_p^R)^T k_j^R}{\sqrt{d_h + d_h^R}} \tag{34}

Expanding each component:
1. Content Path:

(qpC)TkjC=(WUQcpQ)T(WUKcjKV)=(cpQ)T(WUQ)TWUKcjKV\begin{align} (q_p^C)^T k_j^C &= (W^{UQ}c_p^Q)^T(W^{UK}c_j^{KV}) \tag{35} \\ &= (c_p^Q)^T (W^{UQ})^T W^{UK} c_j^{KV} \tag{36} \end{align}

2. Positional Path (with RoPE):

(qpR)TkjR=(RΘ,pdWQRcpQ)T(RΘ,jdWKRhj)=(cpQ)T(WQR)T(RΘ,pd)TRΘ,jdWKRhj=(cpQ)T(WQR)TRΘ,pjdWKRhj\begin{align} (q_p^R)^T k_j^R &= (R_{\Theta,p}^d \cdot W^{QR}c_p^Q)^T (R_{\Theta,j}^d \cdot W^{KR}h_j) \tag{37} \\ &= (c_p^Q)^T (W^{QR})^T (R_{\Theta,p}^d)^T R_{\Theta,j}^d W^{KR} h_j \tag{38} \\ &= (c_p^Q)^T (W^{QR})^T R_{\Theta,p-j}^d W^{KR} h_j \tag{39} \end{align}

This decomposition highlights how the content similarity component measures semantic relationships independent of position, while the positional relationship component explicitly encodes the relative position (p-j) between tokens. The additive interaction between these components in the attention calculation allows the model to consider both semantic compatibility and positional context when determining attention weights.

2.3 Optimizations for Efficient Inference

For the content similarity component, MLA employs matrix absorption:

(qpC)TkjC=(WUQcpQ)T(WUKcjKV)=(cpQ)T(WUQ)TWUKcjKV\begin{align} (q_p^C)^T k_j^C &= (W^{UQ}c_p^Q)^T(W^{UK}c_j^{KV}) \tag{40} \\ &= (c_p^Q)^T (W^{UQ})^T W^{UK} c_j^{KV} \tag{41} \end{align}

By defining the absorbed matrix (WUQ)' = (WUQ)T WUK, we get:

(qpC)TkjC=(cpQ)T(WUQ)cjKV=((WUQ)cpQ)TcjKV\begin{align} (q_p^C)^T k_j^C &= (c_p^Q)^T (W^{UQ})' c_j^{KV} \tag{42} \\ &= ((W^{UQ})' c_p^Q)^T c_j^{KV} \tag{43} \end{align}

This optimization represents a significant computational efficiency gain during inference. By pre-computing the absorbed matrix (WUQ)', we transform what would be two sequential matrix multiplications (the up-projection of query followed by dot product with up-projected key) into a single multiplication followed by a dot product with the compressed key vector.

The absorbed matrix (WUQ)' effectively encapsulates both the query and key up-projection operations in a single transformation. This is particularly valuable during inference, as it reduces the computational overhead for each token generation step. The operation ((WUQ)' cpQ)T cjKV directly computes the content similarity using only the compressed representations, without requiring full decompression of the cached vectors.

2.4 Maintaining Relative Position in the Position Path

For the positional component, we have:

(qpR)TkjR=(RΘ,pdWQRcpQ)T(RΘ,jdWKRhj)=(cpQ)T(WQR)T(RΘ,pd)TRΘ,jdWKRhj=(cpQ)T(WQR)TRΘ,pjdWKRhj\begin{align} (q_p^R)^T k_j^R &= (R_{\Theta,p}^d \cdot W^{QR}c_p^Q)^T (R_{\Theta,j}^d \cdot W^{KR}h_j) \tag{44} \\ &= (c_p^Q)^T (W^{QR})^T (R_{\Theta,p}^d)^T R_{\Theta,j}^d W^{KR} h_j \tag{45} \\ &= (c_p^Q)^T (W^{QR})^T R_{\Theta,p-j}^d W^{KR} h_j \tag{46} \end{align}

Since RΘ,p-jd = (RΘ,pd)T RΘ,jd, the attention score naturally encodes the relative position (p-j) between the tokens. This mathematical property is central to the effectiveness of the decoupled RoPE approach. The fundamental challenge in position encoding for efficient inference is maintaining awareness of relative positions while avoiding recomputation of key vectors for each new token. The decoupled approach solves this by leveraging a key property of rotation matrices: the product of a rotation matrix and its transpose encodes the relative angle between them. This means that by caching the position-encoded vectors kjR = RΘ,jd ⋅ WKRhj for each token position j, and computing qpR = RΘ,pd ⋅ WQRcpQ for the current position p, their dot product naturally captures the relative positional relationship without requiring recomputation of previous keys.

2.5 Complete Inference-Time Attention Calculation

During inference, the optimized attention calculation becomes:

apj=((WUQ)cpQ)TcjKV+(RΘ,pdWQRcpQ)TkjRdh+dhR(47)a_{pj} = \frac{((W^{UQ})' c_p^Q)^T c_j^{KV} + (R_{\Theta,p}^d \cdot W^{QR}c_p^Q)^T k_j^R}{\sqrt{d_h + d_h^R}} \tag{47}

Where (WUQ)' is pre-computed, cjKV and kjR are cached for all previous tokens, and we only calculate cpQ and (RΘ,pd ⋅ WQRcpQ) for the current token.

3. Why the Naive Approach to Combining RoPE and MLA Fails?

Now that we understand the decoupled RoPE solution, let's examine why a more straightforward approach doesn't work.

3.1 The Naive Approach: Applying RoPE After Decompression

A seemingly natural way to combine RoPE with MLA would be to apply the rotational encoding after decompressing the cached latent vectors as shown in Figure 3:

Naive MLA

Figure 3: Applying RoPR after decompression in MLA architecture

1. Compress the hidden states for storage in the KV cache:

ctKV=WDKVht(48)c_t^{KV} = W^{DKV}h_t \tag{48}

2. During attention computation, decompress the cached vectors:

kt=WUKctKV=WUKWDKVht(49)k_t = W^{UK}c_t^{KV} = W^{UK}W^{DKV}h_t \tag{49}

3. Apply RoPE to the decompressed vectors based on their position:

km(m)=RΘ,mdkm=RΘ,mdWUKWDKVhm(50)k_m(m) = R_{\Theta,m}^d \cdot k_m = R_{\Theta,m}^d \cdot W^{UK}W^{DKV}h_m \tag{50}

This approach seems intuitive but creates a fundamental problem during inference.

3.2 The Re-computation Problem

During inference, for the current query token at position p and a key token at position j < p, the attention score calculation requires:

apj=qp(p)Tkj(pj)(51)a_{pj} = q_p(p)^T k_j(p-j) \tag{51}

Notice that crucial detail: kj(p-j) – we need the key vector for token j encoded with the relative position (p-j), not just its original absolute position j. But here's the problem: during inference, we've only cached cjKV for previous tokens, not their RoPE-encoded keys. To compute kj(p-j) correctly, we need:

kj(pj)=RΘ,pjdWUKcjKV(52)k_j(p-j) = R_{\Theta,p-j}^d \cdot W^{UK}c_j^{KV} \tag{52}

This requires applying a different rotation matrix RΘ,p-jd to each cached key, which depends on the distance (p-j) from the current position p.

Let's prove why all keys must be recomputed with each new token. According to the RoPE formulation, the attention score between a query at position p and a key at position j is:

qpTkj=(RΘ,pdWqxp)T(RΘ,jdWkxj)=xpTWqTRΘ,pjdWkxj(53)q_p^T k_j = (R_{\Theta,p}^d \mathbf{W}_q\mathbf{x}_p)^T(R_{\Theta,j}^d \mathbf{W}_k\mathbf{x}_j) = \mathbf{x}_p^T \mathbf{W}_q^T R_{\Theta,p-j}^d \mathbf{W}_k\mathbf{x}_j \tag{53}

In MLA with compressed representations, this becomes:

qpTkj=(RΘ,pdWUQcpQ)T(RΘ,jdWUKcjKV)(54)q_p^T k_j = (R_{\Theta,p}^d \cdot W^{UQ}c_p^Q)^T(R_{\Theta,j}^d \cdot W^{UK}c_j^{KV}) \tag{54}

But during inference, to capture the correct relative position (p-j), we need to recalculate:

kj(pj)=RΘ,pjdWUKcjKV(55)k_j(p-j) = R_{\Theta,p-j}^d \cdot W^{UK}c_j^{KV} \tag{55}

Or equivalently:

kj(pj)=(RΘ,pd)TRΘ,jdWUKcjKV(56)k_j(p-j) = (R_{\Theta,p}^d)^T R_{\Theta,j}^d \cdot W^{UK}c_j^{KV} \tag{56}

This means for each new token position p, we must recompute all previous keys with their relative position to p, which effectively eliminates the benefit of KV caching. Instead of simply retrieving cached vectors, we must perform a matrix multiplication for every previous token with each new step, significantly increasing the computational cost.

3.3 The Matrix Absorption Impossibility

A natural optimization attempt would be to absorb some of the matrix multiplications. Let's explore this possibility:

qpTkj=(RΘ,pdWUQcpQ)T(RΘ,jdWUKcjKV)(57)q_p^T k_j = (R_{\Theta,p}^d \cdot W^{UQ}c_p^Q)^T(R_{\Theta,j}^d \cdot W^{UK}c_j^{KV}) \tag{57}
=(cpQ)T(WUQ)T(RΘ,pd)TRΘ,jdWUKcjKV(58)= (c_p^Q)^T (W^{UQ})^T (R_{\Theta,p}^d)^T R_{\Theta,j}^d W^{UK} c_j^{KV} \tag{58}
=(cpQ)T(WUQ)TRΘ,pjdWUKcjKV(59)= (c_p^Q)^T (W^{UQ})^T R_{\Theta,p-j}^d W^{UK} c_j^{KV} \tag{59}

If the rotation matrix RΘ,p-jd commuted with WUK (meaning RΘ,p-jd ⋅ WUK = WUK ⋅ RΘ,p-jd), we could define: [NOTE: THIS IS NOT POSSIBLE]

(WUQ)=(WUQ)T(WUK)(60)(W^{UQ})' = (W^{UQ})^T (W^{UK}) \tag{60}

And compute:

(cpQ)T(WUQ)RΘ,pjdcjKV(61)(c_p^Q)^T (W^{UQ})' R_{\Theta,p-j}^d c_j^{KV} \tag{61}

This would allow us to apply the rotation directly to the compressed representations, avoiding the need for decompression and recomputation. However, rotation matrices do not generally commute with arbitrary matrices. To prove this, let's consider the product of a 2×2 rotation matrix and a general 2×2 matrix:

Rθ=(cosθsinθsinθcosθ)(62)R_{\theta} = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix} \tag{62}
A=(abcd)(63)A = \begin{pmatrix} a & b \\ c & d \end{pmatrix} \tag{63}

Computing Rθ ⋅ A:

(cosθsinθsinθcosθ)(abcd)=(acosθcsinθbcosθdsinθasinθ+ccosθbsinθ+dcosθ)(64)\begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix} \begin{pmatrix} a & b \\ c & d \end{pmatrix} = \begin{pmatrix} a\cos\theta-c\sin\theta & b\cos\theta-d\sin\theta \\ a\sin\theta+c\cos\theta & b\sin\theta+d\cos\theta \end{pmatrix} \tag{64}

Computing A ⋅ Rθ:

(abcd)(cosθsinθsinθcosθ)=(acosθ+bsinθasinθ+bcosθccosθ+dsinθcsinθ+dcosθ)(65)\begin{pmatrix} a & b \\ c & d \end{pmatrix} \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix} = \begin{pmatrix} a\cos\theta+b\sin\theta & -a\sin\theta+b\cos\theta \\ c\cos\theta+d\sin\theta & -c\sin\theta+d\cos\theta \end{pmatrix} \tag{65}

These results are different unless A has a special structure. Therefore:

RΘ,pjdWUKWUKRΘ,pjd(66)R_{\Theta,p-j}^d \cdot W^{UK} \neq W^{UK} \cdot R_{\Theta,p-j}^d \tag{66}

This non-commutativity prevents the matrix absorption optimization, forcing us to recalculate all keys with their appropriate rotations for each new token position.

4. Conclusion: Why Decoupled RoPE Succeeds Where the Naive Approach Fails

The decoupled RoPE strategy succeeds by separating positional and content information into parallel paths, allowing each to be processed optimally:

Content path can be efficiently compressed without worrying about position encoding.

Position path handles rotary encodings separately, maintaining relative position awareness.
Concatenation combines both signals without requiring recalculation of previous keys.

This separation allows MLA to achieve both memory efficiency (through compression) and computational efficiency (by avoiding recomputation), while still preserving the powerful relative position encoding capabilities of RoPE.
In contrast, the naive approach attempts to apply position encoding on top of the compression-decompression pipeline, creating a fundamental mathematical incompatibility that forces costly recomputations during inference.
The decoupled RoPE strategy represents an elegant architectural solution that demonstrates the importance of carefully considering how different components of a model interact, particularly when optimizing for inference efficiency.

5. References

  1. Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv preprint arXiv:2104.09864.

  2. DeepSeek-AI, Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., et al. (2024). DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv preprint arXiv:2405.04434.

  3. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.