Multi Head Latent Attention: The RoPE Compatibility Problem - A Mathematical Analysis

1. Introduction

1.1 Multi-Head Latent Attention: Advancing Inference Efficiency in Large Language Models

Large Language Models (LLMs) have transformed natural language processing capabilities, yet their deployment presents substantial challenges as model size increases to hundreds of billions of parameters with extended context windows of tens or hundreds of thousands of tokens. During the autoregressive generation process, the Key-Value (KV) cache emerges as a critical bottleneck, presenting organizations with a fundamental trade-off between computational efficiency and memory resource allocation.

Without KV caching, the computational complexity for generating each token scales quadratically with sequence length (O(n²) per token), while maintaining minimal memory requirements O(1). This approach becomes prohibitively expensive for long sequences, as each new token would require recomputing attention scores with all previous tokens. KV caching strategy reduce this to linear computational complexity (O(n) per token), but at the cost of linear memory growth O(n). For standard Multi-Head Attention (MHA), the total KV cache memory consumption can be expressed as:

\text{Memory}_{\text{MHA}} = B \times L \times N_L \times 2 \times N_H \times D_H \times P \tag{1}

¹Concrete Memory Savings Example

Let's calculate the memory requirements for a typical large language model with:

Batch size (B) = 1
Sequence length (L) = 32,768 tokens
Number of layers (NL) = 32
Number of heads (NH) = 32
Head dimension (DH) = 128
Precision (P) = 2 bytes (FP16)
MLA content dimension (DC) = 64
MLA rotary dimension (DR) = 8

Standard MHA memory (Equation 1):

Memory_MHA = B × L × NL × 2 × NH × DH × P

With our parameters:

Memory_MHA = 1 × 32,768 × 32 × 2 × 32 × 128 × 2 bytes  
           = 1 × 32,768 × 32 × 4 × 32 × 128 bytes  
           = 1 × 32,768 × 128 × 128 × 32 bytes  
           = 32,768 × 16,384 × 32 bytes  
           = 17,179,869,184 bytes ≈ 16 GB

MLA memory with compression (Equation 2):

Memory_MLA = B × L × NL × (DC + DR) × P

With our parameters:

Memory_MLA = 1 × 32,768 × 32 × (64 + 8) × 2 bytes  
           = 1 × 32,768 × 32 × 72 × 2 bytes  
           = 32,768 × 32 × 144 bytes  
           = 32,768 × 4,608 bytes  
           = 151,003,136 bytes ≈ 144 MB

Memory reduction ratio:

Reduction ratio = Memory_MHA / Memory_MLA  
                = 17,179,869,184 bytes / 151,003,136 bytes  
                ≈ 113.77  
                ≈ 99.1% reduction

This dramatic reduction enables models to handle much longer contexts with the same hardware, or allows deployment on more resource-constrained devices.

¹Concrete Memory Savings Example

Let's calculate the memory requirements for a typical large language model with:

Batch size (B) = 1
Sequence length (L) = 32,768 tokens
Number of layers (NL) = 32
Number of heads (NH) = 32
Head dimension (DH) = 128
Precision (P) = 2 bytes (FP16)
MLA content dimension (DC) = 64
MLA rotary dimension (DR) = 8

Standard MHA memory (Equation 1):

Memory_MHA = B × L × NL × 2 × NH × DH × P

With our parameters:

Memory_MHA = 1 × 32,768 × 32 × 2 × 32 × 128 × 2 bytes  
           = 1 × 32,768 × 32 × 4 × 32 × 128 bytes  
           = 1 × 32,768 × 128 × 128 × 32 bytes  
           = 32,768 × 16,384 × 32 bytes  
           = 17,179,869,184 bytes ≈ 16 GB

MLA memory with compression (Equation 2):

Memory_MLA = B × L × NL × (DC + DR) × P

With our parameters:

Memory_MLA = 1 × 32,768 × 32 × (64 + 8) × 2 bytes  
           = 1 × 32,768 × 32 × 72 × 2 bytes  
           = 32,768 × 32 × 144 bytes  
           = 32,768 × 4,608 bytes  
           = 151,003,136 bytes ≈ 144 MB

Memory reduction ratio:

Reduction ratio = Memory_MHA / Memory_MLA  
                = 17,179,869,184 bytes / 151,003,136 bytes  
                ≈ 113.77  
                ≈ 99.1% reduction

This dramatic reduction enables models to handle much longer contexts with the same hardware, or allows deployment on more resource-constrained devices.

This formula encapsulates the memory requirements across batch size (B), sequence length (L), number of layers (N_L), heads (N_H), head dimension (D_H), and precision (P). The factor of 2 accounts for storing both keys and values separately. As models grow larger and context windows expand, this memory requirement becomes increasingly untenable, even on high-end hardware accelerators.

Multi-Head Latent Attention (MLA) addresses this challenge through a novel approach that transforms the fundamental equation of memory consumption:

\text{Memory}_{\text{MLA}} = B \times L \times N_L \times (D_C + D_R) \times P \tag{2}

Where D_C is the compression KV dimension and D_R is the dimension of the key rotary position component¹. This reformulation enables substantial memory savings without compromising model capabilities, creating new possibilities for deploying models in resource-constrained environments.

Architectural Differences: MHA vs. MLA

Standard Multi-Head Attention (MHA) and Multi-Head Latent Attention (MLA) share the same high-level goal of enabling tokens to attend to each other, but differ significantly in their internal architecture and memory efficiency characteristics. In standard MHA, each token's hidden representation undergoes three parallel linear projections to create query, key, and value vectors. This process can be represented mathematically as:

q_t = W^Q h_t \tag{3}

k_t = W^K h_t \tag{4}

v_t = W^V h_t \tag{5}

These projections are then split into N_H attention heads, each operating in a lower-dimensional space:

q_t^i, k_t^i, v_t^i \in \mathbb{R}^{D_H} \tag{6}

The attention mechanism computes weighted interactions between tokens, where the weights are determined by the compatibility between queries and keys. For each head i, the attention output is computed as:

\text{o}_{t,i} = \sum_{j=1}^{t} \text{Softmax}_j\left(\frac{(\text{q}_{t,i})^T\text{k}_{j,i}}{\sqrt{d_h}}\right)\text{v}_{j,i} \tag{7}

\begin{align} \text{where:} \\ \text{j} &: \text{ Position index } (1 \leq j \leq t) \text{ of previous tokens in the sequence} \\ \text{t} &: \text{ Current position in the sequence} \\ \text{q}_{t,i} &: \text{ Query vector at position } t \text{ for head } i \\ \text{k}_{j,i} &: \text{ Key vector at position } j \text{ for head } i \\ \text{v}_{j,i} &: \text{ Value vector at position } j \text{ for head } i \\ \text{d}_h &: \text{ Dimension of each attention head} \\ \text{o}_{t,i} &: \text{ Output of attention at position } t \text{ for head } i \\ \text{i} &\in \{1, 2, \ldots, n_h\}, \text{ where } n_h \text{ is the total number of attention heads} \end{align} \tag{8}

The outputs from all heads are concatenated and projected through an output matrix:

\text{u}_t = W^O[\text{o}_{t,1}; \text{o}_{t,2}; ...; \text{o}_{t,n_h}] \tag{9}

During inference, MHA caches the full key and value vectors for each token across all layers and heads, creating substantial memory pressure as sequence length increases. MLA fundamentally reimagines this architecture by introducing an intermediate compression step and decoupling content information from positional information. The architecture consists of two parallel paths:

Figure 1: Architecture of Multi Head Latent Attention

Content Path (with compression):

c_t^{KV} = W^{DKV}h_t \tag{10}

k_t^C = W^{UK}c_t^{KV} \tag{11}

v_t = W^{UV}c_t^{KV} \tag{12}

q_t^C = W^{UQ}c_t^{Q} \tag{13}

c_t^{Q} = W^{DQ}h_t \tag{14}

Position Path (with RoPE):

k_t^R = R_{\Theta,t}^d \cdot W^{KR}h_t \tag{15}

q_t^R = R_{\Theta,t}^d \cdot W^{QR}c_t^Q \tag{16}

Where R_Θ,t^d represents the rotary position encoding matrix. The final key and query representations are formed by concatenating both components:

k_t = [k_t^C; k_t^R] \tag{17}

q_t = [q_t^C; q_t^R] \tag{18}

During inference, MLA caches both the compressed latent vectors c_t^KV and the rotary key components k_t^R as shown in Figure 2. This is reflected in the memory formula, where D_C represents the dimension of c_t^KV and D_R represents the dimension of k_t^R.

Figure 2: Cache both the compressed latent vectors c_t^KV and the rotary key components k_t^R

The attention calculation in MLA becomes:

\text{o}_{t,i} = \sum_{j=1}^{t} \text{Softmax}_j\left(\frac{(\text{q}_{t,i}^C)^T\text{k}_{j,i}^C + (\text{q}_{t,i}^R)^T\text{k}_j^R}{\sqrt{d_h + d_h^R}}\right)\text{v}_{j,i}^C \tag{19}

\begin{align} \text{where:}\\ \text{j} &: \text{ Position index } (1 \leq j \leq t) \text{ of previous tokens in the sequence} \\ \text{t} &: \text{ Current position in the sequence} \\ \text{q}_{t,i}^C &: \text{ Content component of query vector at position } t \text{ for head } i \\ \text{k}_{j,i}^C &: \text{ Content component of key vector at position } j \text{ for head } i \\ \text{q}_{t,i}^R &: \text{ Rotary position component of query vector at position } t \text{ for head } i \\ \text{k}_j^R &: \text{ Rotary position component of key vector at position } j \\ \text{v}_{j,i}^C &: \text{ Value vector at position } j \text{ for head } i \\ \text{d}_h &: \text{ Dimension of the content component} \\ \text{d}_h^R &: \text{ Dimension of the rotary position component} \\ \text{o}_{t,i} &: \text{ Output of attention at position } t \text{ for head } i \\ \text{i} &\in \{1, 2, \ldots, n_h\}, \text{ where } n_h \text{ is the total number of attention heads} \end{align} \tag{20}

And the final multi-head output remains:

\text{u}_t = W^O[\text{o}_{t,1}; \text{o}_{t,2}; ...; \text{o}_{t,n_h}] \tag{21}

This formulation separates content-based attention (first term) from position-aware attention (second term), allowing each to be processed optimally. The content path can be efficiently compressed without worrying about position encoding, while the position path handles rotary encodings separately, maintaining relative position awareness. This decoupling strategy is particularly important because applying rotary position encodings directly to compressed representations would create mathematical incompatibilities during inference, requiring costly recomputations for each new token (which we will see in the rest of the article with derivations). By separating content from position, MLA achieves both memory efficiency and computational efficiency.

1.2 Rotary Position Embeddings (RoPE): Mathematical Foundations

Transformer architectures have demonstrated remarkable efficacy across diverse natural language processing tasks, yet they inherently lack sequential awareness due to their parallel token processing mechanism. To mitigate this limitation, position encoding methodologies have been developed to incorporate sequential information into the representation space. Among these approaches, Rotary Position Embedding (RoPE), introduced by Su et al. (2021), represents a mathematically sophisticated advancement in positional encoding.

RoPE encodes positional information by applying a position-dependent rotation to pairs of dimensions in the embedding space. For a token at position m with embedding vector 𝐱_m ∈ ℝ^d, RoPE transforms query and key vectors as follows:

f_q(\mathbf{x}_m, m) = (\mathbf{W}_q\mathbf{x}_m)e^{im\theta} \tag{22}

f_k(\mathbf{x}_n, n) = (\mathbf{W}_k\mathbf{x}_n)e^{in\theta} \tag{23}

Here, the complex exponential e^imθ represents rotation in the complex plane. This operation rotates the query and key vectors by angles proportional to their positions in the sequence. The rotation angle increases with the position index, creating unique position-dependent transformations for each token. For practical implementation in neural networks, these complex number rotations are expressed using real-valued rotation matrices. For embedding vectors with dimension d (where d is even), we can view the embedding space as composed of d/2 two-dimensional subspaces. In each two-dimensional subspace corresponding to dimensions (2i-1, 2i), RoPE applies a 2×2 rotation matrix:

\begin{pmatrix} \cos m\theta_i & -\sin m\theta_i \\ \sin m\theta_i & \cos m\theta_i \end{pmatrix} \tag{24}

Generalizing to a d-dimensional space (where d is even), RoPE uses a block-diagonal rotation matrix R_Θ,m^d:

R_{\Theta,m}^d = \begin{pmatrix} \cos m\theta_1 & -\sin m\theta_1 & 0 & 0 & \cdots & 0 & 0 \\ \sin m\theta_1 & \cos m\theta_1 & 0 & 0 & \cdots & 0 & 0 \\ 0 & 0 & \cos m\theta_2 & -\sin m\theta_2 & \cdots & 0 & 0 \\ 0 & 0 & \sin m\theta_2 & \cos m\theta_2 & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \cdots & \cos m\theta_{d/2} & -\sin m\theta_{d/2} \\ 0 & 0 & 0 & 0 & \cdots & \sin m\theta_{d/2} & \cos m\theta_{d/2} \end{pmatrix} \tag{25}

Where θ_i = 10000^-2(i-1)/d for i ∈ [1, 2, ..., d/2]

Relative position encoding: The principal advantage of Rotary Position Embedding (RoPE) is its intrinsic capacity to encode relative positional information rather than absolute positions. This property becomes mathematically evident when examining the attention mechanism. For a query vector at position m and a key vector at position n, the attention score is formulated as:

q_m^T k_n = (R_{\Theta,m}^d \mathbf{W}_q\mathbf{x}m)^T(R_{\Theta,n}^d \mathbf{W}_k\mathbf{x}_n) = \mathbf{x}_m^T \mathbf{W}q^T R_{\Theta,m-n}^d \mathbf{W}_k\mathbf{x}_n \tag{26}

²Mathematical Proof of Relative Position Property

This relative position property comes from fundamental properties of rotation matrices:

Property 1: The transpose of a rotation matrix inverts the rotation:
(R_Θ,m^d)^T = R_Θ,-m^d

Property 2: Multiplying rotation matrices compounds their rotations:
R_Θ,a^d · R_Θ,b^d = R_Θ,a+b^d

Therefore:
(R_Θ,m^d)^T · R_Θ,n^d = R_Θ,-m^d · R_Θ,n^d = R_Θ,-m+n^d = R_Θ,n-m^d

This mathematical property enables the model to naturally compute relative positional relationships between tokens without storing absolute positions.

Where R_Θ,m-n^d = (R_Θ,m^d)^T R_Θ,n^d. This means the attention score naturally incorporates relative position information (m-n) rather than absolute positions². Consequently, this mathematical property enables transformer architectures to develop position-invariant representations of token relationships, thereby enhancing the model's capability to capture linguistic dependencies across diverse contextual environments.

2. The Decoupled RoPE Strategy in MLA

2.1 Separating Content and Position Information

The key innovation in MLA is the decoupled Rotary Position Embedding (RoPE) strategy, which elegantly separates content information from positional information:
1. Content Path (no position encoding):

c_t^{KV} = W^{DKV}h_t \tag{27}

k_t^C = W^{UK}c_t^{KV} \tag{28}

q_t^C = W^{UQ}c_t^Q \tag{29}

2. Position Path (Rotated keys and queries):

k_t^R = R_{\Theta,t}^d \cdot W^{KR}h_t \tag{30}

q_t^R = R_{\Theta,t}^d \cdot W^{QR}c_t^Q \tag{31}

3. Final Representations (concatenation):

k_t = [k_t^C; k_t^R] \tag{32}

q_t = [q_t^C; q_t^R] \tag{33}

The MLA approach bifurcates the attention mechanism into dual parallel pathways, enabling distinct processing optimizations for different aspects of token representation. This architectural design represents a significant advancement over traditional attention mechanisms by addressing the fundamental tension between computational efficiency and positional awareness.

In the content path, semantic information is processed without positional encoding, allowing for substantial dimensionality reduction through compression. The down-projection matrix W^DKV transforms the high-dimensional hidden state h_t into a compact latent representation c_t^KV with dimension d_c, where typically d_c ≪ n_heads × d_head. This compression captures the essential semantic content while eliminating redundant information, resulting in a more memory-efficient representation that can be cached during inference.

The separate position path maintains spatial awareness through Rotary Position Embeddings (RoPE), applied via the rotation matrix R_Θ,t^d. By isolating positional information in a dedicated pathway with dimension d_R (typically much smaller than the content dimension), MLA preserves the model's ability to understand token relationships without applying position encodings to the compressed representations. This separation is crucial for preventing the computational challenges that would arise from applying rotation matrices to compressed vectors (which we will explore in below sections).

2.2 Attention Calculation with Decoupled RoPE

The attention score between a query at position p and a key at position j becomes:

a_{pj} = \frac{q_p^T k_j}{\sqrt{d_h + d_h^R}} = \frac{(q_p^C)^T k_j^C + (q_p^R)^T k_j^R}{\sqrt{d_h + d_h^R}} \tag{34}

Expanding each component:
1. Content Path:

\begin{align} (q_p^C)^T k_j^C &= (W^{UQ}c_p^Q)^T(W^{UK}c_j^{KV}) \tag{35} \\ &= (c_p^Q)^T (W^{UQ})^T W^{UK} c_j^{KV} \tag{36} \end{align}

2. Positional Path (with RoPE):

\begin{align} (q_p^R)^T k_j^R &= (R_{\Theta,p}^d \cdot W^{QR}c_p^Q)^T (R_{\Theta,j}^d \cdot W^{KR}h_j) \tag{37} \\ &= (c_p^Q)^T (W^{QR})^T (R_{\Theta,p}^d)^T R_{\Theta,j}^d W^{KR} h_j \tag{38} \\ &= (c_p^Q)^T (W^{QR})^T R_{\Theta,p-j}^d W^{KR} h_j \tag{39} \end{align}

This decomposition highlights how the content similarity component measures semantic relationships independent of position, while the positional relationship component explicitly encodes the relative position (p-j) between tokens. The additive interaction between these components in the attention calculation allows the model to consider both semantic compatibility and positional context when determining attention weights.

2.3 Optimizations for Efficient Inference

For the content similarity component, MLA employs matrix absorption:

\begin{align} (q_p^C)^T k_j^C &= (W^{UQ}c_p^Q)^T(W^{UK}c_j^{KV}) \tag{40} \\ &= (c_p^Q)^T (W^{UQ})^T W^{UK} c_j^{KV} \tag{41} \end{align}

By defining the absorbed matrix (W^UQ)' = (W^UQ)^T W^UK, we get:

\begin{align} (q_p^C)^T k_j^C &= (c_p^Q)^T (W^{UQ})' c_j^{KV} \tag{42} \\ &= ((W^{UQ})' c_p^Q)^T c_j^{KV} \tag{43} \end{align}

This optimization represents a significant computational efficiency gain during inference. By pre-computing the absorbed matrix (W^UQ)', we transform what would be two sequential matrix multiplications (the up-projection of query followed by dot product with up-projected key) into a single multiplication followed by a dot product with the compressed key vector.

The absorbed matrix (W^UQ)' effectively encapsulates both the query and key up-projection operations in a single transformation. This is particularly valuable during inference, as it reduces the computational overhead for each token generation step. The operation ((W^UQ)' c_p^Q)^T c_j^KV directly computes the content similarity using only the compressed representations, without requiring full decompression of the cached vectors.

2.4 Maintaining Relative Position in the Position Path

For the positional component, we have:

\begin{align} (q_p^R)^T k_j^R &= (R_{\Theta,p}^d \cdot W^{QR}c_p^Q)^T (R_{\Theta,j}^d \cdot W^{KR}h_j) \tag{44} \\ &= (c_p^Q)^T (W^{QR})^T (R_{\Theta,p}^d)^T R_{\Theta,j}^d W^{KR} h_j \tag{45} \\ &= (c_p^Q)^T (W^{QR})^T R_{\Theta,p-j}^d W^{KR} h_j \tag{46} \end{align}

Since R_Θ,p-j^d = (R_Θ,p^d)^T R_Θ,j^d, the attention score naturally encodes the relative position (p-j) between the tokens. This mathematical property is central to the effectiveness of the decoupled RoPE approach. The fundamental challenge in position encoding for efficient inference is maintaining awareness of relative positions while avoiding recomputation of key vectors for each new token. The decoupled approach solves this by leveraging a key property of rotation matrices: the product of a rotation matrix and its transpose encodes the relative angle between them. This means that by caching the position-encoded vectors k_j^R = R_Θ,j^d ⋅ W^KRh_j for each token position j, and computing q_p^R = R_Θ,p^d ⋅ W^QRc_p^Q for the current position p, their dot product naturally captures the relative positional relationship without requiring recomputation of previous keys.

2.5 Complete Inference-Time Attention Calculation

During inference, the optimized attention calculation becomes:

a_{pj} = \frac{((W^{UQ})' c_p^Q)^T c_j^{KV} + (R_{\Theta,p}^d \cdot W^{QR}c_p^Q)^T k_j^R}{\sqrt{d_h + d_h^R}} \tag{47}

Where (W^UQ)' is pre-computed, c_j^KV and k_j^R are cached for all previous tokens, and we only calculate c_p^Q and (R_Θ,p^d ⋅ W^QRc_p^Q) for the current token.

3. Why the Naive Approach to Combining RoPE and MLA Fails?

Now that we understand the decoupled RoPE solution, let's examine why a more straightforward approach doesn't work.

3.1 The Naive Approach: Applying RoPE After Decompression

A seemingly natural way to combine RoPE with MLA would be to apply the rotational encoding after decompressing the cached latent vectors as shown in Figure 3:

Figure 3: Applying RoPR after decompression in MLA architecture

1. Compress the hidden states for storage in the KV cache:

c_t^{KV} = W^{DKV}h_t \tag{48}

2. During attention computation, decompress the cached vectors:

k_t = W^{UK}c_t^{KV} = W^{UK}W^{DKV}h_t \tag{49}

3. Apply RoPE to the decompressed vectors based on their position:

k_m(m) = R_{\Theta,m}^d \cdot k_m = R_{\Theta,m}^d \cdot W^{UK}W^{DKV}h_m \tag{50}

This approach seems intuitive but creates a fundamental problem during inference.

3.2 The Re-computation Problem

During inference, for the current query token at position p and a key token at position j < p, the attention score calculation requires:

a_{pj} = q_p(p)^T k_j(p-j) \tag{51}

Notice that crucial detail: k_j(p-j) – we need the key vector for token j encoded with the relative position (p-j), not just its original absolute position j. But here's the problem: during inference, we've only cached c_j^KV for previous tokens, not their RoPE-encoded keys. To compute k_j(p-j) correctly, we need:

k_j(p-j) = R_{\Theta,p-j}^d \cdot W^{UK}c_j^{KV} \tag{52}

This requires applying a different rotation matrix R_Θ,p-j^d to each cached key, which depends on the distance (p-j) from the current position p.

Let's prove why all keys must be recomputed with each new token. According to the RoPE formulation, the attention score between a query at position p and a key at position j is:

q_p^T k_j = (R_{\Theta,p}^d \mathbf{W}_q\mathbf{x}_p)^T(R_{\Theta,j}^d \mathbf{W}_k\mathbf{x}_j) = \mathbf{x}_p^T \mathbf{W}_q^T R_{\Theta,p-j}^d \mathbf{W}_k\mathbf{x}_j \tag{53}

In MLA with compressed representations, this becomes:

q_p^T k_j = (R_{\Theta,p}^d \cdot W^{UQ}c_p^Q)^T(R_{\Theta,j}^d \cdot W^{UK}c_j^{KV}) \tag{54}

But during inference, to capture the correct relative position (p-j), we need to recalculate:

k_j(p-j) = R_{\Theta,p-j}^d \cdot W^{UK}c_j^{KV} \tag{55}

Or equivalently:

k_j(p-j) = (R_{\Theta,p}^d)^T R_{\Theta,j}^d \cdot W^{UK}c_j^{KV} \tag{56}

This means for each new token position p, we must recompute all previous keys with their relative position to p, which effectively eliminates the benefit of KV caching. Instead of simply retrieving cached vectors, we must perform a matrix multiplication for every previous token with each new step, significantly increasing the computational cost.

3.3 The Matrix Absorption Impossibility

A natural optimization attempt would be to absorb some of the matrix multiplications. Let's explore this possibility:

q_p^T k_j = (R_{\Theta,p}^d \cdot W^{UQ}c_p^Q)^T(R_{\Theta,j}^d \cdot W^{UK}c_j^{KV}) \tag{57}

= (c_p^Q)^T (W^{UQ})^T (R_{\Theta,p}^d)^T R_{\Theta,j}^d W^{UK} c_j^{KV} \tag{58}

= (c_p^Q)^T (W^{UQ})^T R_{\Theta,p-j}^d W^{UK} c_j^{KV} \tag{59}

If the rotation matrix R_Θ,p-j^d commuted with W^UK (meaning R_Θ,p-j^d ⋅ W^UK = W^UK ⋅ R_Θ,p-j^d), we could define: [NOTE: THIS IS NOT POSSIBLE]

(W^{UQ})' = (W^{UQ})^T (W^{UK}) \tag{60}

And compute:

(c_p^Q)^T (W^{UQ})' R_{\Theta,p-j}^d c_j^{KV} \tag{61}

This would allow us to apply the rotation directly to the compressed representations, avoiding the need for decompression and recomputation. However, rotation matrices do not generally commute with arbitrary matrices. To prove this, let's consider the product of a 2×2 rotation matrix and a general 2×2 matrix:

R_{\theta} = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix} \tag{62}

A = \begin{pmatrix} a & b \\ c & d \end{pmatrix} \tag{63}

Computing R_θ ⋅ A:

\begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix} \begin{pmatrix} a & b \\ c & d \end{pmatrix} = \begin{pmatrix} a\cos\theta-c\sin\theta & b\cos\theta-d\sin\theta \\ a\sin\theta+c\cos\theta & b\sin\theta+d\cos\theta \end{pmatrix} \tag{64}

Computing A ⋅ R_θ:

\begin{pmatrix} a & b \\ c & d \end{pmatrix} \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix} = \begin{pmatrix} a\cos\theta+b\sin\theta & -a\sin\theta+b\cos\theta \\ c\cos\theta+d\sin\theta & -c\sin\theta+d\cos\theta \end{pmatrix} \tag{65}

These results are different unless A has a special structure. Therefore:

R_{\Theta,p-j}^d \cdot W^{UK} \neq W^{UK} \cdot R_{\Theta,p-j}^d \tag{66}

This non-commutativity prevents the matrix absorption optimization, forcing us to recalculate all keys with their appropriate rotations for each new token position.

4. Conclusion: Why Decoupled RoPE Succeeds Where the Naive Approach Fails

The decoupled RoPE strategy succeeds by separating positional and content information into parallel paths, allowing each to be processed optimally:

Content path can be efficiently compressed without worrying about position encoding.

Position path handles rotary encodings separately, maintaining relative position awareness.
Concatenation combines both signals without requiring recalculation of previous keys.

This separation allows MLA to achieve both memory efficiency (through compression) and computational efficiency (by avoiding recomputation), while still preserving the powerful relative position encoding capabilities of RoPE.
In contrast, the naive approach attempts to apply position encoding on top of the compression-decompression pipeline, creating a fundamental mathematical incompatibility that forces costly recomputations during inference.
The decoupled RoPE strategy represents an elegant architectural solution that demonstrates the importance of carefully considering how different components of a model interact, particularly when optimizing for inference efficiency.

5. References

Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv preprint arXiv:2104.09864.
DeepSeek-AI, Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., et al. (2024). DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv preprint arXiv:2405.04434.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.