varmology

Layer Normalization as a Projection: The Complete Geometric Interpretation

Sat, 17 May 2025 00:00:00 -0400

1. Introduction

Layer Normalization is a crucial technique in modern neural networks, particularly in Large Language Models (LLMs), where it helps stabilize training and accelerate convergence. While typically presented as a statistical normalization procedure, there’s a deeper, more elegant interpretation: layer normalization can be understood as a sequence of geometric projections in vector space.

This statistical operation, now a standard component in most neural network architectures, serves as a vital stabilizer during training. By normalizing activations across the feature dimension, it helps prevent the internal covariate shift problem that can slow down or destabilize training. However, beyond its practical benefits, layer normalization harbors a beautiful geometric interpretation that provides deeper insights into why it works so effectively.

This article provides a comprehensive exploration of this geometric perspective, breaking down each step with rigorous mathematical derivations and intuitive explanations. By understanding layer normalization through the lens of projections, we gain insights into why it works so effectively and how it relates to the geometry of feature spaces.

2. The Standard Layer Normalization Formulation

Before diving into the geometric interpretation, let’s review the standard formulation of layer normalization.

Given an input vector $x = (x_1, x_2, \ldots, x_d)$ of dimension $d$, layer normalization performs the following transformation:

$y = \frac{x-\mathrm{E}[x]}{\sqrt{\mathrm{Var}[x]}}$

Where:

$\mathrm{E}[x] = \frac{1}{d}\sum_{i=1}^d x_i$ is the mean of the vector
$\mathrm{Var}[x] = \frac{1}{d}\sum_{i=1}^d (x_i - \mathrm{E}[x])^2$ is the variance of the vector

This transformation centers the vector by subtracting the mean, then scales it by dividing by the standard deviation. The result is a vector with zero mean and unit variance across its components, regardless of the input’s original scale or offset. After this normalization, the vector typically undergoes an affine transformation with learnable parameters (a scaling and a bias term) that allows the network to recover the representational power that might be lost during normalization.

At first glance, this appears to be a purely statistical operation. However, as we’ll see, it can be elegantly reinterpreted as a sequence of geometric transformations in the vector space.

3. Understanding Vector Centering

The all-ones vector $\vec{1}$ has special significance in many mathematical fields including linear algebra, statistics, and machine learning. In the context of normalization, it represents the direction along which all components change uniformly. Its geometric interpretation connects statistical concepts like mean and variance to vector projections in high-dimensional spaces.

The first step in layer normalization is centering the vector by subtracting the mean from each component. Centering is a fundamental preprocessing step in many statistical and machine learning methods. It shifts the coordinate system so that the “center of mass” of the data lies at the origin. In the context of a single vector, centering removes the common offset across all dimensions, focusing instead on the relative differences between components.

For a vector $x$, the centered vector is:

$x_{centered} = x - \mathrm{E}[x] \cdot \vec{1} = (x_1 - \mathrm{E}[x], x_2 - \mathrm{E}[x], \ldots, x_d - \mathrm{E}[x])$

Where $\vec{1} = (1, 1, \ldots, 1)$ is the all-ones vector.

In neural networks, centering helps stabilize gradients during training by removing large offsets that might cause activations to saturate. It also makes the learning process more consistent across different input scales.

Centering has several important geometric interpretations. First, it can be viewed as a translation to the origin, shifting the coordinate system so that the mean becomes the new origin. This is a rigid translation of the vector space, preserving all distances and angles between points while moving the center of mass to zero.

Second, centering removes the “common mode” component of the vector that is the same across all dimensions, leaving only the pattern of variations. This “common mode” represents a uniform shift in all directions and often contains less discriminative information than the relative patterns between features.

Third, as we’ll explore in detail, centering can be viewed as projecting a vector onto the hyperplane orthogonal to the all-ones vector. This perspective connects statistical centering to the geometric operation of projection, providing new insights into its properties.

4. Vector Centering as a Projection

Now we come to the key insight: centering a vector is geometrically equivalent to projecting it onto the hyperplane orthogonal to the all-ones vector. This connection between a statistical operation (centering) and a geometric one (projection) is both elegant and profound.

To understand this equivalence, we need to explore how projections work and how the all-ones vector defines a special direction in the space. Let’s define the all-ones vector $\vec{1} = (1, 1, \ldots, 1) \in \mathbb{R}^d$. This vector has several important properties.

Its length is:

$\|\vec{1}\| = \sqrt{\sum_{i=1}^d 1^2} = \sqrt{d}$

The all-ones vector has a magnitude that grows with the square root of the dimension, reflecting the fact that adding more dimensions increases its length.

The normalized all-ones vector is:

$\hat{1} = \frac{\vec{1}}{\|\vec{1}\|} = \frac{(1, 1, \ldots, 1)}{\sqrt{d}} = (\frac{1}{\sqrt{d}}, \frac{1}{\sqrt{d}}, \ldots, \frac{1}{\sqrt{d}})$

This unit vector points in the same direction as $\vec{1}$ but has length 1, making it useful for projections.

For any vector $x$, the inner product with $\vec{1}$ gives the sum of its components:

$\langle x, \vec{1} \rangle = \sum_{i=1}^d x_i$

This property connects the geometric operation of inner product with the statistical operation of summation.

The inner product with the normalized all-ones vector gives:

$\langle x, \hat{1} \rangle = \sum_{i=1}^d x_i \cdot \frac{1}{\sqrt{d}} = \frac{1}{\sqrt{d}} \sum_{i=1}^d x_i = \frac{\sum_{i=1}^d x_i}{\sqrt{d}} = \sqrt{d} \cdot \mathrm{E}[x]$

This remarkable result connects the mean (a statistical concept) with the inner product (a geometric concept).

The hyperplane orthogonal to $\vec{1}$ consists of all vectors $v$ such that $\langle v, \vec{1} \rangle = 0$, or equivalently, $\sum_{i=1}^d v_i = 0$. This is a $(d-1)$-dimensional subspace of $\mathbb{R}^d$. This hyperplane has a special statistical interpretation: it contains all vectors whose components sum to zero, or equivalently, all vectors with mean zero. It represents the space of centered vectors, those with no “common mode” component.

In 3D, this is the plane passing through the origin with equation $x + y + z = 0$. We can visualize this as a plane that cuts through the origin and is tilted equally with respect to all three coordinate axes.

5. The Geometric Equivalence of Centering and Hyperplane Projectio

The hyperplane projection concept can be visualized geometrically: In 3D space, the all-ones vector $\vec{1} = (1,1,1)$ points along the main diagonal from the origin. The hyperplane orthogonal to this vector is the plane $x + y + z = 0$, which passes through the origin and forms equal angles with all three coordinate axes. When we project a vector onto this hyperplane, we are essentially removing any component that points in the direction of this diagonal. This isolates the variations between components while eliminating the common offset.

The key insight is understanding why projecting a vector onto the hyperplane orthogonal to the all-ones vector is geometrically equivalent to centering it.

When we center a vector $x$, we’re subtracting the same value (the mean) from each component:

$x_{centered} = (x_1 - \mathrm{E}[x], x_2 - \mathrm{E}[x], \ldots, x_d - \mathrm{E}[x])$

Geometrically, this means we’re moving the vector in the direction opposite to the all-ones vector $\vec{1} = (1, 1, \ldots, 1)$ by a distance of $\mathrm{E}[x]$ along each dimension.

Now, consider what happens when we project a vector onto a hyperplane. The projection removes the component of the vector that is parallel to the normal vector of the hyperplane. In our case, the hyperplane is orthogonal to $\vec{1}$, so its normal vector is $\vec{1}$.

The component of $x$ parallel to $\vec{1}$ is:

$\text{comp}_{\vec{1}}(x) = \frac{\langle x, \vec{1} \rangle}{\|\vec{1}\|^2} \cdot \vec{1}$

Since $\langle x, \vec{1} \rangle = \sum_{i=1}^d x_i$ and $\|\vec{1}\|^2 = d$, we have:

$\text{comp}_{\vec{1}}(x) = \frac{\sum_{i=1}^d x_i}{d} \cdot \vec{1} = \mathrm{E}[x] \cdot \vec{1}$

This component represents a vector where all elements are equal to the mean of $x$. It’s the part of $x$ that points in the direction of the all-ones vector, corresponding to the “common mode” or uniform shift across all dimensions.

When we project $x$ onto the hyperplane orthogonal to $\vec{1}$, we remove this component:

$\text{proj}_{\text{hyperplane}}(x) = x - \text{comp}_{\vec{1}}(x) = x - \mathrm{E}[x] \cdot \vec{1}$

This is exactly the centered vector! The projection operation has produced the same result as centering.

So, geometrically, centering a vector is equivalent to projecting it onto the hyperplane orthogonal to the all-ones vector because centering removes the mean from each component, effectively removing the “uniform” part of the vector. Projection onto the hyperplane removes the component parallel to the normal vector, which in this case is the all-ones vector. These two operations are mathematically identical, both resulting in $x - \mathrm{E}[x] \cdot \vec{1}$.

This equivalence provides a powerful geometric interpretation of the statistical operation of centering, connecting two seemingly different mathematical concepts.

6. Deriving the Projection Formula

Let’s derive the formula for projecting a vector $x$ onto the hyperplane orthogonal to $\vec{1}$ in a step-by-step manner.

The projection of a vector onto a subspace involves two steps: first, finding the component of the vector along the normal direction to the subspace, and second, subtracting this component from the original vector.

The vector projection formula is foundational in linear algebra:

$\text{proj}_{\text{subspace}}(v) = v - \frac{\langle v, n \rangle}{\|n\|^2} \cdot n$

This operation has wide applications beyond normalization, including in computer graphics (shadow calculations), signal processing (noise elimination), and quantum mechanics (measurement operations). Understanding projections helps connect abstract mathematical concepts to their geometric interpretations.

The projection of $x$ onto the direction of $\hat{1}$ (the normalized all-ones vector) is:

$\text{proj}_{\hat{1}}(x) = \langle x, \hat{1} \rangle \hat{1}$

This gives the component of $x$ that points in the direction of the all-ones vector. Substituting the value of $\langle x, \hat{1} \rangle$:

$\text{proj}_{\hat{1}}(x) = \sqrt{d} \cdot \mathrm{E}[x] \cdot \frac{\vec{1}}{\sqrt{d}} = \mathrm{E}[x] \cdot \vec{1}$

To get the projection onto the hyperplane orthogonal to $\vec{1}$, we subtract this component:

$p_1(x) = x - \text{proj}_{\hat{1}}(x) = x - \mathrm{E}[x] \cdot \vec{1}$

Component-wise, this gives us:

$p_1(x)_j = x_j - \mathrm{E}[x]$

Which is exactly the centered vector!

Let’s verify that $p_1(x)$ is indeed orthogonal to $\vec{1}$:

$\langle p_1(x), \vec{1} \rangle = \sum_{j=1}^d (x_j - \mathrm{E}[x]) = \sum_{j=1}^d x_j - d \cdot \mathrm{E}[x] = \sum_{j=1}^d x_j - \sum_{j=1}^d x_j = 0$

This confirms that the projection is orthogonal to $\vec{1}$ as required. The centered vector lies exactly on the hyperplane defined by the all-ones vector.

In our case, $n = \vec{1}$ and $v = x$:

$p_1(x) = x - \frac{\langle x, \vec{1} \rangle}{\|\vec{1}\|^2} \cdot \vec{1} = x - \frac{\sum_{i=1}^d x_i}{d} \cdot \vec{1} = x - \mathrm{E}[x] \cdot \vec{1}$

This gives us the same result as before, confirming our understanding of centering as a projection.

7. The Subspace Perspective

This decomposition has deep connections to concepts in linear algebra and statistics. In statistics, it relates to the decomposition of total variance into “between-group” and “within-group” components. In signal processing, it corresponds to separating DC offset from AC components. In physics, it resembles decomposing a force into conservative and non-conservative components. The power of this perspective is that it clarifies what information layer normalization preserves (relative patterns) versus what it removes (common offsets).

The space $\mathbb{R}^d$ can be decomposed into two orthogonal subspaces: the one-dimensional subspace spanned by $\vec{1}$, which contains all vectors with equal components (the space of “uniform shifts” or “common modes”), and the $(d-1)$-dimensional subspace orthogonal to $\vec{1}$, which contains all vectors whose components sum to zero (the space of “variations around the mean”).

Any vector $x$ can be uniquely expressed as the sum of two components, one from each subspace:

$x = (x - p_1(x)) + p_1(x)$

Or equivalently:

$x = \mathrm{E}[x] \cdot \vec{1} + (x - \mathrm{E}[x] \cdot \vec{1})$

Where $\mathrm{E}[x] \cdot \vec{1}$ is the component in the direction of $\vec{1}$, representing the uniform shift (the mean), and $x - \mathrm{E}[x] \cdot \vec{1}$ is the component orthogonal to $\vec{1}$, representing the pattern of variations around the mean.

This decomposition provides insight into the structure of the vector: it separates the overall magnitude (represented by the mean) from the pattern of variations between components.

8. Projection onto the Unit Sphere: The Second Step

The unit sphere projection introduces a critical non-linearity in the normalization process. Unlike the hyperplane projection (which is linear), projecting onto the unit sphere is a non-linear operation. This non-linearity contributes to the expressiveness of neural networks with layer normalization, allowing them to represent more complex functions. In optimization terms, this projection constrains the solution space to vectors of unit length, improving the conditioning of the optimization problem. Without this step, the scale of activations could vary widely between different layers and neurons, causing optimization instabilities.

After centering the vector, the next step in layer normalization is normalizing by the standard deviation. This can be interpreted as a second geometric operation: projection onto the unit sphere, followed by a scaling.

The unit sphere is the set of all points at a fixed distance (radius 1) from the origin. Projecting a vector onto the unit sphere normalizes its length while preserving its direction, making it a natural geometric counterpart to the statistical operation of dividing by the standard deviation.

The projection of any non-zero vector $v$ onto the unit sphere, denoted as $p_S(v)$, normalizes the vector to unit length:

$p_S(v) = \frac{v}{\|v\|}$

This operation preserves the direction of the vector but changes its length to 1. It can be interpreted as scaling the vector so that it just touches the unit sphere.

The projection onto the unit sphere has several important properties that make it useful for normalization. It preserves the direction of the original vector, ensuring that the relative relationships between dimensions are maintained, which is often more important than the absolute values. It normalizes the length to exactly 1, which helps stabilize gradient magnitudes during training, preventing them from exploding or vanishing.

Unlike projection onto a subspace, projection onto the unit sphere is a non-linear operation. This non-linearity plays a role in the expressiveness of neural networks, allowing them to represent more complex functions. One technical note is that the projection is undefined for the zero vector (since division by zero is undefined). In practice, this is rarely an issue since deep learning frameworks add a small epsilon to the denominator to prevent division by zero.

9. Connecting the Norm of the Centered Vector to Variance

This relationship connects two seemingly different mathematical domains: geometry and statistics. The equality $\|p_1(x)\|^2 = d \cdot \mathrm{Var}[x]$ shows that the geometric concept of distance in the centered subspace directly corresponds to the statistical concept of variance scaled by dimension. Historical Note: This connection has been implicitly used in statistics for decades, particularly in Principal Component Analysis (PCA), but the explicit relationship between variance and projection distance in the context of neural network normalization was only formalized with layer normalization techniques.

To establish the link between variance normalization and sphere projection, we need to relate the norm of the centered vector to the variance.

The squared norm of the centered vector $p_1(x) = x - \mathrm{E}[x] \cdot \vec{1}$ is:

$\|p_1(x)\|^2 = \sum_{i=1}^d (x_i - \mathrm{E}[x])^2$

This sum represents the total squared deviation from the mean across all dimensions. It’s closely related to the variance:

$\sum_{i=1}^d (x_i - \mathrm{E}[x])^2 = d \cdot \frac{1}{d} \sum_{i=1}^d (x_i - \mathrm{E}[x])^2 = d \cdot \mathrm{Var}[x]$

Therefore:

$\|p_1(x)\|^2 = d \cdot \mathrm{Var}[x]$

Taking the square root:

$\|p_1(x)\| = \sqrt{d \cdot \mathrm{Var}[x]}$

This beautiful result connects the geometric measure (norm) with the statistical measure (variance multiplied by dimension). It shows that the length of the centered vector is proportional to the standard deviation, with the dimension as the constant of proportionality.

Rearranging the equation, we can express the variance in terms of the norm:

$\mathrm{Var}[x] = \frac{\|p_1(x)\|^2}{d}$

This shows that the variance is the average squared distance from the mean, which is the squared norm of the centered vector divided by the dimension.

Taking the square root:

$\sqrt{\mathrm{Var}[x]} = \frac{\|p_1(x)\|}{\sqrt{d}}$

This result allows us to connect layer normalization’s division by the standard deviation to a geometric scaling operation related to the norm of the centered vector.

10. Connecting Layer Normalization to Sequential Projections

Now we’ll derive the complete connection between layer normalization and the two projections. This will show how the statistical normalization procedure can be reinterpreted as a sequence of geometric transformations.

The projection of $p_1(x)$ onto the unit sphere is:

$p_S(p_1(x)) = \frac{p_1(x)}{\|p_1(x)\|} = \frac{x - \mathrm{E}[x]}{\|p_1(x)\|}$

This normalization preserves the direction of the centered vector but scales it to have unit length. Substituting the value of $\|p_1(x)\|$:

$p_S(p_1(x)) = \frac{x - \mathrm{E}[x]}{\sqrt{d \cdot \mathrm{Var}[x]}}$

This expression shows how the projection onto the unit sphere relates to the standard statistical normalization formula, but with a different scaling factor.

Let’s rewrite the original layer normalization formula:

$y = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x]}}$

Now let’s manipulate this to match our projection-based expression:

$\frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x]}} = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x]}} \cdot \frac{\sqrt{d}}{\sqrt{d}} = \sqrt{d} \cdot \frac{x - \mathrm{E}[x]}{\sqrt{d \cdot \mathrm{Var}[x]}} = \sqrt{d} \cdot p_S(p_1(x))$

This gives us our final result:

$y = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x]}} = \sqrt{d} \cdot p_S(p_1(x))$

This elegant formula reveals that layer normalization is equivalent to: first projecting onto the hyperplane orthogonal to $\vec{1}$ (centering), then projecting onto the unit sphere (normalizing), and finally scaling by $\sqrt{d}$. The scaling factor $\sqrt{d}$ accounts for the difference between normalizing by the norm of the centered vector and normalizing by the standard deviation.

11. Alternative Derivation

Alternative derivations strengthen mathematical proofs by approaching the same result from different starting points. This particular approach starts from the statistical formula and derives the geometric interpretation, whereas our primary derivation began with the geometric perspective and showed its equivalence to the statistical formulation. This bidirectional relationship establishes a more robust connection between the two domains. It is similar to how in physics, one can derive the laws of motion from either energy principles or force principles and arrive at equivalent formulations. Such multiple derivations also help identify the core mathematical principles governing a phenomenon, which in turn can inspire new algorithms and approaches to normalization in deep learning.

Let’s approach the derivation from a different angle to further reinforce our understanding. This alternative approach starts with the standard layer normalization formula and progressively transforms it into the projection-based expression.

Starting with the layer normalization formula:

$y = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x]}}$

We can rewrite this in terms of the centered vector $p_1(x) = x - \mathrm{E}[x]$:

$y = \frac{p_1(x)}{\sqrt{\mathrm{Var}[x]}}$

This formulation already separates the centering step (creation of $p_1(x)$) from the normalization step (division by $\sqrt{\mathrm{Var}[x]}$).

Now, let’s expand the variance in terms of the centered vector:

$\mathrm{Var}[x] = \frac{1}{d} \sum_{i=1}^d (x_i - \mathrm{E}[x])^2 = \frac{1}{d} \sum_{i=1}^d p_1(x)_i^2 = \frac{\|p_1(x)\|^2}{d}$

This expresses the variance as the squared norm of the centered vector divided by the dimension, connecting the statistical measure to the geometric one.

Substituting this into our equation:

$y = \frac{p_1(x)}{\sqrt{\frac{\|p_1(x)\|^2}{d}}} = \frac{p_1(x)}{\frac{\|p_1(x)\|}{\sqrt{d}}} = \sqrt{d} \cdot \frac{p_1(x)}{\|p_1(x)\|}$

Since $\frac{p_1(x)}{\|p_1(x)\|} = p_S(p_1(x))$ is the projection onto the unit sphere, we have:

$y = \sqrt{d} \cdot p_S(p_1(x))$

This confirms our previous derivation from a different starting point, strengthening our confidence in the result.

12. The Complete Geometric Interpretation

Layer normalization projection sequence relates to mathematical concepts in differential geometry, where operations on manifolds (curved spaces) involve projections onto tangent spaces followed by normalization. The fact that these operations compose to form a useful neural network operation is not coincidental. Similar sequences of operations appear in areas like quantum mechanics (normalization of wave functions), computer graphics (normal mapping and shading), signal processing (whitening transformations), and control systems (state space normalization). This suggests that layer normalization taps into a fundamental geometric principle that has broad applicability across multiple domains where normalization is beneficial.

We can now interpret layer normalization geometrically as a sequence of operations: First, we project the vector $x$ onto the hyperplane orthogonal to $\vec{1}$, giving us $p_1(x) = x - \mathrm{E}[x] \cdot \vec{1}$. Geometrically, this centers the vector by subtracting the mean from each component. The resulting vector $p_1(x)$ lies in a subspace where the sum of all components is zero.

Second, we project the centered vector $p_1(x)$ onto the unit sphere, giving us $p_S(p_1(x)) = \frac{p_1(x)}{\|p_1(x)\|}$. This normalizes the vector to have a length of 1. The resulting vector points in the same direction as $p_1(x)$ but has unit length.

Finally, we scale the unit vector by $\sqrt{d}$, giving us $\sqrt{d} \cdot p_S(p_1(x))$. This scaling factor ensures the final result matches the standard layer normalization formula.

The complete transformation can be written as:

$y = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x]}} = \sqrt{d} \cdot p_S(p_1(x))$

This geometric interpretation provides insights into why layer normalization works so effectively in neural networks. It removes the “common mode” component of the input (which often carries less discriminative information) and standardizes the scale of the remaining variations, helping gradient-based optimization algorithms converge more quickly and stably.

In 3D, the hyperplane orthogonal to $\vec{1} = (1, 1, 1)$ is the plane $x + y + z = 0$. This plane passes through the origin and is oriented symmetrically with respect to all three coordinate axes. The geometric interpretation of layer normalization involves projecting a vector onto this plane, then onto the unit sphere, and finally scaling by $\sqrt{d}$. This sequence of operations standardizes the vector, making it more amenable to further processing in a neural network.

13. Working Example: Layer Normalization in 3D Space

Visualizing in 3D space helps build intuition about the geometric interpretation. The vector (5,8,2) starts in a general position in space. After centering, it moves to the plane x+y+z=0. Then projection onto the unit sphere normalizes its length, before the final scaling gives it a length of √3. This concrete example demonstrates that the mathematical formulations actually produce the expected results, confirming our theoretical understanding.

To make our discussion concrete, let’s trace a vector’s transformation through layer normalization step by step. We’ll use a 3D example with vector $x = (5, 8, 2)$ and follow its journey.

First, we calculate its statistical properties:

Mean: $\mathrm{E}[x] = \frac{5 + 8 + 2}{3} = 5$
Variance: $\mathrm{Var}[x] = \frac{1}{3}[(5-5)^2 + (8-5)^2 + (2-5)^2] = \frac{1}{3}(0 + 9 + 9) = 6$

Step 1: Centering the vector

We center the vector by subtracting the mean from each component:

$x - \mathrm{E}[x] \cdot \vec{1} = (5, 8, 2) - 5 \cdot (1, 1, 1) = (0, 3, -3)$

This centered vector lies on the hyperplane $x + y + z = 0$: $0 + 3 + (-3) = 0$.

Step 2: Verifying that centering is a projection

For the all-ones vector, we have $\vec{1} = (1, 1, 1)$ with length $\|\vec{1}\| = \sqrt{3}$.

The component of $x$ along $\vec{1}$ is:

$\frac{\langle (5, 8, 2), \vec{1} \rangle}{\|\vec{1}\|^2} \cdot \vec{1} = \frac{15}{3} \cdot (1, 1, 1) = 5 \cdot (1, 1, 1) = (5, 5, 5)$

Subtracting gives us the projection onto the hyperplane:

$(5, 8, 2) - (5, 5, 5) = (0, 3, -3)$

This matches our centered vector, confirming that centering equals projection.

Step 3: Projecting onto the unit sphere

The squared norm of the centered vector is $\|p_1(x)\|^2 = 0^2 + 3^2 + (-3)^2 = 18$, which equals $d \cdot \mathrm{Var}[x] = 3 \cdot 6 = 18$.

We project onto the unit sphere:

$p_S(p_1(x)) = \frac{(0, 3, -3)}{\sqrt{18}} = (0, \frac{3}{\sqrt{18}}, -\frac{3}{\sqrt{18}})$

Step 4: Final scaling

We scale by $\sqrt{d} = \sqrt{3}$:

$\sqrt{3} \cdot p_S(p_1(x)) = (0, \frac{3\sqrt{3}}{\sqrt{18}}, -\frac{3\sqrt{3}}{\sqrt{18}}) = (0, \frac{3}{\sqrt{6}}, -\frac{3}{\sqrt{6}})$

Verification

This matches the direct layer normalization calculation:

$y = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x]}} = \frac{(0, 3, -3)}{\sqrt{6}} = (0, \frac{3}{\sqrt{6}}, -\frac{3}{\sqrt{6}})$

The normalized vector has:

Mean of zero: $\mathrm{E}[y] = \frac{1}{3}(0 + \frac{3}{\sqrt{6}} + (-\frac{3}{\sqrt{6}})) = 0$
Variance of one: $\mathrm{Var}[y] = \frac{1}{3}(0^2 + (\frac{3}{\sqrt{6}})^2 + (-\frac{3}{\sqrt{6}})^2) = 1$
Norm of $\sqrt{d}$: $\|y\| = \sqrt{0^2 + (\frac{3}{\sqrt{6}})^2 + (-\frac{3}{\sqrt{6}})^2} = \sqrt{3}$

14. Key Properties of Layer Normalization

Layer normalization invariance properties have profound implications for deep learning. Scale invariance means a model does not need to learn separate weights for inputs of different magnitudes. Shift invariance means it can focus on relative patterns rather than absolute values. Together, these properties create a more stable optimization landscape and better generalization, especially for models like transformers that must process inputs with widely varying scales and offsets.

The geometric perspective reveals several important properties that explain why layer normalization is so effective in neural networks:

The normalized vector always has zero mean. This follows directly from the centered vector being on the hyperplane orthogonal to the all-ones vector:

$\mathrm{E}[y] = \frac{1}{\sqrt{\mathrm{Var}[x]}} \cdot \frac{1}{d} \left( \sum_{i=1}^d x_i - d \cdot \mathrm{E}[x] \right) = 0$

This property ensures that subsequent layers receive well-centered inputs, preventing activation saturation.

The normalized vector always has unit variance:

$\mathrm{Var}[y] = \frac{1}{d \cdot \mathrm{Var}[x]} \sum_{i=1}^d (x_i - \mathrm{E}[x])^2 = \frac{d \cdot \mathrm{Var}[x]}{d \cdot \mathrm{Var}[x]} = 1$

This stabilizes gradient magnitudes during backpropagation, making the optimization more consistent.

If we multiply all elements of $x$ by a constant $c$, the output remains unchanged:

$\frac{c \cdot x - \mathrm{E}[c \cdot x]}{\sqrt{\mathrm{Var}[c \cdot x]}} = \frac{c \cdot (x - \mathrm{E}[x])}{c \cdot \sqrt{\mathrm{Var}[x]}} = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x]}}$

This makes neural networks more robust to input scaling variations.

If we add a constant $b$ to all elements of $x$, the output remains unchanged:

$\frac{x + b - \mathrm{E}[x + b]}{\sqrt{\mathrm{Var}[x + b]}} = \frac{x + b - (\mathrm{E}[x] + b)}{\sqrt{\mathrm{Var}[x]}} = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x]}}$

Geometrically, this means adding a vector along the all-ones direction, which the projection removes entirely.

The normalized vector always has a norm of $\sqrt{d}$:

$\|y\| = \frac{\|x - \mathrm{E}[x]\|}{\sqrt{\mathrm{Var}[x]}} = \frac{\sqrt{d \cdot \mathrm{Var}[x]}}{\sqrt{\mathrm{Var}[x]}} = \sqrt{d}$

This consistent magnitude helps prevent exploding or vanishing gradients in deep networks.

15. Why This Geometric Interpretation Matters

The geometric perspective on layer normalization provides several significant insights beyond the standard statistical view:

The projection onto the hyperplane orthogonal to the all-ones vector reduces the effective dimensionality from $d$ to $d-1$. This removes a redundant degree of freedom (the common offset), allowing the network to focus its capacity on modeling informative patterns of variation between features.

By projecting onto the hyperplane, layer normalization isolates and removes the “common mode” component—the uniform signal across all dimensions. In many contexts, this global offset carries less discriminative information than the relative variations between features.

While normalizing the length, layer normalization preserves the direction of the centered vector. This maintains the relative relationships between features, which often encode the essential information extracted from the input.

Standardizing the scale and removing shifts creates a more symmetrical optimization landscape. This makes gradient descent more effective by preventing pathological curvature and allowing more balanced optimization steps across different dimensions.

The geometric view clarifies why layer normalization is invariant to both shifts and rescalings of the input—properties that make networks more robust to variations in input distributions and reduce the need for careful data preprocessing.

This interpretation bridges statistical operations (centering, standardizing) and geometric transformations (projections), providing a unifying framework that enhances our understanding of how neural networks process information through sequential layers.

16. Applications and Comparison with Normalization Techniques

Comparison of Normalization Techniques Through the Geometric Lens: Batch Normalization normalizes across the batch dimension, effectively projecting onto hyperplanes defined by batch statistics for each feature. This makes it dependent on batch size and requiring running statistics during inference. Instance Normalization, used in image processing, applies normalization to each channel separately, performing projections in channel-specific subspaces. This is particularly effective for style transfer tasks. Group Normalization is a middle ground between layer and instance normalization, dividing channels into groups and normalizing within each group. Geometrically, this corresponds to projecting onto group-specific hyperplanes. Weight Normalization, instead of normalizing activations, normalizes weight vectors by projecting them onto unit spheres. This aims to improve the conditioning of the optimization problem from the parameter side rather than the activation side. Each technique corresponds to a different choice of projection subspace, with layer normalization offering the advantage of being independent of batch statistics while still normalizing across the full feature dimension.

Layer normalization has become a critical component in modern deep neural networks for several key reasons:

1. Gradient Stability

By normalizing activations, layer normalization helps prevent exploding or vanishing gradients, a critical issue in deep networks. The consistent scale at each layer ensures gradients remain within a reasonable range as they propagate backward through the network.

2. Faster Convergence

The standardized scale and zero mean of normalized activations create a more favorable optimization landscape. This allows optimizers to take larger, more effective steps, reducing the number of iterations needed to reach good solutions.

3. Reduction of Internal Covariate Shift

Normalization stabilizes the distributions of network activations, preventing the phenomenon where each layer must continuously adapt to shifting input statistics. This allows each layer to learn more efficiently.

4. Independence from Batch Size

Unlike batch normalization, layer normalization operates independently for each sample, making it ideal for:

Variable batch sizes
Recurrent neural networks
Transformer architectures
Online learning scenarios

This independence from batch statistics provides consistent behavior during both training and inference, eliminating the need for running statistics.

5. Facilitation of Deep Architectures

Layer normalization has been crucial for enabling the training of very deep networks, particularly transformers with dozens or hundreds of layers. By stabilizing the signal through these deep stacks, it prevents the compounding effects of statistical irregularities.

The geometric interpretation helps us understand the relationship between different normalization techniques as different projection operations applied to different subspaces of the data.

17. Conclusion

Layer normalization, while typically presented as a statistical operation, reveals its deeper nature when viewed through the lens of geometric transformations in vector space. This interpretation unfolds as a sequence of elegant projections:

Hyperplane Projection (Centering): We project the input vector onto the hyperplane orthogonal to the all-ones vector, removing the “common mode” component and centering the representation.
Unit Sphere Projection (Normalizing): We project the centered vector onto the unit sphere, preserving its direction while standardizing its length.
Scaling: We scale by $\sqrt{d}$ to match the conventional formulation, ensuring unit variance.

These operations are captured in the formula:

$y = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x]}} = \sqrt{d} \cdot p_S(p_1(x))$

This geometric perspective provides several key insights:

It connects statistical operations (centering, standardizing) with geometric transformations (projections)
It explains why layer normalization helps gradient-based optimization
It reveals why the technique is invariant to shifts and rescalings
It provides a unifying framework for understanding various normalization approaches

The formula $y = \sqrt{d} \cdot p_S(p_1(x))$ encapsulates this understanding, showing that layer normalization fundamentally projects data onto a standardized subspace where the “common mode” has been removed and the scale has been normalized.

By viewing layer normalization as a geometric transformation rather than just a statistical operation, we gain a more intuitive understanding of its effects and can better appreciate its role in the remarkable success of modern neural networks, particularly transformers and other deep architectures.

References

[1] Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). "Layer normalization." arXiv preprint arXiv:1607.06450.

[2] Ioffe, S., & Szegedy, C. (2015). "Batch normalization: Accelerating deep network training by reducing internal covariate shift." In International Conference on Machine Learning (pp. 448-456).

[3] Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2016). "Instance normalization: The missing ingredient for fast stylization." arXiv preprint arXiv:1607.08022.

[4] Wu, Y., & He, K. (2018). "Group normalization." In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 3-19).

[5] Salimans, T., & Kingma, D. P. (2016). "Weight normalization: A simple reparameterization to accelerate training of deep neural networks." Advances in Neural Information Processing Systems, 29, 901-909.

The RoPE Compatibility Problem in DeepSeek's Multi Head Latent Attention

Sun, 27 Apr 2025 00:00:00 -0400

1. Introduction

1.1 Multi-Head Latent Attention: Advancing Inference Efficiency in Large Language Models

Large Language Models (LLMs) have transformed natural language processing capabilities, yet their deployment presents substantial challenges as model size increases to hundreds of billions of parameters with extended context windows of tens or hundreds of thousands of tokens. During the autoregressive generation process, the Key-Value (KV) cache emerges as a critical bottleneck, presenting organizations with a fundamental trade-off between computational efficiency and memory resource allocation.

Without KV caching, the computational complexity for generating each token scales quadratically with sequence length (O(n²) per token), while maintaining minimal memory requirements O(1). This approach becomes prohibitively expensive for long sequences, as each new token would require recomputing attention scores with all previous tokens. KV caching strategy reduce this to linear computational complexity (O(n) per token), but at the cost of linear memory growth O(n). For standard Multi-Head Attention (MHA), the total KV cache memory consumption can be expressed as:

$\text{Memory}_{\text{MHA}} = B \times L \times N_L \times 2 \times N_H \times D_H \times P \tag{1}$

Concrete Memory Savings Example

Lets calculate the memory requirements for a typical large language model with:
- Batch size (B) = 1
- Sequence length (L) = 32,768 tokens
- Number of layers (NL) = 32
- Number of heads (NH) = 32
- Head dimension (DH) = 128
- Precision (P) = 2 bytes (FP16)
- MLA content dimension (DC) = 64
- MLA rotary dimension (DR) = 8

Standard MHA memory (Equation 1):
Memory_MHA = B × L × NL × 2 × NH × DH × P

With our parameters:
Memory_MHA = 1 × 32,768 × 32 × 2 × 32 × 128 × 2 bytes = 1 × 32,768 × 32 × 4 × 32 × 128 bytes = 1 × 32,768 × 128 × 128 × 32 bytes = 32,768 × 16,384 × 32 bytes = 17,179,869,184 bytes ≈ 16 GB

MLA memory with compression (Equation 2):
Memory_MLA = B × L × NL × (DC + DR) × P

With our parameters:
Memory_MLA = 1 × 32,768 × 32 × (64 + 8) × 2 bytes = 1 × 32,768 × 32 × 72 × 2 bytes = 32,768 × 32 × 144 bytes = 32,768 × 4,608 bytes = 151,003,136 bytes ≈ 144 MB

Memory reduction ratio:
Reduction ratio = Memory_MHA / Memory_MLA = 17,179,869,184 bytes / 151,003,136 bytes ≈ 113.77 ≈ 99.1% reduction

This dramatic reduction enables models to handle much longer contexts with the same hardware, or allows deployment on more resource-constrained devices.

This dramatic reduction enables models to handle much longer contexts with the same hardware, or allows deployment on more resource-constrained devices.

This formula encapsulates the memory requirements across batch size (B), sequence length (L), number of layers (N_L), heads (N_H), head dimension (D_H), and precision (P). The factor of 2 accounts for storing both keys and values separately. As models grow larger and context windows expand, this memory requirement becomes increasingly untenable, even on high-end hardware accelerators.

Multi-Head Latent Attention (MLA) addresses this challenge through a novel approach that transforms the fundamental equation of memory consumption:

$\text{Memory}_{\text{MLA}} = B \times L \times N_L \times (D_C + D_R) \times P \tag{2}$

Where D_C is the compression KV dimension and D_R is the dimension of the key rotary position component. This reformulation enables substantial memory savings without compromising model capabilities, creating new possibilities for deploying models in resource-constrained environments.

Architectural Differences: MHA vs. MLA

Standard Multi-Head Attention (MHA) and Multi-Head Latent Attention (MLA) share the same high-level goal of enabling tokens to attend to each other, but differ significantly in their internal architecture and memory efficiency characteristics. In standard MHA, each token’s hidden representation undergoes three parallel linear projections to create query, key, and value vectors. This process can be represented mathematically as:

$q_t = W^Q h_t \tag{3}$

$k_t = W^K h_t \tag{4}$

$v_t = W^V h_t \tag{5}$

These projections are then split into N_H attention heads, each operating in a lower-dimensional space:

$q_t^i, k_t^i, v_t^i \in \mathbb{R}^{D_H} \tag{6}$

The attention mechanism computes weighted interactions between tokens, where the weights are determined by the compatibility between queries and keys. For each head i, the attention output is computed as:

$\text{o}_{t,i} = \sum_{j=1}^{t} \text{Softmax}_j\left(\frac{(\text{q}_{t,i})^T\text{k}_{j,i}}{\sqrt{d_h}}\right)\text{v}_{j,i} \tag{7}$

The outputs from all heads are concatenated and projected through an output matrix:

$\text{u}_t = W^O[\text{o}_{t,1}; \text{o}_{t,2}; ...; \text{o}_{t,n_h}] \tag{9}$

During inference, MHA caches the full key and value vectors for each token across all layers and heads, creating substantial memory pressure as sequence length increases. MLA fundamentally reimagines this architecture by introducing an intermediate compression step and decoupling content information from positional information. The architecture consists of two parallel paths:

Figure 1: Architecture of Multi Head Latent Attention

Content Path (with compression):

$c_t^{KV} = W^{DKV}h_t \tag{10}$

$k_t^C = W^{UK}c_t^{KV} \tag{11}$

$v_t = W^{UV}c_t^{KV} \tag{12}$

$q_t^C = W^{UQ}c_t^{Q} \tag{13}$

$c_t^{Q} = W^{DQ}h_t \tag{14}$

Position Path (with RoPE):

$k_t^R = R_{\Theta,t}^d \cdot W^{KR}h_t \tag{15}$

$q_t^R = R_{\Theta,t}^d \cdot W^{QR}c_t^Q \tag{16}$

Where R_Θ,t^d represents the rotary position encoding matrix. The final key and query representations are formed by concatenating both components:

$k_t = [k_t^C; k_t^R] \tag{17}$

$q_t = [q_t^C; q_t^R] \tag{18}$

During inference, MLA caches both the compressed latent vectors c_t^KV and the rotary key components k_t^R as shown in Figure 2. This is reflected in the memory formula, where D_C represents the dimension of c_t^KV and D_R represents the dimension of k_t^R.

Figure 2: Cache both the compressed latent vectors *c_t^KV* and the rotary key components *k_t^R*

The attention calculation in MLA becomes:

$\text{o}_{t,i} = \sum_{j=1}^{t} \text{Softmax}_j\left(\frac{(\text{q}_{t,i}^C)^T\text{k}_{j,i}^C + (\text{q}_{t,i}^R)^T\text{k}_j^R}{\sqrt{d_h + d_h^R}}\right)\text{v}_{j,i}^C \tag{19}$

$\begin{align} \text{where:}\\ \text{j} &: \text{ Position index } (1 \leq j \leq t) \text{ of previous tokens in the sequence} \\ \text{t} &: \text{ Current position in the sequence} \\ \text{q}_{t,i}^C &: \text{ Content component of query vector at position } t \text{ for head } i \\ \text{k}_{j,i}^C &: \text{ Content component of key vector at position } j \text{ for head } i \\ \text{q}_{t,i}^R &: \text{ Rotary position component of query vector at position } t \text{ for head } i \\ \text{k}_j^R &: \text{ Rotary position component of key vector at position } j \\ \text{v}_{j,i}^C &: \text{ Value vector at position } j \text{ for head } i \\ \text{d}_h &: \text{ Dimension of the content component} \\ \text{d}_h^R &: \text{ Dimension of the rotary position component} \\ \text{o}_{t,i} &: \text{ Output of attention at position } t \text{ for head } i \\ \text{i} &\in \{1, 2, \ldots, n_h\}, \text{ where } n_h \text{ is the total number of attention heads} \end{align} \tag{20}$

And the final multi-head output remains:

$\text{u}_t = W^O[\text{o}_{t,1}; \text{o}_{t,2}; ...; \text{o}_{t,n_h}] \tag{21}$

This formulation separates content-based attention (first term) from position-aware attention (second term), allowing each to be processed optimally. The content path can be efficiently compressed without worrying about position encoding, while the position path handles rotary encodings separately, maintaining relative position awareness. This decoupling strategy is particularly important because applying rotary position encodings directly to compressed representations would create mathematical incompatibilities during inference, requiring costly recomputations for each new token (which we will see in the rest of the article with derivations). By separating content from position, MLA achieves both memory efficiency and computational efficiency.

1.2 Rotary Position Embeddings (RoPE): Mathematical Foundations

Transformer architectures have demonstrated remarkable efficacy across diverse natural language processing tasks, yet they inherently lack sequential awareness due to their parallel token processing mechanism. To mitigate this limitation, position encoding methodologies have been developed to incorporate sequential information into the representation space. Among these approaches, Rotary Position Embedding (RoPE), introduced by Su et al. (2021), represents a mathematically sophisticated advancement in positional encoding.

RoPE encodes positional information by applying a position-dependent rotation to pairs of dimensions in the embedding space. For a token at position m with embedding vector 𝐱_m ∈ ℝ^d, RoPE transforms query and key vectors as follows:

$f_q(\mathbf{x}_m, m) = (\mathbf{W}_q\mathbf{x}_m)e^{im\theta} \tag{22}$

$f_k(\mathbf{x}_n, n) = (\mathbf{W}_k\mathbf{x}_n)e^{in\theta} \tag{23}$

Here, the complex exponential e^imθ represents rotation in the complex plane. This operation rotates the query and key vectors by angles proportional to their positions in the sequence. The rotation angle increases with the position index, creating unique position-dependent transformations for each token. For practical implementation in neural networks, these complex number rotations are expressed using real-valued rotation matrices. For embedding vectors with dimension d (where d is even), we can view the embedding space as composed of d/2 two-dimensional subspaces. In each two-dimensional subspace corresponding to dimensions (2i-1, 2i), RoPE applies a 2×2 rotation matrix:

$\begin{pmatrix} \cos m\theta_i & -\sin m\theta_i \\ \sin m\theta_i & \cos m\theta_i \end{pmatrix} \tag{24}$

Generalizing to a d-dimensional space (where d is even), RoPE uses a block-diagonal rotation matrix R_Θ,m^d:

$R_{\Theta,m}^d = \begin{pmatrix} \cos m\theta_1 & -\sin m\theta_1 & 0 & 0 & \cdots & 0 & 0 \\ \sin m\theta_1 & \cos m\theta_1 & 0 & 0 & \cdots & 0 & 0 \\ 0 & 0 & \cos m\theta_2 & -\sin m\theta_2 & \cdots & 0 & 0 \\ 0 & 0 & \sin m\theta_2 & \cos m\theta_2 & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \cdots & \cos m\theta_{d/2} & -\sin m\theta_{d/2} \\ 0 & 0 & 0 & 0 & \cdots & \sin m\theta_{d/2} & \cos m\theta_{d/2} \end{pmatrix} \tag{25}$

Where θ_i = 10000^-2(i-1)/d for i ∈ [1, 2, …, d/2]

Relative position encoding: The principal advantage of Rotary Position Embedding (RoPE) is its intrinsic capacity to encode relative positional information rather than absolute positions. This property becomes mathematically evident when examining the attention mechanism. For a query vector at position m and a key vector at position n, the attention score is formulated as:

$q_m^T k_n = (R_{\Theta,m}^d \mathbf{W}_q\mathbf{x}m)^T(R_{\Theta,n}^d \mathbf{W}_k\mathbf{x}_n) = \mathbf{x}_m^T \mathbf{W}q^T R_{\Theta,m-n}^d \mathbf{W}_k\mathbf{x}_n \tag{26}$

Mathematical Proof of Relative Position Property

This relative position property comes from fundamental properties of rotation matrices:

Property 1: The transpose of a rotation matrix inverts the rotation:
(R_Θ,m^d)^T = R_Θ,-m^d

Property 2: Multiplying rotation matrices compounds their rotations:
R_Θ,a^d · R_Θ,b^d = R_Θ,a+b^d

Therefore:
(R_Θ,m^d)^T · R_Θ,n^d = R_Θ,-m^d · R_Θ,n^d = R_Θ,-m+n^d = R_Θ,n-m^d

This mathematical property enables the model to naturally compute relative positional relationships between tokens without storing absolute positions.

Where R_Θ,m-n^d = (R_Θ,m^d)^T R_Θ,n^d. This means the attention score naturally incorporates relative position information (m-n) rather than absolute positions. Consequently, this mathematical property enables transformer architectures to develop position-invariant representations of token relationships, thereby enhancing the model’s capability to capture linguistic dependencies across diverse contextual environments.

2. The Decoupled RoPE Strategy in MLA

2.1 Separating Content and Position Information

The key innovation in MLA is the decoupled Rotary Position Embedding (RoPE) strategy, which elegantly separates content information from positional information:

Content Path (no position encoding):

$c_t^{KV} = W^{DKV}h_t \tag{27}$

$k_t^C = W^{UK}c_t^{KV} \tag{28}$

$q_t^C = W^{UQ}c_t^Q \tag{29}$

Position Path (Rotated keys and queries):

$k_t^R = R_{\Theta,t}^d \cdot W^{KR}h_t \tag{30}$

$q_t^R = R_{\Theta,t}^d \cdot W^{QR}c_t^Q \tag{31}$

Final Representations (concatenation):

$k_t = [k_t^C; k_t^R] \tag{32}$

$q_t = [q_t^C; q_t^R] \tag{33}$

The MLA approach bifurcates the attention mechanism into dual parallel pathways, enabling distinct processing optimizations for different aspects of token representation. This architectural design represents a significant advancement over traditional attention mechanisms by addressing the fundamental tension between computational efficiency and positional awareness.

In the content path, semantic information is processed without positional encoding, allowing for substantial dimensionality reduction through compression. The down-projection matrix W^DKV transforms the high-dimensional hidden state h_t into a compact latent representation c_t^KV with dimension d_c, where typically d_c ≪ n_heads × d_head. This compression captures the essential semantic content while eliminating redundant information, resulting in a more memory-efficient representation that can be cached during inference.

The separate position path maintains spatial awareness through Rotary Position Embeddings (RoPE), applied via the rotation matrix R_Θ,t^d. By isolating positional information in a dedicated pathway with dimension d_R (typically much smaller than the content dimension), MLA preserves the model’s ability to understand token relationships without applying position encodings to the compressed representations. This separation is crucial for preventing the computational challenges that would arise from applying rotation matrices to compressed vectors (which we will explore in below sections).

2.2 Attention Calculation with Decoupled RoPE

The attention score between a query at position p and a key at position j becomes:

$a_{pj} = \frac{q_p^T k_j}{\sqrt{d_h + d_h^R}} = \frac{(q_p^C)^T k_j^C + (q_p^R)^T k_j^R}{\sqrt{d_h + d_h^R}} \tag{34}$

Expanding each component:

Content Path:

$\begin{align} (q_p^C)^T k_j^C &= (W^{UQ}c_p^Q)^T(W^{UK}c_j^{KV}) \\ &= (c_p^Q)^T (W^{UQ})^T W^{UK} c_j^{KV} \end{align} \tag{35}$

Positional Path (with RoPE):

$\begin{align} (q_p^R)^T k_j^R &= (R_{\Theta,p}^d \cdot W^{QR}c_p^Q)^T (R_{\Theta,j}^d \cdot W^{KR}h_j) \\ &= (c_p^Q)^T (W^{QR})^T (R_{\Theta,p}^d)^T R_{\Theta,j}^d W^{KR} h_j \\ &= (c_p^Q)^T (W^{QR})^T R_{\Theta,p-j}^d W^{KR} h_j \end{align} \tag{36}$

This decomposition highlights how the content similarity component measures semantic relationships independent of position, while the positional relationship component explicitly encodes the relative position (p-j) between tokens. The additive interaction between these components in the attention calculation allows the model to consider both semantic compatibility and positional context when determining attention weights.

2.3 Optimizations for Efficient Inference

For the content similarity component, MLA employs matrix absorption:

$\begin{align} (q_p^C)^T k_j^C &= (W^{UQ}c_p^Q)^T(W^{UK}c_j^{KV}) \\ &= (c_p^Q)^T (W^{UQ})^T W^{UK} c_j^{KV} \end{align} \tag{37}$

By defining the absorbed matrix (W^UQ)’ = (W^UQ)^T W^UK, we get:

$\begin{align} (q_p^C)^T k_j^C &= (c_p^Q)^T (W^{UQ})' c_j^{KV} \\ &= ((W^{UQ})' c_p^Q)^T c_j^{KV} \end{align} \tag{38}$

This optimization represents a significant computational efficiency gain during inference. By pre-computing the absorbed matrix (W^UQ)’, we transform what would be two sequential matrix multiplications (the up-projection of query followed by dot product with up-projected key) into a single multiplication followed by a dot product with the compressed key vector.

The absorbed matrix (W^UQ)’ effectively encapsulates both the query and key up-projection operations in a single transformation. This is particularly valuable during inference, as it reduces the computational overhead for each token generation step. The operation ((W^UQ)’ c_p^Q)^T c_j^KV directly computes the content similarity using only the compressed representations, without requiring full decompression of the cached vectors.

2.4 Maintaining Relative Position in the Position Path

For the positional component, we have:

Since R_Θ,p-j^d = (R_Θ,p^d)^T R_Θ,j^d, the attention score naturally encodes the relative position (p-j) between the tokens. This mathematical property is central to the effectiveness of the decoupled RoPE approach. The fundamental challenge in position encoding for efficient inference is maintaining awareness of relative positions while avoiding recomputation of key vectors for each new token. The decoupled approach solves this by leveraging a key property of rotation matrices: the product of a rotation matrix and its transpose encodes the relative angle between them. This means that by caching the position-encoded vectors k_j^R = R_Θ,j^d ⋅ W^KRh_j for each token position j, and computing q_p^R = R_Θ,p^d ⋅ W^QRc_p^Q for the current position p, their dot product naturally captures the relative positional relationship without requiring recomputation of previous keys.

2.5 Complete Inference-Time Attention Calculation

During inference, the optimized attention calculation becomes:

$a_{pj} = \frac{((W^{UQ})' c_p^Q)^T c_j^{KV} + (R_{\Theta,p}^d \cdot W^{QR}c_p^Q)^T k_j^R}{\sqrt{d_h + d_h^R}} \tag{40}$

Where (W^UQ)’ is pre-computed, c_j^KV and k_j^R are cached for all previous tokens, and we only calculate c_p^Q and (R_Θ,p^d ⋅ W^QRc_p^Q) for the current token.

3. Why the Naive Approach to Combining RoPE and MLA Fails?

Now that we understand the decoupled RoPE solution, let’s examine why a more straightforward approach doesn’t work.

3.1 The Naive Approach: Applying RoPE After Decompression

A seemingly natural way to combine RoPE with MLA would be to apply the rotational encoding after decompressing the cached latent vectors as shown in Figure 3:

Figure 3: Applying RoPR after decompression in MLA architecture

Compress the hidden states for storage in the KV cache:

$c_t^{KV} = W^{DKV}h_t \tag{41}$

During attention computation, decompress the cached vectors:

$k_t = W^{UK}c_t^{KV} = W^{UK}W^{DKV}h_t \tag{42}$

Apply RoPE to the decompressed vectors based on their position:

$k_m(m) = R_{\Theta,m}^d \cdot k_m = R_{\Theta,m}^d \cdot W^{UK}W^{DKV}h_m \tag{43}$

This approach seems intuitive but creates a fundamental problem during inference.

3.2 The Re-computation Problem

During inference, for the current query token at position p and a key token at position j < p, the attention score calculation requires:

$a_{pj} = q_p(p)^T k_j(p-j) \tag{44}$

Notice that crucial detail: k_j(p-j) – we need the key vector for token j encoded with the relative position (p-j), not just its original absolute position j. But here’s the problem: during inference, we’ve only cached c_j^KV for previous tokens, not their RoPE-encoded keys. To compute k_j(p-j) correctly, we need:

$k_j(p-j) = R_{\Theta,p-j}^d \cdot W^{UK}c_j^{KV} \tag{45}$

This requires applying a different rotation matrix R_Θ,p-j^d to each cached key, which depends on the distance (p-j) from the current position p.

Let’s prove why all keys must be recomputed with each new token. According to the RoPE formulation, the attention score between a query at position p and a key at position j is:

$q_p^T k_j = (R_{\Theta,p}^d \mathbf{W}_q\mathbf{x}_p)^T(R_{\Theta,j}^d \mathbf{W}_k\mathbf{x}_j) = \mathbf{x}_p^T \mathbf{W}_q^T R_{\Theta,p-j}^d \mathbf{W}_k\mathbf{x}_j \tag{46}$

In MLA with compressed representations, this becomes:

$q_p^T k_j = (R_{\Theta,p}^d \cdot W^{UQ}c_p^Q)^T(R_{\Theta,j}^d \cdot W^{UK}c_j^{KV}) \tag{47}$

But during inference, to capture the correct relative position (p-j), we need to recalculate:

$k_j(p-j) = R_{\Theta,p-j}^d \cdot W^{UK}c_j^{KV} \tag{48}$

Or equivalently:

$k_j(p-j) = (R_{\Theta,p}^d)^T R_{\Theta,j}^d \cdot W^{UK}c_j^{KV} \tag{49}$

This means for each new token position p, we must recompute all previous keys with their relative position to p, which effectively eliminates the benefit of KV caching. Instead of simply retrieving cached vectors, we must perform a matrix multiplication for every previous token with each new step, significantly increasing the computational cost.

3.3 The Matrix Absorption Impossibility

A natural optimization attempt would be to absorb some of the matrix multiplications. Let’s explore this possibility:

$q_p^T k_j = (R_{\Theta,p}^d \cdot W^{UQ}c_p^Q)^T(R_{\Theta,j}^d \cdot W^{UK}c_j^{KV}) \tag{50}$

$= (c_p^Q)^T (W^{UQ})^T (R_{\Theta,p}^d)^T R_{\Theta,j}^d W^{UK} c_j^{KV} \tag{51}$

$= (c_p^Q)^T (W^{UQ})^T R_{\Theta,p-j}^d W^{UK} c_j^{KV} \tag{52}$

If the rotation matrix R_Θ,p-j^d commuted with W^UK (meaning R_Θ,p-j^d ⋅ W^UK = W^UK ⋅ R_Θ,p-j^d), we could define: [NOTE: THIS IS NOT POSSIBLE]

$(W^{UQ})' = (W^{UQ})^T (W^{UK}) \tag{53}$

And compute:

$(c_p^Q)^T (W^{UQ})' R_{\Theta,p-j}^d c_j^{KV} \tag{54}$

This would allow us to apply the rotation directly to the compressed representations, avoiding the need for decompression and recomputation. However, rotation matrices do not generally commute with arbitrary matrices. To prove this, let’s consider the product of a 2×2 rotation matrix and a general 2×2 matrix:

$R_{\theta} = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix} \tag{55}$

$A = \begin{pmatrix} a & b \\ c & d \end{pmatrix} \tag{56}$

Computing R_θ ⋅ A:

$\begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix} \begin{pmatrix} a & b \\ c & d \end{pmatrix} = \begin{pmatrix} a\cos\theta-c\sin\theta & b\cos\theta-d\sin\theta \\ a\sin\theta+c\cos\theta & b\sin\theta+d\cos\theta \end{pmatrix} \tag{57}$

Computing A ⋅ R_θ:

$\begin{pmatrix} a & b \\ c & d \end{pmatrix} \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix} = \begin{pmatrix} a\cos\theta+b\sin\theta & -a\sin\theta+b\cos\theta \\ c\cos\theta+d\sin\theta & -c\sin\theta+d\cos\theta \end{pmatrix} \tag{58}$

These results are different unless A has a special structure. Therefore:

$R_{\Theta,p-j}^d \cdot W^{UK} \neq W^{UK} \cdot R_{\Theta,p-j}^d \tag{59}$

This non-commutativity prevents the matrix absorption optimization, forcing us to recalculate all keys with their appropriate rotations for each new token position.

4. Conclusion: Why Decoupled RoPE Succeeds Where the Naive Approach Fails

The decoupled RoPE strategy succeeds by separating positional and content information into parallel paths, allowing each to be processed optimally:

Content path can be efficiently compressed without worrying about position encoding.

Position path handles rotary encodings separately, maintaining relative position awareness.

Concatenation combines both signals without requiring recalculation of previous keys.

This separation allows MLA to achieve both memory efficiency (through compression) and computational efficiency (by avoiding recomputation), while still preserving the powerful relative position encoding capabilities of RoPE.

In contrast, the naive approach attempts to apply position encoding on top of the compression-decompression pipeline, creating a fundamental mathematical incompatibility that forces costly recomputations during inference.

The decoupled RoPE strategy represents an elegant architectural solution that demonstrates the importance of carefully considering how different components of a model interact, particularly when optimizing for inference efficiency.

5. References

[1] Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv preprint arXiv:2104.09864.

[2] DeepSeek-AI, Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., et al. (2024). "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv preprint arXiv:2405.04434.

[3] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). "Attention is all you need." Advances in Neural Information Processing Systems, 30.

Analysis of Matrix Multiplications in Transformer Architectures

Wed, 01 Jan 2025 00:00:00 -0500

This analysis takes inspiration from Lequn Chen’s excellent article on transformer batching which analyzed performance on the A100 GPU. Building on their insights, this analysis focuses on the H100 architecture and provides fresh perspectives on transformer computations, with detailed performance analysis, comprehensive roofline model examination, and future optimization strategies specific to H100’s architecture.

Introduction

Transformer blocks are built on two primary types of matrix multiplications: dense layer operations and the QK multiplication in self-attention mechanisms

To understand more about the transformer block please read Sebastian Raschkas blogs. . These operations form the backbone of how Transformers process and encode input data, and their computational cost can be analyzed in terms of FLOPs (floating-point operations).

Dense Layers

Dense layers, are a fundamental component of Transformer blocks. These layers project inputs from one space to another. They are frequently used in the multi-head attention mechanism in projection operations, such as the generation of Q (query), K (key), and V (value) vectors in self-attention layers. Dense layers are also a crucial part of the Multi-Layer Perceptron (MLP) block, such as in models like LLaMA. A dense layer operates on an input tensor X of shape (batch,seqlen,h), where batch is the batch size, seqlen is the sequence length, and h is the hidden size. It uses a weight matrix W of shape (h,h) to perform a linear projection, producing an output tensor of the same shape (batch,seqlen,h)For higher-dimensional inputs, vector-matrix multiplication broadcasts across all dimensions except the last one. A dense layer of shape (h, h) applied to a tensor of shape (b, s, h) first reshapes to (b*s, h), performs the matrix multiplication, then reshapes back to (b, s, h).

Note: This broadcasting pattern is core to transformer architectures, allowing efficient parallel processing while preserving hidden dimension operations. through the matrix multiplication X⋅W.

Self Attention

The QK multiplication is a core operation in the self-attention mechanism of Transformer models, enabling the computation of how each token in a sequence “attends” to every other token. This operation generates the attention scores that underpin the model’s ability to contextualize the input. To begin, the input tensor X of shape (batch,seqlen,h), where batch is the batch size, seqlen is the sequence length, and h is the hidden size, is linearly projected into the Query (Q) and Key (K) matrices. Both Q and K have the same shape as X, (batch,seqlen,h), where h=n⋅d, with n being the number of attention heads and d the dimensionality of each head in Multi Head Attention.

FLOPs and IO

Dense Layer

The computational cost of this operation, measured in floating-point operations (FLOPs)Here FLOPs stands for number of floating point operations needed for the Matrix Multiplication, IO stands for number of Input and Output data transfer. In this current section these are not metrics of the Hardware(GPU). They are theoretical metrics. , is calculated as; each element in the output requires h multiplications and h−1 additions, approximately 2h operations per output element. With batch⋅seqlen⋅h output elements, the total number of operations is FLOPs = b⋅seqlen⋅h⋅(2h), which simplifies to

$\text{FLOPs} = 2 \cdot b \cdot \text{seqlen} \cdot h^2$

As a result, dense layers scale quadratically with the hidden size h, making them computationally expensive as h increasesThe computational complexity increases quadratically with the hidden size, making this a critical consideration for large models. .

The input matrix X has a shape of (b,seqlen,h), so the total number of elements read from X is b⋅seqlen⋅h. The weight matrix W, with a shape of (h,h), has h⋅h elements that are read. After performing the matrix multiplication X⋅W, the output matrix has the same shape as X, which is (b,seqlen,h), and the number of elements written to the output is b⋅seqlen⋅h.

$\text{IO} = 2 \cdot b \cdot \text{seqlen} \cdot h + h^2$

Self Attention

Init

During initialization, when the entire sequence is processed at once, the Q and K matrices have the shapes (b, n, seqlen, d). To compute attention, the K matrix is transposed to (b, n, d, seqlen). The matrix multiplication Q⋅K^T then produces an output tensor of shape (b, n, seqlen, seqlen), where each element represents the attention score between a pair of tokens.

For each element in the output, d multiplications and (d−1) additions are required, totalling approximately 2d operations per element. Since the output matrix has seqlen⋅seqlen elements, and this computation occurs for each batch and head, the total number of FLOPs can be calculated as:

$\text{FLOPs} = 2 \cdot d \cdot b \cdot n \cdot \text{seqlen}^2$

The Q matrix and K matrix have a shape of (b,n,seqlen,d), so the number of elements read is b⋅n⋅seqlen⋅d for both and the output attention scores have a shape of (b,n,seqlen,seqlen), so the number of elements written is b⋅n⋅seqlen^2. Adding all of them gives:

$\text{IO}_{\text{Init}} = 2 \cdot (b \cdot n \cdot \text{seqlen} \cdot d) + (b \cdot n \cdot \text{seqlen}^2)$

Auto-Regressive Step

In the auto-regressive phase, where tokens are processed incrementally, the computation is performed for only the current token against all previously decoded tokens. Here, the Q matrix has a shape of (b, n, 1, d), while the K matrix remains (b, n, seqlen, d). After transposition, K^T has the shape (b, n, d, seqlen). The resulting output tensor has the shape (b, n, 1, seqlen), representing attention scores for the current token against all preceding tokens.

For each output element, d multiplications and (d−1) additions are required, as before. However, since only seqlen elements are computed (instead of seqlen^2), the total FLOPs are:

$\text{FLOPs} = 2 \cdot d \cdot b \cdot n \cdot 1 \cdot \text{seqlen} = 2 \cdot b \cdot n \cdot d \cdot \text{seqlen}$

The Q matrix has a shape of (b,n,1,d) so the number of elements read is b⋅n⋅1⋅d and K matrix has a shape of (b, n, seqlen, d), so the number of elements read is b⋅n⋅seqlen⋅d and the output attention scores have a shape of (b,n,1,seqlen), so the number of elements written is b⋅n⋅seqlen. Adding all of them gives:

$\text{IO}_{\text{AR}} = (b \cdot n \cdot d) + (b \cdot n \cdot \text{seqlen} \cdot d) + (b \cdot n \cdot \text{seqlen})$

Arithmetic Intensity

Arithmetic intensity is a critical metric that represents the ratio of computational operations (FLOPs) to memory operations (IO bytes), expressed as FLOPs/Byte

Arithmetic Intensity (AI) = FLOPs/Bytes is a key performance indicator that helps determine whether an operation is compute-bound or memory-bound. Higher AI values suggest compute-bound operations, while lower values indicate memory-bound operations.
Read more here . The three plots visualize this relationship for different MatMul layers in Transformer blocks (Dense Layer, QK Init, and QK AR) using logarithmic scales on both axes, where each increment represents an order of magnitude increase. The diagonal gray line represents a 1:1 ratio between FLOPs and bytes, with points above this line indicating operations that perform more computations per byte of memory accessed.

Figure 1: Arithmetic Intensity Analysis for Transformer Operations (Dense Layer, QK Init, QK AR) with sequence length 100

Figure 2: Arithmetic Intensity Analysis for Transformer Operations (Dense Layer, QK Init, QK AR)

For Dense Layers, the arithmetic intensity is governed by:

$\text{FLOPs} = 2 \cdot b \cdot \text{seqlen} \cdot h^2$

$\text{IO} = 2 \cdot b \cdot \text{seqlen} \cdot h + h^2$

This results in quadratic scaling with hidden size (h), making these layers increasingly compute-intensive as models grow largerDense Layer Arithmetic Intensity:

$\text{AI} = \frac{\text{FLOPs}}{\text{IO}} = \frac{2bsh^2}{2bsh + h^2} = \frac{2bsh}{2bs + h} = \frac{h}{1 + \frac{h}{2bs}} = O\left(\frac{1}{\frac{1}{h} + \frac{1}{2bs}}\right)$

This ratio reveals key insights:
- When h is large: AI approaches O(b·s)
- As b·s increases: AI approaches O(h)
- When both h and b·s are large: AI is limited by min(h, b·s)

Note: Ive used s instead of seqlen for consistency with typical notation, but they represent the same sequence length parameter.

This mathematical relationship explains why increasing batch size improves efficiency: the denominator term 1/b approaches zero, maximizing arithmetic intensity. This is why dense layers in large models benefit significantly from batch processing. . The stepping pattern visible in the graph reflects this quadratic relationship, where larger hidden sizes show steeper curves and higher arithmetic intensity. This explains why dense layers in large models can become significant computational bottlenecks.

QK Init (Init) operations are characterized by:

$\text{FLOPs} = b \cdot n \cdot \text{seqlen}^2 \cdot 2 \cdot d$

$\text{IO} = 2 \cdot (b \cdot n \cdot \text{seqlen} \cdot d) + (b \cdot n \cdot \text{seqlen}^2)$

The middle graph shows parallel lines for different sequence lengths, indicating consistent arithmetic intensity patterns that scale predictably with sequence length. So as the sequence length increases, they become more compute heavy thus higher seqlen in QK Init stage cause bottlenecks in the compute.

QK AR (Auto-Regressive) computations follow:

$\text{FLOPs} = b \cdot n \cdot d \cdot \text{seqlen} \cdot 2$

$\text{IO} = (b \cdot n \cdot d) + (b \cdot n \cdot \text{seqlen} \cdot d) + (b \cdot n \cdot \text{seqlen})$

Unlike QK Init, this operation scales linearly with sequence length, resulting in more favorable arithmetic intensity characteristicsSelf-Attention Arithmetic Intensity:

For QK^T multiplication:
$\text{FLOPs} = b \cdot n \cdot \text{seqlen}^2 \cdot 2 \cdot d$
$\text{IO} = 2(b \cdot n \cdot \text{seqlen} \cdot d) + (b \cdot n \cdot \text{seqlen}^2)$

QK Arithmetic Intensity:
$\text{AI} = \frac{\text{FLOPs}}{\text{IO}} = \frac{b \cdot n \cdot \text{seqlen}^2 \cdot 2d}{2bnd \cdot \text{seqlen} + bn \cdot \text{seqlen}^2} = \frac{2 \cdot \text{seqlen} \cdot d}{2d + \text{seqlen}}$

This derivation reveals crucial properties:
- Batch size b cancels out completely
- AI depends only on sequence length and head dimension
- Scaling b increases both compute and memory linearly
- No inherent efficiency gain from batching unlike dense layers

The final expression shows why self-attentions performance characteristics remain constant regardless of batch size, making it fundamentally different from dense layer operations. . This is evident in the rightmost graph, where points cluster tightly along similar trajectories regardless of sequence length.

However, these graphs represent theoretical relationships that don’t account for real-world hardware constraints. The Roofline Model becomes crucial here as it helps bridge this gap by providing a framework to understand actual performance limitations. In the Roofline Model, performance is bounded by two primary factors: the peak computational performance (represented by a horizontal line) and the memory bandwidth limit (shown as a diagonal line). The lower of these two bounds at any given arithmetic intensity determines the maximum achievable performance. We’ll look at the roofline model in the sections below.

Analysis of Dense Layer

To analyze the dense layers, let’s look at the Throughput vs Batch graph. Throughput is calculated in tokens per second on the Y-axis and the X-axis shows the batch * seqlen dimension for that particular Dense operationFor higher-dimensional inputs, the vector-matrix multiplication is broadcasted to all dimensions except for the last one. For example, when applying a dense layer of shape (h, h) to a tensor of shape (b, s, h), the tensor is reshaped to (b*s, h) before the matrix multiplication and then reshaped back to (b, s, h) afterward. .

Figure 3a: Throughput vs Batch for Dense Layer with small sequence length on NVIDIA H100 for INIT stage

Figure 3b: Throughput vs Batch for Dense Layer with large sequence length on NVIDIA H100 for INIT stage

In Figure 3a, we can see that there is a benefit from batching. the throughput increases as the batch size increases. This is for the smaller dimension of the seqlen, but as the seqlen is made largerWhen the prompt is larger around 100+ tokens as the input , we can see that increasing the batch size improves the throughput only till a certain point but beyond that the throughput saturates as in Figure 3b.

We can infer that the H100 falls short of utilizing all the compute units for the matrix size when the input prompt is smaller but not for larger sequence lengths.

Figure 4: Consolidated view of Dense Layer throughput across different dimensions on NVIDIA H100 for INIT stage

To show a consolidated view on Throughput vs the batch for all the batch dimensions with variations in h, d, n, and seqlen, it is not very useful to plot all of them separately for all the combinations of them. Instead using FLOPs on the x-axis allows us to analyze different model sizes on a single plot. This figure uses FLOPs as the x-axis, which is similar to b*s since FLOPs are O(bsh^2)The relationship between FLOPs and batch size demonstrates how computational complexity scales with model parameters, directly impacting throughput characteristics. . This plot shows that as the batch increases the also throughput increases when the seqlen is smaller (which is when the prompts are smaller). But if the batch is higher (either the batch is higher when the LLM is being served or the prompt is larger which causes seqlen to be larger or both) the throughput saturates.

Figure 5a: Dense Layer performance in auto-regressive stage - Throughput Analysis on NVIDIA H100

Figure 5b: Dense Layer performance in auto-regressive stage across different hidden-dims

In the autoregressive(AR) stage, the sequence length is always 1The seqlen is 1 in AR step for dense because there is only the new Token which was generated in the previous step, that needs to be processed at this step. Keys and Values are needed for all the Tokens but this is handled by the KV Cache. Hence only 1 token processing gives us a seqlen = 1 . So there is no practical upper limit on the throughput even for higher batches. This reflects in the graphs from Figure 5a and 5b.

Figure 6: Latency analysis for Dense Layer in auto-regressive stage on NVIDIA H100

From Figure 6, we can see that batching dense layer in the auto regressive generation stage does not significantly affect the latency of the generation. This is a good thing because a batch of 100 has the same latency as that of lower batch sizes.

In system design, managing batch sizes and sequence lengths is crucial, particularly for larger models during the Init phaseThe relationship between batch size and sequence length creates a complex optimization space that directly impacts system performance and resource utilization. Understanding these dynamics is crucial for efficient model deployment. . This phase tends to be the primary performance bottleneck, requiring careful optimization to improve efficiency. Conversely, the autoregressive generation phase scales more effectively, making it less of a limiting factor in overall performance. Smaller models with hidden sizes below 2048 demonstrate better efficiency across both phases, highlighting their suitability for latency-sensitive applications. Additionally, effective batching strategies can significantly enhance the performance of the generation phase without incurring notable penalties. These insights suggest the need for distinct optimization strategies tailored to the Init and generation phases in model serving.

Analysis of Self Attention

Figure 7a: Self Attention performance with small sequence length on NVIDIA H100 for INIT stage

Figure 7b: Self Attention performance with large sequence length on NVIDIA H100 for INIT stage

Analyzing the graphs above reveals that for smaller sequence lengths (shorter prompts) in the Init stage, batching has a more significant impact, providing noticeable benefitsThe impact of batching on self-attention performance varies significantly with sequence length, creating an important consideration for optimization strategies. . However, in the graph in Figure 7b, where the sequence length is larger (seqlen = 500) during the initialization stage, the throughput of the QK matrix multiplication begins to saturate as the batch size increases.

Figure 8a: Self Attention performance analysis for INIT stage across different hidden dimensions, measured on NVIDIA H100

Figure 8b: Self Attention performance analysis for INIT stage across different sequence lengths, measured on NVIDIA H100

Let’s examine the plots with FLOPs on the x-axis, representing different model sizes on the same graphUsing FLOPs as a metric allows for direct comparison across different model configurations, providing insights into computational efficiency scaling. . For sequence lengths less than 500, throughput increases as the batch size grows. However, for sequence lengths greater than 500, the plots become linear, showing no increase in throughput despite an increase in batch size.

Figure 9a: Self Attention auto-regressive performance with small sequence length for h = 4096, measured on NVIDIA H100

Figure 9b: Self Attention auto-regressive performance with large sequence length for h = 4096, measured on NVIDIA H100

Figure 9c: Self Attention auto-regressive performance across different hidden dimensions, measured on NVIDIA H100

Figure 9d: Self Attention auto-regressive performance across different sequence length, measured on NVIDIA H100

A similar pattern is observed in the auto-regressive stage, where increasing the batch size for larger sequence lengths has minimal to no effect. This occurs because they share a similar Arithmetic Intensity. Additionally, as auto-regression progresses, the sequence length increases, further diminishing the impact of batching.

Figure 10a: Latency analysis for Self Attention across different sequence lengths, measured on NVIDIA H100

Figure 10b: Latency analysis for Self Attention across different different hidden dimensions, measured on NVIDIA H100

Self-attention latency is comparable to that of a dense layer but increases with batch size, unlike a dense layer. This latency scales approximately linearly with batch size because self-attention primarily involves batched matrix multiplication. With a fixed FLOP-to-I/O ratio, increasing the batch size proportionally raises both FLOPs and I/O, maintaining a constant ratioSelf-Attention Arithmetic Intensity:
$\text{QK AI} = \frac{2 \cdot \text{seqlen} \cdot d}{2d + \text{seqlen}}$

- Batch size b cancels out completely
- AI depends only on sequence length and head dimension

Therefore, increasing batch size does not change the AI, it increases both FLOPs and IO at the same multiplier. . For example, increasing the batch size from 100 to 1000 directs the system to process more items simultaneously, boosting total throughput without accelerating the processing of individual items. The fundamental matrix multiplication operations still require the same number of steps per item, as the computational work (FLOPs) and memory operations (I/O) scale together. Additionally, in auto-regressive tasks, as the sequence length grows, more time is required to process each subsequent step.

Roofline Model

Figure 11: Roofline Model analysis for all operations on NVIDIA H100

The roofline model

The Roofline Model is a performance model seeking to give the limitations of a specific hardware component in terms of algorithm performance. The model is often employed visually as a log-log plot of Arithmetic Intensity vs Flops/s. Read the math behind it here presents the data points for all benchmark combinations, organized using the Roofline Model. Different stages and layers are distinguished through color coding. Overlaid on the figure are the theoretical memory bandwidth and FLOP/s limits, based on NVIDIA H100 specifications.

Two key insights emerge from this visualization:

The data points cluster into distinct groups and sub-groups, naturally reflecting the computational and memory characteristics of various stages and layers.
The data points closely follow the theoretical roofline, demonstrating that the benchmarks effectively leverage the hardware’s capabilities relative to its peak performance.

To observe the impact of batching, let’s examine a specific case (h=4096, s=100)

Figure 12: Detailed Roofline analysis for h=4096 and s=100, measured on NVIDIA H100

Arithmetic Intensity and Achieved FLOP/s: Arithmetic intensity across operations follows the sequence: dense_init > qk_init > dense_ar > qk_ar. Achieved FLOP/s also follows this order. The dense layer during initialization is constrained by the GPU’s peak computational performance. For small models and short sequence lengths, batching provides slight improvements, but significant performance gains require investing in a more powerful GPU.
Dense Layer in Auto-Regression: Unlike initialization, the dense layer in the auto-regression stage behaves differently. For the same model size, its data points align with the slope of the GPU’s memory bandwidth, indicating that its performance is memory bandwidth-bound. Under this constraint, increasing the batch size enhances the achieved FLOP/s by improving arithmetic intensity.
Batching and Self-Attention: Batching significantly impacts self-attention. While it does not alter the arithmetic intensity of self-attention, it increases the achieved FLOP/s for short sequence lengths by enabling parallel processing.
Kernel Implementation in Self-Attention: The increase in achieved FLOP/s for self-attention, despite unchanged arithmetic intensity, suggests that the kernel implementation may be suboptimal, potentially failing to fully utilize the GPU’s compute units.

Data Availability

All the data used in this analysis is publicly available in CSV format at transformer_bench/data. While this article focuses on bf16 dtype results, the repository contains data for fp32 and fp16 dtypes as well on the H100 GPU. You are encouraged to perform their own analysis using these additional precision formats and contribute their findings to the repository at doteval/transformer_bench.

Summary

Performance Hierarchy and Hardware Constraints
- Arithmetic intensity and achieved FLOP/s follow a clear hierarchy: dense_init > qk_init > dense_ar > qk_ar
- Dense layer initialization is compute-bound by GPU peak performance
- Dense layer auto-regression is memory bandwidth-bound
- Performance improvements in compute-bound operations require GPU upgrades, while memory-bound operations benefit from optimized batching strategies
Sequence Length and Batching Dynamics
- Short sequence lengths (< 500 tokens) show significant benefits from batching
- Longer sequences (> 500 tokens) show diminishing returns from increased batch sizes
- In autoregressive generation, sequence length remains at 1, allowing for efficient batching
- Throughput saturation occurs at different batch sizes depending on sequence length and model size
Self-Attention Characteristics and Optimization
- Self-attention benefits from batching without changing arithmetic intensity
- Current kernel implementations show signs of suboptimal compute unit utilization
- Parallel processing capabilities are not fully exploited, suggesting room for optimization
- Performance scales linearly with batch size due to the nature of matrix multiplication operations
Model Size Considerations
- Smaller models (hidden sizes < 2048) demonstrate better efficiency across all phases
- Larger models face significant computational bottlenecks during Init
- Memory bandwidth becomes a limiting factor for large models in autoregressive phase
- Different optimization strategies are needed for different model sizes
System Design and Implementation Insights
- Init phase is typically the primary performance bottleneck
- Autoregressive generation phase shows more favorable scaling characteristics
- Different phases require distinct optimization approaches due to varying performance characteristics
- System designs need to balance between throughput optimization and latency requirements based on use case

Future

Optimizing Self-Attention Through Matrix Fusion

In the self-attention mechanism, we can identify a key optimization opportunity in the matrix multiplication operations. Currently, the computation flow involves:

Computing QK^T which produces an intermediate result with shape (b, n, s, s)
Applying softmax to this intermediate result
Multiplying with V to get the final output of shape (b, n, s, d)

A more efficient approach would combine these operations into a unified computation:

The key insight is that we can fuse these three matrix operations (QK^T, softmax, and multiplication with V) into a single GPU kernel operation
This fusion is particularly effective because the head dimension (d=128) is relatively small
The main challenge lies in handling the softmax operation, which traditionally requires computing across the entire sequence dimension

The softmax computation presents a specific challenge, but this was solved beautifully by Flash Attention. There are 3 versions of Flash Attention. 3rd being a specific optimisation to H100 GPUS, and the first 2 papers can be implemented in any GPU. Links to the papers are in the references.

Efficient Request Batching Strategy

Analysis revealed significant potential in batching multiple requests, even when they have different sequence lengths. Rather than using simple padding, we can implement a more sophisticated approach based on our performance analysis:

Key Observations:

Dense layer performance:
- Shows strong batching benefits
- Maintains nearly constant latency during autoregressive generation
- Treats sequence dimension similarly to batch dimension
Self-attention characteristics:
- Must process each sequence independently
- Cannot be batched across different sequences
- Takes less execution time compared to dense layers

Implementation Strategy:

Input Processing:
- Take variable-length inputs: [(s1, h), (s2, h), ...]
- Combine them into a single matrix: (sum(si), h)
Computation Flow:
- Process the combined matrix through dense layers
- Split the results back into individual sequences
- Handle self-attention computations separately for each sequence

This approach offers several advantages:

Eliminates unnecessary padding computations
Maintains computational efficiency for dense layers
Preserves sequence-specific attention patterns
Balances throughput improvements with latency considerations

The strategy is particularly effective because it:

Leverages the strengths of dense layer batching
Respects the inherent limitations of self-attention
Minimizes computational overhead
Provides flexibility in handling variable-length inputs

This method is presented in Orca. Reference here: https://www.usenix.org/conference/osdi22/presentation/yu

References

[1] Yu, G.-I., Jeong, J. S., Kim, G.-W., Kim, S., & Chun, B.-G. (2022). "Orca: A Distributed Serving System for Transformer-Based Generative Models." In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) (pp. 521-538). Carlsbad, CA: USENIX Association.

[2] Dando, A. (2020). "Arithmetic Intensity and the Roofline Model." Dando's Blog, April 2, 2020.

[3] NVIDIA. "Guide for GEMM." NVIDIA Deep Learning Performance Documentation.

[4] Milakov, M., & Gimelshein, N. (2018). "Online normalizer calculation for softmax." arXiv preprint arXiv:1805.02867v2.

[5] Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." arXiv preprint arXiv:2205.14135v2.

[6] Dao, T. (2023). "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." arXiv preprint arXiv:2307.08691.

[7] Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., & Dao, T. (2024). "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision." arXiv preprint arXiv:2407.08608.

[8] Chen, L. (2023). "Transformer Batching." Lequn Chen Blog, May 13, 2023.

Balancing Memory & Compute: Strategies to Manage KV Cache in LLMs

Mon, 27 May 2024 00:00:00 -0400

KV caching as is method to optimize the inference process of large language models (LLMs), reducing the compute requirements from quadratic to linear scaling with the sequence length. Specifically, KV caching involves storing the key and value tensors of past tokens in GPU memory during the generation process, thus avoiding re-computation at each step.

KV caching represents a trade-off between memory usage and compute resourcesMemory-Compute Trade-off:

Without KV Cache:
Compute = O(n²) per token Memory = O(1)

With KV Cache:
Compute = O(n) per token Memory = O(n)

Where n is sequence length . While it reduces computational load, it increases memory consumption due to the need to store cached tensors. In this post, we’ll delve into the challenges posed by the growing size of the KV cache and explore common strategies to address them.

The size of the KV cache grows linearly with the batch size and the total sequence length. The per-token memory consumption depends on the precision used for storing the tensors.

Let’s derive the formula for the total size of the KV cache:

Core Formula Parameters:

b        = batch_size           # Batch size
seq_len  = sequence_length      # Total sequence length
n_layers = num_decoder_blocks   # Number of decoder blocks / attention layers
n_heads  = num_attention_heads  # Number of attention heads per layer
d_head   = head_dimension       # Hidden dimension of the attention layer
p_a      = precision_bytes      # Precision (bytes)

The per-token memory consumption (in bytes) for the KV cache of a multi-head attention (MHA) model is:

per_token_memory = 2 * n_layers * n_heads * d_head * p_a

The total size of the KV cache (in bytes)This formula accounts for the fact that for each token in each sequence in the batch, we need to store two tensors (key and value) for each attention head and each attention layer. :

total_kv_cache_size = 2 * b * seq_len * n_layers * n_heads * d_head * p_a

The challenge with KV caching lies in its unbounded growth with the total sequence length, which poses difficulties in managing GPU memory, especially since the total sequence length may not be known in advance.

Figure 1: Attention (heat)map from the StreamingLLM paper: A lot of attention is consistently allocated to the first token and to the last neighboring tokens (local attention)

Exploring ways to reduce memory footprint of the KV cache

Let’s explore ways to reduce the memory footprint of the KV cache by examining each component of the formula:

Optimizing Batch Size (b)

While decreasing the batch size can indeed alleviate the memory footprint of the KV cache and subsequently reduce latency, it’s generally not preferable. This is because reducing the batch size lowers hardware utilization, diminishing cost efficiency. In upcoming posts, we’ll delve into why increasing the batch size is often more desirable.

Optimizing Sequence Length (seq_len)

To mitigate the dependency on the total sequence lengthAttention Pattern Analysis:

• Strong attention to first tokens
• Local attention clusters
• Special token importance
• Periodic patterns at:
  • Sentence boundaries
  • Paragraph breaks
  • List elements , one approach is to refrain from storing keys and values for all tokens in the sequence. This strategy might involve recomputing missing keys and values on each iteration, prioritizing computational resources over GPU memory consumption, especially when memory bandwidth is a limiting factor.

Another perspective involves not storing keys and values for tokens that the model pays little or no attention to. This could be intentional in models trained to attend only to specific parts of the sequence, such as Mistral-7B, which utilizes sliding window attention (SWA) or local attention. With SWA, attention layers focus solely on neighboring tokens (only 4096), limiting the number of tensor pairs stored per sequence to the window size (4096).

More Methods for Memory Reduction

StreamingLLM Framework

Targeting models with finite-length context windows, this framework observes that initial tokens gather significant attentionStreamingLLM Memory Usage:

Fixed part = n_sink tokens Sliding part = window_size tokens Total Memory = (n_sink + window_size) × token_size vs. Original = full_context × token_size

Typical savings: 40-60% with minimal performance impact . It builds a sliding window by retaining only the first positional tokens (“sink tokens”) and the last neighboring tokens (local attention) in the cache. The cache has a fixed length with both a fixed part and a sliding part.

H2O and Scissorhands Methods

These methods compress the KV cache by setting a maximum number of cached tokens (budget) and discarding tokens when the cache budget is reached. H2O discards one token at a time, while Scissorhands drops tokens based on a target compression ratio. Both methods exploit the observation that influential tokens at a given step tend to remain influential in future steps.

Cache Eviction Policy - Both H2O and Scissorhands employ cache eviction policies to determine which tokens to discard. Scissorhands retains the most recent tokens and tokens with the highest attention scores within a history window. H2O discards tokens with the lowest cumulated attention scores, retaining tokens consistently achieving high attention scores across iterations.

FastGen Method

FastGen focuses on preserving model accuracyFastGen sets an error threshold (ε) for approximation:

Error = ||A - A||_F / ||A||_F Where: A = Original attention matrix A = Approximated matrix ||·||_F = Frobenius norm Typical bounds: ε = 0.1 → ~70% compression ε = 0.05 → ~50% compression ε = 0.01 → ~30% compression by setting a maximum approximation error for the attention matrix instead of a cache budget. It profiles the model’s attention layers to determine compression policies during a prefill phase. These policies, such as keeping special tokens or punctuation tokens, are applied to the KV cache at each generation step to meet the error target. If the target is too stringent, FastGen falls back to regular KV caching.

Optimizing Number of Layers (n_layers)

Reducing the number of layers in a language model does not offer significant gains in terms of memory reduction. Typically, smaller models naturally have fewer layers. Therefore, if a smaller model suits your use case and performs adequately, opting for it is a straightforward solution.

Optimizing Number of Attention Heads (n_heads)

Figure 2: Types of Attention

The multi-query attention (MQA) and grouped-query attention (GQA) architectures provide strategies for reducing the key-value (KV) cache size in models based on the Transformer architectureMQA vs GQA Memory:

MHA: Memory = H × d × 2 MQA: Memory = d × 2 GQA: Memory = g × d × 2 Where: H = Total heads d = Head dimension g = Number of groups (g < H) Real-world example: 32 heads → 8 groups = 75% reduction . These approaches allow for more efficient use of resources without sacrificing model performance significantly.

In MQA, all query heads share the same single key and value heads, meaning that each query head computes attention scores using the same keys, and all heads output values computed using the same values but different attention scores.

GQA splits the query heads into groups, with each group sharing the same unique key-value heads. This allows for a smoother reduction in the number of key-value heads compared to MQA, providing a compromise between model representation capacity and KV cache size.

These architectures have been implemented in various models by different research groups, such as Google Research’s PaLM, TII’s Falcon models, Meta’s Llama-2 (limited to 70B only), and Mistral AI’s Mistral-7B.

Optimizing Hidden Dimension (d_head)

Once again, there is nothing much to gain here if you are not ready to opt for another model.

Optimizing Precision (p_a)

Quantizing the key-value (KV) cache is an effective method for reducing its sizePrecision Impact:

Memory reduction by precision: FP32 (4 bytes) → FP16 (2 bytes): 50% reduction FP16 (2 bytes) → INT8 (1 byte): 50% reduction INT8 (1 byte) → INT4 (0.5 bytes): 50% reduction , but it’s important to use quantization algorithms that operate on both weights and activations, not just weights. Algorithms like LLM.int8() or SmoothQuant are suitable for this purpose, as they quantize both weights and activations, resulting in a reduced memory footprint.

However, for inference tasks, where memory bandwidth is the limiting factor rather than compute power, quantizing the cached tensors before moving them to GPU memory and dequantizing them afterward could suffice. This approach reduces the memory footprint without the overhead of more complex quantization algorithms.

Some inference systems, like FlexGen, NVIDIA TensorRT-LLM, and vLLM framework, already incorporate KV cache quantization features. They store the KV cache and model weights in reduced bit formats (4-bit or 8-bit) dynamically without requiring a calibration step at each iteration.

References

[1] Xiao, G., Tian, Y., Chen, B., Han, S., & Lewis, M. (2023). "Efficient Streaming Language Models with Attention Sinks." In International Conference on Learning Representations (ICLR). arXiv preprint arXiv:2309.17453.

[2] Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y., Ré, C., Barrett, C., Wang, Z., & Chen, B. (2023). "H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models." In Advances in Neural Information Processing Systems (NeurIPS).

[3] Liu, Z., Desai, A., Liao, F., Wang, W., Xie, V., Xu, Z., Kyrillidis, A., & Shrivastava, A. (2023). "Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time." In Advances in Neural Information Processing Systems (NeurIPS).

[4] Ge, Y., Qin, Y., Tang, J., & Liu, Y. (2024). "Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs." In International Conference on Learning Representations (ICLR).

[5] Shazeer, N. (2019). "Fast Transformer Decoding: One Write-Head is All You Need." arXiv preprint arXiv:1911.02150.

[6] Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebrón, F., & Sanghai, S. (2023). "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP).

[7] Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." In Advances in Neural Information Processing Systems (NeurIPS).

[8] Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., & Han, S. (2023). "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models." In International Conference on Machine Learning (ICML).

[9] Sheng, Y., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Chen, B., Liang, P., Ré, C., Stoica, I., & Zhang, C. (2023). "FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU." In International Conference on Machine Learning (ICML).

[10] Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J., Zhang, H., & Stoica, I. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP).