Llm Inference Optimization 101

6 hours ago

ARTICLE AD BOX

Fast conclusion makes nan world spell brrr

Large Language Models (LLMs) make coherent, earthy relationship responses, efficaciously automating a multitude of tasks that were antecedently exclusive to humans. As galore cardinal players successful nan field, specified arsenic Jensen Huang and Ilya Sutskever, personification precocious alluded to, we’re successful an era of agentic AI. This caller paradigm seeks to revolutionize various aspects of our lives, from personalized medicine and acquisition to intelligent assistants, and beyond.

However, it is important to beryllium alert that while these models are getting progressively powerful, wide return is hindered by nan monolithic costs to tally them, frustrating clasp times that render definite real-world applications impractical, arsenic bully as, of course, their expanding c footprint. To reap nan benefits of this technology, while mitigating costs and powerfulness consumption, it is captious that we proceed to optimize each facet of LLM inference.

The extremity of this article is to springiness readers an overview of existent ways successful which researchers and dense learning practitioners are optimizing LLM inference.

What is LLM inference?

Similar to really 1 uses what they learned to lick a caller problem, conclusion is erstwhile a trained AI exemplary uses patterns detected during training to infer and make predictions connected caller data. This conclusion process is what enables LLMs to execute tasks for illustration matter completion, translation, summarization, and conversation.

Text Generation Inference pinch 1-click Models

DigitalOcean has collaborated pinch HuggingFace to relationship 1-click models. This allows for nan integration of GPU Droplets pinch state-of-the-art open-source LLMs successful Text Generation Inference (TGI)-optimized instrumentality applications. This intends galore of nan conclusion optimizations covered successful this article (ex: tensor parallelism, quantization, flashattention, paged attention) are already taken attraction of and maintained by HuggingFace. For accusation connected really to usage these 1-click models, cheque retired our article Getting Started pinch LLMs.

Prerequisites

While this article includes immoderate introductory dense learning concepts, galore topics discussed are comparatively advanced. Those wished to amended understand conclusion optimization are encouraged to investigation nan links scattered passim nan article and successful nan references section.

It is advised that readers personification an knowing of neural web fundamentals, nan attraction mechanism, nan transformer, and information types earlier proceeding.

It would too thief to beryllium knowledgeable astir nan GPU practice hierarchy.

The article,Introduction to GPU Performance Optimization, provides sermon connected really GPUs tin beryllium programmed to accelerate neural web training and inference. It too explains cardinal position specified arsenic latency and throughput.

The Two Phases of LLM Inference

LLM conclusion tin beryllium divided into 2 phases: prefill and decode. These stages are separated owed to nan different computational requirements of each stage. While prefill, a highly-parallelized matrix-matrix cognition that saturates GPU utilization, is compute-bound, decode, a matrix-vector cognition that underutilizes nan GPU compute capability, is memory-bound.

The prefill style tin beryllium likened to reference an afloat archive astatine erstwhile and processing each nan words simultaneously to represent nan first relationship whereas nan decode style tin beryllium compared to continuing to represent this consequence relationship by word, wherever nan premier of each relationship depends connected what was written before.

Let’s investigation why prefill is compute-bound and decode is memory-bound.

Prefill

In nan prefill stage, nan LLM processes nan afloat input punctual astatine erstwhile to make nan first consequence token. This involves performing a afloat guardant locomotion done nan transformer layers for each token successful nan punctual simultaneously. While practice entree is needed during prefill, nan computational activity of processing nan tokens successful parallel predominate nan capacity profile.

Decode

In nan decode stage, matter is generated autoregressively wherever nan adjacent token is predicted 1 astatine clip fixed each erstwhile tokens. The decoding process is memory-bound owed to its petition to galore times entree humanities context. For each caller token generated, nan exemplary must load nan attraction cache (key/value states, AKA KV cache) from each erstwhile tokens, requiring predominant practice accesses that spell overmuch intensive arsenic nan bid grows longer. Despite nan existent computation per token during decode being considerably small than prefill, nan repeated retrieval of cached attraction states from practice makes nan practice bandwidth and redundant practice accesses nan limiting facet during nan decode phase.

Metrics tin beryllium utilized to measurement capacity and spot areas of imaginable bottlenecks during these 2 conclusion stages.

Metrics

Metric Definition Why do we care?

Time-to-First-Token (TTFT)	Time to process nan punctual and make nan first token. TTFT tells america really agelong prefill took.	The longer nan prompt, nan longer nan TTFT arsenic nan attraction strategy needs nan afloat input bid to compute nan KV cache. Inference optimization seeks to minimize TTFT.
Inter-token Latency (ITL) AKA Time per Output Token	Average clip betwixt consecutive tokens. ITL tells america nan title astatine which decoding (token generation) occurs.	Consistent ITLs are cleanable arsenic they are suggestive of businesslike practice management, precocious GPU practice bandwidth, and well-optimized attraction computation.

Optimizing Prefill and Decode

Speculative Decoding

Speculative Decoding uses a smaller, faster exemplary to make aggregate tokens simultaneously, and past verifies them pinch nan larger target model.

Chunked Prefills and Decode-Maximal Batching

SARATHI shows really chunked prefills tin alteration nan conception of ample prefills into manageable chunks, which tin past beryllium batched pinch decode requests (decode-maximal batching) for businesslike processing.

Batching

Batching groups conclusion requests together, pinch larger batch sizes corresponding to higher throughput. However, batch sizes tin only beryllium accrued up to a definite people owed to constricted GPU on-chip memory.

Batch Size

To execute maximum utilization of nan hardware, 1 tin effort to find nan captious ratio wherever there’s a equilibrium betwixt 2 cardinal limiting factors:

The clip needed to proscription weights betwixt practice and compute units (limited by practice bandwidth)
The clip required for existent computational operations (limited by FLOPS)

While these 2 times are equal, nan batch size tin beryllium accrued without incurring immoderate capacity penalty. Beyond this point, expanding batch size would create bottlenecks successful either practice proscription aliases computation. To find an optimal batch size, profiling is important.

KV cache guidance plays a captious domiciled successful determining nan maximum batch size and improving inference. Thus, the remainder of nan article will attraction connected managing nan KV cache.

KV Cache Management

When looking astatine really practice is allocated successful nan GPU during serving, nan exemplary weights enactment fixed and nan activations only utilize a fraction of nan GPU’s practice resources compared to nan KV cache. Therefore, freeing up abstraction for nan KV cache is critical. This tin beryllium achieved by reducing nan exemplary weight practice footprint done quantization, reducing nan KV cache practice footprint pinch modified architectures and attraction variants, arsenic bully arsenic pooling practice from aggregate GPUs pinch parallelism.

Quantization

Quantization reduces nan number of bits needed to shop nan model’s parameters (ex: weights, activations, and gradients). This method reduces conclusion latency by exchanging practice for accuracy.

Attention and its variants

Review of Queries, Keys, and Values: Queries: Represent nan sermon aliases question. Keys: Represent nan accusation being attended to. Values: Represent nan accusation being retrieved.

Attention weights are computed by comparing queries pinch keys, and past utilized to weight values, producing nan past output representation.

Query (Prompt) → Attention Weights → Relevant Information (Values)

Sliding Window Attention (SWA) aliases conception attention, restricts attraction to a fixed-size exemplary that slides complete nan sequence. While SWA is not businesslike to modular to agelong inputs, Character AI saw that velocity and worth wasn’t impacted pinch agelong sequences erstwhile interleaving SWA and world attention, pinch adjacent world attraction layers sharing a KV cache (cross-layer attention).

Local Attention vs. Global Attention

Local and world attraction mechanisms disagree successful cardinal aspects. Local attraction uses small computation (O(n * w)) and practice by focusing connected token windows, enabling faster conclusion peculiarly for agelong sequences, but whitethorn miss long-range dependencies. Global attention, while computationally overmuch costly (O(n^2)) and memory-intensive owed to processing each token pairs, is tin to amended seizure afloat sermon and long-range limitations astatine nan costs of slower conclusion speed.

Paged Attention

Inspired by virtual practice allocation, PagedAttention projected a exemplary for optimizing KV cache that takes nan assortment of nan number of tokens crossed requests into consideration.

FlashAttention

There are 3 variations of FlashAttention, pinch FlashAttention-3 being nan latest merchandise and optimized for Hopper GPUs. Each loop of this algorithm takes a hardware-aware onslaught to make nan attraction computation arsenic accelerated arsenic possible. Past articles written connected FlashAttention include: Designing Hardware-Aware Algorithms: FlashAttention and FlashAttention-2

Model Architectures: Dense Models vs. Mixture of Experts

Dense LLMs are nan modular wherever each parameters are actively engaged during inference.

Mixture of Experts (MoE) LLMs are composed of aggregate specialized sub-networks pinch a routing mechanism. Because only applicable experts are activated for each input, improved parameter ratio and faster conclusion than dense models is often observed.

Parallelism

Larger models often require aggregate GPUs to tally effectively. There are a number of different parallelization strategies that fto for multi-GPU inference.

Parallelism Type Partions Description Purpose

Data	Data	Splits different batches of accusation crossed devices.	Distribution of practice and computation for ample datasets that wouldn’t caller connected a azygous device
Tensor	Weight Tensors	Splits tensors crossed aggregate devices either row-wise aliases column-wise	Distribution of practice and computation for ample tensors that wouldn’t caller connected a azygous device
Pipeline	Model Layers (vertically)	Splits different stages of nan afloat exemplary pipeline successful parallel	Improves throughput by overlapping computation of different exemplary stages
Context	Input Sequences	Divides input sequences into segments crossed devices	Reduces practice bottleneck for agelong bid inputs
Expert	MoE models	Splits experts, wherever each maestro is simply a smaller model, crossed devices	Allows for larger models pinch improved capacity by distributing computation crossed aggregate experts
Fully Sharded Data	Data, model, optimizer, and gradients	Shards components crossed devices, processes accusation successful parallel, and synchronizes aft each training step	Enables training of highly ample models that transcend nan practice capacity of a azygous instrumentality by distributing immoderate exemplary parameters and activations.

Conclusion

It’s undeniable that conclusion is an breathtaking area of investigation and optimization. The conception moves fast, and to support up, conclusion needs to move faster. In summation to overmuch agentic workflows, we’re seeing overmuch move conclusion strategies that fto models to “think longer” connected harder problems. For example, OpenAI’s o1 exemplary shows accordant capacity improvements connected challenging mathematical and programming tasks erstwhile overmuch computational resources are devoted during inference.

Well, acknowledgment truthful overmuch for reading! This article is surely not conclusive to each location is successful conclusion optimization. Stay tuned for overmuch breathtaking articles connected this taxable and adjacent ones.

References and Other Excellent Resources

Blog posts:

Mastering LLM Techniques: Inference Optimization | NVIDIA Technical Blog LLM Inference astatine modular pinch TGI Looking backmost astatine speculative decoding (Google Research) LLM Inference Series: 4. KV caching, a deeper look | by Pierre Lienhart | Medium A Visual Guide to Quantization - by Maarten Grootendorst Optimizing AI Inference astatine Character.AI Optimizing AI Inference astatine Character.AI (Part Deux)

Papers:

LLM-Inference-Bench: Inference Benchmarking of Large Language Models connected AI Accelerators

Efficient Memory Management for Large Language Model Serving pinch PagedAttention

SARATHI: Efficient LLM Inference by Piggybacking Decodes pinch Chunked Prefills

The LLama 3 Herd of Models (Section 6)

Talks:

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

NVIDIA CEO Jensen Huang Keynote astatine CES 2025

Building Machine Learning Systems for a Trillion Trillion Floating Point Operations :: Jane Street

Dylan Patel - Inference Math, Simulation, and AI Megaclusters - Stanford CS 229S - Autumn 2024

How does batching activity connected modern GPUs?

GitHub Links: