SPEAR Lab - Systems and Protocols for Edge-Enabled Internet

Network Performance and Traffic Characteristics of LLM Services

A Research Survey

1. Introduction and Motivation

Large Language Models have rapidly become one of the most significant new workloads on the Internet. From ChatGPT crossing 700 million weekly active users by mid-2025 to the steady integration of LLM inference into enterprise SaaS products, search engines, and mobile applications, these services now generate traffic at a scale that demands attention from the networking community. Yet unlike traditional web browsing, video streaming, or even earlier cloud computing workloads, LLM inference traffic exhibits fundamentally different characteristics at nearly every layer of the stack, from the transport protocol up through the wide-area network.

This report synthesizes findings from recent academic publications, industry traces, standards-body discussions, and operator reports to answer a core question: how do the network traffic patterns produced by LLM services differ from those of traditional web applications, and what implications do these differences carry for network design and operations?

2. How LLM Inference Traffic Differs from Traditional Web Traffic

2.1 The Structural Asymmetry of Request and Response

Traditional web traffic follows a well-understood pattern. A client sends a short HTTP request, and the server returns a response, typically within tens to hundreds of milliseconds. The response payload for a typical web page object ranges from a few kilobytes to a few megabytes, and the connection is short-lived or multiplexed over HTTP/2 or HTTP/3 streams.

LLM inference breaks this pattern in several important ways. Google Cloud's networking team has described the core distinction clearly: web applications exhibit predictable traffic patterns with requests and responses processed in small time windows measured in milliseconds, whereas generative AI applications exhibit highly variable request and response times, with inference latencies ranging from seconds to minutes. A single LLM query can consume 100% of a GPU's compute time, in contrast to traditional request processing that runs in parallel across server resources.

The request-response structure is also asymmetric in a way that differs from web traffic. The BurstGPT trace study, which captured over 10 million real-world traces from Azure OpenAI GPT services across 213 days, found that request token distributions follow a Zipf pattern (many short prompts, a long tail of longer ones), while response token distributions show a bimodal or shifted Gaussian shape depending on the model. ChatGPT, for instance, tends to produce shorter responses with a bimodal length distribution, whereas Llama-2-13b-chat produces longer, more uniformly distributed responses.

The MCP-enabled agent workload analysis (arxiv:2511.07426) added another dimension to this: in agentic workflows where LLMs interact with tools, the completion-to-prompt token ratio is extremely low because each request carries substantial system prompts and interaction history. This creates a "prompt-heavy" traffic pattern where the upstream payload (client to server) is disproportionately large compared to traditional web interactions.

2.2 Token Streaming and Long-Lived Connections

Perhaps the most distinctive networking characteristic of LLM inference is token streaming. Rather than returning a complete response, LLM services generate tokens autoregressively (one at a time) and stream each token to the client as it is produced. This creates a fundamentally different traffic flow from anything the web has historically served.

The dominant protocol for this delivery is Server-Sent Events (SSE) over HTTP. ChatGPT, Claude, Gemini, and essentially all major LLM APIs use SSE to push token-by-token updates from server to client. This means a single LLM interaction results in a long-lived HTTP connection (often 10 to 60 seconds or more) during which the server sends a stream of small JSON-wrapped text chunks. Each chunk typically contains just one or a few tokens, making the individual payload very small but the connection duration very long relative to a typical web request.

This creates traffic that looks quite unlike traditional web browsing. Instead of a burst of packets for a page load followed by idle time, an LLM session produces a sustained trickle of small packets over an extended period. The inter-packet timing is governed not by network conditions or server load alone, but by the token generation speed of the model, which is itself a function of GPU compute, model size, batch scheduling, and KV cache management.

2.3 Uplink-Downlink Ratio Shift

A striking finding from the Ericsson Mobility Report's measurement of GenAI mobile apps (April 2025) is the unusual uplink traffic share. Typical mobile network traffic is distributed roughly 90% downlink to 10% uplink. GenAI traffic, by contrast, shows 74% downlink and 26% uplink. Some applications deviate even further: DeepSeek and Microsoft Copilot exhibit a roughly 50/50 uplink/downlink split, driven by the large prompt payloads users send.

This matters for network provisioning. Access networks, particularly wireless ones, have historically been dimensioned with a strong downlink bias. If GenAI traffic grows as projected (it currently represents only about 0.06% of total mobile network traffic, but the app market is growing at 81% year over year), the uplink demands could strain access network capacity in ways that traditional web and video traffic never did.

3. Burstiness and Traffic Spikes

3.1 The BurstGPT Evidence

The most comprehensive empirical study of LLM serving burstiness comes from the BurstGPT dataset. The researchers collected real-world traces from Azure OpenAI GPT services and characterized both the temporal and spatial patterns of the workload. Their findings on burstiness are striking and deserve careful attention.

At the macro level, conversation services (interactive ChatGPT use) show periodic patterns: traffic peaks during weekdays and working hours, and drops during nights and weekends. This is broadly similar to traditional web traffic. However, API services (programmatic access to GPT models) follow an aperiodic pattern with much higher burstiness. The shape and scale parameters of the Gamma distribution used to model burstiness vary sharply throughout the day, indicating that the workload is inherently unstable and difficult to predict.

The BurstGPT team modeled request burstiness using Gamma distributions and found that the coefficient of variation is high for API traffic, meaning the load frequently doubles from one time slice to the next. An analysis of Azure production traces revealed that a system can experience traffic bursts during 47% of its operational time. This is far more volatile than typical web server traffic, where request arrivals tend to be smoother at equivalent time granularities.

3.2 Why LLM Traffic is Burstier

Several factors contribute to the unusual burstiness of LLM serving traffic. First, the computational cost per request is orders of magnitude higher than for traditional web requests. A single LLM inference can occupy a GPU for seconds, whereas a web server processes thousands of requests per second. This means that even modest fluctuations in request arrival rate translate into large swings in resource utilization and queuing delay.

Second, the rise of agentic workflows amplifies burstiness. When an LLM agent makes multiple tool calls, each call triggers a new inference request, and these requests tend to arrive in rapid succession. The MCP-enabled workflow study showed that agentic interactions involve many more API round-trips per user task, each carrying substantial prompt context.

Third, the coupling between request complexity and response time creates feedback effects. Longer prompts take more time to process during the prefill phase, which can block other requests. The ASPLOS 2024 characterization of LLM power management at Microsoft observed distinct patterns in the prefill phase (spiky, compute-intensive) versus the decode/token generation phase (longer, more stable, lower power). These phases interact with batching and scheduling decisions to create complex temporal dynamics.

3.3 Failure Rates Under Bursty Load

The BurstGPT study also documents a concerning relationship between burstiness and failure rates. They observed failure rates exceeding 5% for ChatGPT conversation services, significantly higher than typical cloud services. Increased burstiness leads to GPU memory bottlenecks, which in turn cause spikes in failure rates and performance degradation. This creates a negative feedback loop: bursty arrivals cause queuing, queuing exhausts GPU memory (particularly the KV cache), and memory exhaustion causes request failures, further degrading perceived service quality.

4. Transport Layer Challenges: Token Streaming Over TCP

4.1 Head-of-Line Blocking in Token Streams

One of the most insightful networking studies on LLM traffic is the Eloquent paper (Hanchen Li et al., SIGCOMM Workshop 2024), which conducted real-world packet-level measurements of ChatGPT's streaming API under unstable network conditions. Their findings reveal a fundamental mismatch between TCP's reliability guarantees and the needs of LLM token streaming.

When a packet carrying a token is lost, TCP's in-order delivery guarantee means all subsequent tokens are blocked at the receiver until the lost packet is retransmitted. The researchers observed that under lossy conditions (common on mobile networks, WiFi, or in-motion scenarios), subsequent packets carrying newer tokens often arrive before the retransmission of the lost packet. But these newer tokens cannot be rendered because they are blocked behind the gap, creating visible stalls in the user experience.

This is a qualitatively different problem from video streaming (where buffering can absorb short delays) or traditional web page loads (where the user sees nothing until the full response arrives anyway). In LLM token streaming, the user is watching text appear word by word, and any interruption in this flow is immediately perceptible. The Eloquent team measured ChatGPT, Claude, and Bard (Gemini) and found that all three suffer from increased stall ratios under packet loss.

Their proposed solution, Eloquent, adds redundancy by including previously unacknowledged tokens in each new outgoing packet. This ensures that each received packet can independently advance the rendering state. Through simulation, Eloquent reduced stall ratio by 71% compared to TCP's retransmission-based approach. The authors suggest implementation within QUIC (as a custom stream recovery mechanism) or on top of RTP, drawing an analogy to how video and audio streaming handle lossy networks.

4.2 Implications for Protocol Design

The Eloquent work highlights that LLM token streaming is a genuinely new traffic pattern that does not fit neatly into existing categories. It differs from video streaming (which has large pre-bufferable content), from collaborative text editing (where human typing speed is slow enough for TCP retransmission to be invisible), and from traditional web requests (which are atomic). The token generation rate of current LLMs (roughly 30 to 100 tokens per second for most models) sits in an awkward middle ground: fast enough that TCP retransmission delays are noticeable, but slow enough that each token is individually important to the user experience.

This has implications for QUIC adoption, congestion control algorithm selection, and potentially for new application-layer protocols optimized for this specific traffic pattern. The TokenFlow system (Xiao et al., 2025) further explored this space by defining Quality of Service metrics specific to text streaming, including startup latency, stall events, and token usefulness, paralleling the QoE frameworks developed for video streaming over the past decade.

5. Datacenter and Intra-Cluster Traffic Patterns

5.1 Training vs. Inference Traffic

While this report focuses on end-user-facing inference traffic, the datacenter-internal traffic patterns of LLMs deserve mention because they represent an entirely new class of workload for data center networks.

The HotNets 2023 work on LLM-centric network architectures (Ghobadi et al.) characterized the traffic patterns of distributed LLM training across GPU clusters. They found that LLM training traffic is highly structured: it follows the parallelism strategy (data parallel, tensor parallel, pipeline parallel) and creates predictable communication patterns between specific GPU pairs. Notably, 33% of links in a traditional any-to-any Clos network carry zero training traffic and can be removed without performance impact.

The IETF draft on AI traffic (draft-aft-ai-traffic-01) provides a broader framing. It notes that ML traffic consists of a relatively small number of large flows with very low entropy, making traditional ECMP-based load balancing ineffective. This low entropy problem, stemming from the highly structured all-reduce and all-gather communication patterns, forces operators to use packet spraying strategies that require careful tuning.

The IETF RTGWG draft on AI network problems further observes that AI training traffic creates bursts of high-bandwidth elephant flows between the same connections, leading to severe hash conflicts and network congestion even with uniform hash algorithms.

5.2 Inference Serving at Scale

For inference traffic within the datacenter, the two-phase nature of LLM inference (prefill and decode) creates distinct internal traffic patterns. The Splitwise system (ISCA 2024, using Azure's 2023 LLM trace) proposed disaggregating these phases onto separate GPU pools, which changes the internal traffic pattern: prefill GPUs send large intermediate state transfers to decode GPUs. DynamoLLM (HPCA 2025, using Azure's 2024 trace) further characterized these patterns for energy-efficient cluster design.

The Azure public datasets provide two key inference trace collections. The 2023 trace (one day, November 2023) and the 2024 trace (ten days, May 2024) together reveal how LLM inference workloads have grown in both volume and complexity. These traces have been widely adopted as benchmarks for evaluating serving systems including vLLM, Sarathi-Serve, and various KV cache management techniques.

6. Edge and Distributed Inference Traffic

6.1 The Emerging Edge LLM Landscape

A new SIGCOMM NAIC 2025 workshop paper, "LLMs on Edge: Network Traffic Characteristics of Distributed Inference under the Loupe," presents one of the first comprehensive analyses of network traffic in edge-deployed LLM frameworks. The study reveals non-obvious behaviors, including cases where adding compute nodes actually degrades performance due to the networking overhead of distributing activation tensors across devices.

In distributed edge inference (using frameworks like Petals, Distributed-Llama, or MDI-LLM), the traffic pattern changes fundamentally. Instead of a single request-response flow between client and cloud, inference involves repeated exchanges of intermediate activation vectors between edge nodes. These activations can be large (megabytes per layer for larger models), and they must be exchanged synchronously, making the inference throughput highly sensitive to the weakest network link.

The HALO framework (January 2026) addressed lossy edge networks specifically, observing that strict synchronization in distributed inference is often infeasible due to unreliable network conditions. Their semantic-aware approach selectively relaxes synchronization for less critical neuron groups, achieving 3.4x speedup under 5% packet loss on a Raspberry Pi testbed.

6.2 Network Requirements for Edge LLMs

The traffic characteristics of edge LLM inference differ from cloud inference in important ways. The flows are smaller but more latency-sensitive. The communication pattern is synchronous (each layer must complete before the next begins), creating a strict dependency chain. And the heterogeneous nature of edge networks (mixing WiFi, Ethernet, and potentially cellular links) introduces variability that cloud datacenter networks do not face.

7. Macro-Level Internet Traffic Impact

7.1 Current Scale

Despite the hype around AI traffic, the current quantitative impact on total Internet traffic is modest but growing rapidly. The Ericsson Mobility Report (June 2025) measured GenAI traffic at just 0.06% of total mobile network traffic. ChatGPT alone accounts for 60% of that AI traffic share. However, the growth trajectory is steep: LLM referral traffic to websites grew at an average rate of 80% between the first and second halves of 2025.

The Omdia report on AI's impact on wide-area networking provides a broader estimate: AI traffic (including both direct AI services and AI features embedded in other applications) represents roughly 14% of all network traffic today and is projected to reach 31% within three years. They project that AI traffic will eclipse conventional network traffic around 2031.

Nokia's Global Network Traffic Report projects that AI traffic from consumer and enterprise applications could reach 1,441 exabytes per month by 2033, with significant implications for inter-datacenter link capacity as a single AI traffic flow can traverse multiple inter-DC links, creating an amplification effect.

7.2 Upstream Traffic and Topology Changes

Several infrastructure-level changes are being driven by AI traffic. Network operators are fielding requests from major AI players for capacity with potential to scale to 10 terabits per second. Enterprises expect to triple their bandwidth to data centers and clouds in the next 12 to 18 months, with AI as a primary driver.

The topology of the Internet itself is shifting. AI training workloads are driving construction of new data centers in locations with available power rather than proximity to users. Inference workloads, however, require low-latency connections to edge locations near end users. This dual requirement is creating a more distributed Internet topology, with a resurgence in demand for localized data center facilities. Major providers like Zayo and Lumen Technologies have announced multi-billion dollar infrastructure investments specifically to support AI workloads.

7.3 The Cloudflare Perspective

Cloudflare's 2025 Year in Review noted that AI "user action" crawling (LLM providers fetching web content to serve user queries) increased by over 15x in 2025. While other AI bots accounted for 4.2% of HTML request traffic, even Googlebot (which increasingly serves dual search-and-AI-training purposes) accounted for 4.5%. This represents a new and growing category of web traffic that did not exist a few years ago, and it adds load to origin servers in patterns that differ from traditional search crawling.

8. Key Differences: A Summary

To synthesize the findings across all the sources reviewed, LLM traffic differs from traditional web traffic along several dimensions.

Connection duration. Traditional web requests complete in milliseconds; LLM streaming connections persist for seconds to minutes, with sustained low-rate data flow throughout.

Packet size and cadence. Web traffic produces bursts of full-MTU packets during page loads. LLM token streaming produces a steady drip of small packets, each carrying a few tokens worth of JSON payload, at the model's generation rate.

Burstiness profile. While web traffic shows smooth aggregate behavior at moderate time scales, LLM API traffic shows high burstiness even at 20-minute granularities, with the load frequently doubling between adjacent time slices. The BurstGPT analysis found that systems experience traffic bursts during nearly half their operational time.

Uplink share. GenAI traffic has a significantly higher uplink share (26%) than typical web traffic (10%), driven by large prompt payloads.

Transport sensitivity. LLM token streaming is uniquely sensitive to TCP head-of-line blocking. A single lost packet creates a visible stall in the user's text rendering, unlike web page loads (which buffer) or video (which can pre-buffer or degrade quality).

Failure correlation. LLM serving shows failure rates that correlate strongly with burstiness. The high per-request resource cost (GPU memory, compute) means that traffic spikes translate directly into memory exhaustion and cascading failures.

Load balancing incompatibility. Traditional round-robin or utilization-based load balancing is poorly suited for LLM inference because request processing times are highly variable and individual requests monopolize GPU resources.

9. Open Research Directions

Several open questions emerge from this survey.

Transport protocol optimization for token streaming. The Eloquent work is a promising start, but deployment in production systems (potentially within QUIC or as an HTTP/3 extension) remains unexplored. The relationship between congestion control algorithms and token streaming QoE is also unstudied.

Traffic classification and identification. As LLM traffic grows, ISPs and enterprise networks may need to identify and manage it. The encrypted, SSE-based nature of the traffic makes traditional DPI approaches insufficient, but the distinctive temporal pattern of token streaming (regular small packets over long connections) may enable flow-level classification.

Network-aware serving systems. Most LLM serving optimizations (vLLM, Sarathi-Serve, etc.) focus on GPU scheduling and memory management. Incorporating network conditions into serving decisions, especially for edge deployments, is an emerging area.

Burstiness prediction and proactive scaling. The high burstiness documented by BurstGPT suggests that reactive autoscaling is insufficient. Predictive models that anticipate LLM traffic surges could improve both efficiency and user experience.

QoE modeling for text streaming. The TokenFlow work began defining formal QoE metrics for token streaming, but this area is far less mature than video QoE. Understanding user sensitivity to stall duration, stall frequency, and time-to-first-token is essential for driving network optimization.

10. Key References

BurstGPT — Wang et al. "BurstGPT: A Real-World Workload Dataset to Optimize LLM Serving Systems." KDD 2025. 10.31 million traces from Azure OpenAI GPT services over 213 days.
Eloquent — Li et al. "Eloquent: A More Robust Transmission Scheme for LLM Token Streaming." SIGCOMM Workshop on Hot Topics in Networks / ACM SIGCOMM NAI Workshop, 2024. Packet-level measurement and transport design for token streaming.
Azure LLM Inference Traces — Microsoft Azure Public Dataset. Traces from November 2023 (ISCA 2024 / Splitwise) and May 2024 (HPCA 2025 / DynamoLLM).
LLMs on Edge — "LLMs on Edge: Network Traffic Characteristics of Distributed Inference under the Loupe." NAIC Workshop at ACM SIGCOMM, 2025.
ASPLOS 2024 Power Characterization — Patel et al. "Characterizing Power Management Opportunities for LLMs in the Cloud." ASPLOS 2024. Includes analysis of prompt-phase vs. decode-phase traffic and power patterns.
Google Cloud Networking Blog — "Networking capabilities optimize traffic for generative AI apps." June 2024. Industry perspective on LLM traffic differences from traditional web apps.
IETF draft-aft-ai-traffic-01 — Fressancourt et al. "Handling inter-DC/Edge AI-related network traffic: Problem statement." April 2025. Covers incast, entropy, load balancing challenges for ML/LLM traffic.
Ericsson Mobility Report — "GenAI Data Traffic Today." June 2025. Mobile network measurement of GenAI application traffic characteristics.
Omdia Report — "AI's Impact on Wide Area Networking." November 2025. Projections for AI traffic share growth and infrastructure implications.
Nokia Global Network Traffic Report — Projections for AI traffic growth to 1,441 EB/month by 2033.
TokenFlow — Xiao et al. "TokenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling." 2025. QoE framework for text streaming.
MCP-Enabled LLM Performance — "Network and Systems Performance Characterization of MCP-Enabled LLM Agents." arxiv:2511.07426, 2025. Token usage characterization for agentic workflows.
HALO — "HALO: Semantic-Aware Distributed LLM Inference in Lossy Edge Network." January 2026. Distributed inference under packet loss.
Characterization of LLM Development in the Datacenter — "Characterization of Large Language Model Development in the Datacenter." arxiv:2403.07648, 2024. GPU cluster workload analysis from Alibaba.
Keysight ATI — "An Insightful Look into OpenAI API Call's Network Traffic." June 2024. Packet-level analysis of OpenAI API call structure.

Networking traffic usage of AI (especially LLMs)