Filter 30 entries
Model Gemma-4-31B-IT-NVFP4
Weights 31.18 GiB
KV cache 1,563,739 tokens
Context 262,144 tok
Cold start ~13 min
Warm start ~40s
Concurrency 5.97× @ 256K ctx
01 Process startup & model identification 00:01:23
KEY [APIServer]00:01:23non-default args: model=nvidia/Gemma-4-31B-IT-NVFP4 · max_num_batched_tokens=4096

vLLM logs every argument that deviates from its defaults — so this line alone tells you the whole story of the deployment.

model=nvidia/Gemma-4-31B-IT-NVFP4 — NVIDIA's quantized variant of Google's Gemma 4 27B. The -IT suffix means "instruction-tuned" (fine-tuned for chat/instruction following). NVFP4 is NVIDIA's proprietary 4-bit float format from their ModelOpt toolkit, separate from the open-source GPTQ/AWQ/GGUF ecosystem.

max_num_batched_tokens: 4096 — the maximum number of tokens that can be processed across all concurrent requests in a single forward pass. Setting it to 4096 (vs the much higher default) is a deliberate memory-conservative choice that pairs with chunked prefill to control GPU memory pressure and latency variance.

WARN [APIServer]Unknown vLLM env vars: VLLM_VERSION · VLLM_FLASH_ATTN_SRC_DIR

These env vars were set on the host (likely by NVIDIA's DGX container or a prior vLLM installation) but the current vLLM version doesn't recognize them. Harmless — vLLM ignores unknown env vars and warns you so you're not silently relying on something that isn't active.

VLLM_FLASH_ATTN_SRC_DIR used to point vLLM at a custom FlashAttention build. Newer versions handle attention backend selection differently (via the TRITON_ATTN path, seen below), making this obsolete.

KEY [APIServer]Resolved architecture: Gemma4ForConditionalGeneration

vLLM reads config.json from HuggingFace, finds the architectures field, and maps it to an internal model class. Gemma4ForConditionalGeneration is the PyTorch class implementing the Gemma 4 architecture. The "conditional generation" name comes from the multimodal heritage — the model can condition its text output on both text and image inputs.

If the architecture isn't in vLLM's registry, startup fails here. Seeing this line confirms vLLM has full support for this model.

KEY [APIServer]Using max model len 262144 tokens (256K context window)

The maximum context window this deployment supports: 256,144 tokens. This is read from Gemma 4's max_position_embeddings in config.json. The KV cache must be sized to support sequences up to this length — a significant memory commitment that shapes all memory math later in the startup.

02 Quantization, KV cache format & attention backend 00:01:28
KEY [EngineCore]KV cache dtype: fp8_e4m3 — halves footprint, boosts performance, requires correct scaling factors

The KV cache (stored key/value tensors from each attention layer) is held in FP8 E4M3 format — 8-bit float with 4 exponent bits and 3 mantissa bits. It's half the size of BF16, so you can cache twice as many tokens per byte of VRAM. For a 256K context window model, this is a prerequisite rather than an optimisation.

The accuracy caveat is real: FP8 has a narrow dynamic range (max representable value ≈ 448). Without per-tensor scaling factors in the checkpoint, large activation magnitudes in deep layers get silently clipped. See the scaling factor warnings in Phase 3 for where this becomes concrete.

WARN [EngineCore]Detected ModelOpt NVFP4 checkpoint (quant_algo=NVFP4) — format is experimental

NVFP4 is NVIDIA's 4-bit floating-point quantization format from their ModelOpt toolkit — separate from the open-source GPTQ/AWQ/GGUF ecosystem. FP4 packs even more compression than INT4: weights in 4 bits, stored with block-level scaling factors for accuracy recovery.

Gemma-4 in NVFP4 is ~30 GiB vs ~55 GiB in BF16 — this is what makes the 31B model fit on a single GPU. The "experimental" label is legal hedging; NVIDIA ships this format in production on their own hardware. It requires Hopper/Blackwell architecture GPU instructions (H100, B100, B200) — it won't run on older cards.

KEY [EngineCore]Gemma4 heterogeneous head dims (head_dim=256, global_head_dim=512) → forcing TRITON_ATTN backend

This is the most architecturally revealing line in the entire log. Gemma 4 uses a hybrid attention design with two different head sizes:

Local attention layershead_dim=256. These attend to a sliding window of nearby tokens only. Keeps compute tractable at 256K context.

Global attention layershead_dim=512. These can attend to the full 256K context but appear in fewer layers. By interleaving sparse global layers with dense local ones, the model achieves long-range coherence without paying the full O(n²) attention cost at every layer.

FlashAttention (vLLM's default backend) is compiled with a fixed head dimension per model — it cannot handle two different head sizes. vLLM falls back to Triton, which allows flexible kernel shapes. Minor performance cost, required for correctness.

KEY [EngineCore]Chunked prefill enabled · max_num_batched_tokens=4096 · async scheduling enabled

Two related scheduling decisions:

Chunked prefill splits processing of long input prompts into 4096-token chunks, interleaved with decode steps from other requests. Without it, a single user with a 50K prompt monopolises the GPU and causes all other users' generation to stall for seconds. With it, the long prompt is processed incrementally while other requests keep generating. Slight latency increase for the long-prompt user; significant fairness improvement for everyone else.

Async scheduling means the APIServer process (HTTP interface) and EngineCore process (GPU execution) run independently. The APIServer can accept new connections and stream partial results while the GPU is mid-forward-pass.

INFO [EngineCore]Enabled custom fusions: act_quant · Using FlashInferCutlassNvFp4LinearKernel for GEMM

act_quant fusion — instead of running (1) activation function then (2) quantize output to FP4/FP8 as separate GPU kernel launches, they're fused into one. This eliminates a round-trip through GPU memory between the two ops — critical for NVFP4 models where every layer boundary involves a quantize/dequantize step.

FlashInferCutlassNvFp4LinearKernel — the actual matrix multiply operations (the ~85% of compute in a transformer) use NVIDIA CUTLASS (their gold-standard, highly-optimized GEMM library) via the FlashInfer serving framework, with a kernel path specifically tuned for FP4-quantized weight matrices on Hopper/Blackwell.

03 Model download & weight loading 00:01:28 → 00:11:39
WARN [both]Unauthenticated requests to HF Hub — set HF_TOKEN for higher rate limits

The model is downloading from HuggingFace without an auth token. This works for public models but HF rate-limits anonymous requests more aggressively. This warning appears twice because both the APIServer and EngineCore processes independently fetch metadata from HF on startup. Fix with export HF_TOKEN=hf_... in your container env.

INFO [EngineCore]Checkpoint: 30.39 GiB · Available RAM: 74.74 GiB · Filesystem: OVERLAY · Auto-prefetch: disabled

30.39 GiB checkpoint — a 31B-parameter model at 4 bits per weight would be ~15.5 GiB. The extra overhead comes from: embedding tables and output projection remaining in BF16 (much larger), bias vectors, quantization scaling factors, and metadata. The roughly 2× overhead over naive FP4 is typical.

OVERLAY filesystem — the weights are on a Docker overlay FS (copy-on-write layers), not NFS or Lustre. vLLM's auto-prefetch feature is designed for network filesystems where pre-loading shards into CPU RAM while prior shards process yields big wins. On a local overlay FS, Linux's page cache handles this already, so auto-prefetch is correctly disabled.

INFO [EngineCore]HF download: 370.18s · Loading safetensors shards: 4/4 in 223.49s

370s download ≈ 83 MB/s from HuggingFace. Consistent with a rate-limited anonymous connection. With HF_TOKEN and a fast pipe, expect 200–400 MB/s.

223s shard loading is the CPU→GPU transfer time: deserializing 4 safetensors shards (~7.5 GiB each) from disk into VRAM. Safetensors is the modern replacement for PyTorch's pickle-based .bin format — uses memory-mapped files and never executes arbitrary Python code on load (eliminating a supply-chain attack vector present in the old format).

After the first run, weights are cached locally so the download phase (370s) is skipped entirely on all subsequent restarts.

KEY [EngineCore]Model loading: 31.18 GiB VRAM · 596.99s total wall-clock

Two numbers to read carefully:

31.18 GiB VRAM — slightly larger than the 30.39 GiB checkpoint because: activation buffers and scratch memory are counted here, embedding tables kept in BF16 inflate after loading, and quantization metadata adds overhead.

596s total — wall-clock from "start loading" to "model ready to profile", including the HF download. The real cost on a warm restart (weights already on local disk) is ~230s.

Memory accounting so far
Model weights (NVFP4 + BF16 embeds): 31.18 GiB
KV cache available (reported later): 71.34 GiB
CUDA graphs + buffers: ~3–4 GiB
─────────────────────────────────────────────────
Total GPU VRAM: ~106 GiB
→ Consistent with H100 NVL (94 GB HBM3) or B100
04 FP8 scaling & encoder cache 00:11:38 → 00:11:51
WARN [EngineCore]No q_scale in checkpoint — using k_scale as fallback

Storing the KV cache in FP8 requires scaling factors to correctly map BF16 values into FP8's narrow dynamic range. The checkpoint should provide three: q_scale (for query tensors), k_scale (keys), and v_scale (values). This NVFP4 checkpoint only provides k_scale and v_scale.

vLLM falls back to using k_scale for Q as well. In practice this is usually fine — Q and K have similar activation magnitude distributions since they're both linear projections of the same hidden state. But it's technically an approximation. Only affects KV cache storage, not the model weights.

WARN [EngineCore]KV cache scaling factor = 1.0 for fp8_e4m3 — verify checkpoint has proper k/v_scale

A scaling factor of 1.0 means no scaling — BF16 values are cast directly to FP8 without any dynamic range adjustment. This is a yellow flag.

FP8 E4M3 has a max representable value of 448.0. Deep in a transformer (layer 30+ of a 46-layer model) and across very long contexts, attention key/value tensors routinely exceed this range. Values beyond 448 are silently clipped, corrupting the KV cache entries for those positions.

Practical impact
Short contexts (0–8K tokens): minimal, activations stay in range
Long contexts (64K+ tokens): potential quality degradation
Full 256K context: may produce noticeable artifacts

Root cause: checkpoint calibrated for FlashAttention's FP8 path,
which has its own scaling baked into the kernel. TRITON_ATTN does not.
INFO [EngineCore]Encoder cache: 4096-token budget · profiled with 1 max-size video item

Gemma 4 has a SigLIP-based vision encoder that converts images/video frames into token embeddings before they enter the decoder. The encoder cache reserves 4096 tokens of VRAM exclusively for these vision embeddings.

The "1 video item of maximum feature size" is a profiling dummy run — vLLM executes a fake forward pass with the largest possible image input to measure actual encoder memory consumption before committing to the KV cache budget. Not a real video — it's a stress test to find the memory ceiling.

The 4096-token encoder budget aligns with max_num_batched_tokens=4096 from startup — they're sharing the same token budget across modalities.

WARN [EngineCore]Priority not set for op rms_norm — using native (PyTorch) implementation

RMSNorm (Root Mean Square Layer Normalization) is the normalization layer used in Gemma 4 and most modern LLMs. It's applied at every transformer block boundary and sits on the critical path of inference.

vLLM has a kernel priority system to decide between a custom fused CUDA kernel or the native PyTorch implementation. "Native" means PyTorch's own RMSNorm — correct, but not as tightly optimised as a hand-written kernel for this hardware combination. This happens because the NVFP4 + TRITON_ATTN combination lacks a registered priority override for rms_norm. Minor performance cost, not a correctness issue. The Part 1 log pre-announced this: rms_norm=['native'].

05 torch.compile & Inductor 00:12:08 → 00:12:55 (60.46s total)
COMPILE [EngineCore]torch.compile cache: /root/.cache/vllm/torch_compile_cache/3b19b606c2/

vLLM uses torch.compile() (PyTorch 2.0+) to JIT-compile the model's compute graph into optimised machine code. The compiled artifacts are cached to disk so subsequent restarts don't recompile from scratch.

The hash 3b19b606c2 is derived from model config + vLLM version + compilation settings. Change any of those, cache is invalidated and recompilation runs in full. This is why upgrading vLLM causes a slow first restart.

COMPILE [EngineCore]Dynamo bytecode transform: 12.79s

Dynamo is the frontend of torch.compile. It intercepts Python bytecode at runtime, traces the model's execution to build a symbolic computation graph — capturing which tensors flow where through all 46 transformer blocks — and hands that graph to the Inductor backend.

12.79s to trace a 31B model graph is normal. Dynamo does this without running real data; it follows the code paths symbolically to understand the computation structure. The resulting graph is a machine-readable description of every matrix multiply, normalization, and attention operation in the model.

WARN [EngineCore]Not enough SMs to use max_autotune_gemm mode

SMs = Streaming Multiprocessors, the fundamental compute unit of an NVIDIA GPU. An H100 SXM has 132 SMs; an H100 NVL has 114 SMs; a B200 has 148 SMs.

max_autotune_gemm is Inductor's most aggressive matrix-multiply auto-tuning mode — it benchmarks dozens of kernel configurations for every GEMM in the model to find the fastest for this specific hardware. This requires enough SMs that the benchmarking overhead is worth it.

Below the threshold, Inductor uses heuristic kernel selection instead. Performance cost: roughly 5–15% below peak GEMM throughput. Not a blocking issue — the kernels are still correct and optimised, just not exhaustively tuned. Suggests this is running on an H100 NVL (114 SMs) rather than an SXM (132 SMs).

COMPILE [EngineCore]Inductor: compiled graph for range (1, 4096) in 39.73s · 61 entries, 6 artifacts, ~19 MB cache

Inductor takes the Dynamo computation graph and compiles it to optimised CUDA kernels for the token range 1–4096. The range means this compiled artifact handles any batch where total token count is between 1 and 4096 — matching the chunked prefill budget.

The output: 6 compiled kernel artifacts (~19 MB of CUDA machine code) with 61 index entries. The 6 artifacts likely correspond to the major compute phases: attention, FFN, RMSNorm, embedding, output projection, and sampling. 39s for compilation is normal for 31B; subsequent restarts load from cache in seconds.

After compilation, the model's forward pass runs as optimised native code rather than interpreted Python operations — this is where the major throughput gains from torch.compile are realised.

COMPILE [EngineCore]Saved AOT compiled function → /root/.cache/vllm/torch_compile_cache/torch_aot_compile/411e2f67.../model

AOT (Ahead-Of-Time) compilation is a step beyond Inductor's JIT. The compiled graph is serialized as a complete, self-contained binary. On the next startup, vLLM loads this binary directly — skipping both Dynamo tracing (12s) and Inductor compilation (40s).

The hash 411e2f67... fingerprints the compiled graph + model config. This is why a warm restart (same vLLM version, same config, weights already on disk) takes ~40s instead of 10+ minutes: only CUDA graph capture remains.

COMPILE [EngineCore]torch.compile: 60.46s total · Initial warmup run: 1.97s
Compilation time breakdown
Dynamo trace: 12.79s (21%)
Inductor compile: 39.73s (66%)
Artifact save: ~8.0s (13%)
─────────────────────────────────
Total: 60.46s

The 1.97s warmup run is a real forward pass with synthetic data. It verifies the compiled graph produces correct outputs and primes GPU caches and CUDA driver state before profiling begins. Think of it as a rehearsal before the memory budget is committed.

06 CUDA graph capture & KV cache sizing 00:13:11 → 00:14:04 (32s capture)
KEY [EngineCore]Capturing 51 PIECEWISE graphs (largest batch=512) + 35 FULL graphs (largest batch=256) in 32s

CUDA graphs pre-record GPU command sequences and replay them without CPU-launch overhead on each decode step. vLLM captures graphs across many batch sizes so it always has a graph that fits a given request count (padding up to the nearest captured size).

PIECEWISE graphs (51) — capture only the decode phase (generating one new token per request), split at attention boundaries where KV cache lookups are dynamic. Covers batch sizes 1 through 512.

FULL graphs (35) — capture the complete forward pass including attention, without the piecewise split. More memory-efficient at runtime but less flexible. Covers batch sizes 1 through 256.

At runtime vLLM chooses the most efficient graph for the current batch. The capture itself takes 32s — a one-time cost reused across all subsequent requests.

MEM [EngineCore]CUDA graph pool: 0.43 GiB actual · 0.73 GiB estimated · 72.4% accuracy

Before capture, vLLM estimated it would need 0.73 GiB for all 86 CUDA graphs. Actual usage after capture: 0.43 GiB — the estimator was conservative by 0.31 GiB (28%). This matters because the pre-capture estimate is what vLLM uses to budget the remaining VRAM for the KV cache.

Why so small? CUDA graphs store command sequences (what to run), not tensors (the actual data). The activation tensors live in separately allocated buffers; the graph just references them by address and replays the operations. 86 graphs × ~5 MB average = 43 MB per graph family makes sense at this scale.

The 0.31 GiB gap is recaptured as additional KV cache space at runtime, which is why the effective GPU memory utilization warning (see below) suggests increasing --gpu-memory-utilization to compensate.

WARN [EngineCore]--gpu-memory-utilization=0.9200 → effectively 0.9140; increase to 0.9260 to maintain same KV cache size

Since vLLM v0.21.0, CUDA graph memory is profiled and subtracted from the memory budget before KV cache allocation. This changes what a given --gpu-memory-utilization value actually delivers.

If you previously ran with 0.92 on an older vLLM and want the same KV cache size, you now need 0.926. The difference is exactly the 0.73 GiB estimated CUDA graph overhead divided by total GPU VRAM.

Action item: when upgrading vLLM versions, always check this line. A silent KV cache shrink means fewer concurrent users or shorter contexts before token eviction starts, with no error — just degraded throughput.

KEY [EngineCore]Available KV cache: 71.34 GiB1,563,739 tokens · max concurrency @ 256K: 5.97×

The two most important numbers in the entire startup log:

71.34 GiB for KV cache — after weights (31.18 GiB), CUDA graphs (0.43 GiB actual), encoder cache, and buffers, this is what's left for storing attention context. The FP8 encoding is what makes 71 GiB of cache possible — in BF16 you'd have roughly half as many tokens cached.

1,563,739 total cached tokens — how many tokens can be "live" simultaneously across all active requests.

Concurrency at different context lengths
Max context (256K tok): 1,563,739 ÷ 262,144 = 5.97× ← log reports this
Long context (32K tok): 1,563,739 ÷ 32,768 ≈ 47.7×
Typical chat (8K tok): 1,563,739 ÷ 8,192 ≈ 190.9×
Short chat (4K tok): 1,563,739 ÷ 4,096 ≈ 381.8×

For synthetic data gen with 50K prompts: ≈ 31 concurrent jobs
For short taxonomy QA generation (4K): ≈ 381 concurrent jobs
INFO [EngineCore]JIT monitor activated — Triton JIT compilations during inference will be logged as warnings

A guard against unexpected runtime compilation. If a Triton kernel hasn't been compiled for a particular input shape encountered during serving, it will JIT-compile on the fly — adding latency to that specific request (potentially seconds). The JIT monitor catches these events and logs them as warnings so you know when you're hitting an uncharted shape.

Common trigger: a request with an input length that falls outside the range of pre-captured CUDA graphs. Rarely happens in practice if your typical request shapes are consistent, but good to monitor in production logs.

INFO [EngineCore]Engine init (profile + KV cache + warmup): 145.31s (compilation: 60.46s)

Total time from "model loaded" to "engine ready to serve", broken down:

Post-load init breakdown
torch.compile (Dynamo + Inductor): 60.46s (42%)
KV cache profiling + allocation: ~25s (17%)
CUDA graph capture (86 graphs): 32.0s (22%)
Multimodal warmup: ~28s (19%)
─────────────────────────────────────────────────
Total: 145.31s

On a warm restart (AOT cache hit, weights on disk): only CUDA graph capture (~32s) + multimodal warmup (~28s) remain. Everything else loads from cache in seconds.

INFO [EngineCore]Skipping FlashInfer autotune — disabled

FlashInfer has its own auto-tuning system (separate from Inductor's) that benchmarks attention kernel configurations for this GPU and sequence length distribution. It's disabled here — either explicitly, or because TRITON_ATTN is the active attention backend and FlashInfer autotune only applies to FlashInfer's own attention kernels.

Enabling it would add 1–5 minutes to startup but could yield better attention throughput. For a dedicated production serving deployment (infrequent restarts), worth enabling. For iterative development or frequent restarts, leave disabled.

07 API server startup & route registration 00:14:04 → 00:14:39
WARN [APIServer]Default sampling params overridden by generation_config.json: temperature=1.0 · top_k=64 · top_p=0.95

Gemma 4's generation_config.json specifies default sampling parameters that override vLLM's own defaults. This means requests that don't explicitly set these parameters will use the model's own defaults.

Effective defaults unless overridden per-request
temperature = 1.0 (full randomness, no temperature scaling)
top_k = 64 (sample from top 64 tokens by probability)
top_p = 0.95 (nucleus sampling at 95% probability mass)

For synthetic data generation (your Caprica use case), you likely want to set these explicitly per request rather than relying on these defaults. High temperature + top_k=64 is a reasonable diversity setting for synthetic data, but you may want lower temperature for factual QA pairs and higher for creative variation. Override via the temperature, top_k, and top_p fields in your API requests.

To revert to vLLM's own defaults instead: relaunch with --generation-config vllm.

INFO [APIServer]Chat template format: 'openai' · Multimodal warmup: 28.99s + 0.21s readonly

Chat template format 'openai' — vLLM detected that Gemma 4's chat template (the Jinja2 template that converts message arrays into raw text prompt format) follows the OpenAI conversation structure: {"role": "user", "content": "..."} message objects. This means you can call this server with the standard OpenAI Python SDK and message format with no adaptation needed.

Multimodal warmup (28.99s) — the vision encoder path is executed with real dummy inputs to ensure all multimodal kernels are compiled and cached before the first real request arrives. Without this, the first image request would trigger JIT compilation and take 10–30 seconds. The "readonly" warmup (0.21s) caches the processor configuration without re-running the encoder.

KEY [APIServer]Server ready on http://0.0.0.0:8000 · Supported tasks: ['generate']

Supported tasks: ['generate'] — this deployment is text/image generation only. Not embedding, not classification, not reranking. The generate task covers all the completion and chat endpoints.

The server binds to 0.0.0.0 (all interfaces) on port 8000 — accessible from within the container network and, depending on Docker port mapping, from the host.

INFO [APIServer]Routes registered (27 endpoints)

The full route table, grouped by purpose:

Method
Route
Purpose
GET
/health · /ping · /load
Liveness & readiness probes — wire these to your load balancer / k8s
GET
/metrics
Prometheus metrics — request latency, throughput, KV cache utilization, queue depth
GET
/v1/models
Lists served model name(s) — required for OpenAI SDK compatibility
POST
/v1/chat/completions
OpenAI chat API — primary endpoint for chat/instruction use
POST
/v1/chat/completions/batch
Batch chat completions — your primary endpoint for Caprica generation jobs
POST
/v1/completions
OpenAI legacy completions (raw text in, text out, no chat template applied)
POST
/v1/messages
Anthropic Messages API format — drop-in compatible with Anthropic SDK clients
POST
/v1/responses · /v1/responses/{id} · /v1/responses/{id}/cancel
OpenAI Responses API (stateful, streaming-native) — newer alternative to completions
POST
/tokenize · /detokenize
Count tokens or decode token IDs — useful for prompt sizing without running inference
POST
/invocations · /inference/v1/generate
SageMaker / NVIDIA Triton compatible endpoints for cloud ML platform integration
POST
/generative_scoring
Log-probability scoring — useful for reward models, ranking, and critic pipelines
POST
/v1/chat/completions/render · /v1/completions/render
Preview the rendered prompt string after chat template application — useful for debugging template issues
POST
/scale_elastic_ep · /is_scaling_elastic_ep
Expert parallelism scaling (MoE models only) — not applicable to dense Gemma 4
KEY [APIServer]Application startup complete — server ready

The server is live. Total cold-start time: ~13 minutes from 00:01:23 to 00:14:39.

Complete startup timeline (cold start)
HF download (cached after first run): 370s
Weight deserialization → GPU VRAM: 223s
Encoder cache profiling: 12s
torch.compile (Dynamo + Inductor): 60s
KV cache profiling + allocation: 25s
CUDA graph capture (86 graphs): 32s
Multimodal encoder warmup: 29s
─────────────────────────────────────────────
Total cold start: ~13m
Total warm start (no download, AOT hit): ~40s

For production deployments, a persistent container (no restarts between requests) eliminates all of this overhead. Container restart strategies and health-check grace periods should be sized accordingly.