From 15 Seconds to 3: A Deep Dive into TensorRT Inference Optimization
Estimated reading time: 26-33 minutes | 6,635 words
The Performance Problem
Our AI image generator powered by FLUX.1-dev was taking 15 seconds per image on an H100 GPU. For a production Bittensor subnet serving real-time requests, that’s simply too slow. Users expect results in seconds, not tens of seconds.
With TensorRT, we brought inference time down to 3 seconds - a 5x speedup. But TensorRT isn’t magic. It’s a disciplined application of well-understood optimization techniques, carefully orchestrated into a cohesive system.
This post explains:
- Why inference is slow in PyTorch
- How TensorRT fixes these bottlenecks
- What the full PyTorch → ONNX → TensorRT pipeline looks like
- A deep dive into an advanced pattern: dynamic LoRA weight switching via refitting
What is TensorRT?
TensorRT is NVIDIA’s high-performance inference optimizer and runtime. It takes trained models (from PyTorch, TensorFlow, or ONNX) and transforms them into optimized “engines” that execute dramatically faster on NVIDIA GPUs.
Who needs this? If you’re running AI models in production where latency matters - real-time APIs, edge devices, high-throughput services - TensorRT can be the difference between a system that barely works and one that scales effortlessly.
What we’re building: Throughout this post, we’ll walk through a real implementation: a Bittensor miner that generates images using FLUX.1-dev, with support for dynamically switching between custom LoRA models without rebuilding engines.
Let’s dive into the problems TensorRT solves, one by one.
TensorRT versions: This implementation uses TensorRT 10.0’s refitting capabilities. As of mid-2025, TensorRT 10.12 is the latest release (June 2025), with performance improvements in 10.10-10.12. TensorRT 10.11 fixed important refitting performance regressions from 10.10, so we recommend using 10.11+ for production deployments.
Table of Contents
- The Performance Problem
- What is TensorRT?
- Problem 1: PyTorch’s Dynamic Computation Graph
- Problem 2: FP32 Precision Wastes Memory Bandwidth
- Problem 3: Switching LoRA Models is Painfully Slow ⭐
- Problem 4: GPU Sits Idle During Weight Loading
- Problem 5: Squeezing Out Even More Performance with FP8 ⭐
- Bringing It All Together: The Complete Pipeline
- When to Use TensorRT (and When Not To)
- Conclusion
Problem 1: PyTorch’s Dynamic Computation Graph
The Problem
PyTorch is designed for flexibility. Its dynamic computation graph means you can change the model structure on every forward pass - conditionals, loops, dynamic shapes - all evaluated at runtime.
Let’s be honest: in 2025, PyTorch with torch.compile has closed the gap significantly. Recent benchmarks show torch.compile matching or even exceeding TensorRT for many models, especially smaller transformers. For some workloads, torch.compile is genuinely the better choice.
But TensorRT still pulls ahead in specific scenarios, and our FLUX deployment is one of them:
Where TensorRT still wins:
-
Exhaustive kernel benchmarking: TensorRT spends 20-30 minutes trying hundreds of kernel implementations and picking the absolute fastest.
torch.compileuses heuristics and can’t afford that time. -
Static shape optimization: When your input shapes never change, TensorRT can bake in optimizations that dynamic compilers can’t. Things like pre-computed memory offsets, specialized kernels for exact dimensions, and aggressive constant folding.
-
Multi-layer fusion patterns: TensorRT can fuse complex patterns across many layers that JIT compilers might miss. It has the time to explore combinations that would be too slow to search at runtime.
-
Predictable performance: Once built, a TensorRT engine gives you the exact same performance every time. No JIT warmup, no recompilation, no variance.
For our FLUX deployment, we measured a consistent 5x speedup with TensorRT over baseline PyTorch. With torch.compile? We’d probably see 1.8-2.2x - still significant, but TensorRT’s exhaustive kernel benchmarking pays off for this specific architecture and workload.
When torch.compile wins: For many LLM and smaller transformer workloads, torch.compile delivers comparable performance to TensorRT with drastically simpler deployment. The December 2024 Collabora study found torch.compile consistently outperformed TensorRT across popular models like Llama-7b, Mistral, and Phi-3. Your mileage will vary - always benchmark your specific model.
The Solution: Static TensorRT Engines
TensorRT analyzes your model once during the build phase and creates a static execution plan - a .plan file that’s essentially a compiled binary for your GPU.
Key optimizations:
- Graph fusion: Multiple operations are fused into single kernels (e.g., Conv → BatchNorm → ReLU becomes one kernel)
- Pre-compiled execution: No runtime graph building or kernel selection
- Direct GPU execution: Minimal CPU involvement once inference starts
- Optimized memory layout: Data stays on GPU, arranged for maximum bandwidth
From our implementation (trt.py:222-232):
# Building the TensorRT engine
engine = engine_from_network(
(builder, network, parser),
config=CreateConfig(
fp16=True, # Enable half precision
profiles=[p], # Static shape optimization
memory_pool_limits={ # Allocate workspace for tactics
trt.MemoryPoolType.WORKSPACE: 2**33, # 8GB
},
tactic_sources=[ # Try all available kernels
trt.TacticSource.CUBLAS,
trt.TacticSource.CUDNN,
trt.TacticSource.CUBLAS_LT,
],
refittable=True, # Enable weight updates (more on this later!)
),
)
save_engine(engine, path=self.engine_path)
The result: A .plan file that contains optimized GPU machine code. Loading this file at runtime gives you direct, fast execution with zero Python overhead.
The trade-off: Build time of 20-30 minutes and loss of dynamic flexibility. But for production inference where the model is fixed, this is a fantastic trade-off.
A note on batching: TensorRT really shines with batch inference. Our implementation uses max_batch_size=1 (single image at a time) which limits GPU utilization. If you can batch multiple requests together, you’ll see even better speedups and GPU saturation. The downside? Higher latency per request since you’re waiting to fill the batch.
Web Resources:
Problem 2: Getting the Most Out of Lower Precision
The Reality in 2025
Let’s skip the “FP32 vs FP16” comparison. By 2025, everyone runs inference in FP16 or lower. PyTorch does it, TensorRT does it, everyone does it. The question isn’t whether to use mixed precision, but how well your framework implements it.
The actual differences:
Modern GPUs are memory-bound for large models. Moving billions of parameters from VRAM to compute units is the bottleneck, not the math itself. Both PyTorch and TensorRT understand this.
What TensorRT Does Differently
Both PyTorch and TensorRT use mixed precision (FP16 for most ops, FP32 for numerically sensitive ones). The difference is in the implementation details:
1. Fine-Grained Precision Control
TensorRT lets you specify precision per-layer, and it actually enforces it. From our implementation (trt.py:215-220):
# Selectively keep certain layer types in FP32 for accuracy
for i in range(network.num_layers):
layer = network.get_layer(i)
# Reduction operations need precision (sums, means)
if layer.type in [trt.LayerType.REDUCE]:
layer.precision = trt.float32
# Power operations are numerically sensitive
if layer.type == trt.LayerType.ELEMENTWISE and '/Pow' in layer.name:
layer.precision = trt.float32
# Everything else runs in FP16
PyTorch’s autocast does similar things, but TensorRT gives you more explicit control and guarantees about what precision each layer actually uses.
2. Kernel Tactic Selection (The Real Win)
This is where TensorRT’s build time pays off. During those 20-30 minutes, it:
- Benchmarks hundreds of kernel implementations for each layer on YOUR specific GPU
- Tests different algorithms (CUDNN, cuBLAS, cuBLAS-LT, custom kernels)
- Picks the absolute fastest for your exact hardware and model shape
PyTorch uses good default kernels, but it can’t spend 30 minutes benchmarking every possible implementation. TensorRT can and does.
Example: For a specific MatMul operation in FLUX:
- PyTorch might use cuBLAS (fast, general purpose)
- TensorRT tries: cuBLAS, cuBLAS-LT (5 variants), CUDNN (3 variants), custom fused kernels (10+ variants)
- Picks the one that’s 15-20% faster for that exact operation size
Multiply this across hundreds of operations and the gains compound.
The bottom line: Both frameworks use FP16. TensorRT just picks better kernels for each operation.
Web Resources:
Problem 3: Switching LoRA Models is Painfully Slow ⭐
This is the star of our implementation - an advanced TensorRT pattern that makes dynamic model serving practical.
The Problem
In production, you often need to serve multiple variants of a model. For our Bittensor subnet, validators send requests with different custom LoRA (Low-Rank Adaptation) models - think of them as “style plugins” that modify the base FLUX model’s behavior.
Traditional approaches fail here:
Option 1: Rebuild TensorRT engine for each LoRA
- Build time: 20-30 minutes per LoRA
- Result: Completely impractical for real-time serving
Option 2: Keep multiple engines in memory
- Memory cost: 3-5GB per engine
- For 10 LoRAs: 30-50GB VRAM just for engines
- Result: Wastes precious GPU memory
Option 3: Reload PyTorch model and switch LoRA
- Reload time: 5-10 seconds per switch
- Result: GPU sits idle during model loading
None of these work for a production system that needs to switch between dozens of LoRAs on-demand.
The Breakthrough: TensorRT Refitting
TensorRT engines can be built with a refittable=True flag, which allows updating weights only without recompiling the execution graph.
The key insight: LoRA fine-tuning only modifies model weights - the architecture (layer structure, connections) stays identical. So we can:
- Build one refittable engine for the base model
- Swap in different LoRA weights at runtime
- Update takes ~0.5 seconds instead of 5-10 seconds (PyTorch reload)
That’s a 10-20x speedup for model switching compared to the PyTorch reload approach - and infinitely faster than the 30-minute rebuild option!
This was introduced in TensorRT 10.0 with features like weight-stripped engines and the REFIT_IDENTICAL flag, enabling “LoRA switches within the same runtime without recompilation” (NVIDIA Blog).
Weight-Stripped Engines: The Perfect Complement
TensorRT 10.0 introduced weight-stripped engines, which separate execution code from weights:
Traditional engine: ~3-5GB (kernels + weights bundled) Weight-stripped engine: ~50MB (only kernels, 99% compression!)
How it works:
- Build engine with
kSTRIP_PLANandkREFIT_IDENTICALflags - Engine contains only CUDA kernel code, no weights
- At runtime, refit with weights from ONNX model or LoRA variants
- Same performance as traditional engines, 99% smaller deployment size
Why this matters for production:
- Deploy 50MB engines instead of 5GB (faster CI/CD, less bandwidth)
- Store one small engine, swap unlimited LoRA weight sets
- Weights stay in original format (ONNX/safetensors) until runtime
- Perfect for serverless or edge deployment with size constraints
Our implementation doesn’t use weight-stripping yet (we bundle weights in the engine), but combining refitting + weight-stripping would reduce our deployment from ~3-5GB to ~50MB while maintaining the 0.5-second LoRA switching capability.
Caveat: Weight-stripped engines require exact weight shapes at refit time. Using kREFIT_IDENTICAL means you must refit with weights that match the build-time shapes exactly, or you’ll get undefined behavior.
Implementation Deep Dive
Let’s walk through how we implemented this in our codebase.
Phase 1: Build a Refittable Engine
First, we build the engine with refitting enabled (trt.py:205):
extra_build_args = {
"refittable": True, # ⭐ This is the magic flag
"precision_constraints": "obey", # Respect FP32/FP16 layer constraints
# ... other optimization settings
}
engine = engine_from_network(
(builder, network, parser),
config=CreateConfig(fp16=True, profiles=[p], **extra_build_args)
)
This creates a .plan file where TensorRT tracks which tensors are weights (and thus can be updated) vs. intermediate activations (which cannot).
Phase 2: Create the TRTRefitter Class
The TRTRefitter class manages weight updates (trt.py:563-610):
class TRTRefitter:
def __init__(self, engine, flux_path):
# Create the refitter object
self.refitter = trt.Refitter(engine, TRT_LOGGER)
# Load base model weights
transformer_sd = DiffusionPipeline.from_pretrained(
flux_path, torch_dtype=torch.float16
).transformer.state_dict()
# Initialize refitter with base weights
for trt_weight_name in self.refitter.get_all_weights():
# Map TensorRT names back to PyTorch names
# (TRT adds prefixes like "transformer." and "base_layer.")
pyt_weight_name = trt_weight_name.replace('transformer.', '')
pyt_weight_name = pyt_weight_name.replace('base_layer.', '')
if pyt_weight_name in transformer_sd:
# Set the weight in the refitter
self.refitter.set_named_weights(
trt_weight_name,
trt.Weights(transformer_sd[pyt_weight_name].numpy())
)
What’s happening here:
trt.Refitter(engine, ...)creates a refitter object that can update the engine’s weights- We load the base FLUX model’s weights from HuggingFace
- We map PyTorch weight names → TensorRT internal names (this mapping is generated during the build phase)
- We initialize the refitter with the base weights
Phase 3: Swap LoRA Weights
Now the fast part - updating just the LoRA weights:
def prepare_lora_refit(self, lora_path):
"""Load LoRA weights and prepare them for refitting"""
from safetensors.torch import load_file
lora_sd = load_file(lora_path) # Load LoRA from disk
start_time = time.time()
# Only update weights that have ".lora." in their name
for trt_weight_name in self.refitter.get_all_weights():
if '.lora.' in trt_weight_name:
pyt_weight_name = trt_weight_name.replace('.lora.weight', '.weight')
if pyt_weight_name in lora_sd:
# Update this specific weight
self.refitter.set_named_weights(
trt_weight_name,
trt.Weights(lora_sd[pyt_weight_name].numpy())
)
print(f"LoRA prep: {time.time() - start_time:.4f} seconds")
def commit_refit(self):
"""Push updated weights to the GPU engine"""
start_time = time.time()
if not self.refitter.refit_cuda_engine():
# Check what went wrong
missing = self.refitter.get_missing_weights()
raise RuntimeError(f"Failed to refit. Missing: {missing}")
print(f"Refit committed: {time.time() - start_time:.4f} seconds")
The workflow:
prepare_lora_refit(): Load LoRA weights from disk, update only LoRA-specific weights in the refittercommit_refit(): Push all updated weights to the GPU engine in one atomic operation- Total time: ~0.5 seconds (vs. 30 minutes for rebuild, or 5-10 seconds for PyTorch reload)
Phase 4: Use the Refitted Engine
From lora_generate_image.py:130-131:
# Create TRT transformer and its refitter
transformer = TRTTransformer(engine_path, transformer_config,
torch.device("cuda"), max_batch_size=1)
refitter = TRTRefitter(transformer.engine.engine, base_model_path)
# Later, when switching LoRAs:
refitter.prepare_lora_refit('./lora_v1.safetensors')
refitter.commit_refit()
# Engine now has LoRA v1 weights!
# Generate image with LoRA v1
image = pipe(prompt="a lighthouse on a rocky coast at sunset", ...).images[0]
# Switch to different LoRA
refitter.prepare_lora_refit('./lora_v2.safetensors')
refitter.commit_refit()
# Engine now has LoRA v2 weights!
# Generate with LoRA v2
image = pipe(prompt="a lighthouse on a rocky coast at sunset", ...).images[0]
Why This Works
The magic of refitting:
- TensorRT keeps the execution graph frozen (layer connections, kernel selections, memory layout)
- Only the weight tensors are updated
- Since LoRA training only modifies weights (not architecture), this is a perfect match
- The engine validates that weight shapes haven’t changed, ensuring safety
Performance comparison:
| Method | Time | Memory |
|---|---|---|
| Rebuild engine | 20-30 min | 3-5GB (one engine) |
| Keep multiple engines | Instant | 30-50GB (10 engines) |
| PyTorch reload | 5-10 sec | ~8GB |
| TensorRT refitting | 0.5 sec | 3-5GB (one engine) |
Real-World Impact
According to NVIDIA’s blog, “By using weight refitting, the turnaround time when weights change is significantly shortened, which improves the liveness of GenAI applications.”
In our Bittensor miner, this means:
- ✅ Serve unlimited LoRA variants with fixed memory footprint
- ✅ Switch models in half a second (imperceptible to users)
- ✅ No downtime or warm-up periods
- ✅ Same inference performance as a static engine
Web Resources:
Problem 4: GPU Sits Idle During Weight Loading
The Problem
Even with our lightning-fast 0.5-second LoRA refitting, there’s still wasted time:
Sequential workflow:
[Load LoRA from disk] → [Prepare refit] → [Commit to GPU] → [Run inference]
(I/O bound) (CPU work) (GPU work) (GPU work)
0.2s 0.2s 0.1s 3.0s
Total time: 3.5 seconds per image
The problem? During the first 0.5 seconds (load + prepare + commit), the GPU is completely idle. We’re paying for expensive H100 compute that’s just sitting there waiting for CPU and I/O operations to complete.
For high-throughput serving, this inefficiency compounds. If we’re processing 100 requests per minute, we’re wasting 50 seconds of GPU time every minute - nearly half our capacity!
The Solution: Dual-Engine Pipeline Architecture
The fix is classic producer-consumer pattern: keep two TensorRT engines in memory and pipeline the work.
How it works:
Time 0-3s: [Engine A: Load LoRA 1] → [Engine A: Inference for request 1]
Time 0-3s: [Engine B: idle]
Time 3-6s: [Engine B: Load LoRA 2] → [Engine B: Inference for request 2]
Time 3-6s: [Engine A: idle]
Result: GPU always has work to do!
Implementation: TRTInferenceServer
From lora_generate_image.py:107-132:
class TRTInferenceServer:
def __init__(self, base_model_path, engine_path, mapping_path):
self.request_queue = queue.Queue() # Incoming requests
self.engine_queue = queue.Queue() # Pool of available engines
self.inference_queue = queue.Queue() # Engines ready to infer
# Create TWO identical TRT engines
self.engine_queue.put(self._create_trt_transformer(engine_path, ...))
self.engine_queue.put(self._create_trt_transformer(engine_path, ...))
# Two worker threads
self.loader_thread = threading.Thread(
target=self._weight_loader_worker, daemon=True
)
self.inference_thread = threading.Thread(
target=self._inference_worker, daemon=True
)
def _create_trt_transformer(self, engine_path, transformer_config):
"""Create a TRT transformer + refitter pair"""
transformer = TRTTransformer(engine_path, transformer_config,
torch.device("cuda"), max_batch_size=1)
refitter = TRTRefitter(transformer.engine.engine, self.base_model_path)
return transformer, refitter
Architecture diagram:
┌──────────────┐
│ Request │ User submits inference request
│ Queue │
└──────┬───────┘
│
v
┌──────────────────────────────────┐
│ Weight Loader Worker Thread │ [Producer]
│ - Get idle engine from pool │
│ - Load LoRA from disk │ (I/O + CPU work)
│ - Prepare refit │
│ - Commit refit to GPU │
│ - Pass to inference queue │
└──────────────┬───────────────────┘
│
v
┌──────────────────────────────────┐
│ Inference Worker Thread │ [Consumer]
│ - Get engine from inference Q │
│ - Run GPU inference │ (GPU work)
│ - Save image to disk │
│ - Return engine to pool │
└──────────────┬───────────────────┘
│
v
┌──────────────┐
│ Engine │ [Engine A] ←→ [Engine B]
│ Pool │ (alternating: one loads, one infers)
└──────────────┘
Worker Thread 1: Weight Loader
def _weight_loader_worker(self):
"""Producer: Prepare engines with the right weights"""
while True:
# Get an idle engine from the pool
transformer, refitter = self.engine_queue.get()
# Get next request
request = self.request_queue.get()
if request is None: # Shutdown signal
self.inference_queue.put(None)
break
print(f"[Loader] Loading LoRA for {request.lora_path}")
# Load and refit (I/O + CPU work, GPU idle for THIS engine)
if request.lora_path:
refitter.prepare_lora_refit(request.lora_path)
refitter.commit_refit()
# Hand off to inference worker
self.inference_queue.put((request, transformer, refitter))
# Mark this stage complete
self.engine_queue.task_done()
self.request_queue.task_done()
Key insight: While Engine A is loading weights (CPU/I/O bound), Engine B is running inference (GPU bound). No conflict!
Worker Thread 2: Inference
def _inference_worker(self):
"""Consumer: Run inference and return engine to pool"""
while True:
job = self.inference_queue.get()
if job is None: # Shutdown signal
break
request, transformer, refitter = job
print(f"[Inference] Running inference for {request.lora_path}")
inference_start = time.time()
# Swap in the transformer (cheap pointer swap)
self.pipe.transformer = transformer
# Run inference (GPU-bound work)
generator = torch.Generator(device="cuda").manual_seed(request.seed)
image = self.pipe(
prompt=request.prompt,
num_inference_steps=request.num_inference_steps,
guidance_scale=request.guidance_scale,
height=request.height,
width=request.width,
generator=generator
).images[0]
# Ensure GPU finishes before continuing
torch.cuda.synchronize()
print(f"[Inference] Completed in {time.time() - inference_start:.3f}s")
# Return engine to the pool (it's now idle and ready for next request)
self.engine_queue.put((transformer, refitter))
# Save image (I/O, happens after engine is back in pool)
image.save(request.output_path)
self.inference_queue.task_done()
Key insight: As soon as inference finishes, the engine goes back to the pool where the loader thread can grab it for the next request.
Atomic File Writes (Bonus: Preventing Race Conditions)
One subtle detail from the recent race condition fix (lora_generate_image.py:246-256):
# Write to temp file first, then atomically rename
with tempfile.NamedTemporaryFile(
mode='wb',
dir=output_dir,
suffix='.png',
delete=False
) as tmp_file:
tmp_path = tmp_file.name
image.save(tmp_path)
# Atomic rename - file only becomes visible when complete
os.replace(tmp_path, request.output_path)
Why this matters: If a validator checks for the image file while we’re still writing it, they might get a partial/corrupted image. Atomic rename ensures the file only appears when it’s fully written.
Performance Impact
Sequential (single engine):
- Load + refit: 0.5s
- Inference: 3.0s
- Total: 3.5s per request
Pipelined (dual engine):
- While Engine A infers (3.0s), Engine B loads next weights (0.5s)
- The 0.5s load time is hidden by the 3.0s inference
- Effective throughput: ~3.0s per request
Improvement: 15% throughput increase + much better GPU utilization
For a system processing 1000 images/day, this saves 8-10 minutes of compute time daily. At scale, those minutes become hours.
When to Use This Pattern
Dual-engine makes sense when:
- ✅ Your weight-loading time < your inference time (so one engine can finish before the other needs it)
- ✅ You have enough VRAM for two engines (~6-10GB)
- ✅ Throughput matters (serving many requests sequentially)
Skip it when:
- ❌ Loading takes longer than inference (you’d still block)
- ❌ Memory-constrained (can’t fit two engines)
- ❌ Low request volume (overhead not worth it)
Web Resources:
Problem 5: Squeezing Out Even More Performance with FP8 ⭐
A 2025 cutting-edge technique that can push diffusion models even faster.
The Reality of FP8 in 2025
By 2025, FP8 quantization on NVIDIA Hopper GPUs (H100, H200) has become a standard technique for diffusion model optimization. While our implementation uses FP16, going one step further to FP8 can unlock another 2-2.5x speedup.
Performance potential:
- FLUX.1-dev with TensorRT FP16: ~3 seconds (our current implementation)
- FLUX.1-dev with TensorRT FP8: ~1.5-2 seconds (2.4x faster than PyTorch FP16)
What is FP8?
FP8 (8-bit floating point) comes in two formats:
- E4M3 (4-bit exponent, 3-bit mantissa): Better for activations, chosen for diffusion
- E5M2 (5-bit exponent, 2-bit mantissa): Higher dynamic range but less precision
TensorRT uses E4M3 for diffusion models because it provides finer-grained precision where activations cluster around zero.
Why FP8 Works for Diffusion Models
Memory bandwidth: FP8 halves memory transfers vs FP16 (8 bits vs 16 bits) Tensor Cores: H100 has dedicated FP8 Tensor Cores that are 2x faster than FP16 Numerical stability: Diffusion models are surprisingly robust to lower precision
Adobe’s Firefly video generation model achieved:
- 60% latency reduction with FP8
- Nearly 40% reduction in total cost of ownership (TCO)
- Minimal quality degradation
Implementation Approach
Using TensorRT Model Optimizer:
import modelopt.torch.quantization as mtq
# Post-training quantization (PTQ) - no retraining needed
model = mtq.quantize(
model,
quant_cfg={
"quant_mode": "fp8", # Use FP8 quantization
"quant_format": "e4m3", # E4M3 format for activations
"algorithm": "max", # Max calibration for diffusion
},
forward_loop=calibration_loop, # Run 128-512 calibration samples
)
# Export to ONNX (quantization annotations included)
torch.onnx.export(model, sample_input, "flux_fp8.onnx")
# Build TensorRT engine (respects FP8 annotations)
# Build process is identical to FP16, TensorRT sees FP8 ops
Calibration: FP8 quantization requires calibrating on 128-512 sample images to determine optimal scaling factors for each layer. This takes 10-20 minutes but only needs to be done once.
Combining FP8 with Refitting
The challenge: Can you combine FP8 quantization with LoRA refitting?
Current limitation (2025): TensorRT documentation notes that “high-precision weights used in FP4 double quantization are not refittable.” The story for FP8 + refitting is still evolving.
Workaround: Build separate FP8 engines for your most popular LoRA variants, fall back to FP16 refittable engine for long-tail LoRAs.
Performance vs Quality Trade-off
When FP8 works well:
- ✅ Diffusion models (FLUX, Stable Diffusion, video generation)
- ✅ Large models where memory bandwidth dominates
- ✅ Hopper GPUs (H100, H200) with dedicated FP8 Tensor Cores
- ✅ Production serving where 1-2 seconds matters
When to stick with FP16:
- ❌ When you need perfect numerical accuracy
- ❌ On pre-Hopper GPUs (no FP8 Tensor Cores)
- ❌ When you require dynamic LoRA refitting
- ❌ Models with known numerical instability
Quality validation: Always run side-by-side A/B tests comparing FP8 vs FP16 outputs. For FLUX.1-dev, most users report imperceptible quality differences, but your specific use case may vary.
Getting Started with FP8
Phase 1: Validation
- Install TensorRT Model Optimizer:
pip install nvidia-modelopt[torch] - Quantize your model with 256 calibration samples
- Generate 100 test images side-by-side (FP8 vs FP16)
- Measure quality metrics (FID, CLIP score, human eval)
Phase 2: Production
- If quality is acceptable, build FP8 TensorRT engines
- Add FP8 engine path alongside FP16 in your server
- A/B test in production with 5% traffic
- Monitor quality metrics and user feedback
Phase 3: Optimization (optional)
- Experiment with per-layer precision (some layers FP16, most FP8)
- Try mixed FP8/FP16 for quality-critical layers
- Profile memory bandwidth savings with NVIDIA Nsight
Real-World Impact
If we applied FP8 to our FLUX implementation:
| Metric | FP16 (current) | FP8 (potential) | Improvement |
|---|---|---|---|
| Inference time | 3.0s | ~1.5-2.0s | 1.5-2x faster |
| Memory usage | ~8GB | ~4-5GB | 40-50% reduction |
| Throughput | 20 img/min | 30-40 img/min | 1.5-2x higher |
| Cost per image | $0.0025 | $0.0012-0.0017 | 30-50% cheaper |
The bottom line: FP8 is the next frontier for diffusion model optimization in 2025. If you’re on H100 and have already exhausted FP16 optimizations, FP8 can unlock another 1.5-2x speedup with minimal quality impact.
Web Resources:
- TensorRT FP8 Quantization for Stable Diffusion
- Optimizing FLUX.1 with Low-Precision Quantization
- TensorRT Model Optimizer v0.15 Release
Bringing It All Together: The Complete Pipeline
We’ve covered five distinct optimizations. Let’s synthesize them into the complete end-to-end flow.
Development Phase (One-Time Setup)
1. Train or download PyTorch model
↓
2. Export to ONNX format
├── torch.onnx.export(model, sample_inputs, ...)
├── Requires sample inputs matching expected inference shapes
└── Produces: model.onnx + external weight files
↓
3. Build TensorRT Engine (20-30 minutes)
├── Parse ONNX model
├── Apply graph fusion optimizations
├── Select optimal kernels via tactic benchmarking
├── Apply FP16/FP32 mixed precision
├── Compile to GPU machine code
└── Produces: transformer.plan (~3-5GB)
↓
4. Generate weight mapping
├── Hash-based matching of PyTorch ↔ TensorRT weight names
└── Produces: mapping.json (for refitting)
↓
5. Cache by hardware signature
└── {GPU_name}_cu{CUDA_version}_trt{TRT_version}_fp16/
├── transformer.plan
├── mapping.json
└── onnx/ (optional, for debugging)
Production Runtime
1. Server Initialization
├── Load base Diffusion Pipeline (FLUX.1-dev)
├── Load TensorRT engine into Engine A
├── Load TensorRT engine into Engine B (duplicate)
├── Create refitter objects for both engines
└── Start loader and inference worker threads
↓
2. For Each Inference Request
├── Loader Thread (Engine A):
│ ├── Get request from queue
│ ├── Load LoRA weights from disk
│ ├── Prepare refit (map weights to TRT format)
│ ├── Commit refit to GPU (~0.5s total)
│ └── Pass to inference queue
│
├── Inference Thread (Engine A):
│ ├── Receive engine + request
│ ├── Run TRT inference (~3s)
│ ├── Synchronize GPU
│ ├── Save image atomically
│ └── Return engine to pool
│
└── Meanwhile, Engine B processes next request...
↓
3. Result
└── Image generated in ~3 seconds (vs 15s PyTorch)
Key Files and Their Roles
| File | Purpose | Key Content |
|---|---|---|
trt.py:163-263 | Engine class | TRT engine loading, buffer allocation, inference execution |
trt.py:436-560 | DiffusionTransformer | ONNX export configuration for FLUX Transformer |
trt.py:563-610 | TRTRefitter | Weight update without recompilation |
trt.py:613-696 | TRTTransformer | Drop-in replacement for PyTorch Transformer |
trt.py:718-808 | build_transformer_engine_from_pipeline() | End-to-end build orchestration |
lora_generate_image.py:107-260 | TRTInferenceServer | Dual-engine pipeline with producer-consumer workers |
Configuration Deep Dive
From trt.py:193-211, here are the important build settings:
extra_build_args = {
# Memory allocation for tactics benchmarking and graph optimization
"memory_pool_limits": {
trt.MemoryPoolType.WORKSPACE: 2**33, # 8GB workspace
trt.MemoryPoolType.TACTIC_DRAM: 2**33, # 8GB for tactic selection
},
# Which kernel libraries to try
"tactic_sources": [
trt.TacticSource.CUBLAS, # Standard BLAS
trt.TacticSource.CUBLAS_LT, # CUDA 11+ optimized BLAS
trt.TacticSource.CUDNN, # Convolutions, RNNs, attention
trt.TacticSource.EDGE_MASK_CONVOLUTIONS, # Specialized conv kernels
trt.TacticSource.JIT_CONVOLUTIONS, # Runtime-compiled convs
],
# Enable weight updates at runtime
"refittable": True,
# Enforce layer-level precision constraints (respect FP32 overrides)
"precision_constraints": "obey",
}
What these settings do:
- memory_pool_limits: TensorRT needs scratch memory during build to benchmark tactics. More memory = can test more variants = better optimization (but longer build time)
- tactic_sources: Each source represents a different kernel library. TensorRT will try all of them and pick the fastest for each layer
- refittable: Enables our LoRA weight-swapping magic
- precision_constraints: Without this, TensorRT might ignore your FP32 layer overrides and force everything to FP16
Real-World Performance Numbers
From our production Bittensor miner:
| Metric | PyTorch Baseline | TensorRT Optimized | Improvement |
|---|---|---|---|
| Inference time | 15 seconds | 3 seconds | 5x faster |
| Inference time (first run) | 15 seconds | 4-5 seconds | Warmup overhead |
| Memory usage | ~12GB VRAM | ~8GB VRAM | 33% reduction |
| LoRA switch time | 5-10 seconds | 0.5 seconds | 10-20x faster |
| Throughput (sequential) | 4 images/min | 20 images/min | 5x higher |
| GPU utilization | 60-70% | 90-95% | Better saturation* |
*GPU utilization numbers are from our FLUX deployment. Your mileage will vary based on model size, batch size, and hardware. We’re running batch_size=1, which limits utilization compared to batched inference.
Cost implications:
At cloud GPU pricing (~$2-3/hr for H100):
- PyTorch: 15s/image = 240 images/hour = $0.0125 per image
- TensorRT: 3s/image = 1200 images/hour = $0.0025 per image
5x cost reduction for the same hardware!
Build Time Investment
One-time costs:
- Initial engine build: 20-30 minutes
- ONNX export development/debugging (first time)
- Testing and validation
Ongoing costs:
- Rebuild when model architecture changes (rare in production)
- Rebuild when switching GPU types (engines are hardware-specific)
ROI calculation:
If you’re running 1000 inferences/day:
- Time saved per day: (15s - 3s) × 1000 = 12,000 seconds = 3.3 hours
- Break-even point: ~30 min build time ÷ 3.3 hr daily savings = after first day of production use
Troubleshooting Common Issues
Issue 1: ONNX export fails with unsupported operators
- Symptom:
RuntimeError: ONNX export failed: Unsupported operator - Solution:
- Check for custom layers not supported by ONNX
- Try enabling higher
opset_version=18or later - Replace custom ops with ONNX-compatible equivalents
- For RMSNorm layers, use the
ONNXSafeRMSNormwrapper (seetrt.py:110-137)
Issue 2: Engine build fails with out-of-memory
- Symptom:
Error: Memory allocation failed during engine build - Solution:
- Reduce workspace memory:
trt.MemoryPoolType.WORKSPACE: 2**32(4GB instead of 8GB) - Set
TRT_BUILD_SAFE=1environment variable to disable aggressive optimizations - Close other GPU processes during build
- Build on a machine with more VRAM
- Reduce workspace memory:
Issue 3: Accuracy regression - generated images look wrong
- Symptom: TensorRT outputs differ visually from PyTorch
- Solution:
- Add more layer types to FP32 whitelist (see
trt.py:215-220) - Enable strict type constraints:
precision_constraints="obey" - Compare layer-by-layer outputs using ONNX Runtime as intermediate step
- Check for numerical instability in LayerNorm, Softmax, or Power operations
- Add more layer types to FP32 whitelist (see
Issue 4: TensorRT slower than expected or slower than PyTorch
- Symptom: Little to no speedup, or even slower than PyTorch
- Solution:
- Verify FP16 is enabled: check build logs for “fp16=True”
- Ensure static shapes: dynamic shapes have 2-3x overhead
- Profile with NVTX:
torch.cuda.nvtx.range_push/poparound inference - Check for CPU-GPU sync points: remove unnecessary
torch.cuda.synchronize()calls - Verify Tensor Cores are being used: check build logs for “CUBLAS_LT” tactics
Issue 5: Refitting is slower than expected
- Symptom: LoRA refitting takes 2-5 seconds instead of 0.5 seconds
- Solution:
- Upgrade to TensorRT 10.11+ (10.10 had known refitting performance regression)
- Check for convolution layers in FP16/INT8 within branches or loops (known regression)
- Profile with NVTX to identify which refit operations are slow
- Consider if weight shapes are mismatched (forces internal recompilation)
- Verify you’re using
kREFITorkREFIT_IDENTICALflags correctly
Web Resources:
- TensorRT Best Practices Guide
- ONNX Operator Compatibility
- TensorRT Troubleshooting FAQ
- NVIDIA Developer Forums
When to Use TensorRT (and When Not To)
TensorRT is powerful, but it’s not always the right choice. Here’s a practical decision framework.
✅ Use TensorRT When…
1. Production inference with high throughput requirements
- Serving hundreds to millions of requests per day
- Every millisecond of latency matters
- Hardware costs are significant
2. Model architecture is stable
- Not changing layer structure frequently
- Fine-tuning only (weights change, architecture doesn’t)
- Production-ready models, not research prototypes
3. NVIDIA GPU infrastructure
- You’re running on NVIDIA hardware (required)
- A100, H100, or recent RTX/Tesla GPUs
- CUDA 11.8+ and TensorRT 8.5+ installed
4. Latency-critical applications
- Real-time inference (video processing, robotics)
- Interactive applications (chatbots, live image generation)
- Batch processing with tight SLAs
5. You can tolerate build time
- 20-30 minute one-time build cost is acceptable
- Can cache engines and reuse across deployments
- CI/CD pipeline can handle multi-minute build steps
❌ Skip TensorRT When…
1. Research and experimentation phase
- Model architecture changes frequently
- Rapid iteration is more valuable than speed
- Prototyping different approaches
2. CPU-only deployment
- TensorRT requires NVIDIA GPUs
- Consider ONNX Runtime for CPU optimization instead
3. Non-NVIDIA hardware
- AMD GPUs → look at ROCm or ONNX Runtime
- Apple Silicon → use Core ML or Metal Performance Shaders
- Edge TPUs → use TensorFlow Lite
4. torch.compile meets your performance needs
- Simple models (small CNNs, basic transformers)
- torch.compile achieving your latency/throughput targets
- Deployment simplicity outweighs potential TensorRT gains (10-20%)
- Workloads where torch.compile benchmarks equal or better than TensorRT
5. Extremely dynamic architectures
- Variable number of layers per forward pass
- Conditional execution with data-dependent branches
- Dynamic shapes that change drastically between requests
Alternative Approaches
Don’t want the full TensorRT complexity? Consider these alternatives:
| Approach | Speedup | Ease of Use | Portability |
|---|---|---|---|
| torch.compile | 1.5-2x | ⭐⭐⭐⭐⭐ One line | PyTorch only |
| ONNX Runtime | 2-3x | ⭐⭐⭐⭐ Easy export | Cross-platform |
| TorchScript | 1.3-1.8x | ⭐⭐⭐ Moderate | PyTorch only |
| BetterTransformer | 1.5-2x | ⭐⭐⭐⭐⭐ Drop-in | Transformers only |
| TensorRT | 3-6x | ⭐⭐ More complex | NVIDIA GPUs only |
| Quantization (INT8) | 2-4x | ⭐⭐⭐ Requires calibration | Various backends |
| FP8 Quantization | 1.5-2.5x | ⭐⭐⭐ Requires calibration | H100/H200 GPUs |
A note on torch.compile:
PyTorch 2.0’s torch.compile has made huge strides in closing the performance gap with TensorRT. For some models, it can match or even beat TensorRT. The choice often comes down to your specific model architecture and deployment constraints. It’s worth trying torch.compile first since it’s just one line of code.
The pragmatic approach for 2025:
- Start with
torch.compile(one line:model = torch.compile(model)) - If that’s not enough and you’re on NVIDIA GPUs, try TensorRT
- Consider FP8 quantization as complementary to either approach (especially for diffusion models)
- Profile, measure, and choose based on your specific workload - the landscape is more competitive than ever
Getting Started Checklist
If you’ve decided TensorRT is right for you:
Phase 1: Validation
- Benchmark your PyTorch baseline (latency, throughput, memory)
- Export a simple model to ONNX to test compatibility
- Build a basic TensorRT engine and verify it works
- Compare outputs (PyTorch vs TRT) to check numerical accuracy
- Measure actual speedup (aim for ≥2x to justify complexity)
Phase 2: Production Integration
- Set up engine caching by hardware signature
- Implement proper error handling and fallback to PyTorch
- Add monitoring (inference time, GPU utilization, errors)
- Test edge cases (different input shapes, error conditions)
- Document build process for your team
Phase 3: Advanced Optimization (optional)
- Implement refitting if you need dynamic model variants
- Set up dual-engine pipeline if throughput is critical
- Fine-tune layer-level precision (FP32 vs FP16)
- Profile with NVTX and optimize bottlenecks
- Consider INT8 quantization for even more speed
Learning Resources
Official Documentation:
Community Resources:
- Torch-TensorRT (PyTorch Integration)
- TensorRT GitHub Issues (great for troubleshooting)
- NVIDIA Developer Forums
Quantization and Optimization:
- TensorRT Model Optimizer Documentation
- FP8 Quantization for Diffusion Models
- Working with Quantized Types in TensorRT
Tutorials and Examples:
- NVIDIA TensorRT Blog Posts
- Ultralytics TensorRT Export Guide
- This project’s implementation (real production code!)
Final Thoughts
TensorRT is a power tool. Like any power tool, it requires more setup and expertise than simpler alternatives, but when you need maximum performance, it delivers.
The decision usually comes down to:
- Can you afford 20-30 minutes of build time? (one-time cost)
- Are you on NVIDIA hardware? (required)
- Do you need 3x+ speedup over PyTorch? (typical benefit)
If you answered yes to all three, TensorRT is likely worth the investment.
Conclusion
We started with a 15-second inference problem. By systematically applying TensorRT’s optimization toolkit, we brought it down to 3 seconds - a 5x speedup that makes real-time AI image generation practical.
The Journey Recap
Problem 1: Dynamic graph overhead → Static compiled engines
- Eliminated Python interpreter and framework dispatch overhead
- Pre-compiled execution plans with kernel fusion
- Result: Direct GPU execution with minimal CPU involvement
Problem 2: FP32 precision waste → Mixed precision + kernel auto-tuning
- FP16 for most operations (2x memory bandwidth improvement)
- Selective FP32 for numerically sensitive layers
- Automatic kernel selection via tactics benchmarking
- Result: 2-3x faster execution with Tensor Cores
Problem 3: Slow LoRA switching → Weight refitting
- Dynamic weight updates without graph recompilation
- 0.5 seconds vs 30 minutes for model switching
- Single engine serves unlimited model variants
- Result: 3600x faster model switching
Problem 4: GPU idle time → Dual-engine pipeline
- Producer-consumer pattern with engine pool
- Weight loading happens while other engine infers
- Atomic file writes prevent race conditions
- Result: 15% throughput improvement, 95% GPU utilization
Problem 5: FP8 quantization → Next-level optimization for 2025
- FP8 quantization on H100/H200 GPUs for 2x+ speedup
- Post-training quantization with calibration (no retraining)
- Trade-off: potential quality impact vs major performance gain
- Result: 1.5-2x additional speedup (3s → 1.5-2s), 40-50% memory reduction
The Key Insight
TensorRT isn’t magic - it’s the disciplined application of well-known optimization techniques:
- Graph fusion (compiler optimization)
- Mixed precision (numerical analysis)
- Kernel auto-tuning (empirical performance testing)
- Weight refitting (separating data from code)
- Pipeline parallelism (concurrent systems design)
- FP8 quantization (aggressive precision reduction for cutting-edge hardware)
What makes TensorRT powerful is how it orchestrates these techniques into a cohesive system, automatically and reliably.
Real-World Impact
For our Bittensor miner serving AI image generation:
- 5x faster inference (15s → 3s)
- 5x higher throughput (4 images/min → 20 images/min)
- 5x lower cost per image generated
- Practical LoRA switching for dynamic model serving
The 20-30 minute build time pays for itself after a single day of production use.
When It’s Worth It
TensorRT makes sense when:
- ✅ You’re on NVIDIA GPUs in production
- ✅ Your model architecture is stable
- ✅ Latency and cost matter
- ✅ You can invest a day setting it up
Skip it when:
- ❌ You’re still in research/prototyping
- ❌ PyTorch is already fast enough
- ❌ You’re on non-NVIDIA hardware
Try It Yourself
The complete implementation from this post is available in the dippy-studio-bittensor-miner repository.
Key files to explore:
trt.py- Engine building, refitting, and executionlora_generate_image.py- Dual-engine inference server
Start with a simple model, export to ONNX, build an engine, and measure the speedup. You might be surprised at how much faster your inference becomes.
Final Thought
“The 20-30 minute build time might seem daunting, but when it saves you 12 seconds per inference across millions of requests, the math becomes compelling quickly.”
Optimization isn’t about making everything faster - it’s about making the right things faster. TensorRT gives you the tools to do exactly that.
Questions or feedback? Open an issue in the repository or reach out on the Bittensor Discord. I’d love to hear about your TensorRT experiences!
Found this helpful? Share it with someone struggling with slow inference. Every millisecond counts in production AI.