From 15 Seconds to 3: A Deep Dive into TensorRT Inference Optimization

October 12, 2025 · 33 min read

Estimated reading time: 26-33 minutes | 6,635 words

The Performance Problem

Our AI image generator powered by FLUX.1-dev was taking 15 seconds per image on an H100 GPU. For a production Bittensor subnet serving real-time requests, that’s simply too slow. Users expect results in seconds, not tens of seconds.

With TensorRT, we brought inference time down to 3 seconds - a 5x speedup. But TensorRT isn’t magic. It’s a disciplined application of well-understood optimization techniques, carefully orchestrated into a cohesive system.

This post explains:

Why inference is slow in PyTorch
How TensorRT fixes these bottlenecks
What the full PyTorch → ONNX → TensorRT pipeline looks like
A deep dive into an advanced pattern: dynamic LoRA weight switching via refitting

What is TensorRT?

TensorRT is NVIDIA’s high-performance inference optimizer and runtime. It takes trained models (from PyTorch, TensorFlow, or ONNX) and transforms them into optimized “engines” that execute dramatically faster on NVIDIA GPUs.

Who needs this? If you’re running AI models in production where latency matters - real-time APIs, edge devices, high-throughput services - TensorRT can be the difference between a system that barely works and one that scales effortlessly.

What we’re building: Throughout this post, we’ll walk through a real implementation: a Bittensor miner that generates images using FLUX.1-dev, with support for dynamically switching between custom LoRA models without rebuilding engines.

Let’s dive into the problems TensorRT solves, one by one.

TensorRT versions: This implementation uses TensorRT 10.0’s refitting capabilities. As of mid-2025, TensorRT 10.12 is the latest release (June 2025), with performance improvements in 10.10-10.12. TensorRT 10.11 fixed important refitting performance regressions from 10.10, so we recommend using 10.11+ for production deployments.

The Performance Problem
What is TensorRT?
Problem 1: PyTorch’s Dynamic Computation Graph
Problem 2: FP32 Precision Wastes Memory Bandwidth
Problem 3: Switching LoRA Models is Painfully Slow ⭐
Problem 4: GPU Sits Idle During Weight Loading
Problem 5: Squeezing Out Even More Performance with FP8 ⭐
Bringing It All Together: The Complete Pipeline
When to Use TensorRT (and When Not To)
Conclusion

Problem 1: PyTorch’s Dynamic Computation Graph

The Problem

PyTorch is designed for flexibility. Its dynamic computation graph means you can change the model structure on every forward pass - conditionals, loops, dynamic shapes - all evaluated at runtime.

Let’s be honest: in 2025, PyTorch with torch.compile has closed the gap significantly. Recent benchmarks show torch.compile matching or even exceeding TensorRT for many models, especially smaller transformers. For some workloads, torch.compile is genuinely the better choice.

But TensorRT still pulls ahead in specific scenarios, and our FLUX deployment is one of them:

Where TensorRT still wins:

Exhaustive kernel benchmarking: TensorRT spends 20-30 minutes trying hundreds of kernel implementations and picking the absolute fastest. torch.compile uses heuristics and can’t afford that time.
Static shape optimization: When your input shapes never change, TensorRT can bake in optimizations that dynamic compilers can’t. Things like pre-computed memory offsets, specialized kernels for exact dimensions, and aggressive constant folding.
Multi-layer fusion patterns: TensorRT can fuse complex patterns across many layers that JIT compilers might miss. It has the time to explore combinations that would be too slow to search at runtime.
Predictable performance: Once built, a TensorRT engine gives you the exact same performance every time. No JIT warmup, no recompilation, no variance.

For our FLUX deployment, we measured a consistent 5x speedup with TensorRT over baseline PyTorch. With torch.compile? We’d probably see 1.8-2.2x - still significant, but TensorRT’s exhaustive kernel benchmarking pays off for this specific architecture and workload.

When torch.compile wins: For many LLM and smaller transformer workloads, torch.compile delivers comparable performance to TensorRT with drastically simpler deployment. The December 2024 Collabora study found torch.compile consistently outperformed TensorRT across popular models like Llama-7b, Mistral, and Phi-3. Your mileage will vary - always benchmark your specific model.

The Solution: Static TensorRT Engines

TensorRT analyzes your model once during the build phase and creates a static execution plan - a .plan file that’s essentially a compiled binary for your GPU.

Key optimizations:

Graph fusion: Multiple operations are fused into single kernels (e.g., Conv → BatchNorm → ReLU becomes one kernel)
Pre-compiled execution: No runtime graph building or kernel selection
Direct GPU execution: Minimal CPU involvement once inference starts
Optimized memory layout: Data stays on GPU, arranged for maximum bandwidth

From our implementation (trt.py:222-232):

# Building the TensorRT engine
engine = engine_from_network(
    (builder, network, parser),
    config=CreateConfig(
        fp16=True,                    # Enable half precision
        profiles=[p],                 # Static shape optimization
        memory_pool_limits={          # Allocate workspace for tactics
            trt.MemoryPoolType.WORKSPACE: 2**33,  # 8GB
        },
        tactic_sources=[              # Try all available kernels
            trt.TacticSource.CUBLAS,
            trt.TacticSource.CUDNN,
            trt.TacticSource.CUBLAS_LT,
        ],
        refittable=True,              # Enable weight updates (more on this later!)
    ),
)
save_engine(engine, path=self.engine_path)

The result: A .plan file that contains optimized GPU machine code. Loading this file at runtime gives you direct, fast execution with zero Python overhead.

The trade-off: Build time of 20-30 minutes and loss of dynamic flexibility. But for production inference where the model is fixed, this is a fantastic trade-off.

A note on batching: TensorRT really shines with batch inference. Our implementation uses max_batch_size=1 (single image at a time) which limits GPU utilization. If you can batch multiple requests together, you’ll see even better speedups and GPU saturation. The downside? Higher latency per request since you’re waiting to fill the batch.

Web Resources:

Problem 2: Getting the Most Out of Lower Precision

The Reality in 2025

Let’s skip the “FP32 vs FP16” comparison. By 2025, everyone runs inference in FP16 or lower. PyTorch does it, TensorRT does it, everyone does it. The question isn’t whether to use mixed precision, but how well your framework implements it.

The actual differences:

Modern GPUs are memory-bound for large models. Moving billions of parameters from VRAM to compute units is the bottleneck, not the math itself. Both PyTorch and TensorRT understand this.

What TensorRT Does Differently

Both PyTorch and TensorRT use mixed precision (FP16 for most ops, FP32 for numerically sensitive ones). The difference is in the implementation details:

1. Fine-Grained Precision Control

TensorRT lets you specify precision per-layer, and it actually enforces it. From our implementation (trt.py:215-220):

# Selectively keep certain layer types in FP32 for accuracy
for i in range(network.num_layers):
    layer = network.get_layer(i)
    # Reduction operations need precision (sums, means)
    if layer.type in [trt.LayerType.REDUCE]:
        layer.precision = trt.float32
    # Power operations are numerically sensitive
    if layer.type == trt.LayerType.ELEMENTWISE and '/Pow' in layer.name:
        layer.precision = trt.float32
    # Everything else runs in FP16

PyTorch’s autocast does similar things, but TensorRT gives you more explicit control and guarantees about what precision each layer actually uses.

2. Kernel Tactic Selection (The Real Win)

This is where TensorRT’s build time pays off. During those 20-30 minutes, it:

Benchmarks hundreds of kernel implementations for each layer on YOUR specific GPU
Tests different algorithms (CUDNN, cuBLAS, cuBLAS-LT, custom kernels)
Picks the absolute fastest for your exact hardware and model shape

PyTorch uses good default kernels, but it can’t spend 30 minutes benchmarking every possible implementation. TensorRT can and does.

Example: For a specific MatMul operation in FLUX:

PyTorch might use cuBLAS (fast, general purpose)
TensorRT tries: cuBLAS, cuBLAS-LT (5 variants), CUDNN (3 variants), custom fused kernels (10+ variants)
Picks the one that’s 15-20% faster for that exact operation size

Multiply this across hundreds of operations and the gains compound.

The bottom line: Both frameworks use FP16. TensorRT just picks better kernels for each operation.

Web Resources:

Problem 3: Switching LoRA Models is Painfully Slow ⭐

This is the star of our implementation - an advanced TensorRT pattern that makes dynamic model serving practical.

The Problem

In production, you often need to serve multiple variants of a model. For our Bittensor subnet, validators send requests with different custom LoRA (Low-Rank Adaptation) models - think of them as “style plugins” that modify the base FLUX model’s behavior.

Traditional approaches fail here:

Option 1: Rebuild TensorRT engine for each LoRA

Build time: 20-30 minutes per LoRA
Result: Completely impractical for real-time serving

Option 2: Keep multiple engines in memory

Memory cost: 3-5GB per engine
For 10 LoRAs: 30-50GB VRAM just for engines
Result: Wastes precious GPU memory

Option 3: Reload PyTorch model and switch LoRA

Reload time: 5-10 seconds per switch
Result: GPU sits idle during model loading

None of these work for a production system that needs to switch between dozens of LoRAs on-demand.

The Breakthrough: TensorRT Refitting

TensorRT engines can be built with a refittable=True flag, which allows updating weights only without recompiling the execution graph.

The key insight: LoRA fine-tuning only modifies model weights - the architecture (layer structure, connections) stays identical. So we can:

Build one refittable engine for the base model
Swap in different LoRA weights at runtime
Update takes ~0.5 seconds instead of 5-10 seconds (PyTorch reload)

That’s a 10-20x speedup for model switching compared to the PyTorch reload approach - and infinitely faster than the 30-minute rebuild option!

This was introduced in TensorRT 10.0 with features like weight-stripped engines and the REFIT_IDENTICAL flag, enabling “LoRA switches within the same runtime without recompilation” (NVIDIA Blog).

Weight-Stripped Engines: The Perfect Complement

TensorRT 10.0 introduced weight-stripped engines, which separate execution code from weights:

Traditional engine: ~3-5GB (kernels + weights bundled) Weight-stripped engine: ~50MB (only kernels, 99% compression!)

How it works:

Build engine with kSTRIP_PLAN and kREFIT_IDENTICAL flags
Engine contains only CUDA kernel code, no weights
At runtime, refit with weights from ONNX model or LoRA variants
Same performance as traditional engines, 99% smaller deployment size

Why this matters for production:

Deploy 50MB engines instead of 5GB (faster CI/CD, less bandwidth)
Store one small engine, swap unlimited LoRA weight sets
Weights stay in original format (ONNX/safetensors) until runtime
Perfect for serverless or edge deployment with size constraints

Our implementation doesn’t use weight-stripping yet (we bundle weights in the engine), but combining refitting + weight-stripping would reduce our deployment from ~3-5GB to ~50MB while maintaining the 0.5-second LoRA switching capability.

Caveat: Weight-stripped engines require exact weight shapes at refit time. Using kREFIT_IDENTICAL means you must refit with weights that match the build-time shapes exactly, or you’ll get undefined behavior.

Implementation Deep Dive

Let’s walk through how we implemented this in our codebase.

Phase 1: Build a Refittable Engine

First, we build the engine with refitting enabled (trt.py:205):

extra_build_args = {
    "refittable": True,  # ⭐ This is the magic flag
    "precision_constraints": "obey",  # Respect FP32/FP16 layer constraints
    # ... other optimization settings
}

engine = engine_from_network(
    (builder, network, parser),
    config=CreateConfig(fp16=True, profiles=[p], **extra_build_args)
)

This creates a .plan file where TensorRT tracks which tensors are weights (and thus can be updated) vs. intermediate activations (which cannot).

Phase 2: Create the TRTRefitter Class

The TRTRefitter class manages weight updates (trt.py:563-610):

class TRTRefitter:
    def __init__(self, engine, flux_path):
        # Create the refitter object
        self.refitter = trt.Refitter(engine, TRT_LOGGER)

        # Load base model weights
        transformer_sd = DiffusionPipeline.from_pretrained(
            flux_path, torch_dtype=torch.float16
        ).transformer.state_dict()

        # Initialize refitter with base weights
        for trt_weight_name in self.refitter.get_all_weights():
            # Map TensorRT names back to PyTorch names
            # (TRT adds prefixes like "transformer." and "base_layer.")
            pyt_weight_name = trt_weight_name.replace('transformer.', '')
            pyt_weight_name = pyt_weight_name.replace('base_layer.', '')

            if pyt_weight_name in transformer_sd:
                # Set the weight in the refitter
                self.refitter.set_named_weights(
                    trt_weight_name,
                    trt.Weights(transformer_sd[pyt_weight_name].numpy())
                )

What’s happening here:

trt.Refitter(engine, ...) creates a refitter object that can update the engine’s weights
We load the base FLUX model’s weights from HuggingFace
We map PyTorch weight names → TensorRT internal names (this mapping is generated during the build phase)
We initialize the refitter with the base weights

Phase 3: Swap LoRA Weights

Now the fast part - updating just the LoRA weights:

def prepare_lora_refit(self, lora_path):
    """Load LoRA weights and prepare them for refitting"""
    from safetensors.torch import load_file
    lora_sd = load_file(lora_path)  # Load LoRA from disk

    start_time = time.time()

    # Only update weights that have ".lora." in their name
    for trt_weight_name in self.refitter.get_all_weights():
        if '.lora.' in trt_weight_name:
            pyt_weight_name = trt_weight_name.replace('.lora.weight', '.weight')

            if pyt_weight_name in lora_sd:
                # Update this specific weight
                self.refitter.set_named_weights(
                    trt_weight_name,
                    trt.Weights(lora_sd[pyt_weight_name].numpy())
                )

    print(f"LoRA prep: {time.time() - start_time:.4f} seconds")

def commit_refit(self):
    """Push updated weights to the GPU engine"""
    start_time = time.time()

    if not self.refitter.refit_cuda_engine():
        # Check what went wrong
        missing = self.refitter.get_missing_weights()
        raise RuntimeError(f"Failed to refit. Missing: {missing}")

    print(f"Refit committed: {time.time() - start_time:.4f} seconds")

The workflow:

prepare_lora_refit(): Load LoRA weights from disk, update only LoRA-specific weights in the refitter
commit_refit(): Push all updated weights to the GPU engine in one atomic operation
Total time: ~0.5 seconds (vs. 30 minutes for rebuild, or 5-10 seconds for PyTorch reload)

Phase 4: Use the Refitted Engine

From lora_generate_image.py:130-131:

# Create TRT transformer and its refitter
transformer = TRTTransformer(engine_path, transformer_config,
                             torch.device("cuda"), max_batch_size=1)
refitter = TRTRefitter(transformer.engine.engine, base_model_path)

# Later, when switching LoRAs:
refitter.prepare_lora_refit('./lora_v1.safetensors')
refitter.commit_refit()
# Engine now has LoRA v1 weights!

# Generate image with LoRA v1
image = pipe(prompt="a lighthouse on a rocky coast at sunset", ...).images[0]

# Switch to different LoRA
refitter.prepare_lora_refit('./lora_v2.safetensors')
refitter.commit_refit()
# Engine now has LoRA v2 weights!

# Generate with LoRA v2
image = pipe(prompt="a lighthouse on a rocky coast at sunset", ...).images[0]

Why This Works

The magic of refitting:

TensorRT keeps the execution graph frozen (layer connections, kernel selections, memory layout)
Only the weight tensors are updated
Since LoRA training only modifies weights (not architecture), this is a perfect match
The engine validates that weight shapes haven’t changed, ensuring safety

Performance comparison:

Method	Time	Memory
Rebuild engine	20-30 min	3-5GB (one engine)
Keep multiple engines	Instant	30-50GB (10 engines)
PyTorch reload	5-10 sec	~8GB
TensorRT refitting	0.5 sec	3-5GB (one engine)

Real-World Impact

According to NVIDIA’s blog, “By using weight refitting, the turnaround time when weights change is significantly shortened, which improves the liveness of GenAI applications.”

In our Bittensor miner, this means:

✅ Serve unlimited LoRA variants with fixed memory footprint
✅ Switch models in half a second (imperceptible to users)
✅ No downtime or warm-up periods
✅ Same inference performance as a static engine

Web Resources:

Problem 4: GPU Sits Idle During Weight Loading

The Problem

Even with our lightning-fast 0.5-second LoRA refitting, there’s still wasted time:

Sequential workflow:

[Load LoRA from disk] → [Prepare refit] → [Commit to GPU] → [Run inference]
     (I/O bound)          (CPU work)        (GPU work)        (GPU work)
        0.2s                 0.2s              0.1s              3.0s

Total time: 3.5 seconds per image

The problem? During the first 0.5 seconds (load + prepare + commit), the GPU is completely idle. We’re paying for expensive H100 compute that’s just sitting there waiting for CPU and I/O operations to complete.

For high-throughput serving, this inefficiency compounds. If we’re processing 100 requests per minute, we’re wasting 50 seconds of GPU time every minute - nearly half our capacity!

The Solution: Dual-Engine Pipeline Architecture

The fix is classic producer-consumer pattern: keep two TensorRT engines in memory and pipeline the work.

How it works:

Time 0-3s:    [Engine A: Load LoRA 1]  →  [Engine A: Inference for request 1]
Time 0-3s:                                  [Engine B: idle]
Time 3-6s:    [Engine B: Load LoRA 2]  →  [Engine B: Inference for request 2]
Time 3-6s:    [Engine A: idle]

Result: GPU always has work to do!

Implementation: TRTInferenceServer

From lora_generate_image.py:107-132:

class TRTInferenceServer:
    def __init__(self, base_model_path, engine_path, mapping_path):
        self.request_queue = queue.Queue()      # Incoming requests
        self.engine_queue = queue.Queue()       # Pool of available engines
        self.inference_queue = queue.Queue()    # Engines ready to infer

        # Create TWO identical TRT engines
        self.engine_queue.put(self._create_trt_transformer(engine_path, ...))
        self.engine_queue.put(self._create_trt_transformer(engine_path, ...))

        # Two worker threads
        self.loader_thread = threading.Thread(
            target=self._weight_loader_worker, daemon=True
        )
        self.inference_thread = threading.Thread(
            target=self._inference_worker, daemon=True
        )

    def _create_trt_transformer(self, engine_path, transformer_config):
        """Create a TRT transformer + refitter pair"""
        transformer = TRTTransformer(engine_path, transformer_config,
                                     torch.device("cuda"), max_batch_size=1)
        refitter = TRTRefitter(transformer.engine.engine, self.base_model_path)
        return transformer, refitter

Architecture diagram:

┌──────────────┐
│   Request    │  User submits inference request
│   Queue      │
└──────┬───────┘
       │
       v
┌──────────────────────────────────┐
│  Weight Loader Worker Thread     │  [Producer]
│  - Get idle engine from pool     │
│  - Load LoRA from disk           │  (I/O + CPU work)
│  - Prepare refit                 │
│  - Commit refit to GPU           │
│  - Pass to inference queue       │
└──────────────┬───────────────────┘
               │
               v
┌──────────────────────────────────┐
│  Inference Worker Thread         │  [Consumer]
│  - Get engine from inference Q   │
│  - Run GPU inference             │  (GPU work)
│  - Save image to disk            │
│  - Return engine to pool         │
└──────────────┬───────────────────┘
               │
               v
┌──────────────┐
│   Engine     │  [Engine A] ←→ [Engine B]
│   Pool       │  (alternating: one loads, one infers)
└──────────────┘

Worker Thread 1: Weight Loader

def _weight_loader_worker(self):
    """Producer: Prepare engines with the right weights"""
    while True:
        # Get an idle engine from the pool
        transformer, refitter = self.engine_queue.get()

        # Get next request
        request = self.request_queue.get()
        if request is None:  # Shutdown signal
            self.inference_queue.put(None)
            break

        print(f"[Loader] Loading LoRA for {request.lora_path}")

        # Load and refit (I/O + CPU work, GPU idle for THIS engine)
        if request.lora_path:
            refitter.prepare_lora_refit(request.lora_path)
            refitter.commit_refit()

        # Hand off to inference worker
        self.inference_queue.put((request, transformer, refitter))

        # Mark this stage complete
        self.engine_queue.task_done()
        self.request_queue.task_done()

Key insight: While Engine A is loading weights (CPU/I/O bound), Engine B is running inference (GPU bound). No conflict!

Worker Thread 2: Inference

def _inference_worker(self):
    """Consumer: Run inference and return engine to pool"""
    while True:
        job = self.inference_queue.get()
        if job is None:  # Shutdown signal
            break

        request, transformer, refitter = job

        print(f"[Inference] Running inference for {request.lora_path}")
        inference_start = time.time()

        # Swap in the transformer (cheap pointer swap)
        self.pipe.transformer = transformer

        # Run inference (GPU-bound work)
        generator = torch.Generator(device="cuda").manual_seed(request.seed)
        image = self.pipe(
            prompt=request.prompt,
            num_inference_steps=request.num_inference_steps,
            guidance_scale=request.guidance_scale,
            height=request.height,
            width=request.width,
            generator=generator
        ).images[0]

        # Ensure GPU finishes before continuing
        torch.cuda.synchronize()

        print(f"[Inference] Completed in {time.time() - inference_start:.3f}s")

        # Return engine to the pool (it's now idle and ready for next request)
        self.engine_queue.put((transformer, refitter))

        # Save image (I/O, happens after engine is back in pool)
        image.save(request.output_path)

        self.inference_queue.task_done()

Key insight: As soon as inference finishes, the engine goes back to the pool where the loader thread can grab it for the next request.

Atomic File Writes (Bonus: Preventing Race Conditions)

One subtle detail from the recent race condition fix (lora_generate_image.py:246-256):

# Write to temp file first, then atomically rename
with tempfile.NamedTemporaryFile(
    mode='wb',
    dir=output_dir,
    suffix='.png',
    delete=False
) as tmp_file:
    tmp_path = tmp_file.name
    image.save(tmp_path)

# Atomic rename - file only becomes visible when complete
os.replace(tmp_path, request.output_path)

Why this matters: If a validator checks for the image file while we’re still writing it, they might get a partial/corrupted image. Atomic rename ensures the file only appears when it’s fully written.

Performance Impact

Sequential (single engine):

Load + refit: 0.5s
Inference: 3.0s
Total: 3.5s per request

Pipelined (dual engine):

While Engine A infers (3.0s), Engine B loads next weights (0.5s)
The 0.5s load time is hidden by the 3.0s inference
Effective throughput: ~3.0s per request

Improvement: 15% throughput increase + much better GPU utilization

For a system processing 1000 images/day, this saves 8-10 minutes of compute time daily. At scale, those minutes become hours.

When to Use This Pattern

Dual-engine makes sense when:

✅ Your weight-loading time < your inference time (so one engine can finish before the other needs it)
✅ You have enough VRAM for two engines (~6-10GB)
✅ Throughput matters (serving many requests sequentially)

Skip it when:

❌ Loading takes longer than inference (you’d still block)
❌ Memory-constrained (can’t fit two engines)
❌ Low request volume (overhead not worth it)

Web Resources:

Problem 5: Squeezing Out Even More Performance with FP8 ⭐

A 2025 cutting-edge technique that can push diffusion models even faster.

The Reality of FP8 in 2025

By 2025, FP8 quantization on NVIDIA Hopper GPUs (H100, H200) has become a standard technique for diffusion model optimization. While our implementation uses FP16, going one step further to FP8 can unlock another 2-2.5x speedup.

Performance potential:

FLUX.1-dev with TensorRT FP16: ~3 seconds (our current implementation)
FLUX.1-dev with TensorRT FP8: ~1.5-2 seconds (2.4x faster than PyTorch FP16)

What is FP8?

FP8 (8-bit floating point) comes in two formats:

E4M3 (4-bit exponent, 3-bit mantissa): Better for activations, chosen for diffusion
E5M2 (5-bit exponent, 2-bit mantissa): Higher dynamic range but less precision

TensorRT uses E4M3 for diffusion models because it provides finer-grained precision where activations cluster around zero.

Why FP8 Works for Diffusion Models

Memory bandwidth: FP8 halves memory transfers vs FP16 (8 bits vs 16 bits) Tensor Cores: H100 has dedicated FP8 Tensor Cores that are 2x faster than FP16 Numerical stability: Diffusion models are surprisingly robust to lower precision

Adobe’s Firefly video generation model achieved:

60% latency reduction with FP8
Nearly 40% reduction in total cost of ownership (TCO)
Minimal quality degradation

Implementation Approach

Using TensorRT Model Optimizer:

import modelopt.torch.quantization as mtq

# Post-training quantization (PTQ) - no retraining needed
model = mtq.quantize(
    model,
    quant_cfg={
        "quant_mode": "fp8",           # Use FP8 quantization
        "quant_format": "e4m3",        # E4M3 format for activations
        "algorithm": "max",            # Max calibration for diffusion
    },
    forward_loop=calibration_loop,     # Run 128-512 calibration samples
)

# Export to ONNX (quantization annotations included)
torch.onnx.export(model, sample_input, "flux_fp8.onnx")

# Build TensorRT engine (respects FP8 annotations)
# Build process is identical to FP16, TensorRT sees FP8 ops

Calibration: FP8 quantization requires calibrating on 128-512 sample images to determine optimal scaling factors for each layer. This takes 10-20 minutes but only needs to be done once.

Combining FP8 with Refitting

The challenge: Can you combine FP8 quantization with LoRA refitting?

Current limitation (2025): TensorRT documentation notes that “high-precision weights used in FP4 double quantization are not refittable.” The story for FP8 + refitting is still evolving.

Workaround: Build separate FP8 engines for your most popular LoRA variants, fall back to FP16 refittable engine for long-tail LoRAs.

Performance vs Quality Trade-off

When FP8 works well:

✅ Diffusion models (FLUX, Stable Diffusion, video generation)
✅ Large models where memory bandwidth dominates
✅ Hopper GPUs (H100, H200) with dedicated FP8 Tensor Cores
✅ Production serving where 1-2 seconds matters

When to stick with FP16:

❌ When you need perfect numerical accuracy
❌ On pre-Hopper GPUs (no FP8 Tensor Cores)
❌ When you require dynamic LoRA refitting
❌ Models with known numerical instability

Quality validation: Always run side-by-side A/B tests comparing FP8 vs FP16 outputs. For FLUX.1-dev, most users report imperceptible quality differences, but your specific use case may vary.

Getting Started with FP8

Phase 1: Validation

Install TensorRT Model Optimizer: pip install nvidia-modelopt[torch]
Quantize your model with 256 calibration samples
Generate 100 test images side-by-side (FP8 vs FP16)
Measure quality metrics (FID, CLIP score, human eval)

Phase 2: Production

If quality is acceptable, build FP8 TensorRT engines
Add FP8 engine path alongside FP16 in your server
A/B test in production with 5% traffic
Monitor quality metrics and user feedback

Phase 3: Optimization (optional)

Experiment with per-layer precision (some layers FP16, most FP8)
Try mixed FP8/FP16 for quality-critical layers
Profile memory bandwidth savings with NVIDIA Nsight

Real-World Impact

If we applied FP8 to our FLUX implementation:

Metric	FP16 (current)	FP8 (potential)	Improvement
Inference time	3.0s	~1.5-2.0s	1.5-2x faster
Memory usage	~8GB	~4-5GB	40-50% reduction
Throughput	20 img/min	30-40 img/min	1.5-2x higher
Cost per image	$0.0025	$0.0012-0.0017	30-50% cheaper

The bottom line: FP8 is the next frontier for diffusion model optimization in 2025. If you’re on H100 and have already exhausted FP16 optimizations, FP8 can unlock another 1.5-2x speedup with minimal quality impact.

Web Resources:

Bringing It All Together: The Complete Pipeline

We’ve covered five distinct optimizations. Let’s synthesize them into the complete end-to-end flow.

Development Phase (One-Time Setup)

1. Train or download PyTorch model
   ↓
2. Export to ONNX format
   ├── torch.onnx.export(model, sample_inputs, ...)
   ├── Requires sample inputs matching expected inference shapes
   └── Produces: model.onnx + external weight files
   ↓
3. Build TensorRT Engine (20-30 minutes)
   ├── Parse ONNX model
   ├── Apply graph fusion optimizations
   ├── Select optimal kernels via tactic benchmarking
   ├── Apply FP16/FP32 mixed precision
   ├── Compile to GPU machine code
   └── Produces: transformer.plan (~3-5GB)
   ↓
4. Generate weight mapping
   ├── Hash-based matching of PyTorch ↔ TensorRT weight names
   └── Produces: mapping.json (for refitting)
   ↓
5. Cache by hardware signature
   └── {GPU_name}_cu{CUDA_version}_trt{TRT_version}_fp16/
       ├── transformer.plan
       ├── mapping.json
       └── onnx/ (optional, for debugging)

Production Runtime

1. Server Initialization
   ├── Load base Diffusion Pipeline (FLUX.1-dev)
   ├── Load TensorRT engine into Engine A
   ├── Load TensorRT engine into Engine B (duplicate)
   ├── Create refitter objects for both engines
   └── Start loader and inference worker threads
   ↓
2. For Each Inference Request
   ├── Loader Thread (Engine A):
   │   ├── Get request from queue
   │   ├── Load LoRA weights from disk
   │   ├── Prepare refit (map weights to TRT format)
   │   ├── Commit refit to GPU (~0.5s total)
   │   └── Pass to inference queue
   │
   ├── Inference Thread (Engine A):
   │   ├── Receive engine + request
   │   ├── Run TRT inference (~3s)
   │   ├── Synchronize GPU
   │   ├── Save image atomically
   │   └── Return engine to pool
   │
   └── Meanwhile, Engine B processes next request...
   ↓
3. Result
   └── Image generated in ~3 seconds (vs 15s PyTorch)

Key Files and Their Roles

File	Purpose	Key Content
`trt.py:163-263`	`Engine` class	TRT engine loading, buffer allocation, inference execution
`trt.py:436-560`	`DiffusionTransformer`	ONNX export configuration for FLUX Transformer
`trt.py:563-610`	`TRTRefitter`	Weight update without recompilation
`trt.py:613-696`	`TRTTransformer`	Drop-in replacement for PyTorch Transformer
`trt.py:718-808`	`build_transformer_engine_from_pipeline()`	End-to-end build orchestration
`lora_generate_image.py:107-260`	`TRTInferenceServer`	Dual-engine pipeline with producer-consumer workers

Configuration Deep Dive

From trt.py:193-211, here are the important build settings:

extra_build_args = {
    # Memory allocation for tactics benchmarking and graph optimization
    "memory_pool_limits": {
        trt.MemoryPoolType.WORKSPACE: 2**33,      # 8GB workspace
        trt.MemoryPoolType.TACTIC_DRAM: 2**33,    # 8GB for tactic selection
    },

    # Which kernel libraries to try
    "tactic_sources": [
        trt.TacticSource.CUBLAS,                  # Standard BLAS
        trt.TacticSource.CUBLAS_LT,               # CUDA 11+ optimized BLAS
        trt.TacticSource.CUDNN,                   # Convolutions, RNNs, attention
        trt.TacticSource.EDGE_MASK_CONVOLUTIONS,  # Specialized conv kernels
        trt.TacticSource.JIT_CONVOLUTIONS,        # Runtime-compiled convs
    ],

    # Enable weight updates at runtime
    "refittable": True,

    # Enforce layer-level precision constraints (respect FP32 overrides)
    "precision_constraints": "obey",
}

What these settings do:

memory_pool_limits: TensorRT needs scratch memory during build to benchmark tactics. More memory = can test more variants = better optimization (but longer build time)
tactic_sources: Each source represents a different kernel library. TensorRT will try all of them and pick the fastest for each layer
refittable: Enables our LoRA weight-swapping magic
precision_constraints: Without this, TensorRT might ignore your FP32 layer overrides and force everything to FP16

Real-World Performance Numbers

From our production Bittensor miner:

Metric	PyTorch Baseline	TensorRT Optimized	Improvement
Inference time	15 seconds	3 seconds	5x faster
Inference time (first run)	15 seconds	4-5 seconds	Warmup overhead
Memory usage	~12GB VRAM	~8GB VRAM	33% reduction
LoRA switch time	5-10 seconds	0.5 seconds	10-20x faster
Throughput (sequential)	4 images/min	20 images/min	5x higher
GPU utilization	60-70%	90-95%	Better saturation*

*GPU utilization numbers are from our FLUX deployment. Your mileage will vary based on model size, batch size, and hardware. We’re running batch_size=1, which limits utilization compared to batched inference.

Cost implications:

At cloud GPU pricing (~$2-3/hr for H100):

PyTorch: 15s/image = 240 images/hour = $0.0125 per image
TensorRT: 3s/image = 1200 images/hour = $0.0025 per image

5x cost reduction for the same hardware!

Build Time Investment

One-time costs:

Initial engine build: 20-30 minutes
ONNX export development/debugging (first time)
Testing and validation

Ongoing costs:

Rebuild when model architecture changes (rare in production)
Rebuild when switching GPU types (engines are hardware-specific)

ROI calculation:

If you’re running 1000 inferences/day:

Time saved per day: (15s - 3s) × 1000 = 12,000 seconds = 3.3 hours
Break-even point: ~30 min build time ÷ 3.3 hr daily savings = after first day of production use

Troubleshooting Common Issues

Issue 1: ONNX export fails with unsupported operators

Symptom: RuntimeError: ONNX export failed: Unsupported operator
Solution:
- Check for custom layers not supported by ONNX
- Try enabling higher opset_version=18 or later
- Replace custom ops with ONNX-compatible equivalents
- For RMSNorm layers, use the ONNXSafeRMSNorm wrapper (see trt.py:110-137)

Issue 2: Engine build fails with out-of-memory

Symptom: Error: Memory allocation failed during engine build
Solution:
- Reduce workspace memory: trt.MemoryPoolType.WORKSPACE: 2**32 (4GB instead of 8GB)
- Set TRT_BUILD_SAFE=1 environment variable to disable aggressive optimizations
- Close other GPU processes during build
- Build on a machine with more VRAM

Issue 3: Accuracy regression - generated images look wrong

Symptom: TensorRT outputs differ visually from PyTorch
Solution:
- Add more layer types to FP32 whitelist (see trt.py:215-220)
- Enable strict type constraints: precision_constraints="obey"
- Compare layer-by-layer outputs using ONNX Runtime as intermediate step
- Check for numerical instability in LayerNorm, Softmax, or Power operations

Issue 4: TensorRT slower than expected or slower than PyTorch

Symptom: Little to no speedup, or even slower than PyTorch
Solution:
- Verify FP16 is enabled: check build logs for “fp16=True”
- Ensure static shapes: dynamic shapes have 2-3x overhead
- Profile with NVTX: torch.cuda.nvtx.range_push/pop around inference
- Check for CPU-GPU sync points: remove unnecessary torch.cuda.synchronize() calls
- Verify Tensor Cores are being used: check build logs for “CUBLAS_LT” tactics

Issue 5: Refitting is slower than expected

Symptom: LoRA refitting takes 2-5 seconds instead of 0.5 seconds
Solution:
- Upgrade to TensorRT 10.11+ (10.10 had known refitting performance regression)
- Check for convolution layers in FP16/INT8 within branches or loops (known regression)
- Profile with NVTX to identify which refit operations are slow
- Consider if weight shapes are mismatched (forces internal recompilation)
- Verify you’re using kREFIT or kREFIT_IDENTICAL flags correctly

Web Resources:

When to Use TensorRT (and When Not To)

TensorRT is powerful, but it’s not always the right choice. Here’s a practical decision framework.

✅ Use TensorRT When…

1. Production inference with high throughput requirements

Serving hundreds to millions of requests per day
Every millisecond of latency matters
Hardware costs are significant

2. Model architecture is stable

Not changing layer structure frequently
Fine-tuning only (weights change, architecture doesn’t)
Production-ready models, not research prototypes

3. NVIDIA GPU infrastructure

You’re running on NVIDIA hardware (required)
A100, H100, or recent RTX/Tesla GPUs
CUDA 11.8+ and TensorRT 8.5+ installed

4. Latency-critical applications

Real-time inference (video processing, robotics)
Interactive applications (chatbots, live image generation)
Batch processing with tight SLAs

5. You can tolerate build time

20-30 minute one-time build cost is acceptable
Can cache engines and reuse across deployments
CI/CD pipeline can handle multi-minute build steps

❌ Skip TensorRT When…

1. Research and experimentation phase

Model architecture changes frequently
Rapid iteration is more valuable than speed
Prototyping different approaches

2. CPU-only deployment

TensorRT requires NVIDIA GPUs
Consider ONNX Runtime for CPU optimization instead

3. Non-NVIDIA hardware

AMD GPUs → look at ROCm or ONNX Runtime
Apple Silicon → use Core ML or Metal Performance Shaders
Edge TPUs → use TensorFlow Lite

4. torch.compile meets your performance needs

Simple models (small CNNs, basic transformers)
torch.compile achieving your latency/throughput targets
Deployment simplicity outweighs potential TensorRT gains (10-20%)
Workloads where torch.compile benchmarks equal or better than TensorRT

5. Extremely dynamic architectures

Variable number of layers per forward pass
Conditional execution with data-dependent branches
Dynamic shapes that change drastically between requests

Alternative Approaches

Don’t want the full TensorRT complexity? Consider these alternatives:

Approach	Speedup	Ease of Use	Portability
torch.compile	1.5-2x	⭐⭐⭐⭐⭐ One line	PyTorch only
ONNX Runtime	2-3x	⭐⭐⭐⭐ Easy export	Cross-platform
TorchScript	1.3-1.8x	⭐⭐⭐ Moderate	PyTorch only
BetterTransformer	1.5-2x	⭐⭐⭐⭐⭐ Drop-in	Transformers only
TensorRT	3-6x	⭐⭐ More complex	NVIDIA GPUs only
Quantization (INT8)	2-4x	⭐⭐⭐ Requires calibration	Various backends
FP8 Quantization	1.5-2.5x	⭐⭐⭐ Requires calibration	H100/H200 GPUs

A note on torch.compile:

PyTorch 2.0’s torch.compile has made huge strides in closing the performance gap with TensorRT. For some models, it can match or even beat TensorRT. The choice often comes down to your specific model architecture and deployment constraints. It’s worth trying torch.compile first since it’s just one line of code.

The pragmatic approach for 2025:

Start with torch.compile (one line: model = torch.compile(model))
If that’s not enough and you’re on NVIDIA GPUs, try TensorRT
Consider FP8 quantization as complementary to either approach (especially for diffusion models)
Profile, measure, and choose based on your specific workload - the landscape is more competitive than ever

Getting Started Checklist

If you’ve decided TensorRT is right for you:

Phase 1: Validation

Benchmark your PyTorch baseline (latency, throughput, memory)
Export a simple model to ONNX to test compatibility
Build a basic TensorRT engine and verify it works
Compare outputs (PyTorch vs TRT) to check numerical accuracy
Measure actual speedup (aim for ≥2x to justify complexity)

Phase 2: Production Integration

Set up engine caching by hardware signature
Implement proper error handling and fallback to PyTorch
Add monitoring (inference time, GPU utilization, errors)
Test edge cases (different input shapes, error conditions)
Document build process for your team

Phase 3: Advanced Optimization (optional)

Implement refitting if you need dynamic model variants
Set up dual-engine pipeline if throughput is critical
Fine-tune layer-level precision (FP32 vs FP16)
Profile with NVTX and optimize bottlenecks
Consider INT8 quantization for even more speed

Learning Resources

Official Documentation:

Community Resources:

Torch-TensorRT (PyTorch Integration)
TensorRT GitHub Issues (great for troubleshooting)
NVIDIA Developer Forums

Quantization and Optimization:

Tutorials and Examples:

NVIDIA TensorRT Blog Posts
Ultralytics TensorRT Export Guide
This project’s implementation (real production code!)

Final Thoughts

TensorRT is a power tool. Like any power tool, it requires more setup and expertise than simpler alternatives, but when you need maximum performance, it delivers.

The decision usually comes down to:

Can you afford 20-30 minutes of build time? (one-time cost)
Are you on NVIDIA hardware? (required)
Do you need 3x+ speedup over PyTorch? (typical benefit)

If you answered yes to all three, TensorRT is likely worth the investment.

Conclusion

We started with a 15-second inference problem. By systematically applying TensorRT’s optimization toolkit, we brought it down to 3 seconds - a 5x speedup that makes real-time AI image generation practical.

The Journey Recap

Problem 1: Dynamic graph overhead → Static compiled engines

Eliminated Python interpreter and framework dispatch overhead
Pre-compiled execution plans with kernel fusion
Result: Direct GPU execution with minimal CPU involvement

Problem 2: FP32 precision waste → Mixed precision + kernel auto-tuning

FP16 for most operations (2x memory bandwidth improvement)
Selective FP32 for numerically sensitive layers
Automatic kernel selection via tactics benchmarking
Result: 2-3x faster execution with Tensor Cores

Problem 3: Slow LoRA switching → Weight refitting

Dynamic weight updates without graph recompilation
0.5 seconds vs 30 minutes for model switching
Single engine serves unlimited model variants
Result: 3600x faster model switching

Problem 4: GPU idle time → Dual-engine pipeline

Producer-consumer pattern with engine pool
Weight loading happens while other engine infers
Atomic file writes prevent race conditions
Result: 15% throughput improvement, 95% GPU utilization

Problem 5: FP8 quantization → Next-level optimization for 2025

FP8 quantization on H100/H200 GPUs for 2x+ speedup
Post-training quantization with calibration (no retraining)
Trade-off: potential quality impact vs major performance gain
Result: 1.5-2x additional speedup (3s → 1.5-2s), 40-50% memory reduction

The Key Insight

TensorRT isn’t magic - it’s the disciplined application of well-known optimization techniques:

Graph fusion (compiler optimization)
Mixed precision (numerical analysis)
Kernel auto-tuning (empirical performance testing)
Weight refitting (separating data from code)
Pipeline parallelism (concurrent systems design)
FP8 quantization (aggressive precision reduction for cutting-edge hardware)

What makes TensorRT powerful is how it orchestrates these techniques into a cohesive system, automatically and reliably.

Real-World Impact

For our Bittensor miner serving AI image generation:

5x faster inference (15s → 3s)
5x higher throughput (4 images/min → 20 images/min)
5x lower cost per image generated
Practical LoRA switching for dynamic model serving

The 20-30 minute build time pays for itself after a single day of production use.

When It’s Worth It

TensorRT makes sense when:

✅ You’re on NVIDIA GPUs in production
✅ Your model architecture is stable
✅ Latency and cost matter
✅ You can invest a day setting it up

Skip it when:

❌ You’re still in research/prototyping
❌ PyTorch is already fast enough
❌ You’re on non-NVIDIA hardware

Try It Yourself

The complete implementation from this post is available in the dippy-studio-bittensor-miner repository.

Key files to explore:

trt.py - Engine building, refitting, and execution
lora_generate_image.py - Dual-engine inference server

Start with a simple model, export to ONNX, build an engine, and measure the speedup. You might be surprised at how much faster your inference becomes.

Final Thought

“The 20-30 minute build time might seem daunting, but when it saves you 12 seconds per inference across millions of requests, the math becomes compelling quickly.”

Optimization isn’t about making everything faster - it’s about making the right things faster. TensorRT gives you the tools to do exactly that.

Questions or feedback? Open an issue in the repository or reach out on the Bittensor Discord. I’d love to hear about your TensorRT experiences!

Found this helpful? Share it with someone struggling with slow inference. Every millisecond counts in production AI.

The Performance Problem

What is TensorRT?

Table of Contents

Problem 1: PyTorch’s Dynamic Computation Graph

The Problem

The Solution: Static TensorRT Engines

Problem 2: Getting the Most Out of Lower Precision

The Reality in 2025

What TensorRT Does Differently

1. Fine-Grained Precision Control

2. Kernel Tactic Selection (The Real Win)

Problem 3: Switching LoRA Models is Painfully Slow ⭐

The Problem

The Breakthrough: TensorRT Refitting

Weight-Stripped Engines: The Perfect Complement

Implementation Deep Dive

Phase 1: Build a Refittable Engine

Phase 2: Create the TRTRefitter Class

Phase 3: Swap LoRA Weights

Phase 4: Use the Refitted Engine

Why This Works

Real-World Impact

Problem 4: GPU Sits Idle During Weight Loading

The Problem

The Solution: Dual-Engine Pipeline Architecture

Implementation: TRTInferenceServer

Worker Thread 1: Weight Loader

Worker Thread 2: Inference

Atomic File Writes (Bonus: Preventing Race Conditions)

Performance Impact

When to Use This Pattern

Problem 5: Squeezing Out Even More Performance with FP8 ⭐

The Reality of FP8 in 2025

What is FP8?

Why FP8 Works for Diffusion Models

Implementation Approach

Combining FP8 with Refitting

Performance vs Quality Trade-off

Getting Started with FP8

Real-World Impact

Bringing It All Together: The Complete Pipeline

Development Phase (One-Time Setup)

Production Runtime

Key Files and Their Roles

Configuration Deep Dive

Real-World Performance Numbers

Build Time Investment

Troubleshooting Common Issues

When to Use TensorRT (and When Not To)

✅ Use TensorRT When…

❌ Skip TensorRT When…

Alternative Approaches

Getting Started Checklist

Learning Resources

Final Thoughts

Conclusion

The Journey Recap

The Key Insight

Real-World Impact

When It’s Worth It

Try It Yourself

Final Thought