~ / cmdr2

Fri Oct 10 09:35 2025

// Cross-posted from Easy Diffusion’s blog.

Some notes on machine-learning compilers, gathered while researching tech for Easy Diffusion’s next engine (i.e. sdkit v3). For context, see the design constraints of the new engine.

tl;dr summary

The current state is:

Vendor-specific compilers are the only performant options on consumer GPUs right now. For e.g. TensorRT-RTX for NVIDIA, MiGraphX for AMD, OpenVINO for Intel.
Cross-vendor compilers are just not performant enough right now for Stable Diffusion-class workloads on consumer GPUs. For e.g. like TVM, IREE, XLA.

The focus of cross-vendor compilers seems to be either on datacenter hardware, or embedded devices. The performance on desktops and laptops is pretty poor. Mojo doesn’t target this category (and doesn’t support Windows). Probably because datacenters and embedded devices are currently where the attention (and money) is.

This could change in the future! The idea of a cross-vendor ML compiler is clearly awesome, and I think this is the way things should go. But we’re not there yet for desktops/laptops, in terms of runtime performance.

What’s an ML compiler?

It’s a compiler for ML models (not a compiler that uses ML to compile). The basic idea of an ML compiler is to treat an ML model’s execution graph as a program to compile, and to produce an optimized set of GPU-specific instructions. The compiler can optimize the execution graph by doing things like fusing operations together, parallelizing operations when possible, and even mapping groups of operators to GPU-specific instructions. It can use its knowledge of the target GPU architecture to optimize the memory layout and parallelism of operations. Basically what compilers already do for CPUs today, but for GPUs.

We already have a decent graph format: ONNX. Every model that I intend to support has ONNX exports available (and it’s easy to export one, for new models).

Good links for reading more about ML compilers

ML compiler projects

Cross-vendor ML compilers:

XLA, 2017 (the first major ML compiler)
Apache TVM, 2019
IREE, 2023

Vendor-specific ML compilers:

TensorRT-RTX (NVIDIA-only, Windows and Linux)
MiGraphX (AMD-only, Linux)
OpenVINO (Intel-only, Windows and Linux)

Testing compilers

On a Windows 11 desktop with NVIDIA 3060 12 GB (CUDA backend):

TensorRT-RTX: fastest performance. Supports weight-stripped engines (for custom model weights) and LoRA.
IREE: 30x slower than PyTorch on SD 1.5 VAE (130 MB), comparable performance for tiny models (13 MB). So it doesn’t look good for larger models (SD 1.5 Unet, 1.7 GB), or Flux (6-12 GB).
TVM: I wasn’t able to get it working. I managed to compile TVM for CUDA, but wasn’t able to compile an ONNX graph with it. They’re rewriting major parts of the codebase, and the docs and code are out-of-date. I’m sure this could’ve been figured out, but I don’t feel confident in building a new engine on top of a shifting codebase, for unclear performance on desktops. Maybe once it has stabilized, for a future engine.
torch.compile (with WSL) still requires torch, which doesn’t fit the “< 200 MB” installation size target of the new engine.
Executorch isn’t focused on desktops/laptops.
XLA is pretty confusing. They apparently use cuDNN/cuBLAS finally (which exceeds the “< 200 MB” installation size target of the new engine).

I don’t have an AMD or Intel GPU to test MiGraphX or OpenVINO, but I plan on compiling with them anyway and asking for testing help on Easy Diffusion’s Discord server. But from what I’ve read, their features fit my needs and I don’t doubt their performance numbers (since it’s their hardware).

Raw Test Results

# For SD VAE (130 MB):

At fp32:
- TensorRT-RTX: 100 ms / it
- PyTorch (Windows): 137 ms / it
- PyTorch (Linux, torch.compile): 137 ms / it
- IREE (CUDA): 3033 ms / it

At fp16:
- TensorRT-RTX: 33 ms / it
- PyTorch (Windows): 72 ms / it
- PyTorch (Linux, torch.compile): 74 ms / it
- IREE (CUDA): 3315 ms / it

IREE (Vulkan) failed to compile.


# For MobileNet (13.3 MB):

At fp32:
- TensorRT-RTX: 1 ms / it
- PyTorch: 6.9 ms / it
- IREE (CUDA): 5.4 ms / it
- IREE (Vulkan): 12.8 ms / it