~ / cmdr2

projects: freebird, easy diffusion

hacks: carbon editor, torchruntime, findstarlink

// Cross-posted from Easy Diffusion’s blog.

Some notes on machine-learning compilers, gathered while researching tech for Easy Diffusion’s next engine (i.e. sdkit v3). For context, see the design constraints of the new engine.

tl;dr summary

The current state is:

  1. Vendor-specific compilers are the only performant options on consumer GPUs right now. For e.g. TensorRT-RTX for NVIDIA, MiGraphX for AMD, OpenVINO for Intel.
  2. Cross-vendor compilers are just not performant enough right now for Stable Diffusion-class workloads on consumer GPUs. For e.g. like TVM, IREE, XLA.

The focus of cross-vendor compilers seems to be either on datacenter hardware, or embedded devices. The performance on desktops and laptops is pretty poor. Mojo doesn’t target this category (and doesn’t support Windows). Probably because datacenters and embedded devices are currently where the attention (and money) is.

This could change in the future! The idea of a cross-vendor ML compiler is clearly awesome, and I think this is the way things should go. But we’re not there yet for desktops/laptops, in terms of runtime performance.

What’s an ML compiler?

It’s a compiler for ML models (not a compiler that uses ML to compile). The basic idea of an ML compiler is to treat an ML model’s execution graph as a program to compile, and to produce an optimized set of GPU-specific instructions. The compiler can optimize the execution graph by doing things like fusing operations together, parallelizing operations when possible, and even mapping groups of operators to GPU-specific instructions. It can use its knowledge of the target GPU architecture to optimize the memory layout and parallelism of operations. Basically what compilers already do for CPUs today, but for GPUs.

We already have a decent graph format: ONNX. Every model that I intend to support has ONNX exports available (and it’s easy to export one, for new models).

ML compiler projects

Cross-vendor ML compilers:

Vendor-specific ML compilers:

Testing compilers

On a Windows 11 desktop with NVIDIA 3060 12 GB (CUDA backend):

I don’t have an AMD or Intel GPU to test MiGraphX or OpenVINO, but I plan on compiling with them anyway and asking for testing help on Easy Diffusion’s Discord server. But from what I’ve read, their features fit my needs and I don’t doubt their performance numbers (since it’s their hardware).

Raw Test Results
# For SD VAE (130 MB):

At fp32:
- TensorRT-RTX: 100 ms / it
- PyTorch (Windows): 137 ms / it
- PyTorch (Linux, torch.compile): 137 ms / it
- IREE (CUDA): 3033 ms / it

At fp16:
- TensorRT-RTX: 33 ms / it
- PyTorch (Windows): 72 ms / it
- PyTorch (Linux, torch.compile): 74 ms / it
- IREE (CUDA): 3315 ms / it

IREE (Vulkan) failed to compile.


# For MobileNet (13.3 MB):

At fp32:
- TensorRT-RTX: 1 ms / it
- PyTorch: 6.9 ms / it
- IREE (CUDA): 5.4 ms / it
- IREE (Vulkan): 12.8 ms / it