~ / cmdr2

Nov 18 11:13 2025

// Cross-posted from Easy Diffusion’s blog.

Successfully compiled the VAE of Stable Diffusion 1.5 using graph-compiler.

The compiled model is terribly slow because I haven’t written any performance optimizations, and it (conservatively) converts a lot of intermediate tensors to contiguous copies. But we don’t need any clever optimizations to get to decent performance, just basic ones.

It’s pretty exciting because I was able to bypass the need to port the model to C++ manually. Instead, I was able to just compile the exported ONNX model and get the same output values as the original PyTorch implementation (given the same input and weights). I could compile to any platform supported by ggml by just changing one flag (e.g. CPU, CUDA, ROCm, Vulkan, Metal etc).

This pushes further the idea of compiling models using their ONNX export (instead of rewriting them manually in C++). In the future, the compiler will be able to perform a number of optimizations (far more than we can do manually for larger models).

The only big downside of this approach is the need to specify the input shape (e.g. 512x512) during compilation. The compiled graph will only work with that input shape. I’m still thinking about this problem.

The VAE of Stable Diffusion 1.5 is a 10x step-up in complexity from my first test model (i.e TinyCNN). It has one Attention operation, a bunch of Conv2D, MatMul and Transpose operations, and the weights file is 133 MB in size (float32).

The next model to target is the Unet of Stable Diffusion 1.5. This will be another 10x step-up in complexity and model size (compared to the VAE). I think the ONNX-to-GGML translation layer (i.e. ggml-onnx.h) now covers most of the operators required for Unet, but I’ll know more once I actually try to compile and run the model.

~ / cmdr2

projects: freebird, easy diffusion

hacks: carbon editor, torchruntime, findstarlink