~ / cmdr2

Jan 12 09:19 2026

#freebird

Freebird v2.6.0 released. Changes since the last blog post (v2.2.2): Adds the ability to add and edit Text while inside VR. This is useful for labeling and making notes inside VR, without having to sketch notes by hand. Adds support for Vulkan and Blender 5. Shows the scene scale in the controller’s panel (below the main menu). This will let you know the zoom level of the scene, for e.g. 1:1 or 1:10 or 15:1, so that you can plan accordingly when working with real-world units.

Dec 08 12:55 2025

#sdkit
#easydiffusion

The new engine that’ll power Easy Diffusion’s upcoming v4 release (i.e. sdkit3) has now been integrated into Easy Diffusion. It’s available to test by selecting v4 engine in the Settings tab (after enabling Beta). Please press Save and restart Easy Diffusion after selecting this. It uses stable-diffusion.cpp and ggml under-the-hood, and produces optimized, lightweight builds for the target hardware. The main benefits of Easy Diffusion’s new engine are:

Nov 27 10:05 2025

#sdkit
#v3

Managed to get stable-diffusion.cpp integrated into sdkit v3 and Easy Diffusion. sdkit v3 wraps stable-diffusion.cpp with an API server. For now, the API server exposes an API compatible with Forge WebUI. This saves me time, and allows Easy Diffusion to work out-of-the-box with the new C++ based sdkit. It compiles and runs quite well. Ran it with Easy Diffusion’s UI. Tested with Vulkan and CUDA, on Windows.

Nov 19 05:44 2025

#sdkit
#ggml
#compiler

Following up to the previous post on sdkit v3’s design: The initial experiments with generating ggml from onnx models were promising, and it looks like a fairly solid path forward. It produces numerically-identical results, and there’s a clear path to reach performance-parity with stable-diffusion.cpp with a few basic optimizations (since both will eventually generate the same underlying ggml graph). But I think it’s better to use the simpler option first, i.e. use stable-diffusion.cpp directly. It mostly meets the design goals for sdkit v3 (after a bit of performance tuning). Everything else is premature optimization and scope bloat.

Nov 18 11:13 2025

#ml
#compiler
#sdkit
#onnx
#ggml

Successfully compiled the VAE of Stable Diffusion 1.5 using graph-compiler. The compiled model is terribly slow because I haven’t written any performance optimizations, and it (conservatively) converts a lot of intermediate tensors to contiguous copies. But we don’t need any clever optimizations to get to decent performance, just basic ones. It’s pretty exciting because I was able to bypass the need to port the model to C++ manually. Instead, I was able to just compile the exported ONNX model and get the same output values as the original PyTorch implementation (given the same input and weights). I could compile to any platform supported by ggml by just changing one flag (e.g. CPU, CUDA, ROCm, Vulkan, Metal etc).

Nov 13 09:46 2025

#ml
#compiler
#sdkit

PolyBlocks is another interesting ML compiler, written using MLIR. It’s a startup incubated in IISc Bangalore, run by someone (Uday Bondhugula) who co-authored a paper on compiler optimizations for GPGPUs back in 2008 (17 years ago)! Some of the compiler passes to keep in mind: fusion tiling use hardware acceleration (like tensor cores) constant folding perform redundant computation to avoid global memory accesses where profitable pack into buffers loop transformation unroll-and-jam (register tiling?) vectorization reorder execution for better spatial, temporary and group reuse Scheduling approaches:

Nov 05 09:47 2025

#easydiffusion
#sdkit

Following up to the deep-dive on ML compilers: sdkit v3 won’t use general-purpose ML compilers. They aren’t yet ready for sdkit’s target platforms, and need a lot of work (well beyond sdkit v3’s scope). But I’m quite certain that sdkit v4 will use them, and sdkit v3 will start making steps in that direction. For sdkit v3, I see two possible paths: Use an array of vendor-specific compilers (like TensorRT-RTX, MiGraphX, OpenVINO etc), one for each target platform. Auto-generate ggml code from onnx (or pytorch), and beat it on the head until it meets sdkit v3’s performance goals. Hand-tune kernels, contribute to ggml, and take advantage of ggml’s multi-backend kernels. Both approaches provide a big step-up from sdkit v2 in terms of install size and performance. So it makes sense to tap into these first, and leave ML compilers for v4 (as another leap forward).

Nov 05 09:43 2025

#easydiffusion
#sdkit
#compilers

This post concludes (for now) my ongoing deep-dive into ML compilers, while researching for sdkit v3. I’ve linked (at the end) to some of the papers that I read related to graph execution on GPUs. Some final takeaways: ML compilers might break CUDA’s moat (and fix AMD’s ROCm support). A single compiler is unlikely to fit every scenario. The scheduler needs to be grounded in truth. Simulators might be worth exploring more. ML compilers might break CUDA’s moat (and fix AMD’s ROCm support) It’s pretty clear that ML compilers are going to be a big deal. NVIDIA’s TensorRT is also an ML compiler, but it only targets their GPUs. Once the generated machine code (from cross-vendor ML compilers) is comparable in performance to hand-tuned kernels, these compilers are going to break the (in)famous moat of CUDA.

Nov 05 06:19 2025

#blog
#notes

Great post on why a “work-in-progress” notes blog is useful - https://gregorygundersen.com/blog/2020/01/12/why-research-blog/ This is exactly why I (re)started this blog. This blog is mainly a way to share the notes that I take when working on problems. I’ve always written huge volumes of notes (privately) when working through problems, but making them public has forced me to: Work through them with more rigor and detail (since they’ll be public). Structure them better. Catch and fix biases. Tackle large topics through a series of posts over time. Write them in a way that I can revisit later on and remember what I was thinking (instead of a giant messy blob of notes). It is important though to avoid the trap of feeling productive by publishing notes, instead of finally “shipping” the actual thing that you were meant to finish.

Nov 03 10:38 2025

#ggml
#compiler

It looks like ggml has recently added basic automatic operator fusion into their graph executor (example). It uses a hand-coded list of simple rule-based substitutions (e.g. fuse a matrix multiply followed by add into one op, or a matrix multiply followed by GLU activation into one op etc). Each fused op is a hand-written kernel. These fusion rules are specified per backend (e.g. separate rules for CUDA/ROCm, separate for Vulkan, separate for Metal etc), presumably people may not have written fused ops for certain backends (either due to the backend’s popularity, or lack of sufficient gain in performance).

~ / cmdr2

projects: freebird, easy diffusion

hacks: carbon editor, torchruntime, findstarlink