~ / cmdr2

projects: freebird, easy diffusion

hacks: carbon editor, torchruntime, findstarlink

  # filter by: [ posts | worklogs ]
  • #easydiffusion
  • #sdkit

Following up to the deep-dive on ML compilers: sdkit v3 won’t use general-purpose ML compilers. They aren’t yet ready for sdkit’s target platforms, and need a lot of work (well beyond sdkit v3’s scope). But I’m quite certain that sdkit v4 will use them, and sdkit v3 will start making steps in that direction. For sdkit v3, I see two possible paths: Use an array of vendor-specific compilers (like TensorRT-RTX, MiGraphX, OpenVINO etc), one for each target platform. Auto-generate ggml code from onnx (or pytorch), and beat it on the head until it meets sdkit v3’s performance goals. Hand-tune kernels, contribute to ggml, and take advantage of ggml’s multi-backend kernels. Both approaches provide a big step-up from sdkit v2 in terms of install size and performance. So it makes sense to tap into these first, and leave ML compilers for v4 (as another leap forward).

  • #easydiffusion
  • #sdkit
  • #compilers

This post concludes (for now) my ongoing deep-dive into ML compilers, while researching for sdkit v3. I’ve linked (at the end) to some of the papers that I read related to graph execution on GPUs. Some final takeaways: ML compilers might break CUDA’s moat (and fix AMD’s ROCm support). A single compiler is unlikely to fit every scenario. The scheduler needs to be grounded in truth. Simulators might be worth exploring more. ML compilers might break CUDA’s moat (and fix AMD’s ROCm support) It’s pretty clear that ML compilers are going to be a big deal. NVIDIA’s TensorRT is also an ML compiler, but it only targets their GPUs. Once the generated machine code (from cross-vendor ML compilers) is comparable in performance to hand-tuned kernels, these compilers are going to break the (in)famous moat of CUDA.

  • #blog
  • #notes

Great post on why a “work-in-progress” notes blog is useful - https://gregorygundersen.com/blog/2020/01/12/why-research-blog/ This is exactly why I (re)started this blog. This blog is mainly a way to share the notes that I take when working on problems. I’ve always written huge volumes of notes (privately) when working through problems, but making them public has forced me to: Work through them with more rigor and detail (since they’ll be public). Structure them better. Catch and fix biases. Tackle large topics through a series of posts over time. Write them in a way that I can revisit later on and remember what I was thinking (instead of a giant messy blob of notes). It is important though to avoid the trap of feeling productive by publishing notes, instead of finally “shipping” the actual thing that you were meant to finish.

  • #ggml
  • #compiler

It looks like ggml has recently added basic automatic operator fusion into their graph executor (example). It uses a hand-coded list of simple rule-based substitutions (e.g. fuse a matrix multiply followed by add into one op, or a matrix multiply followed by GLU activation into one op etc). Each fused op is a hand-written kernel. These fusion rules are specified per backend (e.g. separate rules for CUDA/ROCm, separate for Vulkan, separate for Metal etc), presumably people may not have written fused ops for certain backends (either due to the backend’s popularity, or lack of sufficient gain in performance).

  • #freebird
  • #dom

The next major version of Freebird (i.e. v3) will use a new internal architecture that’s much easier to program with. In some ways, it’s an evolution of the architecture used in Freebird v2, but taken to its logical conclusion. The current version of Freebird (v2) uses a DOM-like model, and borrows a lot of programming patterns from browser-based programming. An underlying runtime abstracts away input events (like trigger_press, drag, enter, leave etc). It follows an event dispatch model (using add_event_listener and dispatch_event). Visual elements like menus, transform handles etc are DOM Nodes, which respond to events like drag and click. It also uses CSS-like styling to provide an easy way to style groups of related elements (like menu buttons).

  • #gpu
  • #ai
  • #sdkit

A possible intuition for understanding GPU memory hierarchy (and the performance penalty for data transfer between various layers) is to think of it like a manufacturing logistics problem: CPU (host) to GPU (device) is like travelling overnight between two cities. The CPU city is like the “headquarters”, and contains a mega-sized warehouse of parts (think football field sizes), also known as ‘Host memory’. Each GPU is like a different city, containing its own warehouse outside the city, also known as ‘Global Memory’. This warehouse stockpiles whatever it needs from the headquarters city (CPU). Each SM/Core/Tile is a factory located in different areas of the city. Each factory contains a small warehouse (shed) for stockpiling whatever inventory it needs, also known as ‘Shared Memory’. Each warp is a bulk stamping machine inside the factory, producing 32 items in one shot. There’s a tray next to each machine, also known as ‘Registers’. This tray is used for keeping stuff temporarily for each stamping process. This analogy can help understand the scale and performance penalty for data transfers.

  • #mlir
  • #easydiffusion
  • #sdkit

Good post on using MLIR for compiling ML models to GPUs. It gives a good broad overview of a GPU architecture, and how MLIR fits into that. The overall series looks pretty interesting too! Making a note here for future reference - https://www.stephendiehl.com/posts/mlir_gpu/

  • #easydiffusion
  • #samplers
  • #c++

Wrote a fresh implementation of most of the popular samplers and schedulers used for image generation (Stable Diffusion and Flux) at https://github.com/cmdr2/samplers.cpp. A few other schedulers (like Align Your Steps) have been left out for now, but are pretty easy to implement. It’s still work-in-progress, and is not ready for public use. The algorithmic port has been completed, and the next step is to test the output values against reference values (from another implementation, e.g. Forge WebUI). After that, I’ll translate it to C++.

  • #easydiffusion
  • #sdkit
  • #compilers

Some notes on machine-learning compilers, gathered while researching tech for Easy Diffusion’s next engine (i.e. sdkit v3). For context, see the design constraints of the new engine. tl;dr summary The current state is: Vendor-specific compilers are the only performant options on consumer GPUs right now. For e.g. TensorRT-RTX for NVIDIA, MiGraphX for AMD, OpenVINO for Intel. Cross-vendor compilers are just not performant enough right now for Stable Diffusion-class workloads on consumer GPUs. For e.g. like TVM, IREE, XLA. The focus of cross-vendor compilers seems to be either on datacenter hardware, or embedded devices. The performance on desktops and laptops is pretty poor. Mojo doesn’t target this category (and doesn’t support Windows). Probably because datacenters and embedded devices are currently where the attention (and money) is.

  • #easydiffusion
  • #sdkit
  • #engine

The design constraints for Easy Diffusion’s next engine (i.e. sdkit v3) are: Lean: Install size of < 200 MB uncompressed (excluding models). Fast: Performance within 10% of the best-possible speed on that GPU for that model. Capable: Supports Stable Diffusion 1.x, 2.x, 3.x, XL, Flux, Chroma, ControlNet, LORA, Embedding, VAE. Supports loading custom model weights (from civitai etc), and memory offloading (for smaller GPUs). Targets: Desktops and Laptops, Windows/Linux/Mac, NVIDIA/AMD/Intel/Apple. I think it’s possible, using ML compilers like TensorRT-RTX (and similar compilers for other platforms). See: Some notes on ML compilers.