~ / cmdr2

projects: freebird, easy diffusion

hacks: carbon editor, torchruntime, findstarlink

  • #gpu
  • #ai
  • #sdkit

A possible intuition for understanding GPU memory hierarchy (and the performance penalty for data transfer between various layers) is to think of it like a manufacturing logistics problem: CPU (host) to GPU (device) is like travelling overnight between two cities. The CPU city is like the “headquarters”, and contains a mega-sized warehouse of parts (think football field sizes), also known as ‘Host memory’. Each GPU is like a different city, containing its own warehouse outside the city, also known as ‘Global Memory’. This warehouse stockpiles whatever it needs from the headquarters city (CPU). Each SM/Core/Tile is a factory located in different areas of the city. Each factory contains a small warehouse for stockpiling whatever inventory it needs, also known as ‘Shared Memory’. Each warp is a bulk stamping machine inside the factory, producing 32 items in one shot. There’s a tray next to each machine, also known as ‘Registers’. This tray is used for keeping stuff temporarily for each stamping process. This analogy can help understand the scale and performance penalty for data transfers.

  • #mlir
  • #easydiffusion
  • #sdkit

Good post on using MLIR for compiling ML models to GPUs. It gives a good broad overview of a GPU architecture, and how MLIR fits into that. The overall series looks pretty interesting too! Making a note here for future reference - https://www.stephendiehl.com/posts/mlir_gpu/

  • #easydiffusion
  • #sdkit
  • #compilers

Some notes on machine-learning compilers, gathered while researching tech for Easy Diffusion’s next engine (i.e. sdkit v3). For context, see the design constraints of the new engine. tl;dr summary The current state is: Vendor-specific compilers are the only performant options on consumer GPUs right now. For e.g. TensorRT-RTX for NVIDIA, MiGraphX for AMD, OpenVINO for Intel. Cross-vendor compilers are just not performant enough right now for Stable Diffusion-class workloads on consumer GPUs. For e.g. like TVM, IREE, XLA. The focus of cross-vendor compilers seems to be either on datacenter hardware, or embedded devices. The performance on desktops and laptops is pretty poor. Mojo doesn’t target this category (and doesn’t support Windows). Probably because datacenters and embedded devices are currently where the attention (and money) is.

  • #easydiffusion
  • #sdkit
  • #engine

The design constraints for Easy Diffusion’s next engine (i.e. sdkit v3) are: Lean: Install size of < 200 MB uncompressed (excluding models). Fast: Performance within 10% of the best-possible speed on that GPU for that model. Capable: Supports Stable Diffusion 1.x, 2.x, 3.x, XL, Flux, Chroma, ControlNet, LORA, Embedding, VAE. Supports loading custom model weights (from civitai etc), and memory offloading (for smaller GPUs). Targets: Desktops and Laptops, Windows/Linux/Mac, NVIDIA/AMD/Intel/Apple. I think it’s possible, using ML compilers like TensorRT-RTX (and similar compilers for other platforms). See: Some notes on ML compilers.

  • #easydiffusion
  • #sdkit
  • #amd
  • #torchruntime
  • #windows
  • #intel
  • #integrated
  • #directml

Easy Diffusion (and sdkit) now also support AMD on Windows automatically (using DirectML), thanks to integrating with torchruntime. It also supports integrated GPUs (Intel and AMD) on Windows, making Easy Diffusion faster on PCs without dedicated graphics cards.

  • #easydiffusion
  • #torchruntime
  • #sdkit

Spent the last week or two getting torchruntime fully integrated into Easy Diffusion, and making sure that it handles all the edge-cases. Easy Diffusion now uses torchruntime to automatically install the best-possible version of torch (on the users’ computer) and support a wider variety of GPUs (as well as older GPUs). And it uses a GPU-agnostic device API, so Easy Diffusion will automatically support additional GPUs when they are supported by torchruntime.

  • #easydiffusion
  • #sdkit
  • #freebird
  • #worklog

Continued to test and fix issues in sdkit, after the change to support DirectML. The change is fairly intrusive, since it removes direct references to torch.cuda with a layer of abstraction. Fixed a few regressions, and it now passes all the regression tests for CPU and CUDA support (i.e. existing users). Will test for DirectML next, although it will fail (with out-of-memory) for anything but the simplest tests (since DirectML is quirky with memory allocation).

  • #easydiffusion
  • #sdkit

Worked on adding support for DirectML in sdkit. This allows AMD GPUs and Integrated GPUs to generate images on Windows. DirectML seems like it’s really inefficient with memory though. So for now it only manages to generate images using SD 1.5. XL and larger models fail to generate, even though I have a 12 GB of VRAM in my graphics card.