~ / cmdr2

projects: freebird, easy diffusion

hacks: carbon editor, torchruntime, findstarlink

  • #sdkit
  • #easydiffusion

Released Easy Diffusion v4.3 (which updates to sdkit v3.2). This adds support for Ernie Image (and Ernie Turbo), as well as improved support for Anima models. It also includes a bunch of bug fixes in the rendering engine (i.e. stable-diffusion.cpp), and a few community-contributed bug fixes to the UI.

  • #easydiffusion
  • #sdkit
  • #worklog

Got Easy Diffusion v4 working on Apple and Intel Macs. The performance difference ratio (vs ED v3) is similar to the ratio on Windows (with CUDA) and other deployment targets. So that indicates optimization opportunities in sd.cpp. It’s currently about 1.5x slower than diffusers-based Stable Diffusion. In other news, easyinstaller is also out with its first release, which means that Easy Diffusion can now start shipping AppImage, Flatpak, rpm, deb, pkg, dmg etc for the different platforms. Instead of requiring Linux and Mac users to use the terminal to install and start Easy Diffusion. Will work on this soon.

  • #easydiffusion
  • #sdkit
  • #worklog

For Z-Image, the performance of the stock version of chromaForge is poorer than sd.cpp’s. Mainly because chromaForge isn’t able to run the smaller gguf quantized models that sd.cpp is able to run (chromaForge fails with the errors that I was fixing yesterday). If I really want to push through with this, it would be good to fix the remaining issues with gguf models in chromaForge. Only then can the performance be truly compared (in order to decide whether to release this into ED 3.5). I want to compare the performance of the smaller gguf models, because that’s what ED’s users will run typically.

  • #easydiffusion
  • #sdkit
  • #worklog

Worked on fixing Z-Image support in ED’s fork of chromaForge (a fork of Forge WebUI). Fixed a number of integration issues. It’s now crashing on a matrix multiplication error, which looks like an incorrectly transposed matrix (mostly due to reading the weights in the wrong order). I’ll try to install a stock version of chromaForge to see its raw performance with Z-Image (and whether it’s worth pursuing the integration), and also use it to help investigate the matrix multiplication error (and any future errors).

  • #sdkit
  • #easydiffusion

The new engine that’ll power Easy Diffusion’s upcoming v4 release (i.e. sdkit3) has now been integrated into Easy Diffusion. It’s available to test by selecting v4 engine in the Settings tab (after enabling Beta). Please press Save and restart Easy Diffusion after selecting this. It uses stable-diffusion.cpp and ggml under-the-hood, and produces optimized, lightweight builds for the target hardware. The main benefits of Easy Diffusion’s new engine are:

  • #sdkit
  • #v3

Managed to get stable-diffusion.cpp integrated into sdkit v3 and Easy Diffusion. sdkit v3 wraps stable-diffusion.cpp with an API server. For now, the API server exposes an API compatible with Forge WebUI. This saves me time, and allows Easy Diffusion to work out-of-the-box with the new C++ based sdkit. It compiles and runs quite well. Ran it with Easy Diffusion’s UI. Tested with Vulkan and CUDA, on Windows.

  • #sdkit
  • #ggml
  • #compiler

Following up to the previous post on sdkit v3’s design: The initial experiments with generating ggml from onnx models were promising, and it looks like a fairly solid path forward. It produces numerically-identical results, and there’s a clear path to reach performance-parity with stable-diffusion.cpp with a few basic optimizations (since both will eventually generate the same underlying ggml graph). But I think it’s better to use the simpler option first, i.e. use stable-diffusion.cpp directly. It mostly meets the design goals for sdkit v3 (after a bit of performance tuning). Everything else is premature optimization and scope bloat.

  • #ml
  • #compiler
  • #sdkit
  • #onnx
  • #ggml

Successfully compiled the VAE of Stable Diffusion 1.5 using graph-compiler. The compiled model is terribly slow because I haven’t written any performance optimizations, and it (conservatively) converts a lot of intermediate tensors to contiguous copies. But we don’t need any clever optimizations to get to decent performance, just basic ones. It’s pretty exciting because I was able to bypass the need to port the model to C++ manually. Instead, I was able to just compile the exported ONNX model and get the same output values as the original PyTorch implementation (given the same input and weights). I could compile to any platform supported by ggml by just changing one flag (e.g. CPU, CUDA, ROCm, Vulkan, Metal etc).

  • #ml
  • #compiler
  • #sdkit

PolyBlocks is another interesting ML compiler, written using MLIR. It’s a startup incubated in IISc Bangalore, run by someone (Uday Bondhugula) who co-authored a paper on compiler optimizations for GPGPUs back in 2008 (17 years ago)! Some of the compiler passes to keep in mind: fusion tiling use hardware acceleration (like tensor cores) constant folding perform redundant computation to avoid global memory accesses where profitable pack into buffers loop transformation unroll-and-jam (register tiling?) vectorization reorder execution for better spatial, temporary and group reuse Scheduling approaches:

  • #ml
  • #compiler
  • #onnx
  • #ggml
  • #sdkit
  • #worklog

Wrote a simple script to convert ONNX to GGML. It auto-generates C++ code that calls the corresponding ggml functions (for each ONNX operator). This file can then be compiled and run like a normal C++ ggml program, and will produce the same results as the original model in PyTorch. The generated file can work on multiple backends: CPU, CUDA, ROCm, Vulkan, Metal etc, by providing the correct compiler flags during cmake -B, e.g. -D GGML_CUDA=1 for CUDA.