~ / cmdr2

projects: freebird, easy diffusion

hacks: carbon editor, torchruntime, findstarlink

  • #findstarlink
  • #performance
  • #ops

The migration of findstarlink.com to Cloudflare Pages hit an issue (that I can’t describe here), but I had to roll it back for “reasons”. Would’ve been a nice cost-saver, but for now it’ll stay on S3. But the overall infrastructure of findstarlink (various components) is now quite streamlined, and pleasant to develop-for again. I also hit an issue when trying to optimize the loading time of findstarlink.com’s homepage on slow internet connections. On such connections, it takes a long time to download and parse cities.js (600 KB uncompressed, 300 KB compressed). And the UI thread is blocked while that’s happening (often for 10+ seconds).

  • #stable-diffusion
  • #c++
  • #cuda
  • #easydiffusion
  • #lab
  • #performance
  • #featured

// Cross-posted from Easy Diffusion’s blog. tl;dr - Today, I worked on using stable-diffusion.cpp in a simple C++ program. As a linked library, as well as compiling sd.cpp from scratch (with and without CUDA). The intent was to get a tiny and fast-starting executable UI for Stable Diffusion working. Also, ChatGPT is very helpful! Part 1: Using sd.cpp as a library First, I tried calling the stable-diffusion.cpp library from a simple C++ program (which just loads the model and renders an image). Via dynamic linking. That worked, and its performance was the same as the example sd.exe CLI, and it detected and used the GPU correctly.

  • #easydiffusion
  • #ai
  • #lab
  • #performance
  • #featured

// Cross-posted from Easy Diffusion’s blog. tl;dr: Explored a possible optimization for Flux with diffusers when using enable_sequential_cpu_offload(). It did not work. While trying to use Flux (nearly 22 GB of weights) with diffusers on a 12 GB graphics card, I noticed that it barely used any GPU memory when using enable_sequential_cpu_offload(). And it was super slow. It turns out that the largest module in Flux’s transformer model is around 108 MB, so because diffusers streams modules one-at-a-time, the peak VRAM usage never crossed above a few hundred MBs.