~ / cmdr2

projects: freebird, easy diffusion

hacks: carbon editor, torchruntime, findstarlink

  • #zed
  • #vscode
  • #ai

Switched to Zed from VS Code. It’s really quite cool, mainly in terms of RAM usage and startup time. For my projects, a single VS Code window consumes around 2 GB of RAM, while the same project in Zed consumes around 90 MB. It really is quite insane. And Zed’s cold-start bootup latency is around 3 seconds (for me), compared to around 10-12 seconds in VS Code (before it’s ready to use). I don’t use a lot of extensions in VS Code.

  • #flare
  • #carbon
  • #ai

Bah. I just lost all of the work that I did yesterday. I’ve been using a new text editor over the past few months, which I “vibe-coded” for myself using AI. I never really reviewed its code, and it’s worked fairly well so far. Yes, it is vibe-coded in the full sense of the word, and now the bill has come due. Today, a weird race-condition bug in the editor caused an open file to get deleted (along with my work).

  • #fabricator
  • #ai

the concept of batched run might not make sense anymore in Fabricator, since copilot is moving away from request-based billing to token-based. but many of the other providers still have some concept of subscription billing, and I suspect that we might see subscriptions coming back (since subscriptions are good for business, i.e. negative working capital). once the current financial pressure of inference eases a bit. so maybe I’ll keep the batching code intact, but stop using it for now in Fabricator until it makes sense. I’ll also have to start thinking about cleaning up my inputs, and the reasoning level that’s used (higher produces more intermediate tokens, which may not always be necessary). for now, I don’t think I need to react right away, especially since my AI costs are within my monthly budget. So I can worry about efficiency and optimization later, as necessary, and continue to focus on increasing throughput on my task backlog.

  • #ai
  • #agents

The models powering coding agents currently feel more like fuzzy function calls, or Q&A bots. For more complex tasks, it would be better if they (ironically) behave more like chat, where they refine their understanding (and mine too) with follow-up questions and discussion, rather than being biased towards “answering”/“solving” in the very next reply. For e.g. when talking to a freelancer, we’d expect them to ask follow-up questions and clarify the requirements until we’re both sure that we’ve really understood the task. Or maybe even clarify stuff while implementing. “Plan mode” is an okay approximation (especially if you explicitly ask it to list questions for me). But that’s a workaround - the model is not explicitly post-trained/architected for dialogue. And doesn’t come into play during implementation.

  • #gpu
  • #ai
  • #sdkit

A possible intuition for understanding GPU memory hierarchy (and the performance penalty for data transfer between various layers) is to think of it like a manufacturing logistics problem: CPU (host) to GPU (device) is like travelling overnight between two cities. The CPU city is like the “headquarters”, and contains a mega-sized warehouse of parts (think football field sizes), also known as ‘Host memory’. Each GPU is like a different city, containing its own warehouse outside the city, also known as ‘Global Memory’. This warehouse stockpiles whatever it needs from the headquarters city (CPU). Each SM/Core/Tile is a factory located in different areas of the city. Each factory contains a small warehouse for stockpiling whatever inventory it needs, also known as ‘Shared Memory’. Each warp is a bulk stamping machine inside the factory, producing 32 items in one shot. There’s a tray next to each machine, also known as ‘Registers’. This tray is used for keeping stuff temporarily for each stamping process. This analogy can help understand the scale and performance penalty for data transfers.

  • #findstarlink
  • #ai
  • #llm

I spent some time today doing support for Freebird, Puppetry and Easy Diffusion. Identified a bug in Freebird (bone axis gizmos aren’t scaling correctly in VR), got annoyed by how little documentation I’ve written for Puppetry’s scripting API, and got reminded about how annoying it is for Easy Diffusion to force-download the poor quality starter model (stock SD 1.4) during installation. The majority of the day was spent in using a local LLM for classifying emails. I get a lot of repetitive emails for FindStarlink - people telling me whether they saw Starlink or not (using the predictions on the website). The first part of my reply is always a boilerplate “Glad you saw it” or “Sorry about that”, followed by email-specific replies. I’d really like the system to auto-fill the first part of the email, if it’s a report about Starlink sighting.

  • #ai
  • #ml
  • #llm

Built two experiments using locally-hosted LLMs. One is a script that lets two bots chat with each other endlessly. The other is a browser bookmarklet that summarizes the selected text in 300 words or less. Both use an OpenAI-compatible API, so they can be pointed at regular OpenAI-compatible remote servers, or your own locally-hosted servers (like LMStudio). Bot Chat Summarize Bookmarklet The bot chat script is interesting, but the conversation starts stagnating/repeating after 20-30 messages. The conversation is definitely very interesting initially. The script lets you define the names and descriptions of the two bots, the scene description, and the first message by the first bot. After that, it lets the two bots talk to each other endlessly.

  • #ai
  • #learning
  • #self-awareness

Today I explored an idea for what might happen if an AI model runs continuously, processing inputs, acting and receiving sensory inputs without interruption. Maybe in a text-adventure game. Instead of responding to isolated prompts, the AI would live in a simulated environment, interacting with its world in real time. The experiment is about observing whether behaviors like an understanding of time, awareness, or even a sense of self could emerge naturally through sustained operation.

  • #easydiffusion
  • #ai
  • #lab
  • #performance
  • #featured

tl;dr: Explored a possible optimization for Flux with diffusers when using enable_sequential_cpu_offload(). It did not work. While trying to use Flux (nearly 22 GB of weights) with diffusers on a 12 GB graphics card, I noticed that it barely used any GPU memory when using enable_sequential_cpu_offload(). And it was super slow. It turns out that the largest module in Flux’s transformer model is around 108 MB, so because diffusers streams modules one-at-a-time, the peak VRAM usage never crossed above a few hundred MBs.