~ / cmdr2

projects: freebird, easy diffusion

hacks: carbon editor, torchruntime, findstarlink

// Cross-posted from Easy Diffusion’s blog.

This post concludes (for now) my ongoing deep-dive into ML compilers, while researching for sdkit v3. I’ve linked (at the end) to some of the papers that I read related to graph execution on GPUs.

Some final takeaways:

  1. ML compilers might break CUDA’s moat (and fix AMD’s ROCm support).
  2. A single compiler is unlikely to fit every scenario.
  3. The scheduler needs to be grounded in truth.
  4. Simulators might be worth exploring more.

ML compilers might break CUDA’s moat (and fix AMD’s ROCm support)

It’s pretty clear that ML compilers are going to be a big deal. NVIDIA’s TensorRT is also an ML compiler, but it only targets their GPUs. Once the generated machine code (from cross-vendor ML compilers) is comparable in performance to hand-tuned kernels, these compilers are going to break the (in)famous moat of CUDA.

And thankfully, this will also finally make AMD’s consumer GPUs more accessible to developers (by codifying the immense tribal knowledge of various ROCm versions on AMD’s consumer GPUs).

Hand-written kernels could go the way of hand-written assembly code. This was always going to happen eventually, but I think it’s pretty close now.

General-purpose ML compilers are still far from good, but the infrastructure and know-how is finally coming together. I don’t see anything fundamentally blocking that from happening (other than lots of hard work). The good news is that the recent widespread use of ML models (on all kinds of devices) and the combinatorial explosion of operator * data-type * hardware will naturally force ML compilers to become good.

A single compiler is unlikely to fit every scenario

Compiling a graph automatically into executable GPU code is going to be an entire field in itself, with sub-fields specializing on particular aspects.

Reading more papers on GPU scheduling (listed below) further reinforced the view that this is a manufacturing logistics problem (where you need to figure out the best plan for coordinating your manufacturing operations across cities and factories). There’s ample scope for specialization.

There are a lot of factors to consider:

So it’s unlikely that a single compiler can fit every problem. While there are a lot of shared techniques, each application will need to take a “whole-picture” view of their operating constraints while compiling the graph into machine code.

That’s why the idea of MLIR is pretty useful, where different operating constraints can be baked into different ‘dialects’, producing different domain-specific compilers. But it’s still very early days.

The scheduler needs to be grounded in truth

It’s also clear that the actual performance numbers from the GPUs matter when making the cost model. For e.g.:

An incorrect or simplistic cost model of the GPU can result in inefficient execution plans. For e.g. we can end up with unexpected bottlenecks if the generated kernels take longer (or less) than their predicted execution time. Driver updates can also be a source of surprises, for e.g. changes in cache eviction policies, warp scheduling algorithms etc.

Taking the manufacturing logistics analogy, you need to be aware of what’s actually happening in the factories, regardless of the ‘ideal’ manufacturing plan.

This is well understood - that’s why autotuners are used widely. But a bunch of papers downplay or ignore these factors, presumably because they’re looking at just one slice of the overall problem (e.g. graph optimization). So it is important not to optimize a graph while ignoring the hardware, for e.g. fusing operations into kernels that are too big or too small for a particular device (in context of the overall graph).

Simulators might be worth exploring more

A GPU simulator would model the compute cores, memory hierarchy sizes, performance, memory transfer dynamics etc for each given GPU model. While not recommended for cycle-accurate prediction of performance, it could help find a decent first-approximation execution plan. This plan can then be tuned just-in-time using autotuners.

It may not even need to actually run the code, as long as it produces correct tensor shapes. Because the goal is to model the performance characteristics, not emulate a GPU in software.

A simulator (with a library of lots of popular GPUs) might also help train ML schedulers by simulating a diverse range of operating constraints. Like a “GPU Dojo”.

There have been attempts like GPGPU-Sim, Accel-Sim, MGPUSIM and it is a hard problem. Again, this isn’t a new idea. CPU compilers have hand-written models of various hardware targets.

A simulator will never be perfect - ask any engineer in Formula 1. But I don’t think any Formula 1 team today would get rid of their simulation software, just because they can’t model reality to 100% accuracy.

Random ideas

Random ideas that may or may not be viable:

Some interesting papers