L3 - ML Graph & Compilation¶
L3 covers graph representation, lowering, optimization, and compilation. This layer transforms model programs into forms that can execute efficiently on specific runtimes and hardware.
L3ML Graph & Compilation
- Graph IR
- Operator fusion
- Lowering
- Runtime targets
What belongs here¶
L3 is not the model weights themselves, and it is not the serving engine. It is the translation layer that decides how a model computation graph can be represented, optimized, and mapped to execution backends.
Representative projects¶
| Project | Why it might fit | Adjacent layers |
|---|---|---|
| ONNX Runtime | Runtime and tooling around the ONNX model representation and execution providers. | L3 graph, L7 inference |
| OpenXLA XLA | Compiler technology for optimizing ML workloads across backends. | L2 runtime, L3 compilation |
| Apache TVM | Compiler stack for deep learning models across hardware targets. | L1 hardware, L3 compilation |
| MLIR | Compiler infrastructure useful for representing and lowering domain-specific computations. | L3 graph, L4 numeric |
| TensorRT-LLM | Optimization and runtime path for LLM inference on NVIDIA platforms. | L3 compilation, L7 serving |
| OpenVINO | Toolkit for optimizing and deploying AI inference across Intel hardware. | L3 optimization, L7 inference |
Boundary questions¶
- Does an inference framework belong in L3 when its main value is compilation, and L7 when its main value is serving?
- Should model exchange formats be separated from compiler execution backends?
- How should the layer handle dynamic agent graphs that are not pure tensor graphs?
Signals to watch¶
- More attention to portable intermediate representations for AI workloads.
- Compiler support for speculative decoding, mixture-of-experts routing, and sparse computation.
- Cross-vendor pressure for optimization paths that do not lock models to one accelerator family.