L4 - Numeric & Quantization¶
L4 covers numeric formats, quantization, sparsity, calibration, precision tradeoffs, and compression strategies. This layer often determines whether a model is feasible to run in a given cost, memory, latency, or device envelope.
L4Numeric & Quantization
- Precision
- Compression
- Calibration
- Quality tradeoffs
What belongs here¶
L4 is where architecture becomes arithmetic. It describes how tensors are represented and transformed, and how those choices affect quality, speed, cost, memory, and compatibility.
Representative projects¶
| Project | Why it might fit | Adjacent layers |
|---|---|---|
| bitsandbytes | Quantization and optimizer tooling commonly used in transformer workflows. | L4 numeric, L6 models |
| llama.cpp GGUF | Local inference ecosystem with quantized model formats and CPU/GPU execution paths. | L4 quantization, L7 inference |
| AutoGPTQ | GPTQ quantization tooling for transformer models. | L4 quantization, L6 weights |
| AutoAWQ | Activation-aware quantization tooling for LLM deployment. | L4 quantization, L7 serving |
| NVIDIA TensorRT quantization | Production optimization path for lower-precision inference on NVIDIA platforms. | L3 compilation, L7 inference |
| vLLM quantization support | Serving-time support for quantized model variants. | L4 numeric, L7 serving |
Boundary questions¶
- Should quantized model files be treated as L4 artifacts, L6 model artifacts, or both?
- When quantization is integrated into a serving engine, is the concern still separable?
- How should AILIS represent accuracy, safety, and governance changes caused by numeric choices?
Signals to watch¶
- Wider use of FP8, INT4, and mixed-precision inference.
- Quantization becoming part of model release metadata.
- Evaluation suites that treat quantization as a behavioral change rather than a pure optimization.