Skip to content

L4 - Numeric & Quantization

L4 covers numeric formats, quantization, sparsity, calibration, precision tradeoffs, and compression strategies. This layer often determines whether a model is feasible to run in a given cost, memory, latency, or device envelope.

L4Numeric & Quantization
  1. Precision
  2. Compression
  3. Calibration
  4. Quality tradeoffs

What belongs here

L4 is where architecture becomes arithmetic. It describes how tensors are represented and transformed, and how those choices affect quality, speed, cost, memory, and compatibility.

Representative projects

Project Why it might fit Adjacent layers
bitsandbytes Quantization and optimizer tooling commonly used in transformer workflows. L4 numeric, L6 models
llama.cpp GGUF Local inference ecosystem with quantized model formats and CPU/GPU execution paths. L4 quantization, L7 inference
AutoGPTQ GPTQ quantization tooling for transformer models. L4 quantization, L6 weights
AutoAWQ Activation-aware quantization tooling for LLM deployment. L4 quantization, L7 serving
NVIDIA TensorRT quantization Production optimization path for lower-precision inference on NVIDIA platforms. L3 compilation, L7 inference
vLLM quantization support Serving-time support for quantized model variants. L4 numeric, L7 serving

Boundary questions

  • Should quantized model files be treated as L4 artifacts, L6 model artifacts, or both?
  • When quantization is integrated into a serving engine, is the concern still separable?
  • How should AILIS represent accuracy, safety, and governance changes caused by numeric choices?

Signals to watch

  • Wider use of FP8, INT4, and mixed-precision inference.
  • Quantization becoming part of model release metadata.
  • Evaluation suites that treat quantization as a behavioral change rather than a pure optimization.