L4 - Numeric & Quantization¶

L4 covers numeric formats, quantization, sparsity, calibration, precision tradeoffs, and compression strategies. This layer often determines whether a model is feasible to run in a given cost, memory, latency, or device envelope.

L4Numeric & Quantization

Precision
Compression
Calibration
Quality tradeoffs

What belongs here¶

L4 is where architecture becomes arithmetic. It describes how tensors are represented and transformed, and how those choices affect quality, speed, cost, memory, and compatibility.

Representative projects¶

Project	Why it might fit	Adjacent layers
bitsandbytes	Quantization and optimizer tooling commonly used in transformer workflows.	L4 numeric, L6 models
llama.cpp GGUF	Local inference ecosystem with quantized model formats and CPU/GPU execution paths.	L4 quantization, L7 inference
AutoGPTQ	GPTQ quantization tooling for transformer models.	L4 quantization, L6 weights
AutoAWQ	Activation-aware quantization tooling for LLM deployment.	L4 quantization, L7 serving
NVIDIA TensorRT quantization	Production optimization path for lower-precision inference on NVIDIA platforms.	L3 compilation, L7 inference
vLLM quantization support	Serving-time support for quantized model variants.	L4 numeric, L7 serving

Boundary questions¶

Should quantized model files be treated as L4 artifacts, L6 model artifacts, or both?
When quantization is integrated into a serving engine, is the concern still separable?
How should AILIS represent accuracy, safety, and governance changes caused by numeric choices?

Signals to watch¶

Wider use of FP8, INT4, and mixed-precision inference.
Quantization becoming part of model release metadata.
Evaluation suites that treat quantization as a behavioral change rather than a pure optimization.

L4 - Numeric & Quantization¶

What belongs here¶

Representative projects¶

Boundary questions¶

Signals to watch¶

Links¶