L5 - Tokenization & Encoders¶
L5 covers the conversion of raw inputs into model-facing representations: tokenization, image encoders, audio encoders, patchification, embeddings, and modality-specific preprocessing. It is where the world becomes model input.
L5Tokenization & Encoders
- Text tokens
- Vision patches
- Audio frames
- Embedding spaces
What belongs here¶
L5 is lower than prompt construction. It does not decide what context is selected, but it does define how selected context is represented and how much of it can fit.
Representative projects¶
| Project | Why it might fit | Adjacent layers |
|---|---|---|
| Hugging Face Tokenizers | Fast tokenization library used across transformer workflows. | L5 tokenization, L6 model compatibility |
| OpenAI tiktoken | Tokenizer library used for OpenAI model token accounting and encoding. | L5 tokenization, L8 prompting |
| SentencePiece | Unsupervised text tokenizer and detokenizer commonly used in NLP models. | L5 tokenization, L6 models |
| CLIP | Vision-text representation model that illustrates multimodal encoding boundaries. | L5 encoders, L6 architecture |
| Whisper | Speech recognition model with audio preprocessing and encoding concerns. | L5 audio, L16 applications |
| SigLIP | Vision-language encoder family useful for multimodal retrieval and classification. | L5 encoders, L9 retrieval |
Boundary questions¶
- Does an embedding model belong here, in L6 model architecture, or in L9 retrieval when it is used for search?
- Should token counting be modeled as L5 mechanics or L8 context budgeting?
- How should AILIS represent multimodal systems where each modality has a different encoder stack?
Signals to watch¶
- Longer-context models making tokenization less visible but still economically important.
- Multimodal encoders becoming more composable across products.
- Tokenizer mismatch causing retrieval, evaluation, or governance failures.