Skip to content

L5 - Tokenization & Encoders

L5 covers the conversion of raw inputs into model-facing representations: tokenization, image encoders, audio encoders, patchification, embeddings, and modality-specific preprocessing. It is where the world becomes model input.

L5Tokenization & Encoders
  1. Text tokens
  2. Vision patches
  3. Audio frames
  4. Embedding spaces

What belongs here

L5 is lower than prompt construction. It does not decide what context is selected, but it does define how selected context is represented and how much of it can fit.

Representative projects

Project Why it might fit Adjacent layers
Hugging Face Tokenizers Fast tokenization library used across transformer workflows. L5 tokenization, L6 model compatibility
OpenAI tiktoken Tokenizer library used for OpenAI model token accounting and encoding. L5 tokenization, L8 prompting
SentencePiece Unsupervised text tokenizer and detokenizer commonly used in NLP models. L5 tokenization, L6 models
CLIP Vision-text representation model that illustrates multimodal encoding boundaries. L5 encoders, L6 architecture
Whisper Speech recognition model with audio preprocessing and encoding concerns. L5 audio, L16 applications
SigLIP Vision-language encoder family useful for multimodal retrieval and classification. L5 encoders, L9 retrieval

Boundary questions

  • Does an embedding model belong here, in L6 model architecture, or in L9 retrieval when it is used for search?
  • Should token counting be modeled as L5 mechanics or L8 context budgeting?
  • How should AILIS represent multimodal systems where each modality has a different encoder stack?

Signals to watch

  • Longer-context models making tokenization less visible but still economically important.
  • Multimodal encoders becoming more composable across products.
  • Tokenizer mismatch causing retrieval, evaluation, or governance failures.