The compression pipeline
OpenCompress applies a multi-layer compression pipeline to your input prompts. Each layer targets a different source of token waste, and they compound — the output of one feeds into the next.Layer 1: Input Pruning
What it does: Removes tokens the model doesn’t need to read. A distilled classifier trained on 105K+ agent conversation samples scores each token by semantic importance. Low-importance tokens — filler words, redundant connectors, verbose formatting — are removed while preserving meaning.| Metric | Value |
|---|---|
| Token reduction | 40-60% |
| Quality retention | 95%+ cosine similarity |
| Speed | 4-12x faster than LLMLingua-2 |
Layer 2: Dictionary Aliasing
What it does: Replaces repeated phrases with compact aliases. Common multi-token phrases are mapped to short aliases (e.g.,§A1). A dictionary header is prepended to the prompt so the model can decode them. This is especially effective for:
- System prompts with repeated instructions
- RAG contexts with recurring entity names
- Tool call schemas with verbose type definitions