Adaptivocab
Building Efficient Financial Reasoning Systems with AdaptiVocab and ReasonScore
In this post, we’ll walk through a powerful new method for compressing and accelerating domain-specific prompts in large language models (LLMs). By combining the ideas from the recent paper AdaptiVocab: Enhancing LLM Efficiency in Focused Domains with token-level reasoning scores (ReasonScore), we’ve built a pipeline that reduces prompt size, improves model focus, and runs efficiently on small infrastructure like Hugging Face Spaces or Raspberry Pi devices.
Why This Matters
Most LLM prompts in the financial domain contain highly repetitive instructions:
- “Summarize the income statement”
- “Compare Q3 to Q4 revenue”
- “Assess risk from the 10-K filing”
These phrases are repeated across thousands of prompts, yet tokenized inefficiently. For example:
"Compare net income trend year over year" → 8+ tokens
By merging these into domain-specific vocabulary like:
compare_net_income_yoy → 1 token
…we reduce token count, latency, and cost while improving model context window utilization.
The Solution
We built a modular pipeline that:
- Extracts frequent financial phrases from your real-world prompt logs
- Scores tokens using ReasonScore, a lightweight LLM-based importance estimator
- Creates embeddings for new tokens based on their reasoning value
- Injects those embeddings into a base model like Fin-R1 or Mistral
This allows you to compress real prompts without modifying the model architecture, using smart domain-aware initialization instead of random vectors.
System Overview
At the core is a component-based pipeline:
- Prompt logs → n-gram extractor
- AdaptiVocab patcher → phrase merging
- ReasonScore engine → token importance
- Embedding init → weighted combination
- Model patch → tokens added to tokenizer + embeddings
All configuration is managed with Hydra.
Step-by-Step: From Logs to Model Enhancement
1. Extract Phrases
Using generate_adaptivocab.py
, we extract top n-grams from real user instructions.
2. Score with ReasonScore
We load a small model (e.g., TinyLLaMA) and compute ReasonScores from final attention to determine which tokens are most important for reasoning.
3. Initialize Embeddings
Using a weighted sum of existing embeddings — weighted by ReasonScore — we initialize each new token vector smartly.
4. Patch Your Model
We expand the tokenizer, resize the embedding layer, and inject each new token + its vector.
Run the Full Pipeline
python adaptivocab_pipeline.py --config-path . --config-name config
This will:
- Extract top 50 financial phrases
- Compute ReasonScores
- Initialize new embeddings
- Save the patched embedding matrix
Results and Next Steps
By compressing instruction prompts, you:
- Reduce token usage by up to 50%
- Speed up inference
- Improve reasoning fidelity by removing unnecessary variability
We’re extending this to:
- Integrate with Fin-R1 for real-time advisory
- Run on Hugging Face Spaces + Raspberry Pi
- Add interpretability dashboards for tracing token impact
Try It Yourself
Want to compress your own domain? Clone the repo and point it at your logs:
python adaptivocab_pipeline.py log_file=your_prompts.txt top_k=100
You’ll get a model optimized for your domain — faster, smarter, and explainable.
Inspired By
Coming Soon
- Gradio UI for testing patched model
- Auto-finetuning on compressed vocab
- Integration with Explainability Dashboards