Adaptivocab

April 02, 2025

Page content

Building Efficient Financial Reasoning Systems with AdaptiVocab and ReasonScore

In this post, we’ll walk through a powerful new method for compressing and accelerating domain-specific prompts in large language models (LLMs). By combining the ideas from the recent paper AdaptiVocab: Enhancing LLM Efficiency in Focused Domains with token-level reasoning scores (ReasonScore), we’ve built a pipeline that reduces prompt size, improves model focus, and runs efficiently on small infrastructure like Hugging Face Spaces or Raspberry Pi devices.

Why This Matters

Most LLM prompts in the financial domain contain highly repetitive instructions:

“Summarize the income statement”
“Compare Q3 to Q4 revenue”
“Assess risk from the 10-K filing”

These phrases are repeated across thousands of prompts, yet tokenized inefficiently. For example:

"Compare net income trend year over year" → 8+ tokens

By merging these into domain-specific vocabulary like:

compare_net_income_yoy → 1 token

…we reduce token count, latency, and cost while improving model context window utilization.

The Solution

We built a modular pipeline that:

Extracts frequent financial phrases from your real-world prompt logs
Scores tokens using ReasonScore, a lightweight LLM-based importance estimator
Creates embeddings for new tokens based on their reasoning value
Injects those embeddings into a base model like Fin-R1 or Mistral

This allows you to compress real prompts without modifying the model architecture, using smart domain-aware initialization instead of random vectors.

System Overview

At the core is a component-based pipeline:

Prompt logs → n-gram extractor
AdaptiVocab patcher → phrase merging
ReasonScore engine → token importance
Embedding init → weighted combination
Model patch → tokens added to tokenizer + embeddings

All configuration is managed with Hydra.

Step-by-Step: From Logs to Model Enhancement

1. Extract Phrases

Using generate_adaptivocab.py, we extract top n-grams from real user instructions.

2. Score with ReasonScore

We load a small model (e.g., TinyLLaMA) and compute ReasonScores from final attention to determine which tokens are most important for reasoning.

3. Initialize Embeddings

Using a weighted sum of existing embeddings — weighted by ReasonScore — we initialize each new token vector smartly.

4. Patch Your Model

We expand the tokenizer, resize the embedding layer, and inject each new token + its vector.

Run the Full Pipeline

python adaptivocab_pipeline.py --config-path . --config-name config

This will:

Extract top 50 financial phrases
Compute ReasonScores
Initialize new embeddings
Save the patched embedding matrix

Results and Next Steps

By compressing instruction prompts, you:

Reduce token usage by up to 50%
Speed up inference
Improve reasoning fidelity by removing unnecessary variability

We’re extending this to:

Integrate with Fin-R1 for real-time advisory
Run on Hugging Face Spaces + Raspberry Pi
Add interpretability dashboards for tracing token impact

Try It Yourself

Want to compress your own domain? Clone the repo and point it at your logs:

python adaptivocab_pipeline.py log_file=your_prompts.txt top_k=100

You’ll get a model optimized for your domain — faster, smarter, and explainable.

Inspired By

Coming Soon

Gradio UI for testing patched model
Auto-finetuning on compressed vocab
Integration with Explainability Dashboards