DSPy

Dimensions of Thought: A Smarter Way to Evaluate AI

Dimensions of Thought: A Smarter Way to Evaluate AI

📖 Summary

This post introduces a multidimensional reward modeling pipeline built on top of the CO_AI framework. It covers:

  • ✅ Structured Evaluation Setup How to define custom evaluation dimensions using YAML or database-backed rubrics.

  • 🧠 Automated Scoring with LLMs Using the ScoreEvaluator to produce structured, rationale-backed scores for each dimension.

  • 🧮 Embedding-Based Hypothesis Indexing Efficiently embedding hypotheses and comparing them for contrastive learning using similarity.

  • 🔄 Contrast Pair Generation Creating training pairs where one hypothesis outperforms another on a given dimension.

Uncovering Reasoning in LLMs with Sparse Autoencoders

Summary

Large Language Models (LLMs) like DeepSeek-R1 show remarkable reasoning abilities, but how these abilities are internally represented has remained a mystery. This paper explores the mechanistic interpretability of reasoning in LLMs using Sparse Autoencoders (SAEs) — a tool that decomposes LLM activations into human-interpretable features. In this post, we’ll:

• Explain the SAE architecture used • Compute and visualize ReasonScore • Explore feature steering with sample completions • Provide live visualizations using Python + Streamlit

Optimizing Prompt Generation with MARS and DSPy

🕒 TL;DR

  • We explore MARS, a multi-agent prompt optimizer using Socratic dialogue.
  • We implement it using DSPy + Fin-R1 + EDGAR giving us an end-to-end financial reasoning pipeline.
  • We deploy the whole thing to Hugging Face Spaces with a Gradio UI.

🌟 Introduction

Prompt engineering has become the defining skill of the Large Language Model (LLM) era a delicate balance between science and art. Crafting the perfect prompt often feels like an exercise in intuition, trial, and error. But what if we could take the guesswork out of the process? What if prompts could optimize themselves?