DSPy

Dimensions of Thought: A Smarter Way to Evaluate AI

📖 Summary

This post introduces a multidimensional reward modeling pipeline built on top of the stephanieanie framework. It covers:

✅ Structured Evaluation Setup How to define custom evaluation dimensions using YAML or database-backed rubrics.
🧠 Automated Scoring with LLMs Using the ScoreEvaluator to produce structured, rationale-backed scores for each dimension.
🧮 Embedding-Based Hypothesis Indexing Efficiently embedding hypotheses and comparing them for contrastive learning using similarity.
🔄 Contrast Pair Generation Creating training pairs where one hypothesis outperforms another on a given dimension.

This is the first post in a 100-part series, where we take breakthrough AI papers and turn them into working code building the next generation of AI, one idea at a time.

🧾 Summary

In this post, I’ll walk through how I implemented the ideas from
AI Co-Scientist: Towards an AI Co-Scientist into a working system called Stephanie.

Summary

Large Language Models (LLMs) like DeepSeek-R1 show remarkable reasoning abilities, but how these abilities are internally represented has remained a mystery. This paper explores the mechanistic interpretability of reasoning in LLMs using Sparse Autoencoders (SAEs) — a tool that decomposes LLM activations into human-interpretable features. In this post, we’ll:

• Explain the SAE architecture used • Compute and visualize ReasonScore • Explore feature steering with sample completions • Provide live visualizations using Python + Streamlit

🕒 TL;DR

We explore MARS, a multi-agent prompt optimizer using Socratic dialogue.
We implement it using DSPy + Fin-R1 + EDGAR giving us an end-to-end financial reasoning pipeline.
We deploy the whole thing to Hugging Face Spaces with a Gradio UI.

🌟 Introduction

Prompt engineering has become the defining skill of the Large Language Model (LLM) era a delicate balance between science and art. Crafting the perfect prompt often feels like an exercise in intuition, trial, and error. But what if we could take the guesswork out of the process? What if prompts could optimize themselves?

Dimensions of Thought: A Smarter Way to Evaluate AI

📖 Summary

Building an AI Co-Scientist

🧾 Summary

Uncovering Reasoning in LLMs with Sparse Autoencoders

Summary

Optimizing Prompt Generation with MARS and DSPy

🕒 TL;DR

🌟 Introduction