<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Evaluation on Programmer.ie: Modern AI programming</title>
    <link>http://programmer.ie/categories/evaluation/</link>
    <description>Recent content in Evaluation on Programmer.ie: Modern AI programming</description>
    <generator>Hugo</generator>
    <language>en-us</language>
    <lastBuildDate>Wed, 22 Apr 2026 10:35:46 +0100</lastBuildDate>
    <atom:link href="http://programmer.ie/categories/evaluation/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Beyond Hallucination Energy: A Three-Dimensional Framework for Reliable AI Outputs</title>
      <link>http://programmer.ie/post/trendslop/</link>
      <pubDate>Wed, 22 Apr 2026 10:35:46 +0100</pubDate>
      <guid>http://programmer.ie/post/trendslop/</guid>
      <description>&lt;h2 id=&#34;-1--tldr&#34;&gt;🧩 1.  TLDR&lt;/h2&gt;&#xA;&lt;blockquote&gt;&#xA;&lt;p&gt;&lt;strong&gt;AI doesn&amp;rsquo;t just hallucinate.&#xA;Sometimes it gives answers that are fluent, safe… and completely useless.&lt;/strong&gt;&lt;/p&gt;&lt;/blockquote&gt;&#xA;&lt;p&gt;Most discussions about AI failure focus on hallucination:&lt;/p&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;making things up&lt;/li&gt;&#xA;&lt;li&gt;getting facts wrong&lt;/li&gt;&#xA;&lt;li&gt;fabricating sources&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;p&gt;That&amp;rsquo;s real. It matters.&lt;/p&gt;&#xA;&lt;p&gt;But it&amp;rsquo;s not the most dangerous failure mode in production systems.&lt;/p&gt;&#xA;&lt;p&gt;There is a quieter one.&lt;/p&gt;&#xA;&lt;p&gt;A more subtle one.&lt;/p&gt;&#xA;&lt;p&gt;And in practice a more &lt;em&gt;pervasive&lt;/em&gt; one.&lt;/p&gt;&#xA;&lt;blockquote&gt;&#xA;&lt;p&gt;&lt;strong&gt;AI systems often fail not by being wrong,&#xA;but by failing to think at all.&lt;/strong&gt;&lt;/p&gt;</description>
    </item>
    <item>
      <title>Applied Policy: How to incorporate Policy and Hallucination in self-improving system</title>
      <link>http://programmer.ie/post/policy_applied/</link>
      <pubDate>Wed, 18 Feb 2026 08:00:16 +0000</pubDate>
      <guid>http://programmer.ie/post/policy_applied/</guid>
      <description>&lt;blockquote&gt;&#xA;&lt;p&gt;Building a Self-Improving AI: Cooperative ERL and Embed-RL in a Trace-Native Architecture&lt;/p&gt;&lt;/blockquote&gt;&#xA;&lt;h2 id=&#34;1-the-problem&#34;&gt;1. The Problem&lt;/h2&gt;&#xA;&lt;p&gt;Most self-improving AI systems fail for one of three reasons:&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;First, scalar reward collapse.&lt;/strong&gt; Traditional reinforcement learning compresses multi-dimensional quality into a single scalar. This creates catastrophic interference: improving one axis (e.g., coherence) can degrade another (e.g., hallucination safety). The system optimizes for the blended metric, not the underlying objectives.&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Second, representation drift.&lt;/strong&gt; Embedding-based optimization without behavioral feedback creates geometric collapse. The embedding space becomes increasingly narrow, losing discriminative power. Similar queries map to identical regions. Diversity vanishes. The system becomes brittle.&lt;/p&gt;</description>
    </item>
    <item>
      <title>From Evidence to Verifiability: Rebuilding Trust in AI Outputs 🔏</title>
      <link>http://programmer.ie/post/policy/</link>
      <pubDate>Tue, 03 Feb 2026 12:25:58 +0000</pubDate>
      <guid>http://programmer.ie/post/policy/</guid>
      <description>&lt;h2 id=&#34;-tldr&#34;&gt;⏰ TLDR&lt;/h2&gt;&#xA;&lt;blockquote&gt;&#xA;&lt;p&gt;This work shows that the hardest part of using AI in high-trust environments is not the model, but the policy.&#xA;Once editorial policy is made explicit and executable, AI systems become interchangeable the real challenge is engineering reliable measurements and deterministic enforcement of those policies.&lt;/p&gt;&lt;/blockquote&gt;&#xA;&lt;h2 id=&#34;-summary&#34;&gt;📋 Summary&lt;/h2&gt;&#xA;&lt;p&gt;AI systems are becoming deeply embedded in how we research, write, and reason.&#xA;At the same time, their use in high-trust environments is under strain not because models are incapable, but because they are being deployed into settings that demand &lt;strong&gt;determinism, provenance, and enforceable rules&lt;/strong&gt;.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Search–Solve–Prove: building a place for thoughts to develop</title>
      <link>http://programmer.ie/post/ssp/</link>
      <pubDate>Sun, 02 Nov 2025 01:13:06 +0000</pubDate>
      <guid>http://programmer.ie/post/ssp/</guid>
      <description>&lt;h2 id=&#34;-summary&#34;&gt;🌌 Summary&lt;/h2&gt;&#xA;&lt;p&gt;What if you could &lt;strong&gt;see an AI think&lt;/strong&gt; not just the final answer, but the whole stream of reasoning: every search, every dead end, every moment of insight? We’re building exactly that: a visible, measurable thought process we call &lt;strong&gt;the Jitter&lt;/strong&gt;. This post &lt;strong&gt;the first in a series&lt;/strong&gt; shows how we’re creating the &lt;strong&gt;habitat&lt;/strong&gt; where that digital thought stream can live and grow.&lt;/p&gt;&#xA;&lt;p&gt;We’ll draw on ideas from:&lt;/p&gt;</description>
    </item>
  </channel>
</rss>
