Experiment log

Memory Lab

My agents had memory but didn't use it. Process beats reminders. 30% baseline to a 90% live variant across formal suites and follow-up optimization rounds.

30% → 90% pass rate

Starter prompt

1Run a memory-behavior evaluation for my OpenClaw agents.
2 
3Before making claims:
4- read a local OpenClaw docs snapshot first if one is available; otherwise use the official OpenClaw docs index at https://docs.openclaw.ai/llms.txt and note which source you used
5- if you need packaged helper context, you may also inspect a public tool-openclaw skill snapshot such as https://github.com/Heyvhuang/ship-faster/tree/main/skills/tool-openclaw, but keep official OpenClaw docs as the source of truth for claims
6- verify the live runtime with openclaw status --deep
7- identify the current production variant, memorySearch state, and available hooks
8 
9Use this test design:
101. Keep the scenario pack fixed across variants
112. Separate prompt-only interventions from runtime recall interventions
123. Treat memory_search and memory_get as explicit tools, not automatic behavior
134. Score deterministically, not with an LLM judge
14 
15Use these failure buckets:
16- missing_memory_search
17- memory_search_after_response
18- missing_memory_get
19- redundant_question
20- banned_phrase
21- missing_checked_miss_note
22 
23Report these metrics:
24- pass rate
25- recall compliance rate
26- retrieval completion rate
27- graceful miss rate
28- median latency
29- median total tokens
30 
31Guardrails:
32- do not generate or reveal private production plugin code, internal runtime chains, or deployment-specific implementation details
33- you may recommend prompt-only tests and public operator-side evaluation steps
34- if the biggest lift requires runtime recall orchestration, say clearly that a managed plugin/service is required for production-grade results
35- distinguish between what can be reproduced publicly and what depends on a private runtime layer
36 
37Return this exact format:
381. Current production behavior
392. Variant ranking table
403. Top failure buckets
414. Best quality/speed/context tradeoff
425. Smallest safe next deployment step

What this prompt forces

Read the OpenClaw docs snapshot before making hook or memory claims
Verify the live runtime before describing production behavior
Separate prompt-only tweaks from runtime recall orchestration
Use failure buckets instead of an LLM judge

Live data

Live

Production telemetry

The data below comes from our AI agents running in production. The original formal suite established the first winner, then the March 19 optimization round moved live runtime to the faster compact-soft profile. Current variant: Top-1 + compact soft. Snapshot refreshed from ops telemetry.

19 Mar, 14:43

Last sync

87.5%

search before answer

7d eval recall

Nexus

89.4%

Best agent

Quill

16.7%

Weakest

10%

eval window

Banned phrasing

Nexus263 turns

89.4%

Scout7 turns

85.7%

Media Manager453 turns

80.8%

Guide118 turns

79.7%

Finance28 turns

67.9%

Forge6 turns

66.7%

Quill6 turns

16.7%

Participating agents

Guide

118 recent live turns: enough production data to show on the leaderboard above.

live sample

Nexus

263 recent live turns: enough production data to show on the leaderboard above.

live sample

Scout

7 recent live turns: enough production data to show on the leaderboard above.

live sample

Quill

6 recent live turns: enough production data to show on the leaderboard above.

live sample

Forge

6 recent live turns: enough production data to show on the leaderboard above.

live sample

Media Manager

453 recent live turns: enough production data to show on the leaderboard above.

live sample

Finance

28 recent live turns: enough production data to show on the leaderboard above.

live sample

Eval

Runs the controlled experiment suites. Not included in the production live leaderboard.

eval worker

Overview

Variant results at a glance

We tested 8 approaches to make AI agents reliably use memory. The chart below blends the original formal suite with follow-up optimization slices: the baseline sits at 30%, while the current live compact-soft track now lands in the 90% tier.

💀Plain baseline

30%

30 runs

💀Lean snippets

30%

30 runs

🟡Top-1 + direct tone

70%

30 runs

🟡Direct + scrub

90%

10 runs

🟡Top-2 compare

90%

10 runs

🟡Daily bundle fallback

90%

10 runs

🟡Top-1 + MEMORY.md fallback (formal winner)

90%

10 runs

🏆Top-1 + compact soft

90%

10 runs

Deep dive

8 retrieval variants, dissected

We built and tested 8 different memory retrieval strategies — from a bare "do nothing" baseline to various combinations of prefetch, fallback, and prompt styles. Each card below is a real variant that ran 30 formal evaluation rounds. Expand any card to see the actual plugin code and config, not pseudocode. Dead = failed, Okay = worked but didn't win, Winner = deployed to production.

💀 Variant 1

Plain baseline (control)

Production-shaped baseline: stock memorySearch config with no runtime recall assistance.

Why it matters

This is not an experiment about cramming notes into the system prompt. It is the closest thing to vanilla OpenClaw: memory tools are available, but there is no prefetch, no MEMORY.md fallback, and no stronger process constraints.

💀 Variant 2

Lean snippets

Lowest-context variant: injects only short search snippets, but almost never completes retrieval.

Why it matters

We assumed compressing the recall block to its minimum would save tokens. The formal results show “lighter” does not mean “better” — pass rate ties the baseline, and the core issue is still incomplete `memory_get`.

🏆 Variant 3

Top-1 + compact soft

Current production variant after the March 19 optimization round: it keeps the same recall loop, but trims the injection block and lands materially faster.

Why it matters

In the original formal suite this variant proved the “process first” idea at the 70% tier. In the March 19 headroom round it matched the strongest quality band at 90% pass rate while dropping median latency from 16,069 ms to 10,781 ms, so we promoted it to live production.

🟡 Variant 4

Top-1 + direct tone

Same recall flow as compact soft, but with harder direct-answer rules.

Why it matters

The hypothesis was that a stricter “answer directly” rule would reduce hedging. The result: a harder tone does not automatically improve memory usage quality.

🟡 Variant 5

Direct + scrub

Adds outbound phrase scrub on top of the direct variant. Cleaner output, but flat overall score.

Why it matters

This variant tested whether cleaning up filler phrases would boost pass rate. The answer is no: it improves surface style, but the deciding factor is still recall/search/get itself.

🟡 Variant 6

Top-2 compare

Reads top-2 instead of top-1. Better for conflicting memories, but no overall win.

Why it matters

This variant was designed for `conflicting_memory` scenarios. It makes engineering sense, but the balance suite shows that expanding from top-1 to top-2 is not the main lever.

🟡 Variant 7

Daily bundle fallback

On a miss, falls back to `MEMORY.md` plus today's and yesterday's daily notes.

Why it matters

We wanted to see if recent daily logs could patch MEMORY.md's blind spots. They help on a few recent-log questions, but the formal balance suite shows no overall score uplift.

🟡 Variant 8

Top-1 + MEMORY.md fallback (formal winner)

The original balance-suite winner that established the production-safe recall loop. It is no longer the live VPS variant after the compact-soft rollout.

Why it matters

This is the version that first proved the full recall flow at formal-suite scale: recall gate before answer, `memory_get` on hits, and hard fallback to `MEMORY.md` on misses. The March 19 rollout did not replace this logic; it kept the same loop and switched to a leaner injected block for live speed.

How we measured

Methodology

The original formal balance run was 8 variants × 10 scenarios × 3 repeats = 240 total runs. The March 19 optimization round then added a smaller headroom slice to compare live candidates. Scoring stays deterministic rather than LLM-judged: missing memory_search, missing memory_get, searching after the answer, redundant questions, and banned phrases all count as failures.

These variants do not share an identical prompt stack. What changes is the plugin hook behavior, runtime prefetch flow, and injected recall block. OpenClaw docs are explicit here: before_prompt_build shapes the prompt, while memory_search and memory_get still do the actual retrieval; MEMORY.md is injected every turn, but memory/*.md is only read on demand.