Techniques to improve coding agent velocity
April 3, 2026
This post is a grab bag of techniques to improve engineering velocity that I haven't seen discussed much online. Improving coding agent autonomy is integral to your engineering velocity, and therefore your company's success. As a friend of mine once said, "we're all just playing factorio now".
Share transcripts across your team
Context about why certain design decisions were made is often available from past agent transcripts. As your team scales, most of this context is not readily available on your machine. We built an internal tool that uploads past chat transcripts to a shared MCP server and gave our coding agents a tool to query historical conversations.
Here's an example of this in practice: to get up to speed as Sam and I were pairing on a memory leak bug, claude code reads his transcripts from earlier in the debugging session.
→ "Based on the activity logs, here's what Sam concluded about the memory leak:\n\n## The Smoking Gun Finding [2026-03-31 23:48:35]\nSam identified that **in a single minute (23:40:00-23:40:03), there were 10 `file_cache` messages = 10 `build()` calls**, but there were NO webhook handler logs, +15 lines (ctrl+o to expand)"
Agent: Now I have the picture. Sam identified the
_draft_refresh_loop as a key source of build() calls, and moved GmailClient to build-once-in-__init__. But the metrics show the leak is still active and actually worse — monotonic growth at 4.5-7.1 MB/min across all post-fix instances.The reason is clear: Sam fixed GmailClient close() in
_get_new_message_ids and _process_messages, but CalendarClient has no close() in any of those paths, and _refresh_stale_drafts creates both clients per draft with no close() on either.Let me re-apply my
__del__ fix on top of Sam's commits.
Eval: We measured the impact of transcript sharing on a real 4-session memory leak investigation. In the final debugging session, 48.6% of work was wasted re-investigating application-level fixes that earlier sessions had already tried and ruled out.
Remove blockers to autonomous action
Here are good reasons for a coding agent to stop looping:
- It's done with the task
- It needs clarification on the scope of the task
Here are bad reasons for stopping:
- It needs the user to unblock it
A general principle that I've found to be helpful is "if the agent asks me for something that it could theoretically know the answer to, make sure it has access to it next time". Here is a summary of common "preventable" reasons that the coding agent stops prematurely and how we've addressed them:
| Reason for stopping | Fix |
|---|---|
| Finishes designing UI and asks me how it looks | We give it playwright mcp / chrome devtools mcp so that it can test UI autonomously |
| Needs access to production resources | Render / Neon MCP |
| Needs to test a "real-world" capability | Give it access to the real world (eg give it an email) |
| Needs permission to do something dangerous (eg reading env vars) | Use auto mode, prompt it to never ask for permission |
Here are some examples of these fixes in practice:
Autonomous UI testing with Playwright MCP — instead of asking "how does this look?", the agent takes screenshots and iterates:
→ mcp__playwright__browser_navigate({"url": "http://localhost:3000"})
→ mcp__playwright__browser_take_screenshot({"filename": "mobile-before.png"})
...agent edits CSS...
→ mcp__playwright__browser_take_screenshot({"filename": "mobile-after.png"})
Agent: Much better. The empty space is gone and the email card demo is now visible on mobile. Let me also check tablet and desktop breakpoints.
→ mcp__playwright__browser_take_screenshot({"filename": "desktop-after.png"})
→ mcp__playwright__browser_take_screenshot({"filename": "tablet-after.png"})
Production access via Render MCP — instead of asking me to check prod, the agent pulls live metrics and logs directly:
→ mcp__render__select_workspace({"ownerID": "tea-d1vru795pdvs73edkhi0"})
→ mcp__render__get_metrics({"resourceId": "srv-d6s55t1j16oc73eih6i0", "metricTypes": ["memory_usage", "memory_limit"], "startTime": "2026-03-31T00:00:00Z"})
→ {"metrics": [{"type": "memory_usage", "data": [...]}]}
→ mcp__render__list_logs({"serviceId": "...", "filter": "draft_composer"})
Real-world testing via Gmail API — instead of asking me to send a test email, the agent does it itself:
Agent: Watch is renewed. No webhook yet — let me send the test email for you. I'll need to use a session token to access the Gmail API through the control plane.
...agent restarts local server, registers auth session, queries Gmail API...
→ Returns real inbox emails: "wanna hang tmrw?" from htdowling@gmail.com
Agent: There are two test scheduling emails already in your inbox. The webhooks hit the old server and returned 200, but the background task ran on that process.
Auto mode for dangerous operations — instead of asking permission to read env vars, the agent just checks:
ANTHROPIC_API_KEY is not in .env. Let me check further.→ Bash: env | grep ANTHROPIC_API_KEY | head -c 30
→ ANTHROPIC_API_KEY=sk-ant-api03...
Agent: It's set in your shell environment (not
.env). Now adding the note to CLAUDE.md.
Eval: Across 738 transcripts, 11.4% ended with a preventable stop. Roughly 30% of stops are directly addressable by adding tools.
Automatically auditing prod agent transcripts
Evaluating transcripts as well as agent outputs is important, since coding agents can arrive at a correct output despite legitimate blockers. For example, if you ask an agent to edit a powerpoint, and it's unable to make effective edits to the powerpoint .xml file, it may resort to recreating the powerpoint using html and then converting the html to .pptx.
To facilitate this kind of evaluation, we persist production transcripts and run coding agents that automatically audit every transcript generated by our product and check to see if the agent got blocked or confused. Then it makes a list for us to audit.
You might wonder why not to close the loop completely and automatically improve the product in this way. The answer is that this can lead to the agent getting confused and suggesting un-useful changes. I'd estimate that sometime this year this issue will disappear and autonomous improvement in this way will be reliable.
Here's an example of this technique in practice:
2. abe224f — Scanned all 18 subagent transcripts for errors. Found 353 total tool calls, ~38 real errors across categories: "Shape not found" (16x), "Chart not found" (6x), sibling cascades (7x), MCP timeouts (3x), fetch failures (3x), NaN serialization (1x), Aspose bugs (2x).
3. ad33413 — Deep-dived into confusion/wasted effort. Ranked all 18 agents by confusion score. Top offenders: slide 13 (38 tool calls, 7 errors), slide 17 (24 calls, 6 errors), slide 10 (29 calls, 3 errors). Identified 6 systemic issues.
4. acfe8f8 — Investigated the worst agent (a12fc6e, slide 17) specifically. Found it destroyed a doughnut chart by ungrouping it, then had to recreate it as a simpler pie chart.
Eval: To test reproducibility, we independently re-audited 5 of the 18 subagent transcripts and compared to the original audit's labels.
At Fergana Labs, we're helping coding agents to do great knowledge work, like creating and editing presentations. If this sounds interesting to you, please do reach out and say hi to me at henry@ferganalabs.com.