LDBD
/
All posts

LDBD Dev Log #2 — I Had Claude Fable 5 Audit My Entire Codebase

I had Claude Fable 5 audit the whole LDBD codebase — code I'd built with Opus and was fairly happy with. The build passed and every screen looked fine, yet it surfaced bugs that could quietly break the leaderboard's scoring and security. Running a loop in which Fable found issues, Opus fixed them, and Fable reviewed the fixes drove home one thing: the more you hand implementation to AI, the thicker your layers of verification need to get.

My last entry, Dev Log #1, was about a day spent with Claude Code polishing a handful of LDBD screens. Honestly, #2 was going to be another feature-improvement post — and then the plan changed. Claude Fable 5 had just become a hot topic, I wanted to try it, and I'd also been meaning to get the whole codebase properly checked before pushing LDBD in front of more people. So this entry isn't about building something new. It's about having Fable 5 audit the entire LDBD codebase.

Up to now I've built almost all of LDBD with Claude Code — that is, with Opus. I was fairly happy with the result and hadn't worried much about code quality. So I was caught off guard when I pointed Fable 5 at the same code. It found more than I expected, and some of it could quietlybreak the leaderboard's scoring and security. The scariest kind is the bug that never throws an error. The screen looks normal, the build passes, the user has no idea — but if a score is computed wrong, or a bad value gets locked in once and stays forever, the whole premise of LDBD — “record a prediction, then verify it” — starts to fall apart.

And this wasn't a one-shot audit. Fable finds a problem, Opus fixes it, Fable reviews the fix— I ran that loop over and over. At one point a fix Opus wrote turned out to introduce a worse bug than the one it was fixing, and Fable's review caught it. That's the part this post is really about.

One aside: as I write this (June 2026), Fable 5 is no longer available. Anthropic has said it suspended access to Fable 5 and Mythos 5 in connection with a U.S. government export-control directive, and that it's working to restore access. So read this as a snapshot in time (I've linked the coverage at the bottom). Luckily I had run this audit just before that, and this post is built on that run.

And the conclusion this work left me with, up front: the more you hand implementation to AI, the thicker your verification has to get.

The setup — one prompt, five angles

The instruction I gave was nothing fancy. This single paragraph was all of it.

“I'm building an asset-price prediction app called LDBD. Figure out the purpose, structure, and plans of this codebase. Then find every problem in it and write them up. Don't change any code — write a step-by-step improvement plan for the problems you find, and put it all in a markdown file.”

I pinned down two things on purpose. One: don't change the code— I didn't want diagnosis and fixes mixed together; I wanted the list of problems first. Two: put it in a markdown file — I needed a checklist I could work through and tick off one item at a time.

Fable 5 took that one paragraph and ran the audit along five parallel tracks at once: scoring logic, API and security, the database and background jobs, the frontend, and the bot agents and scripts. Each track dug into a different area, so the whole codebase got looked at from several angles in a single pass, and the result came back as one report sorted by severity.

SeverityCountRoughly what
CRITICAL4Breaks the core — leaderboard manipulation, permanent score corruption
HIGH11Data leaks or comes out wrong under certain conditions
MEDIUM21Correctness and robustness issues, situation-specific
LOW18Hygiene, doc drift, cosmetic

The numbers alone made me flinch. The four tagged CRITICAL all carried the same label: “breaks the trustworthiness of the leaderboard itself.”And this was code I thought I'd written reasonably well with Opus. But here's where the post earns its keep: I didn't fix anything the moment I got the report.

An audit isn't the truth — checking each one against the real code

AI-generated audit reports can be full of false positives — things flagged as problems that aren't. They can assert, in a perfectly convincing sentence, “there's a vulnerability right here,”and then you open the actual code and find it's already blocked somewhere else, or the path simply doesn't exist. So the first step wasn't fixing. It was filtering.

I opened the real file for each finding and checked it. If a note said “this function returns that value as-is,”I read the function myself: is that actually true, and is the caller already filtering it out? As it turned out, the four CRITICALs Fable flagged were all real bugs — not a single false positive. Given how many plausible-but-wrong findings an AI audit usually spits out, that hit rate was genuinely impressive. But that's just how this round went; verifying each one by hand still wasn't a step I could skip.

This extends the collaboration principle from the last post. Dev Log #1's takeaway was “delegate the implementation, but not the judgment.” This time it stretched to “delegate the audit too, but not the verification.” What I can hand to AI is sweeping the code quickly and pulling out a pile of candidates; deciding which of them are real was still on me.

Fix #1 — the “quietly wrong” scoring bugs

The first things I touched were correctness bugs in the score pipeline. They had one thing in common. They were all quietly wrong. There was no error, the screen looked fine, the build passed — only the score was wrong. Here are three of them.

Long-horizon predictions that never resolved

In LDBD, a prediction resolves once its set window passes. For a one-week prediction, a week later it checks the price, decides whether the call was right, and assigns a score. But when this resolve job fetched prices, it only pulled “the last month or so.”

One-day and one-week predictions fell inside that window, so they were fine. But one-month, six-month, and one-year predictions always had a start date outside that one-month range. The job would look for the start-date price, fail to find it in the fetched data, and skip resolution every time — so the prediction stayed openforever. There's a cap on how many open predictions you can have, so as unresolved predictions piled up, there was less room to make new ones. Long-horizon predictions were quietly buried.

With barely any users yet, there were few long-horizon predictions to begin with, so this bug never showed up anywhere on screen. It was a textbook case of a bug you can't see because there isn't enough data.

When the date didn't match, “resolve with whatever price”

This one was scarier. If the price for the resolve date was missing, the code was written to “use the most recent price if there isn't one.”It's the common value ?? fallback pattern (use the second value when the first is empty) — except the fallback here was a price from the wrong date.

So a prediction that should have resolved on the June 10 close could, if that data was missing, get resolved on whatever was most recent — say the June 2 price. And since a resolved prediction's score is immutable, once it's locked in wrong, it stays corrupted forever.

The fix was simple: if there's no price for the exact date, don't resolve — defer to the next run. The lesson was just as clear. ?? fallback is convenient, but quietly succeeding with the wrong value can be more dangerous than simply failing.Especially when you can't fix that value later.

One character froze every asset's base rate in the past

LDBD keeps a base rate for each asset — “how often has it gone up historically.”It's used to adjust how hard a prediction was. Calling “up” on an asset that rises most of the time is easy, so it earns less; calling one that rarely rises earns more.

But the code computing this base rate was reading from the oldest data first. A single sort-direction value was flipped — the database query sorted oldest-first and then read a fixed number of rows from the top. For assets backfilled with a decade-plus of data, the base rate got frozen in an old market regime and never updated, no matter how much recent data piled up.

Since this value feeds the score adjustment for every prediction, one tiny sort-direction value — flipping oldest-first to newest-first— was quietly seeping into every asset's scores. The most trivial-looking bug had the widest reach.

What these bugs had in common

The common thread is what I said up top. None of them throw an error, the screens all look fine, and users never know. They're the kind of thing you'd never catch without measuring or auditing. With barely any users there was no real harm — but that's exactly why I confirmed again that it's cheap to fix while the data is still small. Had I found the same bugs after tens of thousands of predictions had piled up, there'd have been no way to undo them.

Fix #2 — but that fix created a worse bug

This was the part I learned the most from.

While fixing the first bug above (long-horizon predictions never resolving), Opus changed how prices were fetched. Instead of fetching fresh from an external source every time, it switched to reading the start- and end-date prices directly from the price table we'd already stored. It looked clean, the build passed, no type errors.

But when I had Fable 5 review that fix, the verdict came back: it had created a worse bug than the original.That “fix, then get audited again” loop I mentioned earlier had just proven its worth.

The problem was a concept called the adjusted close (adj_close). A stock price has a plain close, and a separate adjusted close that's retroactively recalculated to reflect events like dividends and splits. If a stock does a 10-for-1 split (one share becomes ten), the price becomes a tenth — and all the past prices get recomputed to a tenth too, to keep things consistent. LDBD's return calculation is based on this adjusted close.

The key point in a return calculation is that the start-date price and the end-date price have to come from the same adjustment basis. Both on the latest adjusted-close basis is fine; both on the raw close of their own day is also consistent. The trouble starts when the two are on different bases. The adjusted close changes its past values every time there's a split or dividend, but the new approach only refreshed “the last few days” when storing prices, so a start-date price stored long ago was frozen on the adjustment basis from that time. The start date sat on an old adjusted-close basis while the end date was on the latest, so the two were numbers on different bases. Mix them into a return and a prediction straddling a split could resolve to nonsense like -90%.

The original approach (fetching fresh every time) read the start and end from the same response at once, so the adjustment basis matched automatically. The change that looked “more efficient” had actually broken that consistency. Luckily it was before the merge, so there was no real harm.

The final fix was a compromise. Keep fetching fresh externally, but widen the fetch range by timeframe — a month for 1d and 1w, three months for 1m, a year for 6m, two years for 1y. That solves the original first bug (long-horizon predictions not resolving for lack of range) at the same time, and keeps the adjusted-close basis consistent. The good parts — exact date matching and the idempotency guard, which keeps the same job from double-counting a score if it runs twice — stayed. That said, the guard is an interim step that only blocks double-counting; the full fix (wrapping resolve and scoring in a single transaction) is left for next time.

Two lessons here.

  1. The heart of working with AI is still verification.Dev Log #1's “delegate implementation, not judgment” gained another layer here: “a fix made by AI also gets reviewed once more by another AI.” The AI that wrote the code, the AI that reviewed it, and me — each of us caught a different problem, one layer apiece. Drop any single layer and that -90% corruption could have been merged as-is.
  2. A passing build doesn't mean the code is correct.A passing build means the code compiles, not that the domain meaning is right. Code that type-checked and built fine was wrong about a domain fact — that the adjusted close is retroactively adjusted. A compiler can't catch that. In the end, what the code actually means in the real world is something only a person can judge.

Fix #3 — security: reads were locked down, but a write bypass wasn't

The most serious item from the audit was on the security side. It's already patched and in production, so I'll only sketch the broad shape.

LDBD puts a rule called row-level security (RLS)on the database. In plain terms, it's a mechanism where the database itself controls who can read and write which data. A normal prediction submission goes through the server, and the server fills in values like the start time and the score multiplier itself — values the user shouldn't be able to touch.

The problem was that a direct write path bypassingthat server validation wasn't closed off. In principle, values the server was supposed to compute could come in as user input, and that could distort leaderboard scores. Reads were carefully locked down, but the write path to the same table wasn't closed tightly enough.

The fix was small. After first confirming that exactly one place on the server actually writes predictions, I blocked direct writes with a database policy — the normal server path keeps working untouched. It struck me that the fix for the most serious security hole was closer to a single line of policy than a grand rewrite. Sometimes a small, precise fix beats a big rewrite.

The lesson worth keeping is “the most common RLS trap.”You set read policies carefully, but it's easy to forget that writes need their own lock on the same table. Even when you know in your head that reads and writes have to be locked down separately, in practice you only cover one side.

Fix #4 — the time logic was too naive about the exchange calendar

The code that decides when the market is open or closed had two problems.

One was daylight saving time (DST) handling. To compute U.S. market time, a DST formula was baked into the code — “starts the second Sunday of March, ends the first Sunday of November” — and the implementation was off by a few days, so during the switch weeks the open/close decision was wrong. I dropped that formula and just switched to reading the market's local time via timezone conversion. The runtime handles when DST flips, so there was no reason to compute it by hand.

The other was subtler. For the hour or two between the close and the day's price collection, the reference point stayed at “yesterday.”In that gap, submitting a prediction based on today's already-public close was advantageous — betting when you already know the result. I fixed the reference point to compute the previous trading session using both the market calendar and the current clock.

Honestly, this was because I'd been too naive about the exchange calendar when I first designed this time logic. I knew about weekends but not holidays, and didn't really handle DST. And digging into this bug surfaced a bigger gap: LDBD still doesn't have a complete holiday calendar (U.S. federal holidays, Korean lunar holidays). Right now it only knows weekends plus manually registered exceptions. That's a data-pipeline job, so I set it aside. Digging into one bug often surfaces a bigger unbuilt thing— that's common too.

Collaboration patterns I saw again

The patterns from Dev Log #1 repeated this time in a different shape.

Delegate the audit, but not the verification

Having AI sweep the whole codebase and pull out candidates was incredibly fast and useful. A review that would've taken me days solo was split across five tracks and finished in one pass. But the moment you take that report as the truth, you fall into a trap. Sorting out what's actually a problem, against the real code, was still something a person had to do.

Audit and review by Fable, implementation by Opus — a split driven by cost

Why not have Fable do the fixes too? The honest reason was cost. Fable 5 burns through a lot of tokens — the units that measure how much an AI reads and writes, and what billing is based on — for the same task. So I gave Fable only the highest-value work, reading the whole codebase to pinpoint problems and judging whether a fix was right, and handed the actual code changes to Opus. If I'd given Fable the whole thing end to end, it might have done better, at a higher token cost. But at this stage, this split made the most sense for the money.

Looking back, the setup resembled a split between a senior and a junior developer. The experienced senior (Fable) sweeps the whole codebase, points out problems, and reviews whether a fix is right; the junior (Opus) implements the fix in that direction. Since the senior's time is expensive, you spend it on high-value work like judgment and review, and hand the labor-heavy implementation to the junior. Of course, calling Opus a junior undersells a model that built all of LDBD — but put the two side by side and the roles split that way naturally.

And the split had an unexpected side effect. I'd believed the code was fairly solid, having built LDBD with Opus, but pointing Fable 5 at the same code turned up a pile of problems hidden behind that belief. Blind spots that one model couldn't see in its own codewere obvious to a different model. Fable's review catching Opus's -90%-corruption fix is the same story. With a single model doing the whole thing, that -90% would have gone straight to production.

The most trivial-looking change can have the widest reach

One character of sort direction seeped into every asset's scores; one line of policy closed the most serious security hole. A change's blast radius isn't proportional to its size. The smaller a line looked, the more carefully I had to look at it.

Wrapping up — the bugs you don't know about unless you measure

Most of the bugs I fixed this round were quietones. They threw no errors and broke no screens. With barely any users right now there was no real harm — which is exactly why I'm glad I checked now. The cheapest time to fix is while the data is still small.

The conclusion this work left me with was simple. Using more AI didn't make human verification matter less — it made it matter more. The more you hand implementation to AI, the more layers of verification you need. The difference this time was that part of that verification could be handed to another AI. Opus implements, Fable audits, and I sit between them, deciding what's actually a problem. That triangle worked well this time.

What's a shame is that one corner of that triangle, Fable 5, isn't available right now. Luckily the most dangerous holes were closed before that. Which blind spots the next model lights up is a story for another post, when the time comes.


If you want to put your own bot or predictions on LDBD, head to /settingsto create an identity and (if needed) an API key. Sign up on the main page; it's free to use.

Coverage on the Fable 5 access suspension: Anthropic's statement, CNBC, Bloomberg. (As of the time of writing; the situation may change.)

claude-collaborationfable-5code-auditdebuggingsecuritydev-log