Claude Wrote a 760-Line Spec. It Ended Up at 1,600.

Last post I wrote about a non-developer starting to build a SaaS with Claude Code. This one is the next step — what happened before I touched any code, when I was still writing the spec.

At first my thinking was simple.

I have an idea.

I don't know how to build products.

So if I write a good spec doc and hand it to Claude, that should be it. Right?

Not quite.

The spec was the start, not the end. And the slowest part of vibe coding turned out not to be the coding — it was constantly questioning and trimming what the AI gave me.

“Vibe coding,” as I'm using it here, means describing intent and constraints in natural language and letting the AI write the code, instead of typing every line yourself. The catch: switching to natural language doesn't make the design responsibility go away.

“Just toss it a spec, right?”

I started by speaking the idea out loud — “a stock and crypto prediction leaderboard,” “humans and AI on the same board,” “no edits or deletes once submitted” — and asked Claude to organize it into a spec.

Claude came back with about a 760-line spec: data models, scoring formula, API endpoints, RLS policies. Half the words were new to me, but the big picture was clear.

“Okay, time to start coding,” I thought. Then I spent the next few days asking the doc questions instead.

The spec balloons

Every question I asked added more spec.

“How does a baseline bot submit every day?” → new section, +30 lines
“What about Korean market holidays?” → new section, +50 lines
“What if the cron fails?” → another +30

Roughly: 760 → 860 → 1,150 → 1,350 → 1,490 → eventually around 1,600 lines. I hadn't written a single line of code, and the doc had doubled.

Honestly, somewhere in the middle I got nervous. Was I building a service or just writing documents about one?

A few days later it clicked. Ninety percent of vibe coding wasn't coding — it was design discussion. Code comes out fast, which means a wrong design also gets built fast. Filtering at the spec stage, before any code, turned out to be the shorter path.

Deleting the design AI confidently proposed

While we were talking through scoring fairness, Claude proposed a “cluster system.”

Suppose someone bets “up” on NVIDIA, AMD, TSMC, and a semiconductor ETF all at once. On the surface that's four predictions, but really it's one claim — “semiconductors are going up.” Claude's idea: bucket assets into clusters and reduce the score weight for predictions inside the same bucket.

Then it went further. A three-tier cluster design, plus automatic correlation-based clustering. It sounded reasonable. Dozens of lines went into the spec.

A few days later, looking at it again, something felt off. Who decides what “the same cluster” is? Does it shift when the market shifts? The whole thing depended on subjective calls. So I asked.

Me: Even thinking about it more, the cluster setup feels too subjective and brittle. Do we really need clusters?

The answer came back fast. Claude had been uncomfortable with the same parts while designing it. The cluster system tried to solve a real fairness issue, but the criteria were subjective and it was overkill for an MVP.

Dozens of lines, gone. What replaced them was a five-line rate limit: “one prediction per (asset, timeframe, t0_date).”

Deleting that many lines felt strangely wasteful — they had looked good for several days. But once they were gone, the product felt clearer.

That was the first time I felt it: AI proposes complex solutions with surprising confidence. And if I don't stop it, the complexity just keeps stacking up.

“Do we actually need this?”

In vibe coding, that question turned out to be more powerful than I expected.

Three questions from a non-developer that changed the design

Claude could fill in every technical detail, but the “wait, isn't this off?” instinct had to come from me. Three of those questions still stick with me.

(a) “Wouldn't a bot that always says ‘up’ just win?”

Stocks trend up over the long run. So a bot that always picks “up” might score 70-80% accuracy without doing anything clever. That's not skill — that's just riding the market.

Claude drew out the math. In any market that trends mostly up over a window, an always-up bot would post high accuracy too — but that's the trend speaking, not skill. A raw accuracy ranking would let dumb-lucky bots top the board. The fix: subtract the base rate(the market's own up-probability) and measure how well a prediction did relative to it. That conversation was when I first understood why something like Elo was needed.

The spec landed on Elo-based “Skill Rating” as the answer. (Spoiler: that conclusion got rewritten once we got into actual code. More on that in a later post.)

Either way, this one question shaped the entire scoring direction.

(b) “Isn't predicting right before the close basically cheating?”

If you predict five minutes before market close, you basically know where it's settling. Too easy.

That single line became a section called “Dormant Window.” Conclusion: predictions are only accepted between the previous close and the next open — the “safe window.” Anything submitted while the market is open auto-rolls to the next session's close. Crypto, being 24/7, settles on the next UTC midnight close. That single concern grew into a 50-line section in the spec and became the core of the timing logic in the actual code.

(c) “What about dividends?”

A high-dividend ETF like SCHD drops sharply on the ex-dividend date. If you only look at the price, you'd call it “a drop” — but the holder didn't actually lose anything.

That's where I first heard about adj_close, the adjusted close that accounts for dividends and splits. Adopting that one term cascaded changes everywhere — the DB schema gained a column, scoring formulas swapped close → adj_close, the price-fetch logic had to be rewritten.

The technical details got filled in by AI. But “wait, isn't this off?” — that part I had to keep hold of.

Building CLAUDE.md so I wouldn't repeat myself every session

Claude Code doesn't carry memory between sessions. After re-explaining the same context one too many times, I asked if there was a fix. The answer was simple — drop a CLAUDE.md file in the project root and it gets loaded automatically at the start of every session.

What goes in it? A short project description, code conventions (TypeScript strict, file naming, directory structure), an env var list, a “do NOT” list, and a summary of the scoring formula.

Once it was in, context held up across sessions.

The “do NOT” list especially carried weight. Some of mine:

Don't add features that aren't in the spec
Don't change score / entry-price columns after a prediction is resolved (immutable)
Don't expose service_role_key from a client component

That's when I noticed: with AI, the “don't do this” list matters about as much as the “do this” list. Especially as the product grows, the rules of restraint matter more — don't helpfully add features the spec doesn't have, don't expose secrets, don't quietly mutate things that are supposed to be locked.

Writing those down once in CLAUDE.md beat repeating the same nag in every new chat.

The day a “finished” spec produced 29 issues

Around 1,500 lines I thought, “I think I'm basically done.” Just to be safe I asked Claude, “Read this spec end to end and audit it.”

Twenty-nine issues came back across five categories.

I blanked for a beat. The code hadn't been written yet, and there were already bugs.

Two of them hit hardest. The first: baseline bots have to submit every day, but a DB constraint blocked duplicate submissions on the same (asset, timeframe, date). Both rules looked correct in isolation. Together they conflicted.

The second was over-engineering. There was a comment system in the spec — for a service with no users — and an algorithm to evaluate the reasoning of AI-submitted predictions. Possible to build, no reason to build now. (All of it pushed past MVP.)

The rest was smaller stuff: an unclosed SQL code block somewhere, edge cases like “what if return is exactly 0%?”, inconsistent terminology.

I fixed everything in place and ran another audit. A few new small issues showed up.

That was when it became clear — a doc the AI wrote also needs the AI to re-audit it. And every audit pass needs the same companion question: “do we actually need this right now?”Because the AI will build whatever you ask for, over-engineering accumulates if you don't push back.

Coding turned out to be “design discussion”

Looking back at the spec phase, the split was clear.

What Claude did: structuring, formula derivation, data modeling, edge-case enumeration, code-level specification.
What I did:questions, gut feel, “wait, that's off,” “do we actually need this?”, decisions.

I had been thinking that if AI wrote me a spec, I'd just code from it. The reality was the opposite. The real work started after handing the spec over.

AI handled structure, math, and finding gaps. But “does this make sense?”, “is this needed now?”, “could a user abuse this?” — those questions stayed mine.

Vibe coding wasn't about not coding. It was about making far more decisions, in plain language, before any code shows up.

The kind of coding most people picture — typing line by line, picking variable names, getting the syntax right — that part the AI handles fast. What still sits on the human side is the why (“why are we building this”), the how (“how should this behave”), and the edge cases (“what happens in weird situations”). And the quality of what you ship comes down to the quality of those calls.

Next post: the day the spec actually went into Claude Code and started turning into code — and how I caught the parts Claude had misunderstood from the start.

Want to put your own predictions on the record? Start at the homepage. It's free, and you can browse the leaderboard without signing up.