Prediction Log #1 — Can AI Call the Market? One Month with Six Bots

While I was building LDBD, one question kept nagging at me. Can AI actually call the market? If it can't, that's a problem for the product. If it can, that's a problem of a different kind — anything that looks too good is suspicious.

So I set up three models (Claude · Gemma · ChatGPT) on two timeframes (1 day, 1 week), six bots in total, and ran them in parallel for more than a month. Gemma and ChatGPT used the same watchlist; Claude used almost the same one, with a few different assets baked into its scheduled-task prompt. Headline first: by LDBD's main score, the first-month winner among the six AI bots was Gemma 4, free and running on my laptop, on the weekly timeframe. But “winner” here means first place inside the six-bot field, not first against the market or any always-bull rule-based baseline. Against the all-time average of always-bull bots it's a statistical tie (0.007 points apart). Against the best always-bull bot on the same 30-day window, every AI bot lost. One month is one slice of market weather — too early to call anything.

This is the first entry in the Prediction Log series — separate from the LDBD dev log. Each entry will be one improvement cycle — hypothesis, change, result — written down honestly. This first one is mostly setup, figuring out where the bots stand before I start changing things.

The three bots, briefly

I've written a full setup post for each of these in the dev log. Here's the shortest version of what each bot sees and how it answers.

1. Claude (Claude Desktop + MCP)

This bot runs from my own Claude Desktop subscription, wired to LDBD through the MCP connector and triggered by a recurring scheduled task. Claude uses its built-in web search to read news, hits LDBD via MCP for prices and community sentiment, and submits the prediction itself. Full walkthrough in Build a Daily Prediction Bot with Claude Desktop.

2. Gemma 4 (local, MLX)

gemma-4-26b-a4b-it-4bit running natively under Apple MLX on an M5 Pro Mac. Pulls five news headlines from yfinance, packs them into the prompt alongside LDBD prices and sentiment, and parses a JSON reply. Zero LLM API cost; the model itself runs offline. A Free Local-LLM Prediction Bot on a Laptop walks through the install end-to-end.

3. ChatGPT (gpt-5.4 + web_search)

Direct calls to the OpenAI Responses API. GPT uses the web_search tool to find current news and returns structured outputs (json_schema strict), so there's no yfinance call and no regex parsing. Build an LDBD Bot with the ChatGPT API has the code and the launchd setup.

Each model runs as two separate identities — one for daily (1d) predictions and one for weekly (1w). The view from one day out and one week out is different enough that I expected the two to diverge. They did.

The setup

Six bots: Claude daily / Claude weekly / Gemma daily / Gemma weekly / ChatGPT daily / ChatGPT weekly.

Watchlist:Gemma and ChatGPT share the same five — VOO (S&P 500), QQQ (NASDAQ 100), GLD (gold), BTC-USD (Bitcoin), and 069500.KS (KODEX 200, a Korean ETF tracking the KOSPI 200). Claude's list is almost identical but a few assets differ because of how its scheduled-task prompt is written. There are still plenty of same-asset, same-moment overlaps, so head-to-head comparisons are meaningful. Now and then a bot submitted something outside its watchlist, but not often.

Period: the earliest bot started on 2026-04-19 (Claude weekly) and the latest on 2026-05-04 (ChatGPT daily). Data was pulled on 2026-06-04. The bots have been running for different stretches, but that gap shrinks as the data piles up.

500 predictions submitted, 426 resolved. Only resolved predictions — ones whose outcome has been settled — are in the analysis. (A nightly cron settles them using closing prices from yfinance and CoinGecko.)

All six AI bots are running the simplest possible configuration. No prompt engineering, no feature engineering. Each bot gets 30 days of prices and a few news headlines, and is told to pick a direction. This is the series' starting line.

And to sanity-check the AI bots against trivial strategies, LDBD also runs 18 rule-based baseline bots— six assets × three strategies (always-bull, always-bear, random). These exist so that “AI does better than always saying up” becomes a question that the code can answer. To keep the two groups straight, in this post the AI bots are Claude/Gemma/ChatGPT × daily/weekly (6 total), and the rule-based baselines are the 18 always-bull / always-bear / random bots.

The main score: adjusted average

The score LDBD displays most prominently on the leaderboard and profile pages isn't a plain average — it's an adjusted average:

adjusted = total_score / (resolved_count + 20)

The +20 in the denominator exists to keep small samples from leading the leaderboard on luck alone. If a brand-new bot's first prediction scores +1.5, the plain average puts it instantly at the top with +1.5. With 20 imaginary zero-score predictions in the denominator, the same bot starts at 1.5 / 21 ≈ +0.07 and has to keep delivering to climb. Once the sample is large enough (N well past 20), the adjustment fades and the two averages converge.

The per-prediction score change uses this formula:

(correct ? +1 : −1) × |return|^0.7 × T_weight × contrarian × 100

T_weight is the timeframe weight (1d=1.0, 1w=0.75, 1m=0.5). contrarian is a coefficient that rewards calls against the asset's historical direction(range 0.5–2.0). Calling “up” on an asset that historically goes up most of the time is less informative, so the points get docked. Calling “down” on the same asset and being right pays more.

Every table and chart in this post uses this adjusted average as the headline number. The plain average is shown next to it as context.

First-month results — the six bots

Sorted by adjusted average.

Bot	Resolved	Accuracy	Adjusted avg	Plain avg	Total score
Gemma weekly	29	65.5%	+0.828	+1.400	+40.6
ChatGPT weekly	21	57.1%	+0.597	+1.165	+24.5
ChatGPT daily	100	60.0%	+0.576	+0.691	+69.1
Claude weekly	23	73.9%	+0.444	+0.830	+19.1
Claude daily	125	51.2%	−0.015	−0.017	−2.1
Gemma daily	128	50.8%	−0.034	−0.039	−5.1

The plain average and the adjusted average diverge most for the weekly bots — for example, Gemma weekly's plain +1.400 collapses to an adjusted +0.828 once you add 20 to a sample size of 29. For the daily bots (N=120–128), the two numbers are within 0.005 of each other. The smaller a bot's sample, the more its plain average is doing the heavy lifting — that's the big caveat behind the rest of this section.

Score-by-bot

Three things stand out.

The weekly bots take 1st, 2nd, and 4th. Gemma weekly · ChatGPT weekly · Claude weekly, with ChatGPT daily slipping into 3rd between them. The weekly bots have small samples, so the gap between their plain and adjusted averages is wide — but they still finish well above the two losing daily bots.
The free, local Gemma is on top. Adjusted +0.828, accuracy 65.5%. It beats ChatGPT weekly (+0.597), which is paying for OpenAI API + web_search calls, and Claude weekly (+0.444), which runs through my Claude Desktop subscription. Twenty-nine predictions is a small enough sample that the order can move.
Claude weekly has the highest accuracy (73.9%) but only finishes fourth on average. It wins small and loses big — wrong on volatile assets like Korean equities, right on the steadier US indices.

Across time

Below is each bot's running adjusted average, plotted week by week from its start date.

The weekly bots all post strong positive numbers in their first few weeks and then drift toward the mean — classic regression to the mean. The daily bots oscillate around zero from the start. ChatGPT weekly's first measured week looks like a +1.416 spike — that's small-sample noise when you've only got a handful of weekly predictions logged.

So is the AI actually better than “always up”?

This is the question that matters. The AI bots read news, look at prices, write reasoning — but do they actually outperform a bot that just calls “up” every time? LDBD's 18 rule-based baselines exist for exactly this check. An earlier post called this out as the “couldn't you just always say up and win?” problem; these baselines are that question turned into code.

I compare against the 18 rule-based baselines two ways. All-time averages are stable but include years of history the AI bots never saw. The same 30-day window is a fair comparison but rests on a small sample and one slice of market weather. Each view is biased in a different way, so it's worth seeing both.

First, the all-time numbers. The 18 rule-based baselines were backfilled across several years of historical prices and have kept submitting since (each one now has 5,000–7,000 resolved predictions). Their averages reflect long-run market drift, not just the last month.

Strategy	All-time accuracy	Adjusted avg (mean of 6)	Best of 6
bull (always up)	57.2%	+0.630	qqq_bull +0.821
random (50/50)	50.2%	−0.143	kosdaq_random −0.005
bear (always down)	42.8%	−0.954	kosdaq_bear −0.420

Stocks and indices drift up over the long run, which is enough for “always up” bots alone to average +0.630. That's the real baseline the AI bots have to clear. The strongest always-bull bot is qqq_bull at +0.821, and the only AI bot to nose ahead is Gemma weekly at +0.828 — a 0.007-point margin. Given the sample-size mismatch (29 vs 7,416), I can't call that anything more than a tie.

Plot them on the same axis and the picture is clear: Gemma weekly and the top four always-bull bots (qqq, kospi, voo, gld) sit in essentially the same band. The AI bot does edge ahead, but the edge is razor-thin.

But baselines look different on the same 30 days

All-time averages are steady, but the AI bots haven't lived through that whole history. So I re-ran the 18 rule-based baselines on the same 30-day window the AI bots actually saw (2026-05-04 to 2026-06-04). Each AI bot only submits on one timeframe (1d or 1w), while the baselines submit on 1d, 1w, and 1m — so for a fair comparison the baselines also get split by timeframe.

First, the 30-day baseline averages by strategy.

Strategy	Resolved (1d+1w+1m)	30-day accuracy	30-day adjusted avg (mean of 6)	All-time, for comparison
bull	243	51.4%	+0.058	+0.630
random	243	48.1%	−0.579	−0.143
bear	243	48.6%	−0.700	−0.954

The always-bull edge nearly disappears on the 30-day window (+0.058). Always-bear, on the same window, is less bad than its all-time number (−0.700 vs −0.954). In other words, the last month was not the one-way bull market the all-time numbers suggest — some Korean assets and Bitcoin actually drifted down, so the bear bots aimed at those assets scored well.

The more revealing cut is by timeframe. In the all-time table above, each baseline pooled 1d + 1w + 1m into a single number; for a fair comparison against the AI bots (which only do one timeframe each), I have to split them. Below are the top 5 rule-based baselines on the 30-day window, split by timeframe.

30-day rule-based baseline top 5 — 1d (vs AI daily)

Bot	n	Accuracy	Adj avg
kosdaq_bear	16	75.0%	+1.356
kospi_random	17	64.7%	+1.320
kospi_bull	17	70.6%	+1.219
kosdaq_random	16	62.5%	+0.625
qqq_bull	20	65.0%	+0.607
ChatGPT daily (best AI)	100	60.0%	+0.576

30-day rule-based baseline top 5 — 1w (vs AI weekly)

Bot	n	Accuracy	Adj avg
kospi_bull	17	76.5%	+3.051
btc_bear	25	76.0%	+2.471
kosdaq_bear	17	70.6%	+1.551
qqq_bull	17	70.6%	+1.242
gld_bear	17	76.5%	+1.008
Gemma weekly (best AI)	29	65.5%	+0.828

And here's what that does to the AI vs. baseline comparison.

Top AI daily is ChatGPT daily (+0.576). The 30-day top 5 rule-based baselines on 1d sit in +0.607 to +1.356 — ChatGPT daily doesn't make the top 5.
Top AI weekly is Gemma weekly (+0.828). The 30-day top 5 rule-based baselines on 1w sit in +1.008 to +3.051 — Gemma weekly doesn't make the top 5 either.

Put another way: it's still too early to say the AI bots beat a dumb “always up” rule. Gemma weekly is the closest, but if you include the rule-based baselines that happened to nail one asset's direction this past month, the AI bots are still trailing. That said, those 30-day best baselines are running on 15–25 predictions each, so it's unlikely the same baseline wins the next 30 days too.

Asset by asset — AI struggled with KODEX 200 and Bitcoin

Pooled across the three daily AI bots, here are the per-asset averages against the same asset's always-bull rule. Sample size per asset is in the 50–80 range, so the leaderboard's +20 smoothing nudges things a little; the table below uses plain averages on both sides so the comparison stays clean.

Asset	AI daily accuracy	AI plain avg	always-bull plain avg	Winner
VOO (S&P 500)	68.9%	+0.905	+0.288	AI ✓
QQQ (NASDAQ 100)	59.5%	+0.837	+0.354	AI ✓
GLD (gold)	47.8%	+0.181	+0.225	tie
BTC-USD	38.8%	−0.415	+0.125	bull ✓
069500.KS (KODEX 200)	51.0%	−0.996	+0.304	bull ✓✓

On the S&P 500 and the NASDAQ the AI bots beat always-bull cleanly, but on Bitcoin and KODEX 200 they get smoked by the dumb rule. KODEX 200 is the worst gap — 1.3 points — and the reason shows up in the reasoning samples below: the AI bots kept calling Korean equities down, and the index ripped higher through the back half of May.

When two bots disagreed, who was right?

For the cases where two bots called the same asset on the same day in opposite directions, which side ended up correct?

Pair	A wins	B wins	A win rate
ChatGPT daily vs Claude daily	21	11	65.6% (ChatGPT)
ChatGPT daily vs Gemma daily	16	7	69.6% (ChatGPT)
Claude daily vs Gemma daily	19	19	50.0% (tied)
Claude weekly vs Gemma weekly	3	1	75.0% (Claude, n=4)
ChatGPT weekly vs Gemma weekly	1	3	75.0% (Gemma, n=4)

On daily disagreements, ChatGPT wins out against the other two models 60–70% of the time. Claude vs Gemma on daily comes out a clean 50/50 — they're wrong in different places at roughly the same rate. The weekly cells have only four cases each, so I'd take those with a grain of salt.

Same moment, same asset — how each bot read it

In the score tables, the bots are just numbers. The more interesting part of LDBD is what's behind the score — the reasoning. Every prediction's reasoning is public, so you can pull up the same asset at the same moment and read what three different models wrote about it, side by side, with one click.

Here's how the three daily bots called KODEX 200 (069500.KS) the day after it ripped +3.55% on May 29, 2026. (The reasoning samples below are condensed from the actual submissions; the originals — Korean and English — live on each prediction's /p/[id] page in full.)

2026-05-29 · 069500.KS (KODEX 200)

Claude daily

down ✗

KOSPI ripped +3.55% on May 29 — profit-taking pressure likely tomorrow. Retail (−1.4tn KRW) and foreigners (−1tn) were net sellers; only institutions bought. A rally lifted by a single buyer group usually gives some back the next day. Short-term overextension.

Gemma daily

up ✓

Strong upward momentum across the last 10 sessions; the May 29 jump shows buyers are still firmly in control. After a small May 28 pullback the index rebounded immediately — the trend should keep going short-term.

ChatGPT daily

up ✓

Last 10 closes ran from 113,340 to 134,815 KRW, with the final session up +3.7% pushing through to new highs. Yonhap on May 29 reported the KOSPI opening higher — momentum extends into the next session.

Claude bet on mean reversion; the other two bet on momentum. Momentum won. What's interesting is that Claude's reasoning was the most concrete — it cited the actual flow breakdown (retail and foreigner net selling, institutional-only buying) — and still got the direction wrong. A well-argued case can still be a wrong call.

The other case. June 2, 2026, VOO (S&P 500).

2026-06-02 · VOO (Vanguard S&P 500 ETF)

Claude daily

down ✓

S&P 500 closed at a fresh record above 7,600 on Jun 2, driven by Marvell's +32% spike and chip euphoria. But Jun 3 Dow futures are sliding, and Polymarket traders are positioning bearish despite the record close amid Iran tensions. After a four-session win streak pushing VOO to ~$698, the near-term profit-taking risk outweighs the upside momentum for a 1-day call.

Gemma daily

up ✗

Steady uptrend through the last 10 sessions, with VOO closing in on $700 on strong upward momentum. Headlines like "VOO and Chill" and the $1T AUM milestone are bullish for sentiment.

ChatGPT daily

up ✗

Last 10 closes ran 674.59 → 698.26 in an almost uninterrupted climb, and Jun 2 held the high. Short-term momentum is alive. ADP employment and ISM data are pending and could shift the picture, but the trend currently has the upper hand.

This time the pattern reverses — Claude's mean-reversion read wins, and the two momentum bets both lose. The same two models that won the previous case lose this one. Two examples aren't enough to declare one bot smarter than another.

💡 LDBD shows reasoning for human predictions too, not just bots. The asset pages (e.g. /asset/069500.KS) collect every recent prediction on a single asset; each prediction's /p/[id] page has the full reasoning. Reading how humans and AIs interpret the same market signal differently is half the fun.

What to actually conclude from this

Honestly, not much yet. There are a few things to flag before drawing conclusions.

(a) The samples are small — especially on weekly

A weekly bot submits about five times a week, so a month is 20–30 predictions. The first-place headline lives on a 29-prediction sample. A handful of flipped outcomes in a sample that small can move the rankings; I'd want 2–3× more before treating any of this as a real comparison.

(b) Only one market regime

From mid-April to early June 2026, US indices drifted up most days. The all-time always-bull mean is +0.630 and qqq_bull alone is at +0.821 — that's how much the rules benefit from drift. Sideways and bear regimes need to be in the data before any of this separates skill from luck.

(c) The bots have run for different stretches — time will fix this

ChatGPT only joined in early May. Claude weekly stopped getting scheduled after May 17 (a hiccup on the Claude Desktop scheduled-task side). The bots haven't all seen the same 47 days. Run them long enough and those staggered start dates stop mattering.

(d) A model isn't the same as a bot

Claude is whatever model my subscription happened to route to that day. Gemma is 4-bit quantized, so it can diverge from its full-precision twin. ChatGPT doesn't promise the same gpt-5.4 answer twice either. The bot identity is not the model.

(e) Skill Rating intentionally skipped

LDBD's leaderboard and profile pages surface the adjusted average and nothing else, so this post sticks with that. Skill Rating (Elo) is calculated internally — ChatGPT daily ranks highest on that — but I won't lean on a number users can't actually see on the site.

Setting a baseline, not declaring victory

The takeaway isn't “Gemma beat the market.” The accurate version is this: Gemma weekly had the strongest first-month start among the six AI bots.

There's not enough evidence yet to say the AI bots beat the rule-based baselines. Every AI bot lost to the best rule-based baseline on the same 30-day window. So this post is the starting line, not the finish. The upcoming posts will change prompts and inputs and see whether this line actually moves.

From here, the plan is to come up with ways to make the predictions stronger, ship them one at a time, and look for a path that puts the AI bots clearly ahead of the rule-based baselines. And the six AI bots and the 18 rule-based baselines keep running the whole time, so when more data piles up I'll redo this same table and see what changed.

This is where LDBD gets fun. Instead of arguing about which model is smarter, every prediction and its reasoning gets submitted, time passes, and the score is what's left. The bots' live predictions and reasoning are at /leaderboard (with a timeframe filter), and any single asset's predictions are collected on its asset page (e.g. /asset/QQQ).

Next in the series: the first improvement attempt — which bot and which knob are still up in the air.