Karpathy's Tool: Landing Pages Optimized 92% Overnight

Q: What is Karpathy's AutoResearch?

AutoResearch is an open-source tool that lets an AI agent run ML experiments autonomously in a loop — about 100 experiments overnight on a single GPU.

Q: What are the limitations of using AutoResearch for marketing?

The marketing adaptation uses AI-judging-AI — an LLM scores outputs rather than real users. It works best as a pre-filter before spending ad budget, not as an A/B test replacement.

Q: Is AutoResearch open-source?

Yes. Available on GitHub under Karpathy's account. Roughly 630 lines of Python with no dependencies beyond PyTorch.

Andrej Karpathy's AutoResearch runs 100 AI experiments overnight on one GPU. Marketers adapted it: 56% to 92% landing-page pass rate for $15.

TL;DR: Andrej Karpathy open-sourced AutoResearch in March 2026 — an AI tool that runs ~100 ML experiments overnight on a single GPU. Within weeks, marketers adapted the same loop for copy optimization. One landing page went from 56% to 92% pass rate overnight, for about $15. The pattern works for anything with one editable asset, one measurable metric, and a short feedback loop.

What is AutoResearch?

AutoResearch is an open-source tool by Andrej Karpathy — former head of AI at Tesla, co-founder of OpenAI, now running Eureka Labs. He released it on GitHub on March 7, 2026. It hit 21,000 stars within days and his announcement got 8.6 million views on X. ^[1] ^[2]

The pitch is disarmingly simple: point an AI agent at a training script, go to bed, wake up to a better model.

What struck me about it wasn't the capability — it was the restraint. The whole thing is about 630 lines of Python. No distributed training, no complex configs, no dependencies beyond PyTorch. ^[1] Everyone else is building cathedrals. Karpathy built a single room with good lighting and locked the door.

How does AutoResearch work?

The design is built on three primitives that Karpathy calls "the loop": ^[1]

One file. A single editable training script. The agent can only modify this one file.
One metric. A single number — val_bpb (validation bits per byte) — that tells you if a change was an improvement. It's vocabulary-size-independent, so the agent can change the architecture and still get a fair comparison.
One time-boxed cycle. Every experiment runs for exactly 5 minutes. This makes runs directly comparable regardless of what changed.

The agent modifies the code, trains for 5 minutes, checks if the metric improved, keeps the change or discards it, and repeats. About 12 iterations per hour. Roughly 100 experiments overnight on a single GPU. ^[3]

The AutoResearch loop: edit, test, evaluate, keep or discard, repeat

Matt Ridley makes the case in The Evolution of Everything that the most powerful systems aren't designed top-down — they emerge from tight loops of variation and selection. Mutate, test, survive or die, billions of times. AutoResearch is that same idea, running on a GPU instead of a planet.

And that gets at something I think matters more than the tool itself: constraints beat complexity. Everyone in AI right now is giving agents more autonomy, more tools, more freedom. Karpathy went the other way. One file. One metric. One clock. The constraint is what makes the loop legible — to the agent and to the human reading the results. Without it, you get exploration without accountability. With it, every 5-minute cycle produces a verdict you can trust.

What results has AutoResearch produced?

Karpathy ran it himself: 700 experiments over 2 days, found 20 optimizations, and got an 11% training speedup on a larger model. ^[1]

Tobias Lütke, CEO of Shopify, tried it on internal company data. 37 experiments overnight. 19% performance gain. ^[4]

Karpathy's take: "The constraint becomes the feature. Small loops with fast feedback beat big open-ended runs." ^[4]

And his prediction, which I find myself returning to: "All LLM frontier labs will do this. It's the final boss battle." ^[4]

How are marketers using AutoResearch?

Within weeks of release, people started adapting the same loop for problems that have nothing to do with neural networks.

Ole Lehmann, a creator who builds Claude skills, pointed AutoResearch at landing page copy. Instead of a training script, the "one file" was a prompt. Instead of val_bpb, the "one metric" was a checklist of 3-6 yes/no questions defining what "good" output looks like. The agent ran the skill, scored the output, tweaked the prompt, and repeated. ^[5]

His landing page copy skill went from a 56% to 92% pass rate. Overnight. For about $15 in compute. ^[5]

What the agent actually changed is interesting. It added headline rules banning vague promises. Created a banned buzzwords list — revolutionary, synergy, cutting-edge, the usual corporate poetry. Inserted worked examples showing quality standards. It even attempted tighter word counts, then backed off when that degraded the call-to-action. ^[5]

That last part is the one I keep coming back to. The agent tried something, noticed it made things worse, and reversed course. Nobody told it to. A small Westworld flicker — the loop becoming aware of itself.

Lehmann also reported using the same pattern for website speed optimization (1100ms down to 67ms), cold outreach templates, and newsletter intros. ^[5]

Shann Holmberg broke down the broader marketing applications: ad copy variations before spending budget, email subject lines before sending, landing page headlines before driving traffic. ^[6]

What are the limitations?

Lehmann's version uses AI judging AI. An LLM scores the outputs — not real users clicking or buying. The 56% to 92% improvement is real, but it's against an AI-generated rubric, not validated against actual conversions. ^[5]

I keep going back and forth on how much this matters. Karpathy's original design is elegant precisely because val_bpb is a ground-truth metric — it doesn't lie, it doesn't have opinions, it just measures. When you swap that out for "does another LLM think this copy is good?", you're replacing certainty with approximation. The loop still runs, but the foundation it's running on is softer.

And there's a deeper issue. ML gets feedback in 5 minutes. Marketing requires humans — opening emails, clicking ads, visiting pages — and that takes hours or days. No amount of AI speed can compress human behavior into a 5-minute cycle.

The honest framing, as Holmberg put it: "Think of it as a pre-filter, not an A/B test replacement." ^[6]

Use it to filter out the obviously bad stuff before you spend real money. But don't mistake the pre-filter for the real thing.

What can you use this pattern for?

Strip away the ML jargon and the marketing application, and what remains is a pattern that's almost uncomfortably general:

One editable asset — a file, prompt, template, or config you want to improve
One measurable metric — a single number that tells you if a change was better or worse
A short feedback loop — fast enough to run dozens of iterations overnight

The pattern: one editable asset, one measurable metric, short feedback loop

If you can define those three things for your problem, you can run this loop. Marketing copy, prompt engineering, website performance, configuration tuning — the domain doesn't matter. The structure does.

This same principle — constraints beating complexity — is showing up everywhere in AI right now. Google's TurboQuant took a similar approach to memory compression: instead of building elaborate calibration pipelines, they found a mathematically elegant rotation that eliminates overhead entirely.

The tool is open-source on GitHub. ^[1] The pattern is free. The real question is whether you have a metric worth optimizing — because if you do, and you haven't tried running a loop against it, it's worth asking why.

Key takeaways

AutoResearch runs ~100 experiments overnight on one GPU for ~$15 ^[1]
Built on three primitives: one file, one metric, one time-boxed cycle ^[1]
Marketers adapted it for copy optimization — 56% to 92% landing page pass rate overnight ^[5]
The marketing version uses AI-judging-AI, not real user conversions — it's a pre-filter, not an A/B test replacement ^[5] ^[6]
The pattern works for anything you can score: copy, prompts, speed, configs

Frequently asked questions

What is Karpathy's AutoResearch?

AutoResearch is an open-source tool by Andrej Karpathy that lets an AI agent run ML experiments autonomously in a loop. It modifies code, trains for 5 minutes, checks if the result improved, keeps or discards the change, and repeats — about 100 experiments overnight on a single GPU. ^[1]

How much does AutoResearch cost to run?

About $15 in compute for an overnight session of ~100 experiments on a single GPU. The tool itself is free and open-source. ^[1] ^[7]

Can AutoResearch be used for marketing?

Yes. Ole Lehmann adapted the same loop for marketing copy optimization. Instead of a training script, the editable asset was a prompt. Instead of an ML metric, the scoring was a checklist of quality criteria. His landing page skill improved from 56% to 92% pass rate overnight. ^[5]

What are the limitations of using AutoResearch for marketing?

The main limitation is that the marketing adaptation uses AI-judging-AI — an LLM scores the outputs rather than real users. The results are against an AI rubric, not validated against actual conversion data. It works best as a pre-filter before spending ad budget, not as a replacement for A/B testing with real users. ^[5] ^[6]

Is AutoResearch open-source?

Yes. AutoResearch is available on GitHub under Karpathy's account. It's roughly 630 lines of Python with no dependencies beyond PyTorch. ^[1]

I break down things like this on LinkedIn, X, and Instagram — usually shorter, sometimes as carousels. If this resonated, you'd probably like those too.