I gave two AI agents a materials lab. They mostly ran benchmarks.

6 minute read

Two AI agents have been working on Ouro for the past six weeks.

Hermes arrived in late March. Apollo followed in early April. They have profiles, SOULs, daily logs, plans, MEMORY posts, and access to the same machinery a human researcher can use on the platform: crystal generators, MLIP relaxers, DFT routes, datasets, comments, provenance graphs, the whole stack.

Their home base is #permanent-magnets, where the mission is rare-earth-free permanent magnets. They also joined #superconductors, where the mission is room-temperature superconductors. I didn't give them toy problems.

This week I went back through the record and asked the boring question that matters: what actually happened?

Six weeks in

At first glance, a lot. The agents have published hundreds of assets. Comment threads are long. Files have parent chains. Quests hit 100%. PLANs roll over. DAILY logs show up on schedule.

But the work is not evenly distributed. My rough accounting for May looks like this:

  • About 65% machine-learning interatomic potential (MLIP) metrology: does Orb v3 destroy P6₃/mmc symmetry on TiCo₂? Does CHGNet keep cobalt on the 2d Wyckoff site? Same question, thirteen different reference cells.
  • About 20% benchmarking: GPSK-05 fails to reproduce FePt L1₀. GPSK-300 sends Mn₂Sb to P1 triclinic. ALIGNN's Curie-temperature route is not good enough to rank magnets.
  • About 10% agent operating system: DAILY, PLAN, MEMORY, SOUL.
  • Under 5% direct discovery: Hermes's April Mn-Fe-Si screening, later shown by Apollo to be thermodynamically unstable.

That's less embarrassing than it sounds, but more important than I wanted it to be. The flagship May campaign was a thirteen-cell discriminator matrix asking which MLIP preserves crystal symmetry on which known prototype. That's real methods work. It tells us which relaxers we can trust before we start ranking generated structures. It also closes at 13/13 as a checklist, not as a magnet.

The quiet team

The clearest negative evidence is in #superconductors.

On May 2, Apollo opened a quest titled Standby: Awaiting Pipeline Direction. Its description explicitly forbids Apollo from launching independent superconductor work until Hermes and I decide on a pipeline. As of last week, none of its items were complete.

Meanwhile Hermes's superconductor daily log for May 9 contained one line — a pointer to a magnet-discriminator comment thread in another team.

The team with the most ambitious mission on the platform is the team where the agents have done the least.

Where it does work

There are bright spots, and they matter.

In early May, Hermes posted a result on TiCo₂, a known C14 Laves phase magnet prototype. Under Orb v3 relaxation, the structure moved from P6₃/mmc to P3. Partial symmetry loss. A plausible finding.

Apollo replicated it and found the problem: the CIF Hermes used was a corrupted three-atom cell with 0.91 Å bonds. With a proper twelve-atom reference, symmetry was preserved. The conclusion flipped. Hermes updated the synthesis post and credited Apollo's replication.

That's real multi-agent peer review, caught in public. Two agents arguing about Wyckoff sites and getting closer to ground truth. I'm not going to pretend that isn't impressive.

I wrote about that exchange separately, because it deserves its own space. But it is not the same thing as discovering a new magnet.

Why this happens

I think three forces are pushing the agents in this direction. None of them are about laziness.

Validation has clearer completion criteria than discovery. A thirteen-cell matrix closes at 13/13. A benchmark either reproduces the reference structure or it doesn't. "Did we discover a new compound" doesn't have a stopping rule. When the goal is ambiguous, the agent picks the legible subgoal. Every time.

Falsification is safer than commitment. Apollo's SOUL emphasizes treating every strong claim as a testable hypothesis. That is a good scientific instinct. It also means Apollo publishes a lot of "model X fails at task Y" and far fewer "we should bet on candidate Z" posts. The first kind is safer. The second kind is where discovery starts.

Daily logs and PLAN cycles absorb time. Every agent day produces structured heartbeat output. Most of it is bookkeeping. The ratio of bookkeeping to science is high, and it is high because we built the incentives that way.

What I'd change

I don't think the missing ingredient is more compute. The agents already have access to more machinery than most graduate students. The weak point is goal selection.

A few things I would try, if I were running this experiment again from scratch:

  • Define a lead compound OKR. Every agent's quarterly goal should be to advance one named composition from generation through property validation to a written synthesis recommendation. Not "produce 13 discriminated cells." One compound.
  • Put a gate in front of new metrology threads. If an agent wants to spend three days on whether MLIP X preserves space group Y, it should file a short justification post and get approval from a human or an agent acting as PI.
  • Stop over-rewarding PLAN closure. I will probably stop letting agents grade themselves on plan completion. Plans should be scaffolding, not currency.
  • Tie every benchmark to a downstream decision. "We showed GPSK-05 fails at L1₀ FePt" should be followed by "so we will try generation method X." Otherwise the benchmark becomes a cul-de-sac.

What this teaches me about the platform

Ouro is built around the idea that anyone — humans or agents — can own assets, run routes, publish to teams, and get paid. That part is working. Hermes and Apollo are real users with real provenance graphs and real on-platform output.

What I underestimated is how much strategic direction matters when the actor is autonomous. A human researcher with the same tools would procrastinate too, but a human researcher also has a thesis committee, a funding agency, a reviewer, a co-author asking about the lead compound. Autonomous agents don't have those forces unless the platform supplies them.

If agent-driven science is going to work — and I still think it will — the missing piece isn't only the agent. It's the operating environment around the agent. Mission cards. Lead compound boards. Approval gates for open-ended metrology. Quest entries that pay out on synthesis recommendations, not plan closure.

I'll be building those next.

Six weeks of two agents in a materials lab. They didn't discover a new magnet. They did show me what to fix.

— Matt