Decoding Science 001: AI achieves gold medal in Math Olympiad, the idea-execution gap in research ideas, ultrasound-driven microrobots, journals as 'legacy infrastructure', 3D atomic-scale metrology

Hiya Jain

Dispersion Limits

Hanchen

, and 2 others

Jul 24, 2025

Welcome to Decoding Science: every other week our writing collective highlight notable news—from the latest scientific papers to the latest funding rounds in AI for Science —and everything in between. All in one place.

Welcome to the first issue of Decoding Science! We hope you enjoy our curation of blogs, papers, deals and socials. We’d love to receive your feedback and ideas, especially as we get started. If you have anything to share please email us here.

If you’re interested in covering the Notable Deals section, please message us as well!

What we read

Blogs

OpenAI and DeepMind achieved gold-medal performance at International Mathematical Olympiad [Decoding Science, July 2025]

We now know there are general-purpose reasoning LLMs that can achieve gold-medal–level performance at the International Mathematical Olympiad (IMO). In 2025, both OpenAI and DeepMind reached this milestone. OpenAI’s experimental reasoning model solved five out of six problems (35/42 points) under exam conditions using only natural language and was independently verified by former IMO medalists. Meanwhile, DeepMind’s Gemini Deep Think became the first model officially certified by the IMO committee to solve five problems with natural-language proofs, leveraging parallel reasoning and reinforcement learning. These achievements mark a turning point, showing that generalist models can now perform at the very frontier of human mathematical ability.

By contrast, in 2024, the top AI performer was DeepMind’s hybrid neuro-symbolic system, combining AlphaProof and AlphaGeometry 2, which scored 28 points, just below the gold threshold (so technically, a silver medal). These were specialized systems that translated IMO problems into formal logic representations and solved them via symbolic search, demonstrating strong performance but requiring significant domain-specific design. This year’s advances represent a shift: from specialized engines to general-purpose LLMs that reason, plan, and solve open-domain problems in natural language. The transition suggests a broader capability trend, one where generalist AI is beginning to rival human experts in deeply structured reasoning domains.

Will it be long before AI achieves a Fields Medal–level breakthrough: proving Fermat’s Last Theorem or solving one of the Millennium Prize Problems, such as the Navier–Stokes equations, as DeepMind hinted earlier this year?

Read more here: DeepMind and OpenAI claim gold in International Mathematical Olympiad [Alex Wilkins, New Scientist, July 2025]

Scientific Publishing: Enough is Enough [Seemay Chou, Astera Institute, June 2025]

The Astera Institute announced that it will no longer fund science that is published through traditional pipelines. Chou motivates this decision by arguing that journals have become “legacy infrastructure,” optimized for prestige and scarcity rather than speed, transparency, or reuse. Moreover, this misalignment shapes what questions scientists ask, how they design experiments, and when they choose to share results. Instead, we need to build new, internet-native, channels for scientific distribution: open-source code and data, make peer feedback a less opaque process, and stop presenting research as a clean narrative.

Such forms of rapid, iterative publishing would let AI agents mine the full arc of a project with negative results, abandoned hypotheses, evolving code notebooks providing richer training data to LLMs that generate hypotheses or design experiments. Transparent science would also make for a more dynamic experience for readers by incentivizing automated tools for scrapping and synthesizing reviews and replications, and employing AI methods to dynamically suggest relevant further reading.

Besides the technical advantages, Chou also highlights the immense financial burden of journal publishing with billions now spent on subscription fees. In contrast, granular, open science could allow funders to redirect resources toward better research infrastructures, FAIR data archives, and prize mechanisms that reward real‑world impact.

This problem is not easily tractable but it raises an interesting possibility – will liberating the scientific record help accelerate discovery?

Papers

The Idea-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas [Si et al, arXiv, June 2025]

Why it matters: Good benchmarks are crucial for progress in AI: they define our understanding of the current state of the field and guide the objectives that researchers pursue in their work. This paper presents a large-scale study to evaluate the quality of large language model (LLM)-proposed research ideas by actually implementing them and having experts review the quality of the resulting papers. The authors identify an “idea-execution” gap for LLM proposals, which are rated more highly by experts at the proposal stage than after execution. These results emphasize the importance of large sample sizes, strong human baselines, and real-world deployments for making robust claims about the strengths and weaknesses of AI tools for science.

Idea generation is the first step of many "AI scientist" systems. But just how good are AI proposed ideas? A 2024 study by Si et al found that NLP research ideas proposed by large language models (LLM) scored higher on novelty than human proposed ideas as rated by blinded expert reviewers. Last month, Si et al published a follow-up study that investigated whether these higher scores translate into better research projects. They hired 43 expert "execution participants" to implement and write a short paper for randomly assigned ideas (averaging 103 hours per project!) and 58 expert reviewers to score the generated papers. All participants were blinded as to whether each idea was proposed by a human or an LLM. Interestingly, the authors found that scores for AI generated ideas drop substantially from the proposal to execution stage, which they refer to as the "idea-execution gap" for LLM generated ideas. By contrast, human proposed ideas were scored similarly before and after execution. The authors attribute the difference in scores to a greater emphasis on feasibility and empirical performance in reviews after execution.

Limitations: The idea generation process in this study differs from standard research: ideas are fixed at the beginning of each project, and problems are constrained to innovations in LLM prompting techniques, limiting opportunities for creativity. Additionally, the LLM proposed ideas were generated almost a year ago, so LLM improvement may impact the results if rerun today. Finally, I suspect that the peer-review score evaluation metric may carry a risk of a "publication-utility gap" even after project execution (publication criteria are unfortunately not always aligned with real-world utility), and it may be valuable for future work to explore idea generation metrics that are grounded in the intended impact of the research.

3D Atomic-Scale Metrology of Strain Relaxation and Roughness in Gate-All-Around (GAA) Transistors via Electron Ptychography [Karapetyan et al., arXiv, July 2025]

Why it matters: Reconstructing Gate-All-Around (GAA) devices in a simulated model with atomic precision can surface defects and variations that dictate carrier mobility and leakage at sub-3 nm scales. In creating these single-atom precise models GAA architectures can be developed with faster iteration speed and minimal fabrication trial-and-error, thereby accelerating the transition to ultra-dense energy-efficient semiconductor devices.

Field-effect transistors, particularly metal-oxide semiconductor FETs (MOSFETs) are the fundamental layer enabling virtually all modern digital interactions. Improvements in FETs have made everything from pocket-sized computers (smartphones) to AI-enabled cloud infrastructure possible. Shifting from 2D FETs and FinFETs into 3D Gate-All-Around (GAA) devices has made shorter gate lengths (2-3 nm) possible without increased leakage, effectively continuing to push Moore’s scaling law further. Vertical fabrication of nanosheets and nanowires presents new opportunities to tune 3D architectures and increase logic density, achieved through higher transistor count within the same area.

However, fabrication of nanoscale features in 3D remains an intricate and complex task requiring angstrom-scale precision. Moreover, if new device architectures are to be designed to maximize carrier mobility (the speed at which electrons, or holes*, move through semiconductor devices when an electric field is applied), understanding how to optimize fabrication is essential.

To this end, Karapetyan et al. give a new 3D perspective into semiconductor architectures. Resolving features at 0.49 A width, multislice electron ptychography (MEP) surpasses previous techniques at 0.66 A (tf-iDPC) and 0.83 A (tf-ADF). Figure a-d presents how MEP operates, and what data can be produced. In achieving this greater resolution, it enables uncovering stacking faults (white lines in g), step edges (green arrows in g), interface roughness, pinholes, and strain-relaxations each critical to defining carrier mobility. Beyond imaging, MEP is an interesting technique due to its capacity to reliably reconstruct atomic potentials. Effectively, this means the technique can both image and provide information about the electronic field present. Furthermore, MEP can i) image and reconstruct an atom-scale boundary through a single scan, and ii) resolve features at 40 nm depth (this paper showed 38 nm). Lack of electron scattering further enhances quality of reconstructed models, and predictability of simulated structures.

So what are the next steps? In defining strain-relaxation as a key metric for interface quality, Karapetyan et al. could start to accelerate the speed at which new GAA devices are designed. At present, it still takes 3-4 months, and > 1000 steps!, to form the critical GAA structure that enables superior device performance. Building an atomic-scale model from 3D reconstructed characterization results could improve processing conditions and give new insights into fabrication techniques. And whilst the 2-3 nm limit remains a physical constraint defined by electron tunneling, a shift from 2D to 3D opens up the possibility to shift the semiconductor industry into a new era of vertically integrated, energy efficient architectures with unprecedented density and design flexibility.

*holes = empty states in the valence band that appear when an electron hops into the conduction band. They effectively describe the space that is left behind, and have a charge equal and opposite (+e) to that of an electron (-e).

Model-based reinforcement learning for ultrasound-driven autonomous microrobots [Medany et al., Nature Machine Intelligence, June 2025]

Why it matters: The study is a proof‑of‑concept that model‑based RL can solve a millisecond‑scale control problem. By fusing Dreamer v3 model‑based reinforcement learning with fast image tracking, the team shows that its policy can transfer from simulation and adapt in minutes, solving a millisecond‑scale steering problem. This cuts experimentation time by orders of magnitude and opens the door to microrobots that can autonomously navigate blood vessel-sized channels.

This group from ETH and IBM investigated the control of ultrasound-driven microrobots through model-based reinforcement learning (MBRL). MBRL is a group of RL methods that learn a model of the environment and then use that model to refine a policy, instead of relying purely on trial and error. These microrobots have emerged as a non-invasive alternative, capable of generating tunable propulsive forces, enabling deep navigation into tissues. However, achieving precise control of these microrobots is challenging as several transducers need to be controlled with millisecond resolution to effectively steer; which is too complex for human operators.

To allow autonomous control of ultrasound-driven microrobots, the authors employed Dreamer v.3 MBRL. MBRL is an RL strategy that learns a world model of the microrobot's environment and imagines the future attempts inside the model to plan actions, instead of only relying on trial and error. Image-processing techniques were used to track and detect swarms in real time, framing the control over the microrobots as an RL task. Dreamer v.3 was trained within an imagined model to simulate and imagine future states in the environment. A simulated game environment was designed to pretrain the microrobots to reduce the convergence time during experimental training.

The experimental set-up included an artificial vascular channel encircled by eight piezo-transducers (PZTs). This was all mounted on an inverted microscope to capture the results. The microrobots are produced by the self-organization of microbubbles in an ultrasound field. The PZTs act as the steering mechanism, which generates a pressure gradient for the microrobots to move away from the activated transducer in a perpendicular direction from the PZT. Each action includes the choice of one PZT and its frequency and amplitude. The large continuous action space is particularly favoured by MBRL. Against Proximal Policy Optimization (PPO), Dreamer v.3 converged 50x faster across all virtual channel geometries. A policy trained in the simulation hit 50% target-reach success and climbed to 90% within 30 mins of on-line fine-tuning in new channels.

Notable deals

Diode Computers raises $11.4M led by a16z. Diode is embedding AI into the design process of circuit boards to generate boards that are functional and manufacturable at scale.

Israeli-based quantum computing startup Qedma raises $26M from IBM & others for its noise-resilient software.

Denmark’s Export and Investment Fund, EIFO, invests $5M in defence technology fund D3, strengthening the EU <> Ukraine connection and access to cutting edge defence tech. They are the first state-backed fund to invest in D3, with previous D3 backers including Eric Schmidt, among others.

EFFECT Photonics raises an additional $24M as part of its series D round, to continue building photonic chips computing with light (rather than electricity).

Robinhood CEO AI Math Start-Up Valued at Nearly $900M. Harmonic AI’s founder, Vlad Tenev, has raised $100M in funding to tackle a problem that has sometimes confounded AI models: math.

In case you missed it

Aeneas transforms how historians connect the past [Google DeepMind, July 2025]

What we liked on socials channels

Field Trip

Did we miss anything? Would you like to contribute to Decoding Science by writing a guest post? Drop us a note here or chat with us on Twitter: @pablolubroth @ameekapadia