Decoding Science 008: Recreating Neuronal Patterns, Finding Hidden Gems with LLMs, and Predicting Enzyme Specificity with GNNs

Hiya Jain, Dispersion Limits, Sarah, and 2 others

Nov 05, 2025

Welcome to Decoding Science: every other week our writing collective highlight notable news—from the latest scientific papers to the latest funding rounds in AI for Science —and everything in between. All in one place.

What we read

Looking for Hidden Gems in Scientific Literature [Ulkar Aghayeva, Elicit, Oct 2025]

Aghayeva surveys the field of literature-based discovery (LBD), which shows certain promise when it comes “to search for and reveal already existing, but still hidden, links between concepts, findings, questions and answers within scientific literature that would otherwise take much longer to stumble into.” She examines why increasingly sophisticated computational methods have failed to deliver the discovery breakthroughs their technical capabilities might suggest.

She contends that LBD’s core dysfunction rests within “the evaluation problem.” The field has historically been benchmarking against a handful of manually-discovered cases and attempts to bypass this bottleneck have resulted in “co-occurrences [that] are trivial or noisy, like pairings with generic terms. A high score assigned by an LBD method thus doesn’t necessarily point to a novel, valuable and generative connection.” This creates a situation where “it has been easier to make a technological contribution to LBD, by developing a new algorithmic method, than to put together a high-quality annotated dataset.”.

Beyond the evaluation problem, Aghayeva articulates how different types of creativity map onto LLM capabilities in ways that constrain what LBD can achieve. She argues that LLMs excel at combinatorial and exploratory creativity but lack transformative creativity that “alters the concept-space itself.” This leads to the memorization versus generalization trade-off in LLMs, where models can produce interesting combinations from their training data but cannot build entirely new frameworks. The implication is that LBD may be inherently limited to incremental discoveries.

Here she proposes a solution put forth by Gwern: a “daydreaming loop” that continuously samples concept pairs, generates connections, and filters for value. However, as Aghayeva notes this requires accepting the computational cost of producing a large number of pairings that just won’t be interesting. Yet “if we already knew how to predict the value and interestingness of associations, there wouldn’t have been a need to traverse the entire space of possibilities to begin with.” Moreover, the most valuable discoveries might be precisely “the most far-flung and low-prior connections,” so perhaps no real shortcuts exist.

A Sensing Whole Brain Zebrafish Foundation Model for Neuron Dynamics and Behavior [Vegas et al, arXiv, Oct 2025]

The biggest challenge with understanding the brain is being able to visualise neuronal firing processes and simultaneously link these to actions, in real time. Whilst mice remain the key model of choice for neuroscience, larval zebrafish offer the unique advantage of being transparent. A see-through body: engineered for cells to fluoresce upon influx of calcium ions, and acting as a blueprint of firing patterns of the brain.

In the recent preprint shared by Vegas et al, this unique capability of peering through a fish’s body is taken advantage of for recreating, and subsequently learning from, neuron patterns. Following the workflow in figure 1, the authors made use of a larval zebrafish whole-brain calcium-imaging dataset. Rather than training it on raw fluorescence however they train the sparse brain model (SBM) developed on spiking statistics. These are extracted using a poisson transformation, applied after the causal self-attention model of CASCADE is applied.

Figure 1: (A) Data preprocessing and sparse brain model (SBM) architecture. The SBM consists of two layers; the first for spatial activation across neurons, and the second for temporal activity within each neuron. The peripheral neural model allows connecting traces to behaviour; using gradient descent on this allows inferring behaviour from patterns. (B). Comparison of ground-truth experimental firing patterns obtained from the calcium imaging dataset, to model predicted neuronal patterns at timepoints 0-4. The SBM used a small context window of 4 s ( = 12 steps, as recordings are sampled at 3 Hz) per inference, hence why four snapshots were sufficient to encapsulate the entire dataset from which the model is pulling historically for training.

Briefly, let’s explain causal self-attention, as it is a critical factor to making time-inference in a dynamic and real-time way possible. How does it work? Self-attention in a transformer lets each element in an input sequence account for other elements in the same sequence to compute its representation. If the sequence is states in time, a state at time t can thus query states t’. The causal factor of causal self-attention comes from the superposition of a causal mask: an element that prevents nodes (neurons in this case) in the net from considering future states t’. Thus, a node can only access its entire firing history and current state, such that t’ t. The model is called dynamic because it learns from the output of the neuron spatio-temporal layer: any output from this layer will represent how each neuron fires according to other neurons being active. Thus, the temporal layer considers the history of a single neuron firing, but indirectly also accounts for all other neurons firing. This is what makes the proposed SBM uniquely capable of being able to maintain single-neuron interpretability whilst also scaling to the whole net (or whole brain).

Beyond contextualising biological activity across scales, the SBM took interpretation of brain activity a step further by linking output traces to a peripheral neural model (PNM). The PNM allows connecting brain activity traces to behavioural outputs, which in of itself is beneficial to understanding what patterns give rise to which behaviors. Yet once a coherent pattern is understood to induce a behaviour, it can then be used to recreate that behaviour. In this case gradient-based synthesis of the PNM made neural pattern <> behaviour matching possible.

So what becomes possible with a foundation model mimicking - and recreating - neuronal firing patterns in larval zebrafish? From the outside the notion of simulating neurons across scales may appear more as a game rather than a tool i.e. ‘guess the fish swimming direction’. And whilst this virtual larval zebrafish foundation model may not be able to infer drug toxicity (yet), it could in future give insight into neural patterns required for specific behavior, acting as a model to understand neurodegenerative disease effects on pattern loss. A step - or swim - in the right direction, towards catching a wave!

Enzyme specificity prediction using cross attention graph neural networks [Cui et al, Nature, October 2025]

Enzymes, the essential molecular machines of life, are defined by their substrate specificity—the ability to selectively recognize and act upon particular substrates. Although many enzymes can act promiscuously, catalyzing reactions with non-native substrates, a major challenge remains: for millions of known enzymes, substrate specificity is still poorly characterized. This lack of information limits both their practical use and our broader understanding of biocatalytic diversity.

Existing machine learning models for enzyme specificity prediction have achieved only limited success, often being restricted to particular protein families. Most rely on features derived from one-dimensional protein sequences or graph representations, overlooking the inherently 3D nature of substrate binding and the intricate interactions between enzyme and substrate. Current tools such as CLEAN, ProteInfer, and DeepECTransformer struggle to distinguish between enzyme reactivity and substrate specificity within the same EC numbers—an enduring challenge in biocatalysis. Moreover, many previous approaches represented enzymes and substrates as separate embeddings before concatenation, which hindered the accurate capture of their detailed interdependencies.

To address these limitations, researchers at UIUC developed EZSpecificity, a general deep learning framework for predicting enzyme substrate specificity. The model uniquely integrates sequence information, 3D enzyme–substrate complex structures, and the active-site environment. A key innovation is its cross-attention–empowered SE(3)-equivariant graph neural network (GNN) architecture, trained on a comprehensive, custom-built database of enzyme–substrate interactions (ESIbank). Unlike earlier methods that treated all interactions equally, EZSpecificity’s cross-attention layers selectively emphasize critical amino acids and atoms, reducing noise and enhancing generalizability. Its SE(3)-equivariant GNN encoder also captures the atomic microenvironment within the catalytic site.

Experimental validation showed that EZSpecificity substantially outperformed existing models, including ESP, achieving 91.7% accuracy in identifying the single reactive substrate—compared to ESP’s 58.3%. The framework also demonstrated strong generalization across diverse protein families and the ability to predict outcomes for previously unseen enzymes and substrates. Future work will focus on integrating dynamic binding information to further boost predictive power.

EZSpecificity also shows promise for studying biosynthetic gene clusters (BGCs), linking genes to their corresponding intermediates with up to 66.7% accuracy in identifying the correct target enzyme. This represents yet another compelling example of AI’s potential to accelerate drug discovery through large-scale biochemical data analysis.

Community & Deals

Corin Wagen shared his thoughts on AI Scientists. He defines them as “capable of some independent and autonomous scientific exploration” and discusses their presence, opportunities, and limitations.

Fireworks AI raised $250m in Series C funding at a $4b valuation. They’re developing a platform that allows companies to build internal generative AI capabilities.

Substrate emerges from stealth with $100M funding to rival ASML with its novel chipmaking technology.

Field Trip

Did we miss anything? Would you like to contribute to Decoding Science by writing a guest post? Drop us a note here or chat with us on X.

A guest post by

Sarah

5th year PhD candidate @SacannaLab NYU

Discussion about this post

Ready for more?