The Code of Life: How Large Language Models are Designing New Proteins

Bryan White
Jan 13
23 min read

Digital DNA strand amidst binary code in a dark, tech-themed setting. Blue hues and glowing particles create a futuristic ambiance.

Abstract

The integration of Large Language Models (LLMs) and Generative Artificial Intelligence (GenAI) into the natural sciences represents one of the most significant methodological shifts in modern research history. Moving beyond the predictive paradigms of the previous decade, where machine learning was primarily used to classify data or predict properties, the period of 2024–2025 has ushered in an era of generative capability. This report provides an exhaustive analysis of how these technologies are reshaping bioinformatics, chemicoinformatics, and pharmacoinformatics. We explore the transition from analyzing biological sequences to "writing" new proteins, the evolution of chemical synthesis planning from manual intuition to autonomous agents, and the acceleration of drug discovery through multimodal AI systems. Through a detailed examination of recent breakthroughs—such as the ESM3 protein language model, the "AI Co-Scientist" agentic frameworks, and novel tokenization strategies for molecular representation—this document elucidates the mechanisms, applications, and ethical challenges of this new scientific epoch. By synthesizing bibliometric trends, experimental benchmarks, and case studies ranging from battery material discovery to antimicrobial resistance mechanisms, we demonstrate that AI has evolved from a passive analytical tool into an active collaborator in the scientific method.

1. Introduction: From Prediction to Generation

The scientific method has traditionally operated on a persistent, iterative cycle: observation, hypothesis generation, experimentation, and analysis. For centuries, this loop was driven exclusively by human cognition, limited by the speed at which researchers could read literature, synthesize concepts, and manually execute experiments. The introduction of computational tools in the late 20th century accelerated the analysis phase, and the advent of deep learning in the 2010s revolutionized prediction. However, the years 2024 and 2025 mark a distinct inflection point: the entry of Artificial Intelligence (AI) into the creative and conceptual phases of science.¹

For the first time, generative models—architectures originally designed to process and produce human language—are being applied to the languages of biology and chemistry. This is not merely a metaphorical overlap; the mathematical underpinnings that allow a model like GPT-4 to write a coherent essay are remarkably similar to those required to design a functional enzyme or synthesize a novel drug candidate. In both cases, the system must learn the syntax (grammar) and semantics (meaning) of a sequence, whether that sequence is composed of words, amino acids, or atoms.³

1.1 The Quantitative Shift in Literature

Recent bibliometric analyses reveal a quantitative shift in the scientific literature, providing empirical evidence of this transformation. As of early 2025, publications utilizing "Generative AI" and "Large Language Models" in healthcare and life sciences have overtaken those focusing solely on traditional "Deep Learning" models.¹ This trend signifies a broader move toward "AI General" categories, where proprietary, multimodal foundational models are adapted for specialized scientific tasks. These models are no longer just classifiers; they are creators. They generate novel hypotheses, design proteins that have never existed in nature, and propose synthetic routes for complex molecules.⁵

The surge in FDA-approved AI-enabled medical devices—from just six in 2015 to over 223 by 2023, with continued growth through 2025—validates the maturity of these technologies.² However, the most profound changes are occurring upstream in the discovery phase. Researchers are no longer asking AI to simply "find the needle in the haystack" (screening); they are asking AI to "design a better needle" (generation).

1.2 The Convergence of Disciplines

This report dissects this transformation across three primary, interconnected domains. First, we examine Bioinformatics, where Protein Language Models (PLMs) are simulating hundreds of millions of years of evolution to produce novel biocatalysts and therapeutics. Here, the sequence-structure-function paradigm is being solved by models that can "read" the evolutionary history of life and "write" new chapters.

Second, we explore Chemicoinformatics, where the debate over how to represent molecules as text is driving innovations in synthesis planning and material discovery. The translation of 3D molecular graphs into 1D strings (like SMILES or SELFIES) is evolving, enabling LLMs to plan complex organic syntheses and even discover new battery materials in record time.

Third, we investigate Pharmacoinformatics, where autonomous agents and multimodal systems are compressing the timeline of drug discovery from years to months. The integration of "Omics" data with large-scale text mining is allowing for the identification of therapeutic targets that were previously invisible to human researchers.

Finally, we address the critical challenges of hallucination, bias, and verification that accompany these powerful tools. As AI begins to act as a "Co-Scientist," issues of trust, reproducibility, and safety become paramount.

2. The Theoretical Framework: Science as Language

To understand the application of LLMs in science, one must first appreciate the concept of "tokenization" and the Transformer architecture in a scientific context. In Natural Language Processing (NLP), a sentence is broken down into tokens—words or sub-words—which are then processed to understand context and predict the next token. In the sciences, the "sentence" varies by discipline, but the principle remains constant.

2.1 The Transformer Paradigm in Biology

The Transformer architecture, characterized by its self-attention mechanism, allows models to weigh the importance of different parts of a sequence relative to one another, regardless of their distance in the sequence. In a protein, two amino acids might be far apart in the linear sequence (primary structure) but close together in the folded 3D object (tertiary structure). Traditional Recurrent Neural Networks (RNNs) struggled with these long-range dependencies, often "forgetting" the beginning of a sequence by the time they reached the end.

Transformers, however, excel at capturing the global context of a sequence. The self-attention mechanism creates a matrix of relationships, calculating how every token influences every other token. In biology, this mimics the physical reality of folding, where residues interact across vast sequence distances to stabilize a structure.⁷ This capability has led to the rise of "Foundation Models" in science—large-scale models pre-trained on vast datasets (such as the UniProt database for proteins or PubChem for molecules) that can be fine-tuned for specific tasks. These tasks range from property prediction (e.g., is this molecule toxic?) to generation (e.g., create a protein that binds to this target).⁹

2.2 Tokenization: The Atomic Unit of Meaning

The success of an LLM depends heavily on how data is tokenized. In human language, breaking words into sub-words (like "playing" becoming "play" and "ing") helps models understand roots and suffixes. In science, choosing the right token is an active area of research.

In Genomics, the challenge is sequence length. A human genome is 3 billion base pairs long. Early models used "k-mer" tokenization, treating overlapping chunks of DNA (e.g., "ATCG", "TCGA", "CGAT") as words. However, this creates a massive vocabulary and is computationally expensive. Recent advances in 2024 and 2025 have shifted toward Byte Pair Encoding (BPE), a compression algorithm that iteratively merges the most frequent pairs of characters. This allows models like DNABERT-2 to process much longer sequences of DNA with fewer parameters, capturing the "syntax" of gene regulation more efficiently.¹¹

In Chemistry, the challenge is topology. Molecules are graphs, not strings. To use LLMs, chemists flatten these graphs into strings using notations like SMILES (Simplified Molecular Input Line Entry System). However, standard text tokenizers often break these strings in chemically nonsensical ways (e.g., splitting a chlorine atom "Cl" into "C" and "l"). New methods like Atom Pair Encoding (APE) have been developed to respect chemical intuition, grouping atoms that are chemically bonded into single tokens. This improves the model's ability to learn chemical rules and generate valid molecules.¹³

2.3 Multimodality: Beyond Text

A defining feature of the 2024–2025 landscape is the shift from unimodal to multimodal models. Early scientific LLMs treated biological data strictly as text. However, biology and chemistry are inherently geometric and functional. A protein is not just a sequence of letters; it is a dynamic 3D shape with a specific job.

The latest generation of models, such as ESM3 and AlphaFold 3, integrates sequence data (1D text) with structural data (3D coordinates) and functional annotations (text descriptions) into a unified latent space. This allows for "programmable biology," where a user can prompt a model with a mix of constraints—for example, requesting a protein with a specific active site geometry (structure) and a specific catalytic activity (function). The model acts as a translator between these modalities, ensuring that the generated sequence folds into the requested structure and performs the requested function.¹⁵

3. Bioinformatics: Reading and Writing the Code of Life

Bioinformatics has arguably seen the most profound impact from generative AI, primarily because DNA and protein sequences are naturally analogous to human language. They possess an alphabet (nucleotides or amino acids), a grammar (folding rules and regulatory motifs), and a semantic meaning (biological function). The transition from "reading" genomes (sequencing) to "writing" genomes (synthesis) is being mediated by these language models.

3.1 Protein Language Models (PLMs): The New Engines of Design

The field of protein design has been revolutionized by the emergence of Protein Language Models (PLMs). Unlike the physics-based simulations of the past (such as Rosetta), which calculated energy landscapes to find stable structures through brute-force computation, PLMs learn the statistical patterns of evolution directly from sequence databases like UniProt.⁹

3.1.1 ESM3 and the Simulation of Evolution

One of the most significant milestones in 2024 was the release of ESM3 by EvolutionaryScale. This model represents a departure from purely structural predictors like the earlier versions of AlphaFold. ESM3 is a multimodal generative model trained on billions of proteins, capable of reasoning over sequence, structure, and function simultaneously. It does not just predict what a sequence looks like; it can invent new sequences to fit a description.¹⁵

A landmark demonstration of ESM3's capability was the generation of a novel fluorescent protein, dubbed "esmGFP." The research team prompted the model to generate a sequence that would fold into a fluorescent structure, using a "chain of thought" approach where the model first generated a scaffold, then the active site residues, and finally the full sequence.

The resulting protein shared only 58 percent sequence identity with the closest known natural fluorescent protein. In evolutionary terms, a divergence of this magnitude typically corresponds to approximately 500 million years of natural selection. This means that if nature were to evolve the known fluorescent protein into esmGFP, it would take half a billion years of random mutation and selection to traverse that distance in sequence space.¹⁶

This achievement is profound. It suggests that PLMs have not merely memorized the protein database but have internalized the underlying "physics of evolution." By navigating the high-dimensional latent space of protein sequences, ESM3 acted as an evolutionary simulator, traversing a path through the fitness landscape that nature had not yet taken. This capability allows researchers to "jump" across the protein universe to find functional molecules that are distinct from existing patents or natural limitations, potentially leading to novel industrial enzymes or non-immunogenic therapeutics.¹⁵

3.1.2 AlphaFold 3: Interaction and Accuracy

While ESM3 focuses on generative breadth and evolutionary simulation, Google DeepMind's AlphaFold 3 (released in 2024) pushed the boundaries of predictive accuracy and interaction modeling. Unlike its predecessor AlphaFold 2, which revolutionized static protein structure prediction, AlphaFold 3 expanded its scope to accurately model the interactions between proteins, DNA, RNA, and small molecule ligands.²

The distinction between these models highlights a divergence in utility within the field:

AlphaFold 3 is the superior tool for analyzing how a specific drug might bind to a receptor (docking and interaction). It is a "reader" of detailed molecular interactions.
ESM3 is the superior tool for creating the receptor or the biologic drug itself. It is a "writer" of biological concepts.²¹

This duality—analysis versus creation—forms the core of modern computational biology workflows. A researcher might use ESM3 to generate a candidate enzyme and then use AlphaFold 3 to rigorously validate its binding properties before ever ordering a physical synthesis.

3.2 Genomic Language Models: Decoding Non-Coding Regions

While proteins are the workhorses of the cell, the genome contains the instructions. Genomic Language Models (gLMs) such as DNABERT-2 and Nucleotide Transformer have been developed to interpret non-coding DNA, regulatory elements, and complex genetic interactions.¹¹

3.2.1 The Tokenization Challenge in Genomics

As mentioned in the theoretical framework, the sheer length of genomic sequences presents a massive computational hurdle. A major breakthrough in 2024 was the widespread adoption of Byte Pair Encoding (BPE) for DNA. Previous models relied on "k-mers" (e.g., 6-base chunks), which resulted in massive, overlapping vocabularies that were computationally inefficient.

DNABERT-2 utilizes BPE to create a variable-length vocabulary. This means common regulatory motifs (like a TATA box) might become a single token, while rare sequences are broken down into smaller parts. This efficiency allows the model to process sequences that are thousands of base pairs long, capturing the long-range regulatory dependencies that characterize complex genomes—such as how an enhancer region influences a gene located far away on the chromosome.¹¹

Benchmarking reveals that these updated gLMs outperform their predecessors in tasks like promoter identification and transcription factor binding site prediction, often requiring only a fraction of the computational resources. For instance, DNABERT-2 achieves comparable performance to state-of-the-art models with 21 times fewer parameters, democratizing access to genomic analysis for labs without massive supercomputing clusters.²³

3.3 Evolutionary Biology and Phylogenetics

Generative AI is also reshaping how we reconstruct the history of life. Phylogenetic reconstruction—the inference of evolutionary trees—is computationally intensive and typically relies on complex statistical models of mutation rates. Recent studies have demonstrated that LLMs can perform "zero-shot" phylogenetic analysis by treating the relationship between sequences as a linguistic similarity problem.

3.3.1 NJGPT and Conversational Phylogenetics

Tools like NJGPT (Neighbor-Joining GPT) allow users to construct phylogenetic trees using natural language queries. A user can upload a set of sequences and ask the AI to "build a tree and explain the relationship between these species." The underlying LLM automates the alignment and tree-building steps, lowering the barrier to entry for complex evolutionary analysis.²⁴

3.3.2 The Phylogeny of LLMs

Interestingly, researchers have turned the lens of phylogenetics back onto the AI models themselves. By analyzing the "DNA" of Large Language Models (their weights, architectures, and outputs), scientists have constructed phylogenetic trees of LLMs. These trees reveal the evolutionary history of model architectures, tracing the lineage from the early encoder-decoder systems (like BERT) to the modern decoder-only families (like Llama and GPT). This "meta-evolutionary" analysis provides insights into the rapid "speciation" of AI models, revealing distinct evolutionary speeds across different model families and tracking the flow of architectural innovations.²⁵

3.4 Addressing Bias in Protein Models

A critical challenge in bioinformatics is the bias inherent in the training data. The UniProt database, which serves as the training corpus for almost all PLMs, is heavily skewed toward organisms that are easy to study or medically relevant (humans, mice, E. coli). This leaves the vast majority of biological diversity—the "dark matter" of the protein universe, such as viral proteins or those from extremophiles—underrepresented.

Research in 2024 has shown that PLMs perform significantly worse when designing for these neglected spaces. They suffer from sequence similarity bias, where they struggle to predict fitness for sequences that diverge too far from the training distribution. Recent efforts have focused on fine-tuning models on specific viral families or using "parameter-efficient" strategies to adapt general PLMs to these niche areas. This is crucial for applications like predicting viral escape mutations or designing enzymes for extreme industrial environments.²⁷

4. Chemicoinformatics: The Grammar of Molecules

In chemistry, the challenge for Generative AI is representation. Unlike biological sequences, which are linear strings of characters, molecules are defined by their topology—atoms (nodes) connected by bonds (edges) in specific geometric arrangements. To feed a molecule into an LLM, it must first be "flattened" into a string. The efficiency and accuracy of this translation are the subjects of intense research and debate in the 2024–2025 period.

4.1 The Battle of Representations: SMILES vs. SELFIES

The industry standard for text-based molecular representation has long been SMILES (Simplified Molecular Input Line Entry System). SMILES strings compact chemical structures into ASCII characters (e.g., CCO for ethanol). However, SMILES has a significant flaw in the context of generative AI: it is fragile. A generative model can easily produce a SMILES string that is syntactically correct (it looks like a SMILES string) but chemically impossible (e.g., creating a carbon atom with five bonds or an unclosed ring). This leads to a high rate of "invalid" outputs—hallucinated molecules that cannot exist in physical reality.³⁰

4.1.1 The Rise of SELFIES

To combat this, the SELFIES (SELF-referencing Embedded Strings) representation gained traction. SELFIES is designed to be 100 percent robust; the grammar of the language is constructed such that every possible combination of SELFIES tokens corresponds to a valid molecular graph. If a model generates a random sequence of SELFIES tokens, it will always translate back to a valid molecule. This property is invaluable for generative models, as it prevents the AI from wasting computational effort on invalid chemistry.

4.1.2 Atom Pair Encoding (APE)

However, the "SMILES vs. SELFIES" debate is not settled. Recent benchmarks in late 2024 and 2025 have shown that while SELFIES ensures validity, advanced tokenization strategies applied to SMILES can sometimes yield better performance in downstream tasks like property prediction. Specifically, a novel method called Atom Pair Encoding (APE) has been shown to outperform standard Byte Pair Encoding (BPE).

While BPE merges frequent characters based on statistical frequency in the text, APE merges atoms based on chemical bonds. It preserves the chemical context of atom pairs better than generic text compression algorithms, allowing models to "understand" local chemical environments (like functional groups) more effectively. In biophysics classification tasks (e.g., predicting blood-brain barrier penetration), models using APE-tokenized SMILES significantly outperformed those using BPE, demonstrating that the method of tokenization is just as important as the format of representation.¹³

4.2 Graph-Enhanced LLMs and Molecular Design

Recognizing that string representations will always lose some topological information, researchers at MIT and IBM introduced a hybrid approach in 2025. They developed a system that augments an LLM with Graph Neural Networks (GNNs).

In this architecture, the LLM handles the high-level reasoning and natural language inputs (e.g., "design a molecule that is soluble in water, targets receptor X, and is easy to synthesize"). The LLM then delegates the precise structural generation to the GNN, which operates directly on the molecular graph. This multimodal approach significantly improved the success ratio of valid synthesis plans from 5 percent to 35 percent compared to text-only models. It outperformed models ten times its size by leveraging the specific strengths of graph processing for molecular geometry, proving that multimodality is key to solving the "hallucination" problem in structural design.⁴

4.3 Autonomous Synthesis and Chemical Agents

The ultimate goal of computational chemistry is not just to design a molecule, but to make it. This is where "Agentic AI" comes into play. Agents are LLM-based systems equipped with "tools"—access to literature search engines, reaction databases, and even robotic laboratory hardware.

4.3.1 ChemCrow and the Agent Ecosystem

ChemCrow, a prominent example of this technology updated in 2025, is an LLM agent that integrates chemistry-specific tools to perform tasks ranging from synthesis planning to safety checks. Unlike a standard chatbot that simply predicts the next word, ChemCrow can "reason" that it needs to check a safety database before proposing a reaction.

New systems like "CheMatAgent" and "ChemActor" employ tree-search algorithms to navigate the complex decision trees of chemical synthesis. These agents can plan multi-step synthesis routes, identify necessary reagents, and predict potential safety hazards (like explosive byproducts) before a human chemist ever steps into the lab. In benchmarks, these agents consistently outperform unaugmented GPT-4 in chemical reasoning tasks, demonstrating the power of tool-augmented AI.³²

4.3.2 Retrosynthesis Planning

Retrosynthesis—working backward from a target molecule to simple starting materials—is a classic problem in organic chemistry. It requires deep knowledge of thousands of reaction types and their compatibilities. LLMs trained on reaction databases are now competing with specialized algorithmic approaches. By treating chemical reactions as a translation task (translating "product" to "reactants"), these models can propose novel synthetic routes. However, they still struggle with stereochemistry (the 3D arrangement of atoms) and rare reaction types, necessitating a "Human-in-the-loop" approach where expert chemists validate AI-proposed pathways to ensure feasibility.³⁴

4.4 Material Science: Accelerating Discovery

The principles of chemoinformatics extend naturally into materials science. The search for new battery materials, catalysts, and superconductors is essentially a search for optimal molecular arrangements in crystal lattices.

4.4.1 Case Study: The Microsoft and PNNL Battery Discovery

A defining moment in 2024 was the collaboration between Microsoft and the Pacific Northwest National Laboratory (PNNL). The team utilized Azure Quantum Elements, a platform combining high-performance computing (HPC) and AI, to search for a new solid-state electrolyte for batteries—a material that is less likely to catch fire than current lithium-ion liquid electrolytes.

The process began with a massive dataset of 32 million potential inorganic materials. The AI models filtered this down to 500,000 stable candidates, then further refined it to 18 promising materials based on predicted conductivity and cost. This entire screening process took just 80 hours.

A human researcher then synthesized one of these candidates in the lab, creating a functioning battery prototype. This process, which would have traditionally taken years of trial-and-error, was compressed into weeks. It stands as a premier example of AI acting not just as a tool, but as an accelerant for physical science, bridging the gap between computational prediction and physical realization.³⁶

5. Pharmacoinformatics: From Target to Treatment

Pharmacoinformatics sits at the intersection of biology and chemistry, applying computational tools to the discovery and development of drugs. The integration of Generative AI has compressed the traditional "Design-Make-Test-Analyze" cycle, offering the promise of faster, cheaper therapeutics.

5.1 Target Discovery and Knowledge Graphs

Before a drug can be designed, a biological target (such as a protein or gene) driving the disease must be identified. This is a data mining challenge of immense scale, requiring the synthesis of genomics, proteomics, and decades of scientific literature.

5.1.1 PandaOmics and LLM Scoring

Insilico Medicine’s PandaOmics platform exemplifies the commercial application of these technologies. In its 2024-2025 updates, the platform integrated novel LLM-based scoring metrics. These algorithms scour millions of scientific papers, grants, patents, and clinical trial reports to identify "attention spikes"—sudden increases in scientific interest in specific gene-disease associations.

By combining this text-based intelligence with omics data (genomics, transcriptomics), the system can rank potential therapeutic targets based on novelty and feasibility. For instance, the "Attention Spike" score uses a neural network trained on retrospective publication trends to forecast which targets will become hot topics in the next two years. This multimodal approach filters out the noise of over-hyped targets and highlights under-explored opportunities that are supported by emerging, high-quality evidence.³⁸

5.2 Virtual Screening: Zero-Shot vs. Fine-Tuning

Virtual screening involves evaluating huge libraries of molecules to find those that might bind to a target. Traditionally, this was done using "docking" simulations, which calculate the physical forces between a molecule and a protein. While accurate, docking is computationally expensive and slow for billion-scale libraries.

5.2.1 The Zero-Shot Revolution

Recent benchmarks have pitted LLMs against traditional docking methods. "Zero-shot" learning refers to the ability of a model to perform a task (like predicting drug-target interaction) without having been explicitly trained on that specific target data. Models like "DDI-JUDGE" and "Geneformer" have demonstrated that LLMs can achieve competitive performance in identifying drug interactions and binding affinities purely based on their understanding of chemical and biological language.

Benchmarks show that while fine-tuning (training the model on specific domain data) still yields the highest accuracy (AUC ~0.788), the zero-shot capabilities of foundational models are rapidly improving (AUC ~0.642). This allows researchers to use LLMs as a "first pass" filter, screening massive chemical spaces at a fraction of the cost of physical docking, before subjecting the top candidates to more rigorous physics-based simulation.⁴⁰

5.2.2 LLMs vs. Docking

New benchmarks like Ab-VS-Bench (Antibody Virtual Screening Benchmark) explicitly compare LLMs to docking tools. The results suggest that for tasks like ranking antibodies by affinity, instruction-tuned LLMs can rival structure-based methods, especially when structural data is low-resolution or unavailable. This is critical for early-stage discovery where crystal structures of targets might not yet exist.⁴³

5.3 De Novo Drug Design and Multi-Objective Optimization

Once a target is identified, the challenge shifts to designing a molecule that binds to it. Generative AI allows for de novo design—creating molecules from scratch rather than screening existing libraries.

5.3.1 Generative Chemistry in Practice

Case studies from 2024 highlight the efficacy of this approach. In one instance, generative models were used to repurpose existing drugs for new indications by identifying molecular similarities that human intuition had missed. In oncology and neurodegenerative diseases, AI models have successfully proposed candidate structures that optimize ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) profiles simultaneously. This multi-objective optimization is difficult for humans—who often optimize one property at the expense of another—but trivial for high-dimensional AI models, which can navigate the trade-offs in the latent space.⁴⁴

5.4 The "AI Co-Scientist" and Hypothesis Generation

In early 2025, Google introduced the "AI Co-Scientist," a multi-agent system built on the Gemini 2.0 architecture. This system was designed not just to answer questions, but to formulate and test scientific hypotheses.

5.4.1 Case Study: The "Tail Piracy" Discovery

As a validation of this system, researchers tasked the AI Co-Scientist with explaining the mechanisms of bacterial gene transfer related to antimicrobial resistance (AMR). Specifically, they wanted to understand how certain genetic elements spread between bacteria.

The AI engaged in a "generate, debate, and evolve" process, using multiple internal agents to propose hypotheses, critique them against the literature, and refine them. It independently generated a hypothesis involving "tail piracy"—a mechanism where chromosomal gene transfer elements (cf-PICIs) hijack the tails of diverse bacteriophages (viruses) to spread to new hosts.

This hypothesis matched a mechanism that human researchers had spent nearly a decade confirming experimentally. The AI identified the crucial role of "adaptor" and "connector" proteins in this process. When benchmarked against other models, only the Co-Scientist system accurately derived this complex biological mechanism, demonstrating that AI agents can now engage in high-level scientific reasoning that rivals expert human intuition.⁴⁶

6. Critical Challenges, Limitations, and Ethics

Despite the euphoria surrounding these advances, significant hurdles remain. The deployment of Generative AI in science is not without risk, and the "black box" nature of these models necessitates rigorous scrutiny.

6.1 The Hallucination Problem

"Hallucination"—the generation of factually incorrect but plausible-sounding text—is a well-known issue in LLMs. In science, the consequences are severe. A model might invent a citation, fabricate a chemical property, or misstate a biological mechanism.

Studies in 2024 and 2025 have categorized these hallucinations into open-domain (false general claims) and closed-domain (deviating from provided context). For example, legal and medical chatbots have been caught citing non-existent cases or policies.⁵⁰ In chemistry, models often hallucinate molecular properties if not properly constrained.

Retrieval-Augmented Generation (RAG) is the primary mitigation strategy. By forcing the model to "look up" information in a verified database before generating an answer, accuracy is significantly improved. For example, in molecular property prediction, RAG-enhanced models reduced error rates drastically by retrieving functional group data before making a prediction. Techniques like MIPRO (Multi-hop Instruction PROmpting) further refine this by optimizing the prompts used to query the database, ensuring the model considers all relevant chemical contexts.⁵¹

6.2 The Validation Gap and "Safe Biological Proxies"

The ease of generating hypotheses with AI contrasts with the difficulty of verifying them. There is a growing "validation gap" where the speed of computational output outstrips experimental throughput.

A study by the National Institute of Standards and Technology (NIST) using "safe biological proxies" highlighted this risk. They used AI to generate protein sequences that were predicted to be functional homologs of a known protein. While the AI models (like AlphaFold) predicted that these new sequences would fold correctly, wet-lab experiments revealed that many did not retain biological activity. The structures looked right in the computer, but the proteins were inert in the test tube. This underscores that while AI is a powerful generator, it is not yet a perfect simulator of physical reality, and experimental validation loops remain essential.⁵³

6.3 Data Bias and the "Data Silo" Problem

AI models are only as good as their training data. Biological databases like UniProt are heavily biased toward well-studied organisms. This leads to models that perform poorly on "orphan" targets or underrepresented populations.

In the pharmaceutical industry, this is exacerbated by data silos. Companies hold vast troves of proprietary data on failed clinical trials and negative results—data that would be incredibly valuable for training AI models to avoid failure. However, this data is rarely shared. The lack of diversity in training data can lead to AI models that optimize drugs for the "average" biological profile (typically Caucasian and male, in the context of clinical data), potentially worsening health disparities.⁵⁴

7. Conclusions and Future Outlook

The trajectory of LLMs and Generative AI in science points toward a future of integrated autonomy. We are moving away from isolated tools—a model for folding, a model for docking, a model for writing—toward unified "Agentic" systems that can reason across the entire scientific pipeline.

7.1 The Rise of Self-Driving Science

The convergence of software like the AI Co-Scientist with hardware like robotic laboratories suggests a near future of "Self-Driving Labs." In this paradigm, an AI formulates a hypothesis, designs a molecule to test it, orders the synthesis in a robotic cloud lab, receives the data, and updates its hypothesis—all in a closed loop. This is already being piloted in materials science (as seen with the Microsoft/PNNL battery discovery) and is expanding into drug discovery.

7.2 The Democratization of Discovery

Tools like ChatChemTS and NJGPT allow researchers without coding skills to leverage state-of-the-art AI. A biologist can now perform complex phylogenetic analyses or design novel molecules using simple natural language conversation. This democratization will likely lead to an explosion of discovery from non-traditional sources, but it also necessitates rigorous community standards for verification to prevent a flood of AI-generated scientific noise.⁵⁶

7.3 Summary Comparison of Key Technologies

Domain	Key Model/Tool	Primary Function	Innovation (2024-2025)
Proteomics	ESM3	Generative Design	Multimodal reasoning; simulating 500M years of evolution to create novel proteins like esmGFP.¹⁵
Proteomics	AlphaFold 3	Structure Prediction	High-accuracy interaction modeling (Protein-DNA/Ligand) for validation.²
Genomics	DNABERT-2	Sequence Analysis	Shift to BPE tokenization; processing long-range dependencies efficiently.¹¹
Chemistry	ChemCrow	Agentic Workflow	Integration of external tools for synthesis planning and safety checks.³²
Chemistry	MIT Graph-LLM	Molecule Generation	Hybrid Text-Graph approach to ensure synthesis validity and improve success rates to 35%.⁴
Pharma	PandaOmics	Target Discovery	NLP-driven "attention spike" detection for identifying novel therapeutic targets.³⁹
Materials	Azure Quantum	Material Screening	Compressed discovery timeline (years to weeks) for new battery electrolytes.³⁶

7.4 Final Thought

The use of LLMs and Generative AI in bioinformatics, chemicoinformatics, and pharmacoinformatics has transcended novelty. It is now infrastructure. The ability to speak the language of biology and chemistry through neural networks is allowing scientists to edit the code of life with unprecedented fluency. As we navigate the challenges of bias and verification, the partnership between human intuition and artificial generation promises to unlock solutions to the most intractable problems in health, energy, and sustainability. The "Co-Scientist" era has arrived, and it is writing the next chapter of scientific history.

Works cited

Artificial Intelligence in Healthcare: 2024 Year in Review - medRxiv, accessed January 10, 2026, https://www.medrxiv.org/content/10.1101/2025.02.26.25322978v1.full-text
Science and Medicine | The 2025 AI Index Report | Stanford HAI, accessed January 10, 2026, https://hai.stanford.edu/ai-index/2025-ai-index-report/science-and-medicine
AI-enabled language models (LMs) to large language models (LLMs) and multimodal large language models (MLLMs) in drug discovery and development - PMC - PubMed Central, accessed January 10, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC12684927/
Could LLMs help design our next medicines and materials? | MIT News, accessed January 10, 2026, https://news.mit.edu/2025/could-llms-help-design-our-next-medicines-and-materials-0409
Towards Scientific Discovery with Generative AI: Progress, Opportunities, and Challenges, accessed January 10, 2026, https://arxiv.org/html/2412.11427v1
Need a research hypothesis? Ask AI. | MIT News | Massachusetts Institute of Technology, accessed January 10, 2026, https://news.mit.edu/2024/need-research-hypothesis-ask-ai-1219
Evaluating the advancements in protein language models for encoding strategies in protein function prediction: a comprehensive review - PubMed Central, accessed January 10, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC11790633/
A Comprehensive Review of Protein Language Models - arXiv, accessed January 10, 2026, https://arxiv.org/html/2502.06881v1
Advances in Protein Representation Learning: Methods, Applications, and Future Directions, accessed January 10, 2026, https://arxiv.org/html/2503.16659v1
Building AI foundation models to accelerate the discovery of new battery materials, accessed January 10, 2026, https://www.anl.gov/article/building-ai-foundation-models-to-accelerate-the-discovery-of-new-battery-materials
DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genomes, accessed January 10, 2026, https://arxiv.org/html/2306.15006v2
DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genomes, accessed January 10, 2026, https://openreview.net/forum?id=oMLQB4EZE1
Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling, accessed January 10, 2026, https://pubmed.ncbi.nlm.nih.gov/39443676
Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language Modeling - Universidade NOVA de Lisboa, accessed January 10, 2026, https://novaresearch.unl.pt/en/publications/comparing-smiles-and-selfies-tokenization-for-enhanced-chemical-l/
Evolutionary Scale · ESM3: Simulating 500 million years of evolution with a language model, accessed January 10, 2026, https://www.evolutionaryscale.ai/blog/esm3-release
Simulating 500 million years of evolution with a language model - PubMed, accessed January 10, 2026, https://pubmed.ncbi.nlm.nih.gov/39818825/
Learning the Protein Language: Evolution, Structure and Function - PMC - PubMed Central, accessed January 10, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC8238390/
accessed January 10, 2026, https://pubmed.ncbi.nlm.nih.gov/39818825/#:~:text=We%20have%20prompted%20ESM3%20to,500%20million%20years%20of%20evolution .
Simulating 500 million years of evolution with a language model - bioRxiv, accessed January 10, 2026, https://www.biorxiv.org/content/10.1101/2024.07.01.600583.full
Simulating 500 million years of evolution with a language model - bioRxiv, accessed January 10, 2026, https://www.biorxiv.org/content/10.1101/2024.07.01.600583v1.full.pdf
ESM3 | A New Frontier for Advanced Protein Design - Future Medicine AI, accessed January 10, 2026, https://www.fmai-hub.com/esm3-a-new-frontier-for-advanced-protein-design/
DeepGene: An Efficient Foundation Model for Genomics based on Pan-genome Graph Transformer | bioRxiv, accessed January 10, 2026, https://www.biorxiv.org/content/10.1101/2024.04.24.590879v1.full-text
Benchmarking DNA Foundation Models for Genomic Sequence Classification - PMC - NIH, accessed January 10, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC11343214/
NJGPT: A Large Language Model-Driven, User-Friendly Solution for Phylogenetic Tree Construction | bioRxiv, accessed January 10, 2026, https://www.biorxiv.org/content/10.1101/2024.12.02.626464v1
LLM DNA: Tracing Model Evolution via Functional Representations - OpenReview, accessed January 10, 2026, https://openreview.net/forum?id=UIxHaAqFqQ
LLM DNA: Tracing Model Evolution via Functional Representations - arXiv, accessed January 10, 2026, https://arxiv.org/html/2509.24496v1
Protein language models are biased by unequal sequence sampling across the tree of life, accessed January 10, 2026, https://www.biorxiv.org/content/10.1101/2024.03.07.584001v1
Fine-tuning protein language models unlocks the potential of underrepresented viral proteomes - PMC - PubMed Central, accessed January 10, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC12450373/
Protein language models are biased by unequal sequence sampling across the tree of life - bioRxiv, accessed January 10, 2026, https://www.biorxiv.org/content/10.1101/2024.03.07.584001v1.full.pdf
Improving Chemical Understanding of LLMs via SMILES Parsing - ACL Anthology, accessed January 10, 2026, https://aclanthology.org/2025.emnlp-main.791.pdf
Improving Chemical Understanding of LLMs via SMILES Parsing - arXiv, accessed January 10, 2026, https://arxiv.org/html/2505.16340v1
LLM Chemistry: AI for Chemical Innovation - Emergent Mind, accessed January 10, 2026, https://www.emergentmind.com/topics/llm-chemistry
ChemToolAgent: The Impact of Tools on Language Agents for Chemistry Problem Solving, accessed January 10, 2026, https://arxiv.org/html/2411.07228v2
LLM-Augmented Chemical Synthesis and Design Decision Programs - OpenReview, accessed January 10, 2026, https://openreview.net/forum?id=NhkNX8jYld
LLM-Augmented Chemical Synthesis and Design Decision Programs - ResearchGate, accessed January 10, 2026, https://www.researchgate.net/publication/391677048_LLM-Augmented_Chemical_Synthesis_and_Design_Decision_Programs
How AI and high-performance computing are speeding up scientific discovery - Source, accessed January 10, 2026, https://news.microsoft.com/source/features/innovation/how-ai-and-hpc-are-speeding-up-scientific-discovery/
DOE lab, Microsoft find new battery material in AI-based energy storage research initiative, accessed January 10, 2026, https://www.utilitydive.com/news/pnnl-microsoft-artificial-intelligence-research-battery/704176/
Showcasing gen-AI breakthroughs, Insilico Medicine presents fall edition of Pharma.AI quarterly launch, accessed January 10, 2026, https://insilico.com/news/rd037pu8d1-showcasing-gen-ai-breakthroughs-insilico
Prioritization Scores | PandaOmics - Insilico Medicine, accessed January 10, 2026, https://insilico.com/pandaomics/help/scores
DrugLM: A Unified Framework to Enhance Drug-Target Interaction Predictions by Incorporating Textual Embeddings via Language Models - bioRxiv, accessed January 10, 2026, https://www.biorxiv.org/content/10.1101/2025.07.09.657250v1.full.pdf
Improving drug-drug interaction prediction via in-context learning and judging with large language models - NIH, accessed January 10, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC12171303/
Fine-Tuning Large Language Models for Specialized Use Cases - PMC - NIH, accessed January 10, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC11976015/
Evaluating Large Language Models for Virtual Antibody Screening via Antibody-Antigen Interaction Prediction - bioRxiv, accessed January 10, 2026, https://www.biorxiv.org/content/10.1101/2025.07.26.666985v1.full.pdf
Generative AI in Drug Discovery: Top 10 Use Cases in 2024 - Antier Solutions, accessed January 10, 2026, https://www.antiersolutions.com/blogs/generative-ai-in-drug-discovery-top-10-use-cases-in-2024/
De novo drug design through artificial intelligence: an introduction - Frontiers, accessed January 10, 2026, https://www.frontiersin.org/journals/hematology/articles/10.3389/frhem.2024.1305741/pdf
Google's AI Co-Scientist: The New Research Partner for Biotech Scientists - TheMed, accessed January 10, 2026, https://themed.co.il/googles-ai-co-scientist-the-new-research-partner-for-biotech-scientists/
AI mirrors experimental science to uncover a novel mechanism of gene transfer crucial to bacterial evolution | bioRxiv, accessed January 10, 2026, https://www.biorxiv.org/content/10.1101/2025.02.19.639094v1
Accelerating scientific breakthroughs with an AI co-scientist - Google Research, accessed January 10, 2026, https://research.google/blog/accelerating-scientific-breakthroughs-with-an-ai-co-scientist/
AI mirrors experimental science to uncover a novel mechanism of gene transfer crucial to bacterial evolution | bioRxiv, accessed January 10, 2026, https://www.biorxiv.org/content/10.1101/2025.02.19.639094v1.full
LLM hallucinations and failures: lessons from 5 examples - Evidently AI, accessed January 10, 2026, https://www.evidentlyai.com/blog/llm-hallucination-examples
Augmented and Programmatically Optimized LLM Prompts Reduce Chemical Hallucinations - PMC - PubMed Central, accessed January 10, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC12076503/
Hallucinations Can Improve Large Language Models in Drug Discovery - arXiv, accessed January 10, 2026, https://arxiv.org/html/2501.13824v1
Experimental Evaluation of AI-Driven Protein Design Risks Using Safe Biological Proxies, accessed January 10, 2026, https://www.nist.gov/publications/experimental-evaluation-ai-driven-protein-design-risks-using-safe-biological-proxies
Making sense of AI: bias, trust and transparency in pharma R&D - Drug Target Review, accessed January 10, 2026, https://www.drugtargetreview.com/article/183358/making-sense-of-ai-bias-trust-and-transparency-in-pharma-rd/
Exploring the ethical issues posed by AI and big data technologies in drug development, accessed January 10, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC12581208/
Large language models open new way of AI-assisted molecule design for chemists - PMC, accessed January 10, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC11934680/