A Comparative Analysis of AI Scientist v1 and v2: Architectural and Functional Evolution

1. Introduction

The pursuit of artificial intelligence capable of conducting scientific research autonomously represents a significant frontier in AI development. The AI Scientist project, initiated by Sakana AI, embodies this ambition, aiming to create systems that automate the entire research lifecycle – from ideation and experimentation to manuscript writing and review. The overarching goal is to leverage advancements in foundation models, particularly Large Language Models (LLMs), to accelerate scientific discovery and potentially transform the ever-increasing availability of computational resources into tangible scientific breakthroughs needed to address major global challenges. The initial iteration, referred to herein as AI Scientist v1, served as the first comprehensive framework demonstrating this end-to-end automation. It showcased the ability of LLMs to generate novel research ideas, write and execute code for experiments, visualize results, and produce full scientific papers, complete with a simulated peer-review process. V1 operated using a structured pipeline guided by human-authored "templates" tailored to specific machine learning subfields, such as NanoGPT, 2D Diffusion, and Grokking. These templates provided the necessary codebase, baseline results, and domain context for the AI to operate within.

AI Scientist v2 emerges as a significant evolution of this initial concept. Its development appears driven by the goal of creating a more generalized and adaptive system capable of open-ended scientific exploration across diverse domains without strict reliance on predefined structures. This marks a shift from automating known research workflows within templates towards enabling AI to navigate less defined research landscapes more autonomously. This article provides a comparison between the AI Scientist v1 and AI Scientist v2. The analysis focuses on identifying and evaluating significant changes in their underlying architectural paradigms, orchestration mechanisms, core workflow implementations, approaches to domain specification, and software dependencies. The objective is to illuminate the technical evolution of the AI Scientist project and understand the implications of the changes introduced in V2. The analysis draws upon information presented in the projects' respective README files, key source code files (primarily the main execution scripts), dependency lists, and associated research documentation describing the systems' capabilities and design philosophies.

2. Architectural Paradigm Shift: From Templates to Agentic Tree Search

The most fundamental difference between AI Scientist v1 and v2 lies in their core architectural paradigms, reflecting a shift from a structured, linear process to a more dynamic, exploratory one.

V1: Template-Driven Linear Pipeline

AI Scientist v1 implemented a relatively linear, sequential pipeline to automate the research process. The workflow typically proceeded through distinct stages: Idea Generation, Novelty Check (using scholarly search engines like Semantic Scholar or OpenAlex), Experimental Iteration (involving code modification and execution), Paper Write-up (generating LaTeX manuscripts), and an optional automated Review and Improvement cycle. Central to V1's architecture was the concept of human-authored templates. Each template encapsulated a specific research domain (e.g., NanoGPT) and included essential components:

experiment.py: The core script defining the baseline experiment.
plot.py: A script for generating visualizations from experimental results.
prompt.json: Information and context about the template for the LLM.
seed_ideas.json: Optional examples of research ideas within the domain.

These templates provided the necessary scaffolding, including the initial codebase and context, upon which the AI Scientist operated. LLMs were employed at various stages, guided by specific prompts, to generate ideas, modify the template code (often facilitated by the Aider coding assistant), analyze results, and write the paper. While effective for automating research within the predefined scope of a template, this architecture inherently limited the system's ability to explore outside these boundaries or adapt easily to entirely new domains without significant human effort in creating new templates.

V2: Generalized Agentic Tree Search

AI Scientist v2 moves away from the rigid template-based pipeline towards what is described as a "generalized end-to-end agentic system". The core architectural innovation is the adoption of a "progressive agentic tree search" methodology, a Breadth-First Tree Search (BFTS). In this paradigm, an "experiment manager agent" guides the system through the complex process of exploring a tree of possibilities. Each node or path in the tree might represent a hypothesis, an experimental configuration, a set of results, or an analysis step. The system explores this tree, generating hypotheses, designing and running experiments, and analyzing data in a more integrated and potentially non-linear fashion.

A key objective and feature of V2 is the explicit removal of the dependency on human-authored templates. This architectural choice is intended to enable greater generalization across different Machine Learning (ML) domains and facilitate more open-ended scientific exploration, tackling tasks without a predefined structure. This approach allows V2 to potentially venture into less defined research areas, as demonstrated by its reported success in generating an entirely AI-written workshop paper accepted through peer review.

Analysis of the Shift

This transition from a template-driven pipeline to an agentic tree search represents a fundamental change in the system's philosophy and operational model. V1 provided a structured way to automate research tasks within a known framework, whereas V2 embraces the uncertainty inherent in exploration. This architectural divergence introduces clear trade-offs. V1, constrained by its templates, likely offered higher reliability and predictability when operating within well-defined domains where templates could accurately capture the experimental setup. Its linear pipeline simplified orchestration and debugging. V2, by contrast, prioritizes flexibility and the potential for novel discovery in open-ended scenarios. The agentic tree search allows for a more dynamic exploration of the research space, potentially uncovering insights that wouldn't fit neatly into a predefined template. However, this exploratory power comes at the cost of increased complexity. Tree search algorithms can be computationally intensive and may explore many unfruitful paths, potentially leading to lower overall success rates compared to V1 on tasks well-suited to templates. The move from V1's structured approach to V2's exploratory one reflects a strategic decision to tackle the more challenging aspects of automated scientific discovery – namely, navigating novelty and ambiguity – rather than solely optimizing the automation of established workflows.

The following table summarizes the high-level differences stemming from this architectural shift:

Table 1: High-Level Comparison of AI Scientist v1 vs. v2

3. Orchestration and Execution: launch_scientist.py vs. launch_scientist_bfts.py

The difference in architectural paradigms is directly reflected in the main execution scripts of the two versions and how they orchestrate the overall process.

V1: launch_scientist.py

In V1, launch_scientist.py served as the central orchestrator for the entire workflow. Its primary function was to manage the generation, evaluation, and processing of multiple research ideas based on a selected template. Key command-line arguments allowed users to configure this process:

--experiment: Specified the template directory defining the research domain.
--model: Selected the primary LLM used for tasks like idea generation and writing.
--num-ideas: Controlled how many initial ideas were generated.
--skip-idea-generation / --skip-novelty-check: Allowed bypassing initial stages.
--parallel: Enabled processing multiple ideas concurrently, typically leveraging available GPUs.
--improvement: Triggered an optional step to refine the paper based on automated review.

The script's execution flow involved parsing arguments, setting up the environment (checking GPU availability and LaTeX dependencies), optionally generating and checking the novelty of ideas, filtering novel ideas, and then iterating through these ideas. For each novel idea, it invoked a core function (do_idea) either sequentially or in parallel worker processes. The do_idea function encapsulated the linear workflow for a single idea: running experiments by modifying template code via Aider, generating a LaTeX write-up, performing a review, and optionally improving the paper.

V2: launch_scientist_bfts.py

In contrast, V2's launch_scientist_bfts.py script has a different scope and purpose. It serves as the entry point to initiate the agentic tree search process for a single, predefined research idea. Its focus is on configuring and launching one instance of the complex BFTS exploration, rather than managing a batch of independent ideas. Key arguments reflect this shift:

--load_ideas: Specifies a JSON file containing the initial idea(s).
--idea_idx: Selects which specific idea from the file to execute.
--writeup-type: Defines the format of the final paper (e.g., "normal" 8-page, "icbinb" 4-page).
--model_agg_plots, --model_writeup, --model_citation, --model_review: Allow granular selection of different LLMs for specific sub-tasks within the process.
--skip_writeup / --skip_review: Allow bypassing final stages.

Notably absent are arguments for template selection, idea generation, or parallel processing of multiple distinct ideas within this script's logic. The execution flow involves parsing arguments, setting up the environment, loading the specified idea from the JSON file, creating a results directory, configuring the BFTS process (by editing a bfts_config.yaml file), and then invoking the core tree search function (perform_experiments_bfts_with_agentmanager). After the search completes, the script handles plot aggregation, optional write-up generation (with retries), optional paper review, and process cleanup.

Comparative Analysis

The comparison of the launch scripts highlights several key differences:

Scope of Execution: V1's script manages the lifecycle of multiple ideas derived from a template. V2's script manages the execution of a single, more complex exploratory process based on one pre-loaded idea. This suggests that running multiple V2 experiments might require external orchestration or scripting to invoke launch_scientist_bfts.py multiple times with different ideas or configurations.
Configuration Granularity: V2 introduces more fine-grained control over which LLMs are used for specific tasks (plotting, writing, citation, review) via dedicated arguments, whereas V1 primarily used a single --model argument for most tasks. This allows for optimizing model choice based on task requirements and cost.
Configuration Method: V2 introduces reliance on an external YAML file (bfts_config.yaml) for configuring the core BFTS algorithm. V1's configuration was primarily handled through command-line arguments. The use of a dedicated configuration file in V2 points towards the increased complexity of the underlying tree search mechanism, likely involving numerous parameters (e.g., search depth, breadth limits, heuristics, resource allocation) that benefit from structured configuration management rather than simple command-line flags. This external configuration is a hallmark of systems with more intricate internal workings compared to V1's more straightforward pipeline.
Functional Differences: Arguments related to idea generation (--num-ideas, --skip-idea-generation) and parallel processing of multiple ideas (--parallel) present in V1 are absent in V2's script. Similarly, V1's explicit --improvement step argument is not present in V2's script, suggesting that refinement might be handled differently or integrated within the main write-up/review process. V2 also introduces different --writeup-type options.

The structural changes in the launch script and configuration approach underscore the increased complexity inherent in V2's agentic tree search. While V1's script managed multiple relatively simple linear workflows, V2's script sets up and monitors a single, potentially much deeper and more computationally intensive, exploratory run.

Table 2: Comparison of Main Execution Script Arguments

4. Core Workflow and Logic Implementation (ai_scientist Package)

The architectural and orchestration changes are mirrored in the internal structure and logic of the core ai_scientist Python package.

V1: Modular Pipeline Implementation

Analysis of the import statements in V1's launch_scientist.py reveals a highly modular structure within the ai_scientist package, directly corresponding to the stages of its linear pipeline. Key functions were imported from distinct submodules:

ai_scientist.generate_ideas: Contained generate_ideas and check_idea_novelty.
ai_scientist.llm: Provided create_client for LLM interaction.
ai_scientist.perform_experiments: Handled the execution of experiments defined in the template.
ai_scientist.perform_review: Included perform_review, load_paper, and perform_improvement.
ai_scientist.perform_writeup: Offered perform_writeup and generate_latex.

This structure indicates a clear separation of concerns, with dedicated modules responsible for each step: ideation, LLM interface, experimentation, review/improvement, and writing. The experimentation logic within perform_experiments crucially involved interacting with the template files (experiment.py, plot.py), using the Aider tool (aider-chat library) to make LLM-directed code modifications based on the generated idea, executing the modified code, and collecting results.

V2: Agentic Tree Search Implementation

V2's ai_scientist directory structure by analyzing the imports in launch_scientist_bfts.py:

ai_scientist.treesearch.perform_experiments_bfts_with_agentmanager: Imports the central function perform_experiments_bfts_with_agentmanager. This strongly suggests a new treesearch submodule containing the core logic for the agentic exploration.
ai_scientist.perform_plotting: Provides aggregate_plots.
ai_scientist.perform_writeup: Contains perform_writeup and a new perform_icbinb_writeup.
ai_scientist.perform_llm_review: Includes perform_review (likely text-based review).
ai_scientist.perform_vlm_review: Adds perform_imgs_cap_ref_review, suggesting the use of Vision-Language Models (VLMs) for reviewing figures, captions, and references.
ai_scientist.gather_citations: Contains gather_citations.

This import structure indicates that while some functionalities like plotting, writing, and reviewing persist, the core experimentation logic has been replaced or encapsulated within the new treesearch module, specifically within the perform_experiments_bfts_with_agentmanager function. The introduction of functions like perform_icbinb_writeup and perform_imgs_cap_ref_review points to expanded or modified capabilities in the write-up and review stages.

Analysis of Logic Changes

The fundamental logic shifts from V1's approach of executing a predefined experimental plan derived from an idea and template, to V2's approach of exploring a state space of possibilities managed by an agent within a tree search framework. In V2, the steps of hypothesis generation, experimental design, execution, and analysis are likely more tightly interwoven within the nodes and branches of the search tree, rather than being distinct sequential phases as in V1. A particularly significant change relates to how code is generated or modified for experiments. V1 explicitly relied on the Aider tool and the aider-chat library to edit template files. However, the aider-chat dependency is absent in V2's requirements.txt, and neither the V2 README nor its launch script explicitly mentions Aider. Since automated experiments inherently require code execution and likely modification, V2 is directly prompting the primary LLM (used within the agent manager) to generate or modify code snippets as needed during the tree search, integrating this capability within the perform_experiments_bfts_with_agentmanager logic.

This represents a substantial internal re-architecture of how the system interacts with code, moving away from reliance on a specific external tool towards potentially more integrated, LLM-native code manipulation within the agentic loop.

5. Domain Specification: The Role of Templates

The mechanism for defining the scientific domain and the scope of exploration has undergone a significant transformation between V1 and V2.

V1: Template-Centric

In AI Scientist v1, templates were the cornerstone of domain specification. The templates/ directory was a prominent top-level component in the repository structure, housing subdirectories for different research areas like nanoGPT, 2d_diffusion, and grokking. Each template provided the baseline code (experiment.py), plotting utilities (plot.py), contextual information (prompt.json), and potentially seed ideas (seed_ideas.json) necessary for the AI Scientist to operate within that specific domain. Applying the AI Scientist to a new area of study explicitly required the creation of a new template following the established structure. This approach provided a clear, albeit somewhat rigid, way to define the boundaries and starting point for the automated research process.

V2: Towards Template Independence

AI Scientist v2 explicitly aims to "remove the reliance on human-authored templates". This strategic shift is reflected in the repository structure: the top-level directory listing for V2 notably lacks a templates/ directory, a clear departure from V1. In V2, domain specification appears to be handled differently, likely through the combination of the initial "idea" provided to the system and parameters within the bfts_config.yaml file. The launch_scientist_bfts.py script requires an input idea via --load_ideas and --idea_idx. This initial idea, presumably stored in a JSON format, likely needs to encapsulate the necessary context, potentially including or referencing the starting codebase, research question, or domain background that was previously contained within a V1 template. The bfts_config.yaml file might also contain parameters that help define the scope or constraints of the exploration.

Implications

This move away from explicit templates towards relying on the initial idea and configuration offers potentially greater flexibility. It could allow V2 to tackle research problems in domains where creating a comprehensive template is difficult or impractical, or where the research direction is less defined initially. This aligns directly with V2's stated goal of being a more "generalized" system capable of "open-ended scientific exploration". However, this shift also implies that the quality and completeness of the initial idea provided to V2 become paramount. The burden shifts from creating a structured template directory to formulating an initial idea (likely in JSON format) that provides sufficient grounding and context for the agentic tree search to commence effectively. Without the explicit structure of a template, ensuring the AI agent has the necessary starting information and constraints might require careful crafting of this initial input. The success of V2's exploration may depend heavily on how well this initial idea primes the agentic system.

6. Dependency Ecosystem Evolution

Comparing the requirements.txt files of V1 and V2 reveals significant changes in the underlying software ecosystem, reflecting the architectural shifts and adoption of new tools.

Comparison of requirements.txt

V1's dependencies focused on core LLM APIs, ML libraries, basic visualization, and the Aider coding tool. V2's dependencies show additions related to configuration management, enhanced visualization, data handling, graph processing, and potential cloud integration, alongside notable removals.

Key Additions in V2

Several new dependencies in V2 point towards a more sophisticated software architecture and tooling:

omegaconf: A library for managing hierarchical configurations, likely used for handling the bfts_config.yaml file essential for the tree search.
seaborn, rich: Provide enhanced plotting capabilities beyond basic matplotlib and improved terminal output/logging, respectively, aiding observability.
humanizedataclasses-json, jsonschema: Facilitate more robust handling, validation, and serialization of structured data (like configurations or agent states), crucial in complex systems.
python-igraph: A library for graph analysis and manipulation. Its presence strongly suggests its use in managing the tree structure inherent in the BFTS algorithm.
botocore, boto3: The AWS SDK for Python, indicating potential integration with AWS services for storage, computation, or other cloud functionalities.
Various utility libraries (funcy, shutup, coolname) and developer tools (black, genson) suggest improved development practices.

Key Removals/Changes from V1 to V2

Significant removals or changes highlight functional and architectural divergence:

aider-chat: The removal of this library confirms the shift away from using Aider as the primary coding assistant, supporting the analysis in Section 4. V2 employs a different mechanism for code modification.
google-generativeai: V1's README mentioned support for Google Gemini models, and its requirements.txt included this library. V2's requirements.txt only lists openai and anthropic, suggesting direct support for Google models has been initially removed in V2.
torch: PyTorch was listed as a requirement in V1 but is absent from V2's requirements.txt. This is a potentially major change. It could imply that V2 relies less on executing PyTorch models directly within its main process, perhaps orchestrating experiments run in separate environments or focusing more on API-based model interactions. However, it's also possible that torch is assumed to be pre-installed in V2's target environment. Further investigation would be needed to confirm the exact implications.
pymupdf vs pymupdf4llm: V1 used pymupdf, while V2 uses pymupdf4llm. This indicates an update to a potentially more specialized library for processing PDF documents in the context of LLMs, possibly offering better text extraction or structural analysis for feeding papers to review agents.

Analysis

The evolution of dependencies clearly signals that V2 is not just a refinement but a significant re-architecture incorporating more advanced software engineering practices and tools tailored to its new agentic search paradigm. The inclusion of libraries for configuration management (omegaconf), graph manipulation (python-igraph), data validation (jsonschema), enhanced observability (seaborn, rich), and potential cloud integration (boto3) points to a system designed for greater complexity, robustness, and potentially scalability compared to V1. The removal of aider-chat and the change regarding torch further emphasize the deep internal changes in how V2 handles code and potentially executes experiments. This dependency footprint reflects the transition towards a more sophisticated, exploratory agent framework.

Table 3: Dependency Changes (Key Libraries)

7. Synthesized Summary of Key Differences

The evolution from AI Scientist v1 to v2 represents a significant leap in architectural philosophy and technical implementation. The key differences can be summarized as follows:

Architecture: V1 utilized a linear pipeline heavily reliant on human-authored templates to guide the research process within specific domains. V2 adopts a fundamentally different approach using a generalized, agentic tree search (likely BFTS) designed for open-ended exploration and aiming for template independence.
Flexibility vs. Reliability: This architectural shift implies a trade-off. V1 likely offered greater reliability and predictability for tasks fitting its template structure. V2 prioritizes flexibility and the ability to tackle novel, less-defined problems, accepting the inherent complexities and potential variability of exploratory search.
Orchestration: V1's main script (launch_scientist.py) managed the generation and parallel processing of multiple ideas based on a chosen template. V2's script (launch_scientist_bfts.py) focuses on launching and configuring a single, complex tree search process for one predefined idea, utilizing external YAML configuration (bfts_config.yaml) and offering granular model selection for sub-tasks.
Coding Assistance: V1 explicitly used the aider-chat library for LLM-driven code modification within templates. V2 removes this dependency, suggesting a different, potentially more integrated approach to code generation or manipulation within the agentic framework.
Dependencies: V2's dependencies reflect its increased complexity and adoption of modern software tooling, incorporating libraries for configuration management (omegaconf), graph processing (python-igraph), data validation (jsonschema), enhanced observability (seaborn, rich), and potential cloud integration (boto3). Key V1 dependencies like aider-chat are removed, and the status of torch as a direct requirement has changed.
Focus: While both versions aim for end-to-end automation, V1's highlighted achievement involved validating its automated reviewer against human benchmarks. V2 emphasizes its generalized exploratory capability, showcased by the generation of an AI-written workshop paper accepted via peer review.

8. Implications and Conclusion

The transition from AI Scientist v1 to v2 signifies a clear and ambitious trajectory for the project. It reflects a move beyond simply automating well-defined research workflows towards building AI systems capable of more genuine, open-ended scientific exploration. Sakana AI appears to be directly confronting the challenges of creating AI that can navigate the ambiguity and novelty inherent in the scientific discovery process, rather than just executing predefined steps within rigid constraints. The potential benefits of V2's agentic tree search architecture are significant. Its template-independent design could drastically increase the system's applicability across diverse scientific domains, reducing the human effort previously required for template creation. The exploratory nature of the search holds the promise of discovering more unexpected or unconventional solutions and insights compared to the more constrained approach of V1.

However, this advanced architecture also introduces potential challenges. Agentic tree search is inherently more complex and computationally demanding than a linear pipeline. Controlling, interpreting, and ensuring the reliability of such an exploratory process can be difficult. The removal of explicit templates, while fostering flexibility, might make it harder to initially steer the research direction or ensure the agent has sufficient context, placing a greater burden on the formulation of the input "idea." There might be a higher variance in the quality and success rate of V2's outputs compared to V1 operating within a well-suited template.

For researchers or developers considering using these systems:

AI Scientist v1 might be more suitable for tasks where a clear experimental structure exists and can be captured in a template, potentially offering higher reliability for automating research within that defined scope.
AI Scientist v2 appears better suited for exploring less defined research questions, aiming for broader generalization across ML domains, or for research focused on the capabilities and limitations of agentic AI discovery systems themselves.
Users should be aware of the different operational models: V1 requires template creation, while V2 relies on carefully crafted initial ideas (likely JSON) and YAML configuration. The underlying dependencies and mechanisms for code interaction also differ significantly.

AI Scientist V2 represents a substantial advancement in the quest for automated scientific discovery. It moves beyond the proof-of-concept stage demonstrated by V1's template-bound automation and embraces the complexity of open-ended exploration through its novel agentic tree search architecture. This evolution mirrors broader trends in AI research, pushing towards more autonomous, flexible, and capable AI agents. While introducing new challenges related to complexity and control, V2 takes a significant step closer to the vision of AI not just as a tool to assist human scientists, but as an increasingly independent collaborator in the scientific enterprise. The project underscores the ongoing efforts to harness the power of foundation models to tackle the core creative and exploratory aspects of scientific research.

Alphanome.AI