All in One View

Content from CLI-based AI


Last updated on 2026-03-28 | Edit this page

Estimated time: 25 minutes

Overview

Questions

  • Why use a CLI for AI instead of a browser?
  • What is the Living Spec and why does it matter?

Objectives

  • Compare CLI and browser-based AI tools.
  • Create a Living Spec (GEMINI.md) to guide an agent.
  • Explain the shift from writer to orchestrator.

Ask all learners to run:

BASH

gemini --version

If it returns a version number, they are ready. If the command is not found, they need to complete the install and run gemini auth login before continuing.

Why CLI matters for research


Most researchers use chat-based AI in a browser. These tools are good for brainstorming but run in an isolated sandbox. They cannot see your files, run your code, or understand your project structure without manual uploads.

A CLI (Command Line Interface) agent runs in your terminal — the same place you run Python scripts or navigate your filesystem with ls and cd — and has access to three things a browser tool does not.

Your files and data. The agent can read your actual datasets, inspect your directory structure, and write scripts directly to disk. You are not copying and pasting between a chat window and a code editor. The agent works in your project the way a collaborator sitting at your machine would.

Your installed tools. Your machine probably has domain-specific software on it: geospatial tools like GDAL, bioinformatics pipelines, R packages, custom scripts, institutional data connectors. A browser AI has no idea these exist. A CLI agent can call them directly, pass output between them, and build on what you already have installed.

An iterative loop. When a script fails, the agent sees the error output in the terminal and can try again. You are not copying stack traces back into a chat window. The feedback loop is tight and stays in one place.

Caution

Data privacy and institutional context

The University of California and many other campuses have enterprise agreements with AI vendors like OpenAI and Google. These agreements usually state that your data will not be used to train public models.

However, CLI or API access under these agreements is not always documented. Campus IT often needs to provision this access, and terms can vary. Verify with your institution whether your license covers both web interfaces and CLI/API access.

Warning: Personal accounts may allow CLI/API usage but often lack the privacy protections of institutional licenses. Consult your campus IT policy before using AI tools with sensitive data.

Looking ahead: If your research requires absolute privacy, these skills transfer to open LLMs (like Gemma or Llama) run locally via Ollama.

Shift from writer to orchestrator


Traditional programming requires you to remember the syntax and logic of a script. Spec-Driven Research Orchestration offloads syntax generation to the AI, letting you focus on the high-level logic and the Living Spec.

graph TD
    A[Researcher] -->|Define goal| B(GEMINI.md\nLiving Spec)
    B --> C[Request a plan]
    C --> D{Approval Gate}
    D -->|Approve| E(AI Agent executes)
    D -->|Revise| C
    E -->|Draft code| F{Verification}
    F -->|Passed| G[Final Output]
    F -->|Failed| H[Refinement]
    H --> B
    style D fill:#bbf,stroke:#333,stroke-width:2px
    style F fill:#f9f,stroke:#333,stroke-width:2px

Ask learners: “Have you ever used ChatGPT to write code that looked correct but failed when you ran it?” This is a good time to introduce the concept of orchestration. The goal is not just to “fix” code, but to ensure the AI’s intent (the spec) is correct.

This introduces a new challenge: verification load. You must coordinate and validate the agent’s actions against your requirements.

Callout

Managing cognitive load

It is common to feel “out of the loop” when the AI generates many lines of code quickly. To manage this, focus on anchoring your understanding. Read the comments the AI generates and test small pieces of code frequently. If a block of logic is confusing, ask the AI to explain it before moving on.

File system access


Unlike browser tools, the Gemini CLI has access to your working environment. It can read project context from the directory structure and modify files. Instead of copying and pasting code, the agent writes scripts to your disk and can iterate based on terminal errors.

Caution

Security responsibility

Giving an AI agent access to your filesystem is a security responsibility. A buggy or misconfigured agent could delete files or access sensitive data, such as passwords.

Always consider that your tools can have unintended consequences. Ensure files are backed up or under version control (like Git) so you can revert unwanted changes.

Long context

Like humans, we only have a certain amount of working memory and LLMs operate in similar fashion. This is called the context window in LLM tools. Models like Gemini 2.5 have a long context window of 1 million to 2 million tokens. You can provide the AI with your entire project folder—scripts, documentation, and small datasets—at once.

This allows you to describe the desired state of your project, and the agent coordinates changes across multiple files. In a research context, this is declarative programming with AI agents.

Callout

A large context window is not a free pass

The more you load into a session, the more the model has to track. Beyond a certain point, quality degrades — the model may lose track of earlier instructions, produce inconsistent output, or fixate on the wrong files. This is sometimes called context poisoning.

A large context window makes this easier to run into, not harder. Managing what goes into your context is part of the workflow, not an afterthought.

Let’s make sure this works


Open a terminal window and type gemini --help and you should see something like:

BASH

Usage: gemini [options] [command]

Gemini CLI - Defaults to interactive mode. Use -p/--prompt for non-interactive (headless) mode.

Commands:
  gemini [query..]             Launch Gemini CLI                                                                                                      [default]
  gemini mcp                   Manage MCP servers
  gemini extensions <command>  Manage Gemini CLI extensions.                                                                               [aliases: extension]
  gemini skills <command>      Manage agent skills.                                                                                            [aliases: skill]
  gemini hooks <command>       Manage Gemini CLI hooks.                                                                                         [aliases: hook]
...

Navigate to your project folder and start a session:

BASH

cd /path/to/project/directory
gemini -p "Tell me what operating system I am currently using and list the files in this directory."

Compare the output to what you see when you run ls (or dir on Windows). Did the AI accurately describe your environment?

The AI should return a response similar to:

You are currently using macOS (Darwin). The files in this directory are:
- index.md
- config.yaml
- episodes/
- data/
...

Notice that gemini can ‘see’ your files and understands what environment you are working in.

We have a project folder that we want to start a project in, let’s initialize it as a gemini project and see what that does.

Callout

Working directory matters

Always start the Gemini CLI from inside your project folder. The agent uses the current directory to find your files and spec. Starting from the wrong folder — such as your home directory — is one of the most common sources of confusion in a workshop.

Initialize your project

The Gemini CLI includes an init command that creates a GEMINI.md template in your working directory:

BASH

gemini

You are now inside Gemini. Press / to see the available slash commands and page through the full list. Notice /init — this is the command that will initialize our project. Let’s run it:

BASH

/init

Notice that it does a bunch of stuff and inspects your file folders. After it finishes let’s see what new files have been created. You can do this inside gemini by:

BASH

!ls

This shows the files that are created. You should see a file named GEMINI.md. Let’s look inside it using the cat command with !:

BASH

!cat GEMINI.md

Here is what Gemini generated for a fresh empty project folder:

MARKDOWN

# Project Context: vibe-coding-lesson

## Directory Overview

The `/Users/geno/Desktop/vibe-coding-lesson` directory appears to be an empty
project directory. It currently contains only this `GEMINI.md` file, which
serves as an index and context provider for the project.

## Key Files

*   **`GEMINI.md`**: This file contains project-specific context, documentation,
    and instructional information for AI agents and developers. It is intended to
    be updated as the project evolves.

## Usage

This directory is intended to be a workspace for the 'vibe-coding-lesson'
project. As it is currently empty, it is ready for initialization or the
addition of new files and code. Future usage will depend on the specific goals
of the 'vibe-coding-lesson' project.

---
*Last analyzed on: March 25, 2026*

A few things to notice. Gemini scanned the directory and described what it found. Because the folder was empty, it does not have much to say yet. Once you add data files, scripts, and documentation, running /init again will produce a richer spec that reflects your actual project. This is also something you will want to edit by hand — add your goals, constraints, and any rules you want the agent to follow.

The Living Spec


To get the most out of a CLI agent, provide it with persistent context about your project. This acts as a “Living Spec”—a set of rules and goals the agent must follow across every session.

Every major CLI tool has its own native spec file that it loads automatically when you start a session:

Tool Native spec file Auto-loaded?
Gemini CLI GEMINI.md Yes
Claude Code CLAUDE.md Yes
OpenAI Codex AGENTS.md Yes
Cursor .cursorrules Yes

You can also use a portable spec file—AGENTS.md is a common convention—that you explicitly reference in any prompt: "Read AGENTS.md and then...". It is not auto-loaded by any single tool, but it travels with your project if you switch tools. AGENTS.md was standardized in 2025 by the Agentic AI Foundation under the Linux Foundation — co-founded by Anthropic, OpenAI, and Block — making it the emerging cross-tool portable spec format.

What to include in your spec file

Use this file to define:

  • Current Goal: What you are working on right now.
  • Rules of the Road: Technical constraints (e.g., “Always use pandas for dataframes”).
  • Verification Gates: How you will confirm the code is correct.
Challenge

Challenge: Initialize and customize your spec file

Inside your Gemini CLI session, run /init to create a GEMINI.md file. Then open it in a text editor and add one “Hard Constraint” (something the AI must do) and one “Success Metric” (how you know it’s done).

MARKDOWN

# Project: Arctic Sea Ice Analysis

## Goal
To analyze trends in sea ice extent from 1980-2020.

## Rules of the Road
- **Hard Constraint**: Only use the `xarray` library for spatial data processing.
- **Success Metric**: All final plots must include a valid DOI reference for the data source.

## Conventions
- Use snake_case for variable names.
- Save all plots to the `figures/` directory.
Key Points
  • CLI agents coordinate actions across multiple files.
  • Run /init inside Gemini to create a GEMINI.md Living Spec that reduces context drift.
  • A portable AGENTS.md lets the same spec travel across different AI tools.
  • Research orchestration shifts focus from writing syntax to validating intent.

Content from Best practices for prompting


Last updated on 2026-03-25 | Edit this page

Estimated time: 40 minutes

Overview

Questions

  • How do I write effective prompts?
  • What are common AI failures?
  • How can I make the AI fix its own mistakes?

Objectives

  • Apply the CLEAR framework.
  • Identify common AI failures.
  • Use introspection to refine code.
Callout

Working inside the Gemini CLI

All prompts in this episode are typed inside an active Gemini CLI session. Start one in your project folder before the exercises:

BASH

cd path/to/your/project
gemini

Then type prompts directly at the > prompt. Shell commands (like python script.py) are run in a separate terminal window.

Five principles of effective prompting


Effective prompting is clear technical communication. To get the best results, start by being specific. Include constraints, filenames, and a description of your expected output. Vague requests lead to generic answers, while precise instructions result in usable code.

Provide context. Explain why you need the code and what data you have (e.g., “I am processing a CSV file with these columns…”). This helps the AI understand the goal. Specify outputs clearly—tell the AI where to save files or how to format tables.

Treat prompting as an iterative process. Start with a simple request and add complexity in follow-up prompts. Include validation steps by asking the AI to verify or test its own work.

Callout

The CO-STAR framework

While CLEAR helps with conversation flow, CO-STAR structures complex research prompts that eventually become part of your AGENTS.md:

  • Context: Provide background (e.g., “I am a biologist analyzing RNA-seq data”).
  • Objective: Define the specific task (“Write a script to normalize these counts”).
  • Style: Specify the coding style (“Use the Tidyverse style guide in R”).
  • Tone: Set the personality (“Be concise and prioritize readable code”).
  • Audience: Who is this for? (“For a graduate student who knows R but not bioinformatics”).
  • Response: Define the format (“A single R script with comments and a plot output”).

The Bootstrap Workflow


Instead of writing a full AGENTS.md by hand, use the Bootstrap Workflow. This lets the agent assist in defining the project spec from the start.

  1. Scan: Ask the agent to scan your directory and data files.
  2. Draft: Ask the agent to write an initial AGENTS.md based on what it sees and your high-level goal.
  3. Gate: You review, edit, and approve the spec before any code is written.
Callout

Example bootstrap prompt

“Scan the CSV files in data/raw/. Based on my goal of ‘Analyzing water quality trends’, draft an AGENTS.md file that defines the column schema, required libraries, and a plan for cleaning the data.”

Callout

Concrete example: From bad to good

Aspect Bad prompt Good prompt
Vague vs specific “Clean this data.” “In data.csv, remove rows with missing values in the ‘age’ column and save as clean_data.csv.”
No context vs context “Write a plot script.” “I am building a report for a climate study. Write a Python script using seaborn to create a line plot of ‘temp’ over ‘year’ from results.csv.”
Silent vs validated “Run a t-test.” “Perform a paired t-test between ‘pre’ and ‘post’ columns. Print the t-statistic, p-value, and an interpretation of the result at alpha=0.05.”

Write CLEAR vertically on a whiteboard. As you explain each letter, add the keyword (Concise, Logical, Explicit, Adaptive, Reflective). This helps students remember the framework.

The CLEAR framework


The CLEAR framework, developed by Leo Lo, provides a structured approach to prompt engineering:

graph LR
    C[Concise] --> L[Logical]
    L --> E[Explicit]
    E --> A[Adaptive]
    A --> R[Reflective]
    R -->|Feedback Loop| A
    style R fill:#bbf,stroke:#333,stroke-width:2px

Effective prompts are concise and logical, prioritizing important information and following a sequence of steps. They are also explicit, specifying the scope, persona, and tone of the output. When the AI produces poor results, be adaptive by rephrasing or splitting tasks. Finally, be reflective—evaluate the output and verify facts using other sources rather than trusting the response.

Introspection


The CLEAR framework guides your input, but you can also force the AI to critique its own output. This is often called self-correction.

Emphasize this section. Most learners treat AI output as final. The idea that they can ask the AI to fix its own work is often a new concept. It is like asking a student, “Are you sure you checked your work?”—they often find their own mistakes when asked.

AI models are often better at verifying code than writing it. Never accept the first draft. Follow up with an introspection prompt:

  • “Review the code you just wrote. Are there any edge cases or security vulnerabilities?”
  • “Did you hardcode any file paths?”
  • “Critique your own implementation. Is there a more efficient way?”

Reasoning models

As of 2025, reasoning models (such as OpenAI o1/o3, DeepSeek-R1, or Gemini 2.5 Thinking) have emerged. These models perform chain of thought reasoning before they answer.

When to use them:

  • Standard models (e.g., Gemini Flash): Best for quick formatting, simple scripts, and brainstorming.
  • Reasoning models: Best for complex logic, debugging hard errors, or writing scientific formulas where accuracy is important.

When using a reasoning model, you often do not need to ask for introspection—they do it before showing the code.

Plan before you act


As tasks grow more complex, asking the agent to write code immediately leads to more rewrite time. The emerging best practice is to request a plan first, review it, and approve it before any files are written.

The think-then-do pattern

Start any multi-step task with an explicit planning prompt:

Before writing any code, describe your approach in numbered steps. Do not write any files yet.

Review the plan, push back on steps you disagree with, and ask for alternatives. Once you are satisfied, say “proceed with step 1.”

Checkpoint prompts

Break large tasks into explicit phases so you review the output at each stage before moving forward:

Step 1 only: read the three CSV files and tell me what inconsistencies you find. Do not write any code yet.

This is especially valuable in research because it catches misunderstandings about your data before they propagate into broken code.

The plan file

For complex projects, ask the agent to write a PLAN.md first:

Write a PLAN.md outlining the steps to clean and merge these files. I will review and edit it before you write any code.

This makes the plan a reviewable, editable artifact — a more formal version of the Bootstrap Workflow. Once approved, refer back to it in follow-up prompts: “Proceed with step 2 from PLAN.md.”

Callout

Plan files vs. the Living Spec

A PLAN.md and your GEMINI.md serve different purposes. The spec defines persistent rules and constraints that apply across all sessions. The plan describes the steps for a specific task. Keep them separate: plans are temporary, specs are durable.

Challenge

Challenge: Plan before you clean

Practice the think-then-do pattern before moving on to the data cleaning episode. Inside your Gemini CLI session, type:

I have three CSV files from different research sites with inconsistent column names and date formats. Before writing any code, outline a step-by-step plan for cleaning and merging them into a single dataset. Do not write any files yet.

Review the plan. Does it include an audit step? Does it address missing values? Revise the plan in the conversation until you are satisfied, then save it by asking: “Write this plan to PLAN.md.”

  • An audit step — inspect files before changing them
  • A schema harmonization step — standardize column names
  • A date standardization step
  • A missing value strategy
  • An output verification step

If the agent skipped any of these, ask it to revise before you proceed. The goal is to catch gaps in the plan, not in the code.

AI failures


AI agents are designed to be helpful, which can lead them to take shortcuts.

Common failure modes

  • Determinism collapse: Small variations in prompts or model updates can lead to different outputs for the same task, which affects reproducibility.
    • Fix: Use temperature=0 (if available) and log your model versions and prompts.
  • Over-correction loops: If an agent runs its own tests, it might fix the test to match its buggy code.
    • Fix: Write your own requirements and key tests.
  • Synthetic data substitution: The AI may generate fake data if it cannot find the real file.
  • Silent failure: The AI uses try/except blocks that hide errors.
Discussion

How to catch failures

Have you seen an AI make a confident mistake? In your research, what signs indicate the AI is hallucinating?

Common strategies:

  • Always ask: “Show me the first 10 rows of the data you loaded.”
  • Demand proof: “How did you calculate that p-value? Show the intermediate steps.”
  • Check file sizes: Is the cleaned file 0 bytes?
Challenge

Challenge: The prompt refinement loop

Practice the CLEAR framework to visualize the relationship between “Date” and “Score” in a dataset.

  1. Start with a vague prompt — type this inside your Gemini CLI session:

    Create a plot of the data I just made.

    Observe: Does it work? Is the plot useful? Where did it save it?

  2. Refine the prompt: Write a new prompt that applies context (what the data is), specificity (scatterplot with regression line), and output instructions (save as fig/trend_analysis.png).

Using the 'master_dataset.csv' file, create a Python script to generate a scatterplot of 'date' vs 'score'. Add a linear regression trendline. Label the axes clearly. Save the final plot to a file named 'fig/trend_analysis.png' (create the directory if it doesn't exist).

Reflection

  • How much longer was your refined prompt compared to your first one?
  • Did defining the output filename save you from searching for the file?
  • Extra typing time can save debugging time.
Challenge

Challenge: The introspection loop

Test the AI as a verifier principle. Ask the AI to find flaws in its code before you run it.

  1. Generate a script — type this prompt inside your Gemini CLI session:

    Write a Python script that reads 'data.csv' and calculates the rolling 7-day average of a 'score' column. Handle missing values.
  2. Force introspection: Once the code is generated, do not run it. Follow up in the same session:

    Review the rolling average script you just wrote. Are there any edge cases (like having fewer than 7 days of data) where this would fail? If so, provide an updated version.
  3. Compare: Did the AI find a mistake in its first draft? Did it add a guard clause like min_periods=1?

AI models are often more accurate when asked to critique logic than when asked to generate it. This second pass is part of the editor mindset and reduces manual debugging.

Key Points
  • Be specific and provide context.
  • Plan before you act: request a numbered plan and approve it before any files are written.
  • Always validate AI outputs.
  • Introspection improves code quality.

Content from Data cleaning with AI


Last updated on 2026-03-25 | Edit this page

Estimated time: 50 minutes

Overview

Questions

  • How can AI handle messy data?
  • Can I trust AI to standardize inconsistent files?

Objectives

  • Generate a test dataset using Gemini.
  • Build a data processing pipeline for inconsistent files.
  • Verify AI-generated code before running it.
  • Document the cleaning process.

This episode uses live coding. Learners should follow along by running commands on their own machines.

Prerequisite

Prerequisites

Ensure you are authenticated (gemini auth login) and have a Gemini CLI session running in your project folder. Generating scripts can take 10-30 seconds.

Callout

Working inside the Gemini CLI

All prompts in this episode are typed inside an active Gemini CLI session. Start one in your project folder before the exercises:

BASH

cd path/to/your/project
gemini

Run Python scripts in a separate terminal window when instructed.

Cleaning messy data


Cleaning and merging inconsistent files is a common bottleneck in research. We will use Gemini to standardize messy CSV files.

Generating test data

To practice cleaning, we need a dataset with inconsistencies. We can use AI to simulate a multi-site study where each location used different naming conventions or date formats. Run this command to generate three files: site_A.csv, site_B.csv, and site_C.csv.

Create a python script named 'make_messy_data.py'. It should generate 3 CSV files ('site_A.csv', 'site_B.csv', 'site_C.csv') with 50 rows each. Columns should include 'ID', 'Date', and 'Score', but make them inconsistent (e.g., 'ParticipantID' vs 'id', 'date' vs 'Date_Time'). Add some missing values and varied date formats (like '2023/01/05' vs 'Jan 5, 2023').

After running python make_messy_data.py, you will have three inconsistent files in your directory.

Auditing the data

Before fixing the files, we need to understand the inconsistencies. We can ask the AI to write an inspection script that reads every CSV in the folder and reports the filenames, column names, and missing value counts.

Write a Python script called 'inspect_data.py' that reads every CSV file in the current folder. For each file, print the filename, the list of column names, and the number of missing values in each column.

Run the inspection script. You should see inconsistencies like site_A using ParticipantID while site_B uses id.

Callout

Reasoning models

If your data files are extremely inconsistent, reasoning models (like o1 or DeepSeek-R1) are often more effective. They can identify subtle naming patterns that standard models might miss.

Spec-guided cleaning


In Spec-Driven Research Orchestration, we don’t just ask for a script. We refer to the AGENTS.md file to ensure the script follows the project’s rules.

Harmonizing files with the Spec

We will now ask Gemini to generate a script using the rules defined in AGENTS.md.

Read 'AGENTS.md' and the 3 site CSVs. Write a script called 'clean_and_merge.py' that renames IDs and standardizes dates according to the schema in the spec. Fill missing scores with the median and save to 'master_dataset.csv'. Add comments linking code steps to spec rules.

If a learner’s AI fails to generate working code, provide the pre-written versions from instructors/files/: - backup_make_messy_data.py - backup_inspect_data.py - backup_clean_and_merge.py

Callout

The editor role

Before running the code, open clean_and_merge.py. Check if the logic is sound, if the comments match the code, and if there are syntax errors. You are responsible for the final output.

Run python clean_and_merge.py to create the clean dataset.

Callout

Using comments

We asked the AI to “Add comments explaining each step.” These comments make the script readable and help you verify the methodology.

This challenge requires modifying existing code. If learners are stuck, suggest they ask the AI to read clean_and_merge.py before asking for modifications.

Challenge

Challenge: Update the script

Imagine you need to exclude any participant with a score below 10. Use the Gemini CLI to update clean_and_merge.py instead of editing it manually. Ask the AI to read the file and add the filtering logic. Run the updated script and verify the results in master_dataset.csv.

Read 'clean_and_merge.py'. Modify the script to filter out any rows where 'score' is less than 10. Keep all other logic the same. Save the updated script.

Reflection

  • Did the AI edit the relevant part or rewrite the whole file?
  • Did it include the necessary imports?
  • Did you check the changes before running the script?

Automating documentation

For the final step, have the AI generate a README that explains the data pipeline, including the raw files, cleaning steps, and final output format.

Create a README.md file that explains the data processing pipeline we just built. List the original files, the cleaning steps performed, and the final output format.
Challenge

Challenge: Provenance tracking

To ensure research is reproducible, track which model generated your code and when.

  1. Use the Gemini CLI to add a provenance header to clean_and_merge.py.
  2. The header should be a Python docstring containing:
    • The model used (e.g., Gemini 2.0 Flash)
    • The date
    • A summary of the prompt.
Read 'clean_and_merge.py'. Add a docstring at the very top of the file as a provenance header. Include the model name 'Gemini 2.5 Flash', today's date, and a summary of the prompt: 'Standardize site IDs, format dates, and impute missing scores with site medians.'

Reflection

  • Why record the model version and date?
  • Does it matter if the AI model is updated later?
  • How does this header help with reproducibility?
Key Points
  • Use AI to automate data harmonization and standardization.
  • Audit data before cleaning to identify inconsistencies.
  • Read and test all generated code before running it.

Content from Validation strategies: the approval gate


Last updated on 2026-03-11 | Edit this page

Estimated time: 50 minutes

Overview

Questions

  • How do I move from vibe coding to research orchestration?
  • What is rewrite time and why does it matter for research reproducibility?
  • How can I use one AI to catch the errors of another?

Objectives

  • Shift from a writer to an auditor mindset using the approval gate.
  • Calculate rewrite time to measure workflow efficiency.
  • Use a four-layer validation stack with explicit requirement constraints.
  • Use multi-model verification to peer-review research code.

The approval gate: verification over generation


In an agentic research workflow, your role is to audit and approve code rather than write it. The standard has shifted from vibe coding to spec-driven orchestration.

The approval gate is the point where you decide AI-generated code is robust enough for research production. It separates a working prototype from validated science.

Callout

The review-first standard

The bottleneck in research is no longer writing code; it is verifying it. A high-performance workflow follows this cycle: Plan → Agent Implementation → Automated AI-Powered Testing → Human Review.

Measuring efficiency: rewrite time


Rewrite time is a metric to determine if an AI workflow is actually helping your research.

  • Definition: The manual effort in minutes a researcher spends making AI-generated output production-ready.
  • Goal: If you spend 20 minutes prompting but then 40 minutes fixing the code, your rewrite time is too high.
Discussion

Challenge: calculate your rewrite time

Look back at a script you recently generated with an AI agent. 1. How many minutes did you spend fixing or refactoring the code to make it run? 2. If this took more than 10% of the total task time, what was the likely cause? - Ambiguous specs? - AI over-confidence? - Lack of local context?

The four-layer validation stack


To minimize rewrite time and ensure research rigor, use a structured validation stack.

Layer 1: Requirement constraints (No-Go Zones)

Before the AI writes code, define requirement constraints in your AGENTS.md. These are rules the AI is not allowed to break.

Example: “Do not change the column names in raw_data.csv” or “Use only base R for this visualization to ensure compatibility.”

Layer 2: Automated unit tests

Ask the agent to write tests before the implementation. Use a prompt pattern like: “First, write five Pytest cases that define the success of this data cleaning script. I will approve the tests before you write the logic.”

Layer 3: Metamorphic and invariant checks

Test the relationships in your data that should never change. - Invariants: The total number of participants must remain 150 after merging. - Metamorphic checks: If I change the order of the input rows, the final mean score should not change.

Layer 4: Domain plausibility

This is where your research expertise is irreplaceable. AI does not know that a negative blood pressure reading is impossible.


Multi-model verification


We use a challenger model to audit an implementation model rather than trusting a single AI.

Challenge

Challenge: orchestrate a peer review

  1. Use Model A (such as Claude Code or Cursor) to generate a data cleaning script.

  2. Provide the code to Model B (such as Gemini CLI) with the following prompt:

    “Read this script. Act as a skeptical senior data scientist. Identify three potential edge cases where this script will fail, such as empty strings, NaN values, or encoding issues. Suggest specific assert statements to catch these.”

  3. Reflect: Did the challenger model find something the implementation model missed?

Models have different blind spots. Forcing a second AI to act as an auditor helps bypass the tendency of the primary assistant to be over-confident. This process reduces your manual rewrite time.

Warn learners about approval fatigue the tendency to accept AI suggestions without reading them. The four-layer stack is designed to make the AI prove it is correct before you review the code.

Key Points
  • The approval gate separates experimental prototypes from validated research.
  • Rewrite time is the primary metric for measuring AI workflow value.
  • Requirement constraints prevent the AI from drifting away from research specs.
  • Multi-model verification uses a second AI to act as a skeptical peer reviewer.

Content from Limitations and cautions


Last updated on 2026-03-28 | Edit this page

Estimated time: 25 minutes

Overview

Questions

  • When should I not use AI?
  • What are common failure modes?

Objectives

  • Recognize high-risk scenarios for AI use.
  • Identify hallucinated or outdated code.
  • Distinguish between open and proprietary models.
Callout

The jagged frontier

AI capability is inconsistent. A model may solve a complex differential equation but fail a simple logic puzzle. Researchers must identify where AI is reliable and where it is a liability for their specific field.

When not to trust AI code


Using AI-generated code can introduce risks to research integrity. Security-critical tasks—like authentication, encryption, or handling sensitive data—require expert oversight.

AI may also fail when research involves new statistical methods or domain-specific details. Models synthesize information from training data, which might not include the latest breakthroughs or specific sensor patterns. In performance-critical code, AI often prioritizes common algorithms over the most efficient ones, which can cause bottlenecks in large-scale processing.

Common failure modes


Understanding AI failure modes helps you identify errors before they affect results.

Spec Drift

Spec Drift occurs when the code and the AGENTS.md (Living Spec) become unaligned. The agent may fix a bug in the code but forget to update the spec, leading to future hallucinations. - Prevention: Regularly ask the agent to “Sync the spec with the current code.”

Bootstrap Failures

In the “Bootstrap Workflow,” the AI may miss nuances in raw data during the initial scan. If you approve a flawed spec, the error will propagate through the entire project. - Prevention: Thoroughly audit the agent’s first draft of AGENTS.md.

Silent semantic drift

Semantic drift occurs when an agent makes a change that alters data assumptions or logic without breaking the code. - Example: The code runs and tests pass, but a filtering threshold was changed or a column was renamed incorrectly, affecting the research conclusion. - Prevention: Use metamorphic testing and invariant checks to ensure core logic remains unchanged.

Other failure modes

  • Hallucinated functions: The model uses libraries or APIs that do not exist.
  • Outdated approaches: The AI uses deprecated syntax from its training data.
  • Confident incorrectness: The AI presents wrong formulas or logic as certain.
  • Tool poisoning via MCP: When an agent calls external tools through MCP, a misconfigured or malicious MCP server can inject instructions into the agent’s context (prompt injection). This can cause the agent to take unintended actions or leak data. Mitigation: only install MCP servers from trusted, audited sources.
  • Over-engineering: The model generates complex code for simple problems.
Discussion

Environmental cost

Data centers consume large amounts of electricity and water. Frequent, iterative prompting can be resource-intensive.

  • Energy use: Every AI query requires complex calculations. Some estimates suggest a single generative AI query uses significantly more energy than a standard web search.
  • Code efficiency: AI models often prioritize working code over efficient code. Inefficient software uses more energy and resources over time.

Sustainable practices

To code responsibly:

  1. Think before prompting: Use the CLEAR framework to get the right answer in fewer attempts.
  2. Request optimization: Prompt the AI to optimize for memory or speed once the logic is correct.
  3. Use documentation: If you need simple syntax, check the documentation instead of querying an LLM.

Models are becoming less likely to hallucinate. * If it refuses: Acknowledge that the model correctly identified its own limitations. * Backup: Have a screenshot of a known hallucination ready to show if the AI performs perfectly during the session.

Challenge

Challenge: Test for hallucinations

Inside your Gemini CLI session, type:

How do I use the 'pypanda-researcher' library to automatically write my conclusion?

Note whether the model admits it does not know, hedges with uncertainty, or confidently invents instructions.

Current models (Gemini 2.5, Claude, GPT-4o) are significantly better at refusing or flagging uncertainty than earlier generations — you may get a clean “this doesn’t exist” response. That is the correct behavior. The lesson here is not that hallucination always happens, but that you cannot assume it won’t: always verify suggested libraries and functions exist before using them. Older or smaller models are still more likely to confabulate.

Open science and proprietary AI


The Gemini CLI is not open source, which creates a tension in open research.

  • Proprietary models (Gemini, GPT-4, Claude): These are closed-weight models. You cannot verify their training data, and they may update silently. Institutional agreements provide data privacy but do not solve reproducibility issues.
  • Open-weights models (Gemma, Llama, Mistral): These can be run locally using tools like Ollama. They offer better reproducibility because you can use a specific, frozen version of the model.

Recommendation: Use proprietary models for prototyping and cleaning, but archive the generated code. Do not rely on the AI to regenerate the same code in the future.

Callout

Key lesson

AI increases your speed but does not replace your expertise or responsibility. Your role shifts from writing code to verifying the code the AI produces.

Key Points
  • Avoid AI for security-critical tasks.
  • You are responsible for the final output.
  • Open models offer better reproducibility; proprietary models offer more power.

Content from Resources and next steps


Last updated on 2026-03-28 | Edit this page

Estimated time: 20 minutes

Overview

Questions

  • What other tools are available?
  • Where can I find help?

Objectives

  • List alternative AI coding tools.
  • Locate documentation and support.
  • Understand plugins and Model Context Protocol (MCP).

AI tool landscape


The ecosystem of AI tools for research is expanding. For coding, tools like Claude Code (Anthropic’s CLI), GitHub Copilot, and Aider (a CLI agent that works with multiple models) provide terminal support.

Some researchers use AI-native code editors like Cursor to interact with an entire repository. Models like Gemini 2.5 and DeepSeek-V3/R1 have increased the speed of these interactions.

Research-specific tools are also emerging. Elicit and Consensus focus on scientific paper discovery and evidence-based claims. Google’s NotebookLM allows you to ground an AI’s knowledge in your own collection of research PDFs for summarizing and querying documents.

Extending capabilities


AI tools can now connect with other software. Many browser-based models offer extensions for Google Drive, WolframAlpha, and other services. The Model Context Protocol (MCP) is an open standard that allows AI assistants to connect to local or remote data sources—such as a PostgreSQL database or your local file system—without requiring you to upload data to a central server. In December 2025, Anthropic donated MCP to the Agentic AI Foundation under the Linux Foundation. The 2026 MCP roadmap (published March 2026) focuses on enterprise authentication, audit trails, and agent-to-agent communication.

Caution

MCP security: shadow IT risk

MCP servers are installed by individual researchers or developers, often without institutional IT oversight. Security researchers have identified risks including prompt injection via malicious MCP servers and tools that silently exfiltrate data. Before connecting an MCP server, verify it is actively maintained and from a trusted source. Your institution’s IT policy may not yet cover MCP-enabled tools.

Local models and open tooling


For researchers handling sensitive data or requiring reproducibility, local AI is a growing trend. Running models on your own hardware ensures that data does not leave your machine.

  • Ollama: A tool for running open-weights models (like Llama 3, Mistral, or Gemma) locally on macOS, Linux, and Windows. It provides a CLI and a local API.
  • Aider: A CLI coding agent that connects to proprietary APIs and local models (via Ollama). It is designed for pair programming and refactoring in the terminal.
  • Qwen Coder: High-performance open models from Alibaba (e.g., Qwen2.5-Coder) that can be run offline for security.
Callout

Privacy and performance

Local models offer privacy but require significant hardware (especially GPU VRAM) to match the performance of cloud models like Gemini 2.5. Many researchers use a hybrid approach: cloud models for general scripting and local models for sensitive data.


Automated research discovery

For time-sensitive research, AI agents can scan social platforms, developer forums, and prediction markets to identify trends before they appear in traditional journals. This “real-time literature mapping” works by prompting the agent to synthesize signal from fast-moving sources alongside traditional ones.

Callout

A deep research prompt pattern

Use a prompt like this with any AI assistant that has web search access:

“Act as a research orchestrator. Find the most discussed [TOPIC] from the last 30 days across Reddit, X, and Hacker News.

  1. Discovery: Identify the top 3 tools or methodologies mentioned.
  2. Sentiment: Quote the most upvoted critique for each.
  3. Verification: Cross-reference these with web sources for real-world impact.
  4. Validation: Apply ‘Layer 4: Domain Plausibility’—identify one trend that sounds plausible but may be an AI-generated ‘vibe’ without empirical backing.”

Multi-model ensembles

In advanced workflows, you can use different models for different tasks. For example, one model generates code, another critiques it, and a third writes validation tests. This reduces the chance of a single model’s bias affecting results.

GitHub Agentic Workflows

GitHub Agentic Workflows (technical preview, February 2026) let you write repository automation goals in plain Markdown instead of YAML. GitHub Actions executes them using an LLM. This is a direct production application of the Markdown-as-spec pattern taught in this lesson.

Research-relevant use cases include automated issue triage, CI failure analysis, and pull request review. See the GitHub Agentic Workflows changelog for current status.

Provenance tracking

Include metadata in the header of AI-generated files to ensure reproducibility:

  • Model name and version.
  • Date and a link to the prompt used.
  • A hash of the context files provided.

Cost and efficiency

With long-context models, it is easy to include unnecessary files in prompts, which increases costs.

  • Monitor tokens: Check your API usage dashboard.
  • Optimize context: Only include the files needed for the current task.

Citing and crediting AI


Transparent attribution is essential for open science. Academic standards (including COPE, Nature, and Elsevier) state that AI tools cannot be listed as authors because they lack legal accountability. Instead, cite them as methodological tools.

1. In code repositories (README.md): Add an AI usage section to your project documentation:

Callout

Example attribution

  • Model: Google Gemini 2.5 Flash
  • Role: Spec drafting and spec-guided cleaning.
  • Verification: Verified by [Your Name] via GEMINI.md rules and validate_data.py.

2. In manuscripts: Cite the model in the methods or acknowledgements section.

  • Example: “We used Google Gemini 2.5 Flash to assist with data cleaning scripts. Prompts and raw outputs are available in the supplementary material.”

Resources and standards

Checklist

Summary checklist


Separating utility from marketing is a challenge. New tools appear daily, but many are more hype than substance.

Discussion

Finding tools

Where do you hear about new AI tools? Social media, academic journals, or word of mouth? Note that while social media is fast, reputable channels are more reliable.

Identifying hype

Before adopting a new tool, consider these points:

Checklist

Hype detection

Reputable sources

Follow sources that focus on the practical and ethical aspects of AI in research:

Callout

From vibes to evals

When AI scripts become critical for research, move to systematic evaluation. Create a “gold standard” dataset of known correct answers and test new iterations of AI-generated code against it.

Checklist
Challenge

Challenge: Your research protocol

Draft a plan for integrating these tools into your research workflow while maintaining rigor.

  1. Select a project: Which project would benefit most from an AI cleaning or validation pipeline?
  2. Choose your gates: Which approval gates (Test-first, Diff budget, Snapshot) will you use?
  3. Define requirements: What are 2-3 requirements for that project that the AI cannot change?
  4. Verification strategy: Will you use a second model, unit tests, or metamorphic checks?

A pre-defined protocol reduces decision fatigue during complex coding sessions. Deciding how to verify work early prevents approval fatigue later.

Key Points
  • Gemini, Claude, and Copilot serve different needs.
  • Community support is vital.
  • Plugins and MCP allow AI to connect to external data.