All in One View
Content from CLI-based AI
Last updated on 2026-03-28 | Edit this page
Estimated time: 25 minutes
Overview
Questions
- Why use a CLI for AI instead of a browser?
- What is the Living Spec and why does it matter?
Objectives
- Compare CLI and browser-based AI tools.
- Create a Living Spec (GEMINI.md) to guide an agent.
- Explain the shift from writer to orchestrator.
Why CLI matters for research
Most researchers use chat-based AI in a browser. These tools are good for brainstorming but run in an isolated sandbox. They cannot see your files, run your code, or understand your project structure without manual uploads.
A CLI (Command Line Interface) agent runs in your terminal — the same
place you run Python scripts or navigate your filesystem with
ls and cd — and has access to three things a
browser tool does not.
Your files and data. The agent can read your actual datasets, inspect your directory structure, and write scripts directly to disk. You are not copying and pasting between a chat window and a code editor. The agent works in your project the way a collaborator sitting at your machine would.
Your installed tools. Your machine probably has domain-specific software on it: geospatial tools like GDAL, bioinformatics pipelines, R packages, custom scripts, institutional data connectors. A browser AI has no idea these exist. A CLI agent can call them directly, pass output between them, and build on what you already have installed.
An iterative loop. When a script fails, the agent sees the error output in the terminal and can try again. You are not copying stack traces back into a chat window. The feedback loop is tight and stays in one place.
Data privacy and institutional context
The University of California and many other campuses have enterprise agreements with AI vendors like OpenAI and Google. These agreements usually state that your data will not be used to train public models.
However, CLI or API access under these agreements is not always documented. Campus IT often needs to provision this access, and terms can vary. Verify with your institution whether your license covers both web interfaces and CLI/API access.
Warning: Personal accounts may allow CLI/API usage but often lack the privacy protections of institutional licenses. Consult your campus IT policy before using AI tools with sensitive data.
Looking ahead: If your research requires absolute privacy, these skills transfer to open LLMs (like Gemma or Llama) run locally via Ollama.
Shift from writer to orchestrator
Traditional programming requires you to remember the syntax and logic of a script. Spec-Driven Research Orchestration offloads syntax generation to the AI, letting you focus on the high-level logic and the Living Spec.
graph TD
A[Researcher] -->|Define goal| B(GEMINI.md\nLiving Spec)
B --> C[Request a plan]
C --> D{Approval Gate}
D -->|Approve| E(AI Agent executes)
D -->|Revise| C
E -->|Draft code| F{Verification}
F -->|Passed| G[Final Output]
F -->|Failed| H[Refinement]
H --> B
style D fill:#bbf,stroke:#333,stroke-width:2px
style F fill:#f9f,stroke:#333,stroke-width:2px
Ask learners: “Have you ever used ChatGPT to write code that looked correct but failed when you ran it?” This is a good time to introduce the concept of orchestration. The goal is not just to “fix” code, but to ensure the AI’s intent (the spec) is correct.
This introduces a new challenge: verification load. You must coordinate and validate the agent’s actions against your requirements.
Managing cognitive load
It is common to feel “out of the loop” when the AI generates many lines of code quickly. To manage this, focus on anchoring your understanding. Read the comments the AI generates and test small pieces of code frequently. If a block of logic is confusing, ask the AI to explain it before moving on.
File system access
Unlike browser tools, the Gemini CLI has access to your working environment. It can read project context from the directory structure and modify files. Instead of copying and pasting code, the agent writes scripts to your disk and can iterate based on terminal errors.
Security responsibility
Giving an AI agent access to your filesystem is a security responsibility. A buggy or misconfigured agent could delete files or access sensitive data, such as passwords.
Always consider that your tools can have unintended consequences. Ensure files are backed up or under version control (like Git) so you can revert unwanted changes.
Long context
Like humans, we only have a certain amount of working memory and LLMs operate in similar fashion. This is called the context window in LLM tools. Models like Gemini 2.5 have a long context window of 1 million to 2 million tokens. You can provide the AI with your entire project folder—scripts, documentation, and small datasets—at once.
This allows you to describe the desired state of your project, and the agent coordinates changes across multiple files. In a research context, this is declarative programming with AI agents.
A large context window is not a free pass
The more you load into a session, the more the model has to track. Beyond a certain point, quality degrades — the model may lose track of earlier instructions, produce inconsistent output, or fixate on the wrong files. This is sometimes called context poisoning.
A large context window makes this easier to run into, not harder. Managing what goes into your context is part of the workflow, not an afterthought.
Let’s make sure this works
Open a terminal window and type gemini --help and you
should see something like:
BASH
Usage: gemini [options] [command]
Gemini CLI - Defaults to interactive mode. Use -p/--prompt for non-interactive (headless) mode.
Commands:
gemini [query..] Launch Gemini CLI [default]
gemini mcp Manage MCP servers
gemini extensions <command> Manage Gemini CLI extensions. [aliases: extension]
gemini skills <command> Manage agent skills. [aliases: skill]
gemini hooks <command> Manage Gemini CLI hooks. [aliases: hook]
...
Navigate to your project folder and start a session:
BASH
cd /path/to/project/directory
gemini -p "Tell me what operating system I am currently using and list the files in this directory."
Compare the output to what you see when you run ls (or
dir on Windows). Did the AI accurately describe your
environment?
The AI should return a response similar to:
You are currently using macOS (Darwin). The files in this directory are:
- index.md
- config.yaml
- episodes/
- data/
...
Notice that gemini can ‘see’ your files and understands what environment you are working in.
We have a project folder that we want to start a project in, let’s initialize it as a gemini project and see what that does.
Working directory matters
Always start the Gemini CLI from inside your project folder. The agent uses the current directory to find your files and spec. Starting from the wrong folder — such as your home directory — is one of the most common sources of confusion in a workshop.
Initialize your project
The Gemini CLI includes an init command that creates a
GEMINI.md template in your working directory:
You are now inside Gemini. Press / to see the available
slash commands and page through the full list. Notice /init
— this is the command that will initialize our project. Let’s run
it:
Notice that it does a bunch of stuff and inspects your file folders. After it finishes let’s see what new files have been created. You can do this inside gemini by:
This shows the files that are created. You should see a file named
GEMINI.md. Let’s look inside it using the cat
command with !:
Here is what Gemini generated for a fresh empty project folder:
MARKDOWN
# Project Context: vibe-coding-lesson
## Directory Overview
The `/Users/geno/Desktop/vibe-coding-lesson` directory appears to be an empty
project directory. It currently contains only this `GEMINI.md` file, which
serves as an index and context provider for the project.
## Key Files
* **`GEMINI.md`**: This file contains project-specific context, documentation,
and instructional information for AI agents and developers. It is intended to
be updated as the project evolves.
## Usage
This directory is intended to be a workspace for the 'vibe-coding-lesson'
project. As it is currently empty, it is ready for initialization or the
addition of new files and code. Future usage will depend on the specific goals
of the 'vibe-coding-lesson' project.
---
*Last analyzed on: March 25, 2026*
A few things to notice. Gemini scanned the directory and described
what it found. Because the folder was empty, it does not have much to
say yet. Once you add data files, scripts, and documentation, running
/init again will produce a richer spec that reflects your
actual project. This is also something you will want to edit by hand —
add your goals, constraints, and any rules you want the agent to
follow.
The Living Spec
To get the most out of a CLI agent, provide it with persistent context about your project. This acts as a “Living Spec”—a set of rules and goals the agent must follow across every session.
Every major CLI tool has its own native spec file that it loads automatically when you start a session:
| Tool | Native spec file | Auto-loaded? |
|---|---|---|
| Gemini CLI | GEMINI.md |
Yes |
| Claude Code | CLAUDE.md |
Yes |
| OpenAI Codex | AGENTS.md |
Yes |
| Cursor | .cursorrules |
Yes |
You can also use a portable spec
file—AGENTS.md is a common convention—that you explicitly
reference in any prompt: "Read AGENTS.md and then...". It
is not auto-loaded by any single tool, but it travels with your project
if you switch tools. AGENTS.md was standardized in 2025 by the Agentic
AI Foundation under the Linux Foundation — co-founded by Anthropic,
OpenAI, and Block — making it the emerging cross-tool portable spec
format.
What to include in your spec file
Use this file to define:
- Current Goal: What you are working on right now.
-
Rules of the Road: Technical constraints (e.g.,
“Always use
pandasfor dataframes”). - Verification Gates: How you will confirm the code is correct.
Challenge: Initialize and customize your spec file
Inside your Gemini CLI session, run /init to create a
GEMINI.md file. Then open it in a text editor and add one
“Hard Constraint” (something the AI must do) and one “Success
Metric” (how you know it’s done).
MARKDOWN
# Project: Arctic Sea Ice Analysis
## Goal
To analyze trends in sea ice extent from 1980-2020.
## Rules of the Road
- **Hard Constraint**: Only use the `xarray` library for spatial data processing.
- **Success Metric**: All final plots must include a valid DOI reference for the data source.
## Conventions
- Use snake_case for variable names.
- Save all plots to the `figures/` directory.
- CLI agents coordinate actions across multiple files.
- Run
/initinside Gemini to create aGEMINI.mdLiving Spec that reduces context drift. - A portable
AGENTS.mdlets the same spec travel across different AI tools. - Research orchestration shifts focus from writing syntax to validating intent.
Content from Best practices for prompting
Last updated on 2026-03-25 | Edit this page
Estimated time: 40 minutes
Overview
Questions
- How do I write effective prompts?
- What are common AI failures?
- How can I make the AI fix its own mistakes?
Objectives
- Apply the CLEAR framework.
- Identify common AI failures.
- Use introspection to refine code.
Working inside the Gemini CLI
Five principles of effective prompting
Effective prompting is clear technical communication. To get the best results, start by being specific. Include constraints, filenames, and a description of your expected output. Vague requests lead to generic answers, while precise instructions result in usable code.
Provide context. Explain why you need the code and what data you have (e.g., “I am processing a CSV file with these columns…”). This helps the AI understand the goal. Specify outputs clearly—tell the AI where to save files or how to format tables.
Treat prompting as an iterative process. Start with a simple request and add complexity in follow-up prompts. Include validation steps by asking the AI to verify or test its own work.
The CO-STAR framework
While CLEAR helps with conversation flow, CO-STAR structures complex
research prompts that eventually become part of your
AGENTS.md:
- Context: Provide background (e.g., “I am a biologist analyzing RNA-seq data”).
- Objective: Define the specific task (“Write a script to normalize these counts”).
- Style: Specify the coding style (“Use the Tidyverse style guide in R”).
- Tone: Set the personality (“Be concise and prioritize readable code”).
- Audience: Who is this for? (“For a graduate student who knows R but not bioinformatics”).
- Response: Define the format (“A single R script with comments and a plot output”).
The Bootstrap Workflow
Instead of writing a full AGENTS.md by hand, use the
Bootstrap Workflow. This lets the agent assist in
defining the project spec from the start.
- Scan: Ask the agent to scan your directory and data files.
-
Draft: Ask the agent to write an initial
AGENTS.mdbased on what it sees and your high-level goal. - Gate: You review, edit, and approve the spec before any code is written.
Example bootstrap prompt
“Scan the CSV files in data/raw/. Based on my goal of
‘Analyzing water quality trends’, draft an AGENTS.md file
that defines the column schema, required libraries, and a plan for
cleaning the data.”
Concrete example: From bad to good
| Aspect | Bad prompt | Good prompt |
|---|---|---|
| Vague vs specific | “Clean this data.” | “In data.csv, remove rows with missing
values in the ‘age’ column and save as
clean_data.csv.” |
| No context vs context | “Write a plot script.” | “I am building a report for a climate study. Write a
Python script using seaborn to create a line plot of ‘temp’ over ‘year’
from results.csv.” |
| Silent vs validated | “Run a t-test.” | “Perform a paired t-test between ‘pre’ and ‘post’ columns. Print the t-statistic, p-value, and an interpretation of the result at alpha=0.05.” |
Write CLEAR vertically on a whiteboard. As you explain each letter, add the keyword (Concise, Logical, Explicit, Adaptive, Reflective). This helps students remember the framework.
The CLEAR framework
The CLEAR framework, developed by Leo Lo, provides a structured approach to prompt engineering:
graph LR
C[Concise] --> L[Logical]
L --> E[Explicit]
E --> A[Adaptive]
A --> R[Reflective]
R -->|Feedback Loop| A
style R fill:#bbf,stroke:#333,stroke-width:2px
Effective prompts are concise and logical, prioritizing important information and following a sequence of steps. They are also explicit, specifying the scope, persona, and tone of the output. When the AI produces poor results, be adaptive by rephrasing or splitting tasks. Finally, be reflective—evaluate the output and verify facts using other sources rather than trusting the response.
Introspection
The CLEAR framework guides your input, but you can also force the AI to critique its own output. This is often called self-correction.
Emphasize this section. Most learners treat AI output as final. The idea that they can ask the AI to fix its own work is often a new concept. It is like asking a student, “Are you sure you checked your work?”—they often find their own mistakes when asked.
AI models are often better at verifying code than writing it. Never accept the first draft. Follow up with an introspection prompt:
- “Review the code you just wrote. Are there any edge cases or security vulnerabilities?”
- “Did you hardcode any file paths?”
- “Critique your own implementation. Is there a more efficient way?”
Reasoning models
As of 2025, reasoning models (such as OpenAI o1/o3, DeepSeek-R1, or Gemini 2.5 Thinking) have emerged. These models perform chain of thought reasoning before they answer.
When to use them:
- Standard models (e.g., Gemini Flash): Best for quick formatting, simple scripts, and brainstorming.
- Reasoning models: Best for complex logic, debugging hard errors, or writing scientific formulas where accuracy is important.
When using a reasoning model, you often do not need to ask for introspection—they do it before showing the code.
Plan before you act
As tasks grow more complex, asking the agent to write code immediately leads to more rewrite time. The emerging best practice is to request a plan first, review it, and approve it before any files are written.
The think-then-do pattern
Start any multi-step task with an explicit planning prompt:
Before writing any code, describe your approach in numbered steps. Do not write any files yet.
Review the plan, push back on steps you disagree with, and ask for alternatives. Once you are satisfied, say “proceed with step 1.”
Checkpoint prompts
Break large tasks into explicit phases so you review the output at each stage before moving forward:
Step 1 only: read the three CSV files and tell me what inconsistencies you find. Do not write any code yet.
This is especially valuable in research because it catches misunderstandings about your data before they propagate into broken code.
The plan file
For complex projects, ask the agent to write a PLAN.md
first:
Write a PLAN.md outlining the steps to clean and merge these files. I will review and edit it before you write any code.
This makes the plan a reviewable, editable artifact — a more formal version of the Bootstrap Workflow. Once approved, refer back to it in follow-up prompts: “Proceed with step 2 from PLAN.md.”
Plan files vs. the Living Spec
A PLAN.md and your GEMINI.md serve
different purposes. The spec defines persistent rules and constraints
that apply across all sessions. The plan describes the steps for a
specific task. Keep them separate: plans are temporary, specs are
durable.
Challenge: Plan before you clean
Practice the think-then-do pattern before moving on to the data cleaning episode. Inside your Gemini CLI session, type:
I have three CSV files from different research sites with inconsistent column names and date formats. Before writing any code, outline a step-by-step plan for cleaning and merging them into a single dataset. Do not write any files yet.
Review the plan. Does it include an audit step? Does it address missing values? Revise the plan in the conversation until you are satisfied, then save it by asking: “Write this plan to PLAN.md.”
- An audit step — inspect files before changing them
- A schema harmonization step — standardize column names
- A date standardization step
- A missing value strategy
- An output verification step
If the agent skipped any of these, ask it to revise before you proceed. The goal is to catch gaps in the plan, not in the code.
AI failures
AI agents are designed to be helpful, which can lead them to take shortcuts.
Common failure modes
-
Determinism collapse: Small variations in prompts
or model updates can lead to different outputs for the same task, which
affects reproducibility.
-
Fix: Use
temperature=0(if available) and log your model versions and prompts.
-
Fix: Use
-
Over-correction loops: If an agent runs its own
tests, it might fix the test to match its buggy code.
- Fix: Write your own requirements and key tests.
- Synthetic data substitution: The AI may generate fake data if it cannot find the real file.
-
Silent failure: The AI uses
try/exceptblocks that hide errors.
How to catch failures
Have you seen an AI make a confident mistake? In your research, what signs indicate the AI is hallucinating?
Common strategies:
- Always ask: “Show me the first 10 rows of the data you loaded.”
- Demand proof: “How did you calculate that p-value? Show the intermediate steps.”
- Check file sizes: Is the cleaned file 0 bytes?
Challenge: The prompt refinement loop
Practice the CLEAR framework to visualize the relationship between “Date” and “Score” in a dataset.
-
Start with a vague prompt — type this inside your Gemini CLI session:
Create a plot of the data I just made.Observe: Does it work? Is the plot useful? Where did it save it?
Refine the prompt: Write a new prompt that applies context (what the data is), specificity (scatterplot with regression line), and output instructions (save as
fig/trend_analysis.png).
Using the 'master_dataset.csv' file, create a Python script to generate a scatterplot of 'date' vs 'score'. Add a linear regression trendline. Label the axes clearly. Save the final plot to a file named 'fig/trend_analysis.png' (create the directory if it doesn't exist).
Challenge: The introspection loop
Test the AI as a verifier principle. Ask the AI to find flaws in its code before you run it.
-
Generate a script — type this prompt inside your Gemini CLI session:
Write a Python script that reads 'data.csv' and calculates the rolling 7-day average of a 'score' column. Handle missing values. -
Force introspection: Once the code is generated, do not run it. Follow up in the same session:
Review the rolling average script you just wrote. Are there any edge cases (like having fewer than 7 days of data) where this would fail? If so, provide an updated version. Compare: Did the AI find a mistake in its first draft? Did it add a guard clause like
min_periods=1?
AI models are often more accurate when asked to critique logic than when asked to generate it. This second pass is part of the editor mindset and reduces manual debugging.
- Be specific and provide context.
- Plan before you act: request a numbered plan and approve it before any files are written.
- Always validate AI outputs.
- Introspection improves code quality.
Content from Data cleaning with AI
Last updated on 2026-03-25 | Edit this page
Estimated time: 50 minutes
Overview
Questions
- How can AI handle messy data?
- Can I trust AI to standardize inconsistent files?
Objectives
- Generate a test dataset using Gemini.
- Build a data processing pipeline for inconsistent files.
- Verify AI-generated code before running it.
- Document the cleaning process.
This episode uses live coding. Learners should follow along by running commands on their own machines.
Prerequisites
Ensure you are authenticated (gemini auth login) and
have a Gemini CLI session running in your project folder. Generating
scripts can take 10-30 seconds.
Cleaning messy data
Cleaning and merging inconsistent files is a common bottleneck in research. We will use Gemini to standardize messy CSV files.
Generating test data
To practice cleaning, we need a dataset with inconsistencies. We can
use AI to simulate a multi-site study where each location used different
naming conventions or date formats. Run this command to generate three
files: site_A.csv, site_B.csv, and
site_C.csv.
Create a python script named 'make_messy_data.py'. It should generate 3 CSV files ('site_A.csv', 'site_B.csv', 'site_C.csv') with 50 rows each. Columns should include 'ID', 'Date', and 'Score', but make them inconsistent (e.g., 'ParticipantID' vs 'id', 'date' vs 'Date_Time'). Add some missing values and varied date formats (like '2023/01/05' vs 'Jan 5, 2023').
After running python make_messy_data.py, you will have
three inconsistent files in your directory.
Auditing the data
Before fixing the files, we need to understand the inconsistencies. We can ask the AI to write an inspection script that reads every CSV in the folder and reports the filenames, column names, and missing value counts.
Write a Python script called 'inspect_data.py' that reads every CSV file in the current folder. For each file, print the filename, the list of column names, and the number of missing values in each column.
Run the inspection script. You should see inconsistencies like
site_A using ParticipantID while
site_B uses id.
Reasoning models
If your data files are extremely inconsistent, reasoning models (like o1 or DeepSeek-R1) are often more effective. They can identify subtle naming patterns that standard models might miss.
Spec-guided cleaning
In Spec-Driven Research Orchestration, we don’t just
ask for a script. We refer to the AGENTS.md file to ensure
the script follows the project’s rules.
Harmonizing files with the Spec
We will now ask Gemini to generate a script using the rules defined
in AGENTS.md.
Read 'AGENTS.md' and the 3 site CSVs. Write a script called 'clean_and_merge.py' that renames IDs and standardizes dates according to the schema in the spec. Fill missing scores with the median and save to 'master_dataset.csv'. Add comments linking code steps to spec rules.
If a learner’s AI fails to generate working code, provide the
pre-written versions from instructors/files/: -
backup_make_messy_data.py -
backup_inspect_data.py -
backup_clean_and_merge.py
The editor role
Before running the code, open clean_and_merge.py. Check
if the logic is sound, if the comments match the code, and if there are
syntax errors. You are responsible for the final output.
Run python clean_and_merge.py to create the clean
dataset.
Using comments
We asked the AI to “Add comments explaining each step.” These comments make the script readable and help you verify the methodology.
This challenge requires modifying existing code. If learners are
stuck, suggest they ask the AI to read clean_and_merge.py
before asking for modifications.
Challenge: Update the script
Imagine you need to exclude any participant with a score below 10.
Use the Gemini CLI to update clean_and_merge.py instead of
editing it manually. Ask the AI to read the file and add the filtering
logic. Run the updated script and verify the results in
master_dataset.csv.
Read 'clean_and_merge.py'. Modify the script to filter out any rows where 'score' is less than 10. Keep all other logic the same. Save the updated script.
Automating documentation
For the final step, have the AI generate a README that explains the data pipeline, including the raw files, cleaning steps, and final output format.
Create a README.md file that explains the data processing pipeline we just built. List the original files, the cleaning steps performed, and the final output format.
Challenge: Provenance tracking
To ensure research is reproducible, track which model generated your code and when.
- Use the Gemini CLI to add a provenance header to
clean_and_merge.py. - The header should be a Python docstring containing:
- The model used (e.g., Gemini 2.0 Flash)
- The date
- A summary of the prompt.
Read 'clean_and_merge.py'. Add a docstring at the very top of the file as a provenance header. Include the model name 'Gemini 2.5 Flash', today's date, and a summary of the prompt: 'Standardize site IDs, format dates, and impute missing scores with site medians.'
- Use AI to automate data harmonization and standardization.
- Audit data before cleaning to identify inconsistencies.
- Read and test all generated code before running it.
Content from Validation strategies: the approval gate
Last updated on 2026-03-11 | Edit this page
Estimated time: 50 minutes
Overview
Questions
- How do I move from vibe coding to research orchestration?
- What is rewrite time and why does it matter for research reproducibility?
- How can I use one AI to catch the errors of another?
Objectives
- Shift from a writer to an auditor mindset using the approval gate.
- Calculate rewrite time to measure workflow efficiency.
- Use a four-layer validation stack with explicit requirement constraints.
- Use multi-model verification to peer-review research code.
The approval gate: verification over generation
In an agentic research workflow, your role is to audit and approve code rather than write it. The standard has shifted from vibe coding to spec-driven orchestration.
The approval gate is the point where you decide AI-generated code is robust enough for research production. It separates a working prototype from validated science.
The review-first standard
The bottleneck in research is no longer writing code; it is verifying it. A high-performance workflow follows this cycle: Plan → Agent Implementation → Automated AI-Powered Testing → Human Review.
Measuring efficiency: rewrite time
Rewrite time is a metric to determine if an AI workflow is actually helping your research.
- Definition: The manual effort in minutes a researcher spends making AI-generated output production-ready.
- Goal: If you spend 20 minutes prompting but then 40 minutes fixing the code, your rewrite time is too high.
Challenge: calculate your rewrite time
Look back at a script you recently generated with an AI agent. 1. How many minutes did you spend fixing or refactoring the code to make it run? 2. If this took more than 10% of the total task time, what was the likely cause? - Ambiguous specs? - AI over-confidence? - Lack of local context?
The four-layer validation stack
To minimize rewrite time and ensure research rigor, use a structured validation stack.
Layer 1: Requirement constraints (No-Go Zones)
Before the AI writes code, define requirement constraints in your
AGENTS.md. These are rules the AI is not allowed to
break.
Example: “Do not change the column names in
raw_data.csv” or “Use only base R for this visualization to
ensure compatibility.”
Layer 2: Automated unit tests
Ask the agent to write tests before the implementation. Use a prompt pattern like: “First, write five Pytest cases that define the success of this data cleaning script. I will approve the tests before you write the logic.”
Multi-model verification
We use a challenger model to audit an implementation model rather than trusting a single AI.
Challenge: orchestrate a peer review
Use Model A (such as Claude Code or Cursor) to generate a data cleaning script.
-
Provide the code to Model B (such as Gemini CLI) with the following prompt:
“Read this script. Act as a skeptical senior data scientist. Identify three potential edge cases where this script will fail, such as empty strings, NaN values, or encoding issues. Suggest specific assert statements to catch these.”
Reflect: Did the challenger model find something the implementation model missed?
Models have different blind spots. Forcing a second AI to act as an auditor helps bypass the tendency of the primary assistant to be over-confident. This process reduces your manual rewrite time.
Warn learners about approval fatigue the tendency to accept AI suggestions without reading them. The four-layer stack is designed to make the AI prove it is correct before you review the code.
- The approval gate separates experimental prototypes from validated research.
- Rewrite time is the primary metric for measuring AI workflow value.
- Requirement constraints prevent the AI from drifting away from research specs.
- Multi-model verification uses a second AI to act as a skeptical peer reviewer.
Content from Limitations and cautions
Last updated on 2026-03-28 | Edit this page
Estimated time: 25 minutes
Overview
Questions
- When should I not use AI?
- What are common failure modes?
Objectives
- Recognize high-risk scenarios for AI use.
- Identify hallucinated or outdated code.
- Distinguish between open and proprietary models.
The jagged frontier
AI capability is inconsistent. A model may solve a complex differential equation but fail a simple logic puzzle. Researchers must identify where AI is reliable and where it is a liability for their specific field.
When not to trust AI code
Using AI-generated code can introduce risks to research integrity. Security-critical tasks—like authentication, encryption, or handling sensitive data—require expert oversight.
AI may also fail when research involves new statistical methods or domain-specific details. Models synthesize information from training data, which might not include the latest breakthroughs or specific sensor patterns. In performance-critical code, AI often prioritizes common algorithms over the most efficient ones, which can cause bottlenecks in large-scale processing.
Common failure modes
Understanding AI failure modes helps you identify errors before they affect results.
Spec Drift
Spec Drift occurs when the code and the AGENTS.md
(Living Spec) become unaligned. The agent may fix a bug in the code but
forget to update the spec, leading to future hallucinations. -
Prevention: Regularly ask the agent to “Sync the spec with the
current code.”
Bootstrap Failures
In the “Bootstrap Workflow,” the AI may miss nuances in raw data
during the initial scan. If you approve a flawed spec, the error will
propagate through the entire project. - Prevention: Thoroughly
audit the agent’s first draft of AGENTS.md.
Silent semantic drift
Semantic drift occurs when an agent makes a change that alters data assumptions or logic without breaking the code. - Example: The code runs and tests pass, but a filtering threshold was changed or a column was renamed incorrectly, affecting the research conclusion. - Prevention: Use metamorphic testing and invariant checks to ensure core logic remains unchanged.
Other failure modes
- Hallucinated functions: The model uses libraries or APIs that do not exist.
- Outdated approaches: The AI uses deprecated syntax from its training data.
- Confident incorrectness: The AI presents wrong formulas or logic as certain.
- Tool poisoning via MCP: When an agent calls external tools through MCP, a misconfigured or malicious MCP server can inject instructions into the agent’s context (prompt injection). This can cause the agent to take unintended actions or leak data. Mitigation: only install MCP servers from trusted, audited sources.
- Over-engineering: The model generates complex code for simple problems.
Environmental cost
Data centers consume large amounts of electricity and water. Frequent, iterative prompting can be resource-intensive.
- Energy use: Every AI query requires complex calculations. Some estimates suggest a single generative AI query uses significantly more energy than a standard web search.
- Code efficiency: AI models often prioritize working code over efficient code. Inefficient software uses more energy and resources over time.
Sustainable practices
To code responsibly:
- Think before prompting: Use the CLEAR framework to get the right answer in fewer attempts.
- Request optimization: Prompt the AI to optimize for memory or speed once the logic is correct.
- Use documentation: If you need simple syntax, check the documentation instead of querying an LLM.
Models are becoming less likely to hallucinate. * If it refuses: Acknowledge that the model correctly identified its own limitations. * Backup: Have a screenshot of a known hallucination ready to show if the AI performs perfectly during the session.
Challenge: Test for hallucinations
Inside your Gemini CLI session, type:
How do I use the 'pypanda-researcher' library to automatically write my conclusion?
Note whether the model admits it does not know, hedges with uncertainty, or confidently invents instructions.
Current models (Gemini 2.5, Claude, GPT-4o) are significantly better at refusing or flagging uncertainty than earlier generations — you may get a clean “this doesn’t exist” response. That is the correct behavior. The lesson here is not that hallucination always happens, but that you cannot assume it won’t: always verify suggested libraries and functions exist before using them. Older or smaller models are still more likely to confabulate.
Open science and proprietary AI
The Gemini CLI is not open source, which creates a tension in open research.
- Proprietary models (Gemini, GPT-4, Claude): These are closed-weight models. You cannot verify their training data, and they may update silently. Institutional agreements provide data privacy but do not solve reproducibility issues.
- Open-weights models (Gemma, Llama, Mistral): These can be run locally using tools like Ollama. They offer better reproducibility because you can use a specific, frozen version of the model.
Recommendation: Use proprietary models for prototyping and cleaning, but archive the generated code. Do not rely on the AI to regenerate the same code in the future.
Key lesson
AI increases your speed but does not replace your expertise or responsibility. Your role shifts from writing code to verifying the code the AI produces.
- Avoid AI for security-critical tasks.
- You are responsible for the final output.
- Open models offer better reproducibility; proprietary models offer more power.
Content from Resources and next steps
Last updated on 2026-03-28 | Edit this page
Estimated time: 20 minutes
Overview
Questions
- What other tools are available?
- Where can I find help?
Objectives
- List alternative AI coding tools.
- Locate documentation and support.
- Understand plugins and Model Context Protocol (MCP).
AI tool landscape
The ecosystem of AI tools for research is expanding. For coding, tools like Claude Code (Anthropic’s CLI), GitHub Copilot, and Aider (a CLI agent that works with multiple models) provide terminal support.
Some researchers use AI-native code editors like Cursor to interact with an entire repository. Models like Gemini 2.5 and DeepSeek-V3/R1 have increased the speed of these interactions.
Research-specific tools are also emerging. Elicit and Consensus focus on scientific paper discovery and evidence-based claims. Google’s NotebookLM allows you to ground an AI’s knowledge in your own collection of research PDFs for summarizing and querying documents.
Extending capabilities
AI tools can now connect with other software. Many browser-based models offer extensions for Google Drive, WolframAlpha, and other services. The Model Context Protocol (MCP) is an open standard that allows AI assistants to connect to local or remote data sources—such as a PostgreSQL database or your local file system—without requiring you to upload data to a central server. In December 2025, Anthropic donated MCP to the Agentic AI Foundation under the Linux Foundation. The 2026 MCP roadmap (published March 2026) focuses on enterprise authentication, audit trails, and agent-to-agent communication.
MCP security: shadow IT risk
MCP servers are installed by individual researchers or developers, often without institutional IT oversight. Security researchers have identified risks including prompt injection via malicious MCP servers and tools that silently exfiltrate data. Before connecting an MCP server, verify it is actively maintained and from a trusted source. Your institution’s IT policy may not yet cover MCP-enabled tools.
Local models and open tooling
For researchers handling sensitive data or requiring reproducibility, local AI is a growing trend. Running models on your own hardware ensures that data does not leave your machine.
- Ollama: A tool for running open-weights models (like Llama 3, Mistral, or Gemma) locally on macOS, Linux, and Windows. It provides a CLI and a local API.
- Aider: A CLI coding agent that connects to proprietary APIs and local models (via Ollama). It is designed for pair programming and refactoring in the terminal.
- Qwen Coder: High-performance open models from Alibaba (e.g., Qwen2.5-Coder) that can be run offline for security.
Privacy and performance
Local models offer privacy but require significant hardware (especially GPU VRAM) to match the performance of cloud models like Gemini 2.5. Many researchers use a hybrid approach: cloud models for general scripting and local models for sensitive data.
Advanced trends
Automated research discovery
For time-sensitive research, AI agents can scan social platforms, developer forums, and prediction markets to identify trends before they appear in traditional journals. This “real-time literature mapping” works by prompting the agent to synthesize signal from fast-moving sources alongside traditional ones.
A deep research prompt pattern
Use a prompt like this with any AI assistant that has web search access:
“Act as a research orchestrator. Find the most discussed [TOPIC] from the last 30 days across Reddit, X, and Hacker News.
- Discovery: Identify the top 3 tools or methodologies mentioned.
- Sentiment: Quote the most upvoted critique for each.
- Verification: Cross-reference these with web sources for real-world impact.
- Validation: Apply ‘Layer 4: Domain Plausibility’—identify one trend that sounds plausible but may be an AI-generated ‘vibe’ without empirical backing.”
Multi-model ensembles
In advanced workflows, you can use different models for different tasks. For example, one model generates code, another critiques it, and a third writes validation tests. This reduces the chance of a single model’s bias affecting results.
GitHub Agentic Workflows
GitHub Agentic Workflows (technical preview, February 2026) let you write repository automation goals in plain Markdown instead of YAML. GitHub Actions executes them using an LLM. This is a direct production application of the Markdown-as-spec pattern taught in this lesson.
Research-relevant use cases include automated issue triage, CI failure analysis, and pull request review. See the GitHub Agentic Workflows changelog for current status.
Citing and crediting AI
Transparent attribution is essential for open science. Academic standards (including COPE, Nature, and Elsevier) state that AI tools cannot be listed as authors because they lack legal accountability. Instead, cite them as methodological tools.
Recommended attribution
1. In code repositories (README.md):
Add an AI usage section to your project documentation:
Example attribution
- Model: Google Gemini 2.5 Flash
- Role: Spec drafting and spec-guided cleaning.
-
Verification: Verified by [Your Name] via
GEMINI.mdrules andvalidate_data.py.
2. In manuscripts: Cite the model in the methods or acknowledgements section.
- Example: “We used Google Gemini 2.5 Flash to assist with data cleaning scripts. Prompts and raw outputs are available in the supplementary material.”
Resources and standards
- COPE (Committee on Publication Ethics): Position statement on why AI cannot be an author.
- Elsevier AI Policy: Guidelines on declaring AI use in research.
- CRediT Taxonomy: Use “Software” or “Methodology” roles to describe your use of the tool.
Summary checklist
Navigating the AI landscape
Separating utility from marketing is a challenge. New tools appear daily, but many are more hype than substance.
Finding tools
Where do you hear about new AI tools? Social media, academic journals, or word of mouth? Note that while social media is fast, reputable channels are more reliable.
Reputable sources
Follow sources that focus on the practical and ethical aspects of AI in research:
- Simon Willison’s Weblog: Focuses on AI engineering and security risks.
- Ethan Mollick’s “One Useful Thing”: Analyzes how AI impacts cognitive labor.
- Hamel Husain’s Blog: Focuses on systematic evaluation of AI performance.
- Import AI (Jack Clark): Covers technical breakthroughs and policy.
- The Batch: Balanced coverage of AI in industry and science.
- Data Elixir: Highlights practical tools for data science.
- The Gradient: Detailed articles on AI research.
From vibes to evals
When AI scripts become critical for research, move to systematic evaluation. Create a “gold standard” dataset of known correct answers and test new iterations of AI-generated code against it.
Challenge: Your research protocol
Draft a plan for integrating these tools into your research workflow while maintaining rigor.
- Select a project: Which project would benefit most from an AI cleaning or validation pipeline?
- Choose your gates: Which approval gates (Test-first, Diff budget, Snapshot) will you use?
- Define requirements: What are 2-3 requirements for that project that the AI cannot change?
- Verification strategy: Will you use a second model, unit tests, or metamorphic checks?
A pre-defined protocol reduces decision fatigue during complex coding sessions. Deciding how to verify work early prevents approval fatigue later.
- Gemini, Claude, and Copilot serve different needs.
- Community support is vital.
- Plugins and MCP allow AI to connect to external data.