Sylvain's Blog

Most AI coding agents are optimized for generating code. But analytics work is different. A good data analyst agent needs context about your warehouse, semantic understanding of tables, reproducible workflows, and the ability to turn exploratory conversations into reusable analysis pipelines.

In this post, we'll walk through how to build an AI Data Analyst Agent in Claude Code using a structured claude.md system prompt, a semantic.yaml schema layer, and reusable skills that let you replay analyses after discovering insights through conversation.

The goal is not just "chatting with your database." It's creating an agent that behaves like a disciplined analytics engineer.

The Architecture of an Analytics Agent

A strong AI data analyst agent needs more than SQL generation. It needs memory, structure, and constraints.

At a high level, the stack looks like this:

Claude Code: The execution environment and agent runtime.
claude.md: Persistent operating instructions defining behavior, rules, and workflows.
semantic.yaml: A semantic abstraction layer describing tables, metrics, joins, and business meaning.
Skills: Reusable analysis procedures that standardize common workflows.
Data warehouse access: Snowflake, BigQuery, Postgres, DuckDB, or similar systems.

The key insight is this: analytics agents fail less when they operate from structured semantic context instead of raw schemas.

Designing the claude.md File

The claude.md file acts as the operating manual for the agent. Think of it as a persistent system prompt that defines how the analyst should behave.

A good analytics-focused claude.md usually contains:

Business context and KPI definitions
SQL safety rules
Preferred query patterns
Data quality expectations
Visualization conventions
Instructions for reproducibility

Example:

# Analytics Agent Instructions

You are a senior data analyst.

Always:

- Prefer semantic layer definitions over raw table names
- Explain assumptions before querying
- Validate joins before aggregation
- Avoid SELECT *
- Limit exploratory queries to 100 rows first
- Save reusable workflows as skills

Business Definitions:

- "Active User" = user with a completed session in the last 30 days
- Revenue excludes refunds and test transactions
- Use UTC timestamps unless specified otherwise

This dramatically improves consistency. Without these constraints, agents tend to generate fragile or misleading SQL.

Using semantic.yaml as a Semantic Layer

One of the biggest problems with AI-generated analytics is schema ambiguity. Column names rarely explain business meaning clearly enough.

This is where semantic.yaml becomes critical.

Instead of exposing raw tables directly, define semantic entities, metrics, dimensions, and relationships.

Example:

tables:
  orders:
    description: Customer purchase transactions

    metrics:
      total_revenue:
        sql: SUM(order_amount)
        description: Gross revenue before refunds

      completed_orders:
        sql: COUNT(order_id)
        filters:
          status: completed

    dimensions:
      - customer_id
      - country
      - created_at

    joins:
      customers:
        type: many_to_one
        on: orders.customer_id = customers.id

This gives the agent semantic understanding instead of forcing it to infer meaning from raw SQL schemas.

Benefits include:

Safer SQL generation
Consistent KPI definitions
Fewer hallucinated joins
Better business alignment
Improved explainability

In practice, the semantic layer becomes the difference between "AI autocomplete" and a trustworthy analytics system.

Creating Reusable Analysis Skills

One of the most underrated features in agentic workflows is reusable skills.

During exploratory analysis, you often discover a useful workflow through conversation:

Investigating churn spikes
Analyzing conversion funnels
Segmenting high-value customers
Debugging revenue anomalies

The problem is that conversational insights are ephemeral. Once the chat is over, reproducing the exact reasoning path can be difficult.

Skills solve this by converting successful workflows into reusable procedures.

Example skill:

name: analyze_conversion_drop

description: |
  Investigate funnel conversion declines by segment,
  traffic source, device type, and release window.

steps:
  - compare weekly conversion trends
  - identify statistically significant drops
  - segment by acquisition channel
  - correlate with deployment events
  - generate summary findings

Instead of rediscovering the workflow manually, the agent can replay the same analytical methodology consistently.

Turning Conversations into Repeatable Analysis

This is where AI analyst agents become genuinely powerful.

Most analytics today are trapped inside Slack threads, notebooks, or one-off conversations. Valuable investigative logic disappears after the meeting ends.

A mature analytics agent should:

Capture useful workflows discovered during conversation
Convert them into reusable skills
Parameterize the inputs
Replay analyses on future datasets
Maintain methodological consistency

For example, imagine discovering an effective fraud detection workflow while chatting with the agent. Instead of losing that reasoning process, the agent can save it as:

skill: detect_payment_fraud
inputs:
  - start_date
  - end_date
  - region

workflow:
  - identify anomalous transaction velocity
  - compare against historical baselines
  - cluster suspicious accounts
  - score fraud likelihood

Over time, your analytics organization accumulates a library of reusable analytical intelligence instead of isolated dashboard queries.

Best Practices for Reliable AI Analytics

Building a useful analytics agent is less about model intelligence and more about operational discipline.

Here are the practices that matter most:

Always use a semantic layer: Raw schemas are not enough for reliable business analytics.
Encode business definitions centrally: Never let metrics drift between prompts.
Treat successful analyses as assets: Save reusable workflows as skills.
Require explainability: The agent should explain assumptions before querying.
Constrain SQL generation: Safe defaults reduce hallucinations and expensive queries.
Start exploratory queries small: Limit rows before running warehouse-scale scans.
Separate reasoning from execution: Semantic planning should happen before query generation.

The future of analytics agents is not simply "natural language to SQL." The real opportunity is building systems that preserve institutional analytical reasoning and make it reusable.