Section 1 — Hook & Framing

AI Agents: A Powerful Tool.
An Expensive One.
Let's Use It Right.

How to improve productivity and reduce costs through thoughtful AI agent usage

Based on measured production session data · 2026

Core thesis: AI agent effectiveness and AI agent cost are the same engineering problem.

We are entering usage-based pricing. The way we work with AI agents needs to evolve.

Section 2 — New Reality: Usage-Based Pricing

What GitHub Already Knows About Your Usage Demo

Open GitHub's Copilot usage-based pricing preview page live — real personal subscription data.

Section 2 — New Reality: Usage-Based Pricing

What That Number Means

$3,200

single developer · 1 month
GitHub pricing preview tool · real personal subscription · 2026

Scale	Monthly	Annual
1 developer	$3,200	$38,400
Team of 30	~$96,000	~$1,152,000

This is what ordinary engineering usage becomes under usage-based pricing — before any workflow optimization.

The optimized workflow result appears in Section 9.
Spoiler: same task, 9–21× lower cost.

Based on GitHub's own pricing preview estimate. API-equivalent pricing; GitHub Copilot billing structure may differ. Order of magnitude, not a precision forecast.

Section 2 — New Reality: Usage-Based Pricing

The Mental Model Shift

⎇

Git

You learned commits, branches, merges.
You use it deliberately.

📊

Grafana

You query intentionally.
You scope dashboards.

🤖

AI Agents

Same requirement.
Understand the resource model. Use deliberately.

AI is a tool. Like Git. Like Grafana. Like your IDE.
To use it well, you need to understand how it consumes resources.

Transition: to use it well, you need to understand how pricing works.

Section 3 — How Token Pricing Works

Token Types and Costs

Token Type	Opus 4.6	Sonnet 4.6	Haiku 4.5
Input	$5 / MTok	$1.50 / MTok	$1 / MTok
Cache Read	$0.50 / MTok	$0.15 / MTok	$0.10 / MTok
Cache Write	$6.25 / MTok	$1.875 / MTok	$1.25 / MTok
Output	$25 / MTok	$7.50 / MTok	$5 / MTok

Cache reads are 10× cheaper than input tokens.
Your architecture either exploits this or ignores it.

Output tokens are 5× more expensive than input.
Verbose agent responses compound quickly.

Choosing the wrong model for a task is a direct cost multiplier. Anthropic API pricing, May 2026.

Section 3 — How Token Pricing Works

The Agent Loop — Why Token Costs Compound

Each tool call result is appended to context
Each LLM call pays for the full context at that point
One user message → 10, 20, 50+ LLM calls
The agent decides how many — not you

"You are billed for the sum of all context windows across all turns — not the final context size."

A session showing 1M input tokens doesn't mean 1M in context — it means a growing context was re-sent many times.

Section 4 — Context Window: Your Most Expensive Resource

What Goes Into Context

■ System prompt + AGENTS.md
■ Enabled skill definitions
■ MCP tool definitions
(each MCP adds hundreds–thousands of tokens)
■ Tool call results (raw output from every tool)
■ Conversation history (all turns)

All of this is re-sent with every single LLM call.

Section 4 — Context Window: Your Most Expensive Resource

Context Entropy

Not just quantity — quality matters
Relevant, precise context → better decisions, fewer iterations
Polluted context (tool noise, irrelevant files, verbose outputs) → degrades reasoning
The model must find the signal in the noise — hurts analysis quality, decision quality, output quality

Worst case: Prompt Injection
Model encounters adversarial content in a file it reads — alters behavior. Any data in context has some effect on behavior.

More subtle: Large noisy context → model favors the most capable (= most expensive) model just to cope.

Section 4 — Context Window: Your Most Expensive Resource

Compaction: A Reasoning Event and a Cost Event

When context approaches the limit, the model summarizes and loses precision.

Model forgets verified facts → re-investigates → wastes tokens
Model loses prior decisions → conflicting assumptions → incorrect output
Long sessions with heavy compaction → architectural drift

Thesis (second of three):
"This is the clearest single illustration of quality and cost being the same engineering problem."

The negative feedback loop

Section 4 — Context Window: Your Most Expensive Resource

Best Practices — Context Hygiene Demo

Under Your Full Control

Adjust enabled skills to match current task
Adjust enabled MCPs — disable when not needed
Split large tasks into focused sessions (clean context each time)
Write precise first messages: exact file paths, line ranges, scope boundaries
Avoid @-including entire folders or large files
Prefer terse tools (rg, jq) over verbose equivalents

Mechanisms You Cannot Control

Model's internal memory writes
→ disable if possible; use explicit memory.md
System prompt injections during agent loop
What gets preserved or lost during compaction

General principle: move away from under-the-hood magic toward predictability and control.

Demo: show live how enabling/disabling a skill or MCP affects initial context token count.

Section 5 — Default Copilot Workflow

Convenience Architecture Demo

What default actually looks like

Default is intentionally optimized for accessibility and onboarding — the right tradeoff for getting started
In practice: single agent, full skill set, all MCPs, all tools, one growing context
First user message → AGENTS.md, LODA files, referenced files/folders read → all lands in single context
Context is already significantly loaded before useful work begins

Under-the-hood additions you don't see: initial agent system prompt (not user-configurable), injected guidance during agent loop.

Section 5 — Default Copilot Workflow

Where Token Spend Escapes Your Control

Once you send a message: the agent loop runs. You observe. You cannot steer mid-flight effectively.
Best strategy: press Esc immediately if you see the agent doing something wrong. Stop the loop. Write a more precise prompt. Restart.

Uncontrolled Behavior	Your Mitigation
Number of tool calls	Limit tool visibility via MCP / primitive tool whitelisting
Agent loop iterations	Write tighter scoped prompts; cancel early
Internal memory writes	Disable; use explicit memory.md instead
Compaction timing	Reduce context size so compaction is rare

Section 5 — Default Copilot Workflow

Why This Pushes You Toward Expensive Models

Large noisy
context

→

Small/cheap models
struggle with quality

→

Teams select
most capable model

→

Cost multiplied

The real root cause is context entropy, not model capability.
Fix the context — and you can use cheaper models effectively.

Transition: what can we control right now, without changing the architecture?

Section 6 — Observability

The Observability Gap

What You See

Conversation turns
Final answers
Tool call names (sometimes)

What You Don't See

Token counts per turn
Full tool call chain
Model switches
Compaction events
Cost breakdown
Subagent activity, prompts, answers

You cannot improve what you cannot observe.
You cannot standardize workflows across a team if sessions are invisible.
This is not a nice-to-have. It is a prerequisite for treating AI agent usage as a repeatable, improvable engineering practice.

Section 6 — Observability

Bridging the Gap Demo

Custom session analysis scripts built on raw events.jsonl telemetry

What the Scripts Reveal

Per-model token breakdown
Subagent dispatch: which agent, which model, tokens, tool calls, full prompt + answer
Compaction events: when, how many tokens lost
Tool usage: which tools, call count, success rate, latency
Timeline: exact sequence of everything

$ dotnet script analyze-events.csx
══ Tool Usage Statistics ══
Bash        47 calls  · ok:45  fail:2  · avg 340ms
Read        38 calls  · ok:38  fail:0  · avg 12ms
Grep        21 calls  · ok:21  fail:0  · avg 28ms
══ Errors & Warnings ══
WARN  [turn 14] Compaction triggered — 91K → 22K tokens. 2 decisions lost.
WARN  [turn 22] Tool result >12K tokens returned to orchestrator context
══ Token & Cost Table ══
opus-4.6   in:977K  cache:977K  out:14K
haiku-4.5  in:3.2M  cache:3.2M  out:14K
══ Event Timeline (grouped by agent) ══
[1brainstorm] t+0s   LLM call  opus-4.6    ctx:14K
[sub-explorer] t+4s  tool:Glob  haiku-4.5  14 results
[sub-explorer] t+6s  tool:Read  haiku-4.5  3.2K tokens
[1brainstorm] t+9s   LLM call  opus-4.6   ctx:28K ↑
  … 47 more events …
══ Subagent Dispatches ══
[sub-researcher · haiku-4.5]  14 tool calls · 340K tokens
prompt: "Find all usages of IAuthService in the solution..."
response: "Found 7 usages across 4 files. Primary: AuthController:142..."

Every recommendation in this presentation is verifiable with these scripts on my own sessions.

Section 7 — Five Changes You Can Make Tomorrow

Five Changes You Can Make Tomorrow

1 Disable MCPs not relevant to your current task

2 Disable skills not relevant to your current task

3 Scope your first message: exact file, exact function, exact scope — not "look at my project"

4 One session = one task. Split anything bigger.

5 Cancel immediately if the agent goes the wrong direction — every wrong-direction tool call is already billed

Section 7 — Five Changes You Can Make Tomorrow

The Prompt Quality Multiplier

A precise, well-scoped prompt is the single highest-ROI action.
It narrows the research surface → fewer tool calls → less context pollution → fewer iterations → lower cost + better output

Bad Prompt

        "Help me fix the authentication module"
      

Triggers wide codebase exploration
Agent decides scope autonomously
Many tool calls, noisy context

Good Prompt

        "The JWT token expiry check in

        AuthService.cs:142 is returning false

        for valid tokens when clock skew

        exceeds 5 minutes. Fix only this function.

        Token format: TokenClaims.cs:28"

Exact file + line references
Explicit scope boundary
Minimal tool calls needed

Section 8 — Better Architecture: The Tiered Agent Model

The Army General Analogy

The general's judgment is valuable precisely because it isn't spent crawling through bushes. Protect the orchestrator's attention budget the same way.

Section 8 — Better Architecture: The Tiered Agent Model

The Topology Demo

The orchestrator is smarter not because the model changed — because its attention budget is protected.

Section 8 — Better Architecture: The Tiered Agent Model

Why Tiered Agents Reduce Cost and Improve Reasoning

The most important architectural value of subagents in this workflow is not parallelism.
It is context isolation.

Context isolation

↓

Clean orchestrator context

↓

Protected attention budget

↓

Better reasoning

↓

Fewer iterations → Lower cost

This is not the same quality at lower cost.
It is better quality, because the orchestrator's attention is not diluted by raw noise.

Two Supporting Rules

Expensive models work only with clean, distilled context
Dirty work — tool calls, raw data, external lookups — goes to cheap fast models

Section 8 — Better Architecture: The Tiered Agent Model

Persistent Working Memory

memory.md Structure

Confirmed Facts
Active Assumptions
Open Questions
Decisions Made
Session Reasoning Log

Survives compaction — it's a file, not context. You can read it, edit it, share it, use it as a handoff document.

vs. Built-in Memory

You can't see it
You can't edit it
You don't control what gets written

Stable, reusable context → prompt cache activates automatically every turn.
Cache savings are a byproduct of good architecture, not a separate optimization.

Section 8 — Better Architecture: The Tiered Agent Model

Convenience Architecture → Production Architecture

Every item on the right addresses a specific cost and reasoning quality failure covered in this presentation.

Convenience Architecture	Production Architecture
Single agent	Tiered orchestration
All skills visible	Task-scoped skills
All MCPs active	Task-scoped MCPs
All tools visible	Scoped visible tools
Implicit context growth	Explicit working memory (memory.md)
Optimized for: onboarding	Optimized for: scale, cost, predictability

Section 9 — Cost Comparison: Real Data

Three Scenarios, Same Task

	A — Tiered workflow	B — All Opus, tiered	C — Default single agent
Architecture	Orchestrator + 4 subagents	Same topology, same token volumes	Single agent, all tools/skills/MCPs
Models	Opus 4.6 + Haiku 4.5	Opus 4.6 only	Opus 4.6
Evidence	Measured telemetry, exact API repricing	Exact repricing of measured token volumes	Architectural behavior estimate*
Est. Cost	~$11	~$27.50	~$100–230*

A vs. B: 2.5×
The model routing dividend.
Exact & reproducible.

B vs. C: 4–8×
The isolation dividend.
Architectural estimate.

A vs. C: 9–21×
Combined effect.
Model routing + isolation.

Cache savings (Scenario A, measured): Opus cache reads: 977,749 × $4.50/M = $4.40 saved · Haiku cache reads: 3,176,855 × $0.90/M = $2.86 saved · Total: ~$7.26 (~39% of session cost)

*Scenario C range is wide by design. Default single-agent architecture does not produce stable or predictable token growth. Anthropic API pricing, May 2026.

Section 10 — What's Next

A Tested Workflow. Ready to Extend.

This is a tested, production-validated starting point — not the final answer.

🔧

Change

→

🔬

Test & Observe

analyze-events.csx

→

📋

Analyze Gaps

→

✅

Apply & Repeat

It is open for contribution and improvement from the team.

Section 10 — What's Next

Future Directions

More Specialized Agents

Bug triage workflow: exception in logs → root cause → team attribution → work item in ADO — one command, one output
Domain-specific orchestration patterns

Better Tooling

More LLM-friendly tools with terse output
export-events.csx, rg, jq
General-purpose tools that don't produce noise

Better Prompts

Refined orchestrator and subagent prompts
User-facing prompt templates for common task types

Team Shared Configuration

Shared agent definitions
Shared skills and MCP configurations
Don't let everyone reinvent this independently

Section 11 — Questions

Questions

Key takeaways

1 Noisy context degrades reasoning and increases cost. They are the same engineering problem — and this is measurable. (This is not a best-practice recommendation. It is a measured outcome from real sessions.)

2 Context hygiene and model routing are available today. No infrastructure change required — start with disabling unused MCPs and skills, and scoping your first message precisely.

3 A tiered workflow with context isolation delivers better reasoning quality at 2–5× lower cost, based on real measured sessions. The difference between convenience architecture and production architecture is not tooling — it is intentionality.

Call to action: Try the workflow on one real task. Run the analysis scripts. Compare.

AI Agents: A Powerful Tool.An Expensive One. Let's Use It Right.

What GitHub Already Knows About Your Usage Demo

What That Number Means

The Mental Model Shift

Git

Grafana

AI Agents

Token Types and Costs

The Agent Loop — Why Token Costs Compound

What Goes Into Context

Context Entropy

Compaction: A Reasoning Event and a Cost Event

Best Practices — Context Hygiene Demo

Under Your Full Control

Mechanisms You Cannot Control

Convenience Architecture Demo

Where Token Spend Escapes Your Control

Why This Pushes You Toward Expensive Models

The Observability Gap

What You See

What You Don't See

Bridging the Gap Demo

What the Scripts Reveal

Five Changes You Can Make Tomorrow

The Prompt Quality Multiplier

Bad Prompt

Good Prompt

The Army General Analogy

The Topology Demo

Why Tiered Agents Reduce Cost and Improve Reasoning

Two Supporting Rules

Persistent Working Memory

memory.md Structure

vs. Built-in Memory

Convenience Architecture → Production Architecture

Three Scenarios, Same Task

A Tested Workflow. Ready to Extend.

Future Directions

More Specialized Agents

Better Tooling

Better Prompts

Team Shared Configuration

Questions

AI Agents: A Powerful Tool.
An Expensive One.
Let's Use It Right.