AI Agents: A Powerful Tool.
An Expensive One.
Let's Use It Right.

How to improve productivity and reduce costs through thoughtful AI agent usage

Based on measured production session data · 2026
Core thesis: AI agent effectiveness and AI agent cost are the same engineering problem.

We are entering usage-based pricing. The way we work with AI agents needs to evolve.

What GitHub Already Knows About Your Usage Demo

github.com/settings/copilot/usage-preview GitHub Copilot · Usage-Based Pricing Preview Your projected monthly cost $3,200 / month (single developer) Based on your real usage data · API-equivalent estimate ↗ See full breakdown

Open GitHub's Copilot usage-based pricing preview page live — real personal subscription data.

What That Number Means

$3,200

single developer · 1 month
GitHub pricing preview tool · real personal subscription · 2026

ScaleMonthlyAnnual
1 developer$3,200$38,400
Team of 30~$96,000~$1,152,000
This is what ordinary engineering usage becomes under usage-based pricing — before any workflow optimization.
The optimized workflow result appears in Section 9.
Spoiler: same task, 9–21× lower cost.

Based on GitHub's own pricing preview estimate. API-equivalent pricing; GitHub Copilot billing structure may differ. Order of magnitude, not a precision forecast.

The Mental Model Shift

Git

You learned commits, branches, merges.
You use it deliberately.

📊

Grafana

You query intentionally.
You scope dashboards.

🤖

AI Agents

Same requirement.
Understand the resource model. Use deliberately.

AI is a tool. Like Git. Like Grafana. Like your IDE.
To use it well, you need to understand how it consumes resources.

Transition: to use it well, you need to understand how pricing works.

Token Types and Costs

Token Type Opus 4.6 Sonnet 4.6 Haiku 4.5
Input$5 / MTok$1.50 / MTok$1 / MTok
Cache Read$0.50 / MTok$0.15 / MTok$0.10 / MTok
Cache Write$6.25 / MTok$1.875 / MTok$1.25 / MTok
Output$25 / MTok$7.50 / MTok$5 / MTok
Cache reads are 10× cheaper than input tokens.
Your architecture either exploits this or ignores it.
Output tokens are 5× more expensive than input.
Verbose agent responses compound quickly.

Choosing the wrong model for a task is a direct cost multiplier. Anthropic API pricing, May 2026.

The Agent Loop — Why Token Costs Compound

User message AGENT LOOP LLM Call ctx grows ↑↑ Tool Call +result appended Result in context repeat until done Output done
  • Each tool call result is appended to context
  • Each LLM call pays for the full context at that point
  • One user message → 10, 20, 50+ LLM calls
  • The agent decides how many — not you
"You are billed for the sum of all context windows across all turns — not the final context size."

A session showing 1M input tokens doesn't mean 1M in context — it means a growing context was re-sent many times.

What Goes Into Context

Conversation history (all turns) Tool call results MCP tool definitions Enabled skill definitions System prompt + AGENTS.md ↑ re-sent every LLM call
  • System prompt + AGENTS.md
  • Enabled skill definitions
  • MCP tool definitions
    (each MCP adds hundreds–thousands of tokens)
  • Tool call results (raw output from every tool)
  • Conversation history (all turns)
All of this is re-sent with every single LLM call.

Context Entropy

  • Not just quantityquality matters
  • Relevant, precise context → better decisions, fewer iterations
  • Polluted context (tool noise, irrelevant files, verbose outputs) → degrades reasoning
  • The model must find the signal in the noise — hurts analysis quality, decision quality, output quality
Worst case: Prompt Injection
Model encounters adversarial content in a file it reads — alters behavior. Any data in context has some effect on behavior.
More subtle: Large noisy context → model favors the most capable (= most expensive) model just to cope.

Compaction: A Reasoning Event and a Cost Event

When context approaches the limit, the model summarizes and loses precision.

  • Model forgets verified facts → re-investigates → wastes tokens
  • Model loses prior decisions → conflicting assumptions → incorrect output
  • Long sessions with heavy compaction → architectural drift
Thesis (second of three):
"This is the clearest single illustration of quality and cost being the same engineering problem."
Reasoning Degradation Context Pollution Early Compaction Re-investigation More Pollution

The negative feedback loop

Best Practices — Context Hygiene Demo

Under Your Full Control

  • Adjust enabled skills to match current task
  • Adjust enabled MCPs — disable when not needed
  • Split large tasks into focused sessions (clean context each time)
  • Write precise first messages: exact file paths, line ranges, scope boundaries
  • Avoid @-including entire folders or large files
  • Prefer terse tools (rg, jq) over verbose equivalents

Mechanisms You Cannot Control

  • Model's internal memory writes
    → disable if possible; use explicit memory.md
  • System prompt injections during agent loop
  • What gets preserved or lost during compaction
General principle: move away from under-the-hood magic toward predictability and control.

Demo: show live how enabling/disabling a skill or MCP affects initial context token count.

Convenience Architecture Demo

What default actually looks like

Single Agent Opus 4.6 · All tools All Skills All MCPs All Tools Growing Context (single) Context Overload Compaction
  • Default is intentionally optimized for accessibility and onboarding — the right tradeoff for getting started
  • In practice: single agent, full skill set, all MCPs, all tools, one growing context
  • First user message → AGENTS.md, LODA files, referenced files/folders read → all lands in single context
  • Context is already significantly loaded before useful work begins
Under-the-hood additions you don't see: initial agent system prompt (not user-configurable), injected guidance during agent loop.

Where Token Spend Escapes Your Control

Once you send a message: the agent loop runs. You observe. You cannot steer mid-flight effectively.
Best strategy: press Esc immediately if you see the agent doing something wrong. Stop the loop. Write a more precise prompt. Restart.
Uncontrolled BehaviorYour Mitigation
Number of tool callsLimit tool visibility via MCP / primitive tool whitelisting
Agent loop iterationsWrite tighter scoped prompts; cancel early
Internal memory writesDisable; use explicit memory.md instead
Compaction timingReduce context size so compaction is rare

Why This Pushes You Toward Expensive Models

Large noisy
context
Small/cheap models
struggle with quality
Teams select
most capable model
Cost multiplied
The real root cause is context entropy, not model capability.
Fix the context — and you can use cheaper models effectively.

Transition: what can we control right now, without changing the architecture?

The Observability Gap

What You See

  • Conversation turns
  • Final answers
  • Tool call names (sometimes)

What You Don't See

  • Token counts per turn
  • Full tool call chain
  • Model switches
  • Compaction events
  • Cost breakdown
  • Subagent activity, prompts, answers
You cannot improve what you cannot observe.
You cannot standardize workflows across a team if sessions are invisible.
This is not a nice-to-have. It is a prerequisite for treating AI agent usage as a repeatable, improvable engineering practice.

Bridging the Gap Demo

Custom session analysis scripts built on raw events.jsonl telemetry

What the Scripts Reveal

  • Per-model token breakdown
  • Subagent dispatch: which agent, which model, tokens, tool calls, full prompt + answer
  • Compaction events: when, how many tokens lost
  • Tool usage: which tools, call count, success rate, latency
  • Timeline: exact sequence of everything
$ dotnet script analyze-events.csx
══ Tool Usage Statistics ══
Bash 47 calls · ok:45 fail:2 · avg 340ms
Read 38 calls · ok:38 fail:0 · avg 12ms
Grep 21 calls · ok:21 fail:0 · avg 28ms
══ Errors & Warnings ══
WARN [turn 14] Compaction triggered — 91K → 22K tokens. 2 decisions lost.
WARN [turn 22] Tool result >12K tokens returned to orchestrator context
══ Token & Cost Table ══
opus-4.6 in:977K cache:977K out:14K
haiku-4.5 in:3.2M cache:3.2M out:14K
══ Event Timeline (grouped by agent) ══
[1brainstorm] t+0s LLM call opus-4.6 ctx:14K
[sub-explorer] t+4s tool:Glob haiku-4.5 14 results
[sub-explorer] t+6s tool:Read haiku-4.5 3.2K tokens
[1brainstorm] t+9s LLM call opus-4.6 ctx:28K ↑
… 47 more events …
══ Subagent Dispatches ══
[sub-researcher · haiku-4.5] 14 tool calls · 340K tokens
prompt: "Find all usages of IAuthService in the solution..."
response: "Found 7 usages across 4 files. Primary: AuthController:142..."
Every recommendation in this presentation is verifiable with these scripts on my own sessions.

Five Changes You Can Make Tomorrow

1 Disable MCPs not relevant to your current task
2 Disable skills not relevant to your current task
3 Scope your first message: exact file, exact function, exact scope — not "look at my project"
4 One session = one task. Split anything bigger.
5 Cancel immediately if the agent goes the wrong direction — every wrong-direction tool call is already billed

The Prompt Quality Multiplier

A precise, well-scoped prompt is the single highest-ROI action.
It narrows the research surface → fewer tool calls → less context pollution → fewer iterations → lower cost + better output

Bad Prompt

"Help me fix the authentication module"
  • Triggers wide codebase exploration
  • Agent decides scope autonomously
  • Many tool calls, noisy context

Good Prompt

"The JWT token expiry check in
AuthService.cs:142 is returning false
for valid tokens when clock skew
exceeds 5 minutes. Fix only this function.
Token format: TokenClaims.cs:28"
  • Exact file + line references
  • Explicit scope boundary
  • Minimal tool calls needed

The Army General Analogy

🎖️ General (Orchestrator) Reasons · Decides · Never touches raw data 🔍 Scouts explore · gather intel 📦 Suppliers fetch · retrieve data 🏥 Medics debug · fix · verify ⚙️ Other domain-specific tasks Raw data · Tool results · Noise — stays here, never crosses up
The general's judgment is valuable precisely because it isn't spent crawling through bushes. Protect the orchestrator's attention budget the same way.

The Topology Demo

Tier 1 — Reasoning 1brainstorm · opus-4.6 OPUS Tier 2 — Execution 2plan 3build SONNET — Context Quarantine — dirty context does not cross this line — Tier 3 — Workers / Context Quarantine Zone (haiku-4.5) sub-explorer sub-researcher sub-reviewer sub-debugger All tool calls · MCP queries · raw file reads → return condensed findings only Haiku ~5× cheaper per token than Opus
The orchestrator is smarter not because the model changed — because its attention budget is protected.

Why Tiered Agents Reduce Cost and Improve Reasoning

The most important architectural value of subagents in this workflow is not parallelism.
It is context isolation.
Context isolation
Clean orchestrator context
Protected attention budget
Better reasoning
Fewer iterations → Lower cost
This is not the same quality at lower cost.
It is better quality, because the orchestrator's attention is not diluted by raw noise.

Two Supporting Rules

  • Expensive models work only with clean, distilled context
  • Dirty work — tool calls, raw data, external lookups — goes to cheap fast models

Persistent Working Memory

memory.md Structure

  • Confirmed Facts
  • Active Assumptions
  • Open Questions
  • Decisions Made
  • Session Reasoning Log
Survives compaction — it's a file, not context. You can read it, edit it, share it, use it as a handoff document.

vs. Built-in Memory

  • You can't see it
  • You can't edit it
  • You don't control what gets written
Stable, reusable context → prompt cache activates automatically every turn.
Cache savings are a byproduct of good architecture, not a separate optimization.

Convenience Architecture → Production Architecture

Every item on the right addresses a specific cost and reasoning quality failure covered in this presentation.

Convenience Architecture Production Architecture
Single agent Tiered orchestration
All skills visible Task-scoped skills
All MCPs active Task-scoped MCPs
All tools visible Scoped visible tools
Implicit context growth Explicit working memory (memory.md)
Optimized for: onboarding Optimized for: scale, cost, predictability

Three Scenarios, Same Task

A — Tiered workflow B — All Opus, tiered C — Default single agent
Architecture Orchestrator + 4 subagents Same topology, same token volumes Single agent, all tools/skills/MCPs
Models Opus 4.6 + Haiku 4.5 Opus 4.6 only Opus 4.6
Evidence Measured telemetry, exact API repricing Exact repricing of measured token volumes Architectural behavior estimate*
Est. Cost ~$11 ~$27.50 ~$100–230*
A vs. B: 2.5×
The model routing dividend.
Exact & reproducible.
B vs. C: 4–8×
The isolation dividend.
Architectural estimate.
A vs. C: 9–21×
Combined effect.
Model routing + isolation.
Cache savings (Scenario A, measured): Opus cache reads: 977,749 × $4.50/M = $4.40 saved · Haiku cache reads: 3,176,855 × $0.90/M = $2.86 saved · Total: ~$7.26 (~39% of session cost)

*Scenario C range is wide by design. Default single-agent architecture does not produce stable or predictable token growth. Anthropic API pricing, May 2026.

A Tested Workflow. Ready to Extend.

This is a tested, production-validated starting point — not the final answer.

🔧
Change
🔬
Test & Observe
analyze-events.csx
📋
Analyze Gaps
Apply & Repeat
It is open for contribution and improvement from the team.

Future Directions

More Specialized Agents

  • Bug triage workflow: exception in logs → root cause → team attribution → work item in ADO — one command, one output
  • Domain-specific orchestration patterns

Better Tooling

  • More LLM-friendly tools with terse output
  • export-events.csx, rg, jq
  • General-purpose tools that don't produce noise

Better Prompts

  • Refined orchestrator and subagent prompts
  • User-facing prompt templates for common task types

Team Shared Configuration

  • Shared agent definitions
  • Shared skills and MCP configurations
  • Don't let everyone reinvent this independently

Questions

Key takeaways

1 Noisy context degrades reasoning and increases cost. They are the same engineering problem — and this is measurable. (This is not a best-practice recommendation. It is a measured outcome from real sessions.)
2 Context hygiene and model routing are available today. No infrastructure change required — start with disabling unused MCPs and skills, and scoping your first message precisely.
3 A tiered workflow with context isolation delivers better reasoning quality at 2–5× lower cost, based on real measured sessions. The difference between convenience architecture and production architecture is not tooling — it is intentionality.
Call to action: Try the workflow on one real task. Run the analysis scripts. Compare.