We spent fifty years building CLI tools. grep, curl, git, kubectl -- programs that do one thing well and compose through pipes. Then AI agents arrived and we decided to start from scratch.
That was a mistake. The industry is slowly figuring it out.
Last year I built a text-to-SQL agent that needed to talk to Postgres, fetch schema metadata, run queries, and format results. I wired up three different tool integrations, spent a weekend debugging schema mismatches between frameworks, and eventually ripped most of it out in favor of psql -c '...' --csv piped through jq. The agent got simpler and more reliable in the same commit.
That experience is what this post is about. The path from tools to MCP to CLIs, and why the real breakthrough turned out to be a fourth layer entirely -- one that has nothing to do with protocols or connectivity.
Tools: giving agents hands
OpenAI shipped function calling in 2023. Describe a function as JSON, hand it to GPT, let the model decide when to call it. Anthropic followed with tool use, LangChain wrapped everything in abstractions, and everyone else followed with their own version.
It worked for demos. Then reality hit.
Here's the same "get weather" tool defined for two frameworks, side by side:
# OpenAI # Anthropic
tools = [{ tools = [{
"type": "function", # (no wrapper)
"function": { #
"name": "get_weather", "name": "get_weather",
"parameters": { ... } "input_schema": { ... }
} #
}] }]
Same tool, different key names (parameters vs input_schema), different nesting depth, different response parsing, different error handling. One trivial function, two rewrites. Multiply that by thirty tools across three frameworks and you're maintaining ninety integration points that drift every time someone ships an API update.
I ran into this firsthand building a multi-agent orchestration system. I had tools for database queries, API calls, and file operations. Every time I wanted to test the same agent logic against a different model provider, I was rewriting tool schemas. Not the logic -- just the wrappers. It felt like writing the same function three times in three slightly different dialects.
And the problems went deeper than formatting. Tools lived inside agent code -- you couldn't update one without redeploying the whole agent. Every definition ate context tokens whether the agent needed it or not. I once loaded forty tools into an agent for a demo and watched it confidently pick the wrong one three times in a row before I realized a quarter of the context window was gone before the user even said anything.
Auth was DIY. Security was an afterthought. Discovery was nonexistent -- everything hardcoded at build time.
MCP: standardizing the mess
Anthropic introduced the Model Context Protocol in November 2024 to fix exactly this. One protocol for everything. JSON-RPC 2.0, three primitives: tools, resources, and prompts. Build an MCP server once, use it from any AI host.
It caught on fast. OpenAI adopted it March 2025. Anthropic donated it to the Linux Foundation's Agentic AI Foundation by December. ~28% of Fortune 500 companies had MCP servers running by early 2025.
MCP solved the N*M problem. Your tools work regardless of which model you're running. That's genuinely valuable -- I can swap between Claude and GPT without touching my integrations.
But MCP introduced problems that weren't obvious until you tried to use it in production.
Setup is the first wall. The simplest MCP tool runs 300+ lines in official samples, requires specific runtimes (Python with uv, or Node with npm), and when something breaks you get "Claude was unable to connect" with zero context about why.
Then there's security. Anthropic's own MCP Inspector -- the developer tool for testing MCP servers -- allowed unauthenticated remote code execution. The attack was simple: create a malicious MCP server, wait for a developer to connect the Inspector, and run arbitrary commands on their machine. No auth check, no sandboxing, no confirmation dialog. The tool designed to help you safely evaluate MCP servers was itself the vulnerability. If Anthropic's reference tooling wasn't secure by default, what about the ecosystem? CVE-2025-6514 hit mcp-remote with command injection. Invariant Labs demonstrated how plain text in a public GitHub issue could hijack the official GitHub MCP server into leaking private repo data.
And the protocol made tool sprawl worse, not better. Tool definitions eat ~22% of context. Models pick wrong tools, hallucinate parameters. Cursor hard-capped it at 40 MCP tools. GitHub Copilot at 128. Meanwhile, discovery is fragmented across 16,000+ servers in scattered registries (MCP.so, PulseMCP, glama.ai) with no vetting, the protocol's statefulness breaks serverless deployment, and auth flows are multi-hop nightmares for enterprises.
To be fair -- MCP is the right choice in specific situations. If you need real-time streaming from a service (live stock prices, event feeds), a CLI that returns and exits can't do that. If your tool requires complex OAuth flows with token refresh and multi-tenant isolation, MCP's auth layer handles that better than shelling out to a CLI with a static token. And some services simply don't have a CLI -- internal company APIs, proprietary SaaS platforms, custom databases with no psql equivalent. For those, MCP is exactly what you want.
But MCP is infrastructure, not the whole answer. It standardized how agents connect to things -- knowing how to connect to Postgres doesn't mean knowing when to use a JOIN versus a subquery.
CLI commands: what was there all along
CLI tools are text in, text out. LLMs are text in, text out. We spent a surprisingly long time not connecting those dots.
Unix was designed for a user who reads text, reasons about it, and types text back. An LLM is exactly that user.
Here's what made it click for me. Say you want an agent to list open PRs on a repo. With MCP, you spin up the GitHub MCP server, configure auth, connect a client, and call the tool:
# MCP approach: GitHub MCP server
# First: install server, configure oauth, connect client...
result = await mcp_client.call_tool(
"list_pull_requests",
{"repo": "anthropics/sdk-python", "state": "open"}
)
Or the agent just runs a command:
# CLI approach: one line, already installed
gh pr list --repo anthropics/sdk-python --state open --json number,title,author
Same result. The CLI version needs no server, no SDK, no config file. gh is already on the machine, already authenticated (you ran gh auth login once), and --json gives the agent structured output it can parse. The agent discovers flags by running gh pr list --help at runtime.
Same story with kubectl -- an MCP server for Kubernetes exposed maybe fifteen operations. kubectl exposes hundreds, all composable with pipes: kubectl get pods -o json | jq '.items[] | select(.status.phase=="Failed") | .metadata.name'. Three tools chained, zero integration code, and an agent constructs it from --help alone.
Pipes give composability for free. OS-level sandboxing (Docker, Firecracker) provides security without new permission models.
Claude Code, GitHub Copilot CLI, Warp, Aider, Gemini CLI -- a growing ecosystem of agent-ready tools, all speaking shell commands.
I don't want to undersell the problems, because they're serious.
Shell injection is the big one. When an agent constructs commands from LLM output, the attack surface is wider than traditional shell scripts. A prompt injection that slips a ; rm -rf / into a generated command doesn't care that you have Docker isolation if the agent runs with write access. I've seen agents hallucinate package names in pip install commands -- one study found ~20% of LLM-suggested imports reference packages that don't exist, which is an open door for supply chain attacks.
Output parsing is more fragile than it sounds. kubectl changed its default output format between minor versions, and agents that parsed the old column widths silently started returning wrong data. Not errors -- wrong data. That's worse.
Then there's the exit code problem. curl returns 0 if it successfully contacts a server that returns a 500 error. grep returns 1 for "no match" -- the same code most tools use for actual failures. psql exits 0 on a query that returns a warning but no rows. An agent that branches on exit codes without understanding these quirks will make confident, wrong decisions. And when something does fail, the error message is unstructured English ("ERROR: relation 'users' does not exist") that the agent has to parse with pattern matching -- one locale change and your regex is dead.
Each invocation is also stateless. The agent has to thread context through files or its own memory, and bash vs PowerShell means platform-specific logic everywhere.
So where does that leave us? We've got the plumbing (MCP), the interface (CLIs), and the capabilities (tools). What's missing is the part that tells agents what to actually do with all of it.
Skills: the missing layer
None of the layers so far taught agents how to approach a task. Remember my text-to-SQL agent from the intro? Even after I switched to psql, the agent would still generate queries that JOINed tables with ambiguous column names, or run expensive scans on million-row tables without checking the query plan first. The tools worked fine. The agent just didn't know what a senior engineer knows about writing production SQL.
That's what skills encode. A skill is a SKILL.md file. Here's the one I wrote after that text-to-SQL project:
---
name: text-to-sql
description: Convert natural language to safe, efficient SQL
against Postgres. Handles schema discovery, ambiguous columns,
query validation, and result formatting.
---
# Text-to-SQL
## Steps
1. Run `psql -c '\dt' --csv` to list tables. Then `\d tablename`
for each relevant table to get columns and types.
2. Before writing the query, check for ambiguous column names
across JOINed tables. Always alias with `table.column`.
3. Generate the SQL. Wrap in a CTE if it needs more than one step.
4. Run `EXPLAIN (FORMAT JSON)` first. If estimated rows > 100k
or a Seq Scan hits a large table, add an index hint or
rethink the approach. Do NOT execute expensive queries blind.
5. Execute with `psql -c '...' --csv` and pipe through jq for
formatting if the consumer expects JSON.
## Gotchas
- `COUNT(DISTINCT col)` on nullable columns silently drops NULLs.
- String comparisons are case-sensitive in Postgres by default.
Use `ILIKE` or `LOWER()` unless the user specifically wants
exact match.
- If a table has > 1M rows, always LIMIT results unless the user
explicitly asks for the full dataset. Agents love returning
everything.
YAML metadata on top, markdown instructions below, optionally bundled with scripts and templates. Every gotcha in that file is something that burned me at least once -- the COUNT(DISTINCT) NULL issue alone cost me an afternoon of debugging wrong numbers. Now the agent knows it on the first run.
The key trick is progressive disclosure. Fifty skills in an agent? Only metadata loads (~20-50 tokens each). Full instructions load on-demand when triggered. This directly solves MCP's context bloat problem.
Claude Code shipped skills first. OpenAI adopted the same SKILL.md format for ChatGPT and Codex CLI. The pattern spread -- Cursor uses .cursor/rules/, Windsurf uses .windsurf/rules/, GitHub Copilot reads .github/copilot-instructions.md. The AGENTS.md standard is trying to unify these, with adoption across 20,000+ repos and growing.
The layers stack, they don't compete:
┌─────────────────────────────────────────────────┐
│ SKILLS "how an expert would do this" │ ← cognitive layer
│ SKILL.md, AGENTS.md, .cursor/rules │
├─────────────────────────────────────────────────┤
│ CLI COMMANDS "proven commands to run" │ ← execution layer
│ grep, kubectl, gh, psql, jq │
├─────────────────────────────────────────────────┤
│ MCP "standard access to systems" │ ← integration layer
│ JSON-RPC 2.0, tools/resources/prompts │
├─────────────────────────────────────────────────┤
│ TOOLS "I can call this function" │ ← capability layer
│ function calling, tool use, API wrappers │
└─────────────────────────────────────────────────┘
Each layer solves a problem the one below can't.
A skill for text-to-SQL doesn't just know psql exists. It knows to run EXPLAIN before executing, to alias ambiguous columns across JOINs, and to LIMIT results on large tables. That's encoded expertise, not API docs.
Skill registries are popping up -- SkillsMP, Vercel's Skills.sh, LobeHub -- though most will probably die the same way early MCP registries fragmented. The one that wins will be whichever gets adopted by the Agentic AI Foundation, which already governs MCP and AGENTS.md under the Linux Foundation.
What to actually do about it
The stack isn't finished. Auth across layers is rough. Discovery isn't solved. Multi-agent coordination is barely explored.
But the bottleneck was never connectivity or capability. It was knowledge. We had grep before GPT-4. We just didn't have a good way to teach agents when to use it, which flags matter, and what to do when the output looks wrong.
If you're building CLI tools today, the agent-readiness bar is low: add a --json flag so agents can parse your output, write clear --help text so they can discover flags at runtime, use meaningful exit codes so they can branch on failure, keep errors on stderr so they're separate from data, and consider shipping a SKILL.md in your repo that teaches agents the workflows your tool enables. That last one is the only new idea -- the rest is just good CLI hygiene that we should've been doing anyway.
If you're building agents, stop wiring up every integration from scratch. Check if a CLI already does what you need. Wrap it with a skill that encodes the domain knowledge. Use MCP where you genuinely need real-time bidirectional communication with an external service -- not as the default for everything.
The difference is measurable. Before I wrote the text-to-SQL skill, my agent would take a question like "how many active users last month?" and fire off a SELECT COUNT(*) FROM users WHERE last_login > ... against a 2M-row table with no index on last_login. Forty-five seconds later, it'd return with the answer. After the skill, the same question triggers an EXPLAIN first, the agent sees the Seq Scan on 2M rows, rewrites to filter on the indexed status column with a date range, and comes back in 200ms. Same agent, same tools, same CLI -- just better instructions.
That's the shift. We spent years building better pipes between agents and tools. Turns out the bigger problem was that nobody wrote down what a senior engineer knows about using those tools -- the edge cases, the flags that matter, the order of operations. Skills are just that knowledge, made portable.
The question I keep sitting with: what happens when agents can read and write skills?
I've been prototyping this. My text-to-SQL agent now appends to its own gotchas section when it hits an edge case. Last week it ran a query with a LEFT JOIN that returned duplicate rows because the join key wasn't unique. After I flagged it, the agent added this to the skill file:
## Gotchas
- `COUNT(DISTINCT col)` on nullable columns silently drops NULLs.
- String comparisons are case-sensitive in Postgres by default.
+ - Before any LEFT JOIN, check if the join key is unique on the
+ right table. If not, you'll silently multiply rows. Run
+ `SELECT key, COUNT(*) FROM table GROUP BY key HAVING COUNT(*) > 1`
+ to verify.
Next time it hit a similar query, it checked. No duplicates, no flag from me.
This isn't tool integration anymore. It's something closer to institutional memory -- an agent that gets better at its job by encoding what it learns. It also breaks in interesting ways. The agent once added a gotcha about "always using LEFT JOIN instead of INNER JOIN to avoid missing rows" after a query returned fewer results than expected. The actual problem was a WHERE clause filtering too aggressively -- nothing to do with join type. I had to revert the gotcha before it polluted every future query. So yes, you need a review process for agent-authored skills, the same way you review PRs from junior engineers. Sometimes they nail it. Sometimes they learn the wrong lesson from the right symptom.
But the alternative is what we've been doing: every agent starts from zero, every time, and makes the same mistakes your last agent already learned from. That seems worse.
If you try this, I'd like to hear what breaks. Find me on GitHub or LinkedIn.
Comments (0)