The decision gate
Run this before building anything. Most ideas should die here, and that is a win. Default left and up, earn your way down:
- Can you draw the steps in advance? If yes, it is a workflow or a single LLM call, not an agent.
- If not, can you name a specific bottleneck that a second agent would solve? If no, a single agent is the default.
- Only if yes do you reach for multi-agent, and only in one of three proven patterns.
Why the bias toward simple is correct: this is not taste, it is data. The production reports converged hard in the first half of 2026. Multi-agent systems use roughly 15x more tokens than chat, and token usage explains about 80% of performance variance. On strict sequential reasoning, multi-agent variants showed 39% to 70% performance degradation versus a single agent. A single false statement can infect 100% of agents in hub-and-spoke topologies; what works at 100 requests a minute can collapse at 10,000.
| Pattern | How it works | Use when |
|---|
| Sequential pipeline | Stages run in order, each emits a tangible artifact you can inspect | The work has clean stages like research, outline, draft, review |
| Orchestrator and workers | One orchestrator holds full context and spawns isolated subagents that return compressed summaries, no peer chatter | The default in 2026: one brain, several disposable hands |
| Parallel specialists | Several experts judge the same input independently and never talk | Review and scoring, where reviewers do not need to coordinate |
The seven-layer architecture
Every serious agent is built from these layers. The architecture is identical across all three tiers. What changes by tier is how much rigor you add.
- Model and routing. A router from day one. Tier models per step, cheap for triage, strong for reasoning. Pick by tool-call reliability, not price.
- Context engineering. The highest-leverage layer. Order the window, keep it lean, fight context rot through compaction. A few strong examples, never a pile. Order it: system instructions, then memory, then tool definitions, then history.
- Tools. Minimal sharp set, strict schemas, idempotent side effects, per-call budgets. At 30+ tools, switch from loading all definitions to tool search. Let the agent write code that calls tools rather than many round trips. Bloated, ambiguous tool sets are the top failure mode named by Anthropic.
- Knowledge and retrieval. What the agent reads from outside. Use Postgres with pgvector inside Supabase, with HNSW indexes. It is production-grade below roughly 50 million vectors, past anything an agency hits, and costs nothing on top of the database you already run. Move to Qdrant or Pinecone only when you hit a scale or latency wall you can name. Retrieval quality is an evals problem, not an infra one.
- Memory and self-learning. What the agent keeps: layered working, summary, artifact, long-term. Self-learning via the ACE loop (generator attempts, reflector critiques, curator folds in delta entries to a living playbook). Use incremental delta updates, never full rewrites, and forget on purpose.
- Evals and observability. Trajectory and outcome. Trace every run. Golden cases plus mined production traces. Gate deploys in CI. Non-negotiable.
- Guardrails and security. Prompt injection is risk one, with tool abuse as the main attack surface. Policy as code, approval gates on irreversible actions, runtime checks before execution, budgets and timeouts as a circuit breaker.
Evals, the part that decides ship or stall
| Element | What good looks like |
|---|
| Two eval types | Outcome checks the final answer. Trajectory checks the path, tool choice and order. You need both; most teams wrongly stop at outcome |
| Many samples | Nondeterminism compounds across 10 to 20 calls, so sample many runs for stable metrics, never one |
| Golden set | Hand-craft 50 to 100 anchor cases, then mine production traces; aim for 500+ before trusting aggregates |
| LLM as judge | Use it to score at scale, but validate against human labels since judges carry style biases |
| CI gate | Block deploys that regress; roll out behind a canary |
| Stress test | Inject timeouts, rate limits, bad responses, and score whether the agent still reaches the right end state |
The three tiers
Same architecture, different rigor. The progression is mostly how much eval, guardrail, and governance you add as the stakes rise.
| Dimension | Personal | Business-ready | Enterprise |
|---|
| For | One user, you | Your team and clients at Epilog | Enterprise clients or productized advisory |
| Evals | Light, a few golden checks | Required, 50+ golden cases in CI | Heavy, plus online evals, drift detection, alerting |
| Tracing | Optional | Full on every run | Full plus audit logs for compliance |
| Guardrails | Minimal | Brand and output safety, no fabricated claims | Policy as code at runtime, sub-second interception |
| Approvals | Only if it spends or sends | Gate anything client-facing | Full human-in-the-loop on irreversible actions |
| The real risk | Building clever and never measuring it | Margin leak and off-brand or false output | Failing a client security review |
At the top tier the selling point is not that the agent is smart. It is that it is governable, auditable, and safe. That is defensible IP in a way a pile of workflows is not, which matters for advisory work.
The build sequence
The order is the point. Build the measurable parts before the clever parts.
- Write the spec: bounded job, explicit limits.
- Pick the architecture: run the decision gate.
- Tracing and evals FIRST, before code.
- Minimal tools, validated.
- Context: order it, keep it lean.
- Knowledge and memory, only if needed.
- Guardrails, sized to the tier.
- Eval, gate, canary: block regressions.
- Watch production, mine traces, feed failures back into evals and the playbook. The self-improving loop never ends.
Step 3 before step 4 is the discipline most teams skip.
The agentic engineering discipline
How to actually build with coding agents like Claude Code and Codex without producing confident garbage. You are no longer the person who writes the code. You decide what gets built, set the constraints, and verify the result. The model is a fast, capable, slightly overconfident junior that never pushes back unless you make it.
The 80% trap. The agent is right about 80% of the time, which is exactly the danger. The output looks finished, the tests look green, and the broken fifth surfaces in production. Around 66% of developers report this almost-right-but-not-quite problem. Done is a claim to verify, never a fact.
Spec-driven development. Stop one-shotting features. The reliable loop is plan, then implement in small task groups, then verify each one. A good spec defines six things: outcome, scope boundaries, constraints, prior decisions, task breakdown, and verification criteria.
- Plan first, and actually read the plan. Feed full context, let the agent propose a plan, push back until you are confident. Catch bad approaches here for almost no cost.
- Implement in task groups, one at a time, small enough to test in isolation. Not "build authentication," but "create a registration endpoint that validates email."
- Review every group before moving on, and realign against the spec so drift does not accumulate.
Context files. The context file (CLAUDE.md, or the cross-tool AGENTS.md standard) is the highest-leverage artifact in your repo and the most commonly ruined one. Keep it under about 200 lines. If removing a line would not cause a mistake, cut it. Do not load it with few-shot examples; that makes the agent imitate and stop thinking. Structure it: overview, architecture, conventions, testing, commands, and an explicit anti-patterns section at the end.
| Lever | What it is | Use for |
|---|
| Context file | Always-on conventions and project knowledge | The baseline every session loads |
| Skills | Reusable workflows in a folder with a SKILL.md | Repeat patterns like release deploy or report generation |
| Hooks | Scripts that fire on events, a deterministic gate | A stop hook that blocks "done" until tests pass |
| Subagents | Specialists with isolated context, only a summary returns | Research without polluting context, and verification by a fresh agent |
The verification discipline. The agent will say it is done. Your job at that moment is to not believe it until something proved it. Climb the ladder as stakes rise: prompt-level (agent grades itself, weakest), second opinion (a fresh model refutes it), deterministic gate (a hook runs the tests, strongest). Always review with a different model than the one that implemented. Tier models correctly: strong model on the spec and plan, cheaper on mechanical implementation. Keep the human at the merge: branch and PR, never push to main.
Done is a hypothesis. A passing gate is a fact. Build the gate.
The default stack
The lean default. Four of these you likely already run.
| Layer | Pick | Why |
|---|
| Models and routing | OpenRouter | Model freedom and per-step tiering, a router from day one |
| Agent code layer | Pydantic AI | Model-agnostic, type-safe, light. LangGraph only for durable state |
| Orchestration | n8n | Where the deterministic majority of work lives |
| Knowledge | Supabase pgvector | No extra database, HNSW indexes, scales past agency needs |
| Memory | Mem0 or a Supabase table | Layered, with aging |
| Evals and tracing | Langfuse | Open source, self-hostable |
| Tool interface | MCP | The vendor-neutral standard |
| Guardrails | Policy as code plus approvals | Sized to the tier |
A note on the coding-agent race: rankings flip month to month right now. Treat tool descriptions as directionally true, not a settled verdict, and let your own task results decide. The tool-agnostic AGENTS.md approach is your insurance against the churn. Do not marry one coding agent; the gap between top tools is small, so workflow fit matters more than the score.
Anti-patterns
Things that look smart and are not. Each is a real failure mode from the 2026 reports.
- Architecture: reaching for multi-agent because it sounds advanced. It is more expensive and more fragile. Earn it with a named bottleneck.
- Infrastructure: adding a dedicated vector database before you need one. Use pgvector until you can name the wall.
- Context: stuffing the window to be safe. You are causing context rot and degrading recall.
- Tools: a big undifferentiated tool pile. Ambiguity wrecks tool selection. Fewer sharp tools win.
- Memory: treating memory as just the vector database. It is a separate layered system that also needs deliberate forgetting.
- Evals: shipping on outcome evals only, trusting one run, or trusting leaderboard scores. Your domain set is the only number that matters.
- Workflow: building clever before building measurable, and accepting "done" without a gate.
- Models: picking by price instead of whether tool calling holds up, and tiering backward with a cheap model on the spec.
Checklists you actually run
Per project, before you start:
- One bounded job written down, with explicit limits on what it will not do
- Decision gate run, defaulting to workflow or single agent
- Tracing wired and a golden eval set written, before any clever code
- Minimal tool set defined, validated, idempotent, budgeted
- Guardrails sized to the tier, approval gates on irreversible actions
- CI gate that blocks regressions, canary rollout planned
Per coding session with an agent:
- Context file under 200 lines and current
- Plan written and read by you before any implementation
- Work in task groups, review each before the next
- Review with a different model than implemented
- Branch and PR, never push to main
- A deterministic gate proved done; you did not just trust it
Ongoing, the loop that never ends:
- Mine production traces for failures and interesting cases
- Every recurring correction becomes a rule in the context file, a skill, or an eval case
- Add incrementally, prune the stale, keep the durable layer lean
A closing note
Figures here are illustrative of the patterns described, drawn from the sources below; verify any specific statistic against the primary source before quoting it externally. The point is not the exact number, it is the shape of the discipline.
Sources behind this guide include Anthropic engineering on effective context engineering and code execution with MCP; the ACE (Agentic Context Engineering) paper, arXiv 2510.04618; Spec-Driven Development, arXiv, Feb 2026; the LangChain State of AI Agents 2026; Gartner and MarketsAndMarkets on market size and adoption; multi-agent production reports on token cost and cascade failure; and Claude Code best practices with the AGENTS.md standard.
If you want help applying any of this to a real system in your business, hubungi saya lewat WhatsApp.