\
IntroductionAt first, you just try to communicate with ChatGPT via API, throw in a couple of lines of context, and feel amazed that it responds at all. Then you want it to do something useful. Then — to do it reliably. Eventually — to do it without you.
That’s how an agent is born.
If you’ve also spent the past year cobbling together agents from scripts and wrappers, experimenting and tinkering, and you’re still searching for a cleaner, more sustainable way to build them — this article is for you. I’ve wandered through repos and forums, repeatedly asking myself, “How are others doing it?” > I kept what stuck — what actually felt right after some real use, and gradually distilled a set of core principles for turning a cool idea into a production-ready solution.
This isn’t a manifesto. Think of it as a practical cheat sheet — a collection of engineering principles that help guide an agent from the sandbox to production: from a simple API wrapper to a stable, controllable, and scalable system.
Disclaimer
In this article (Building effective agents), Anthropic defines an agent as a system where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks. Systems where LLMs and tools are orchestrated through predefined code paths they call Workflows. Both are part of a broader concept - Agent Systems.
In this text, Agent = Agent system, where for the sake of stability and control I will more often lean towards Workflows. I hope that in the near future there will be 1-2 more turns of the evolution and true Agents will be ubiquitous, but for now this is not the case
1. Design the FoundationEarly versions of agents usually come together fast: a few functions, a couple of prompts — and hey, it works.
“If it works, why make it complicated?”
In the beginning, everything seems stable. The agent responds, executes code, behaves sensibly. But once you switch the model, restart the system, or plug in a new interface — suddenly it becomes unstable, unpredictable, hard to debug.
And often, the root cause isn’t in the logic or the prompts, but much earlier: broken memory management, hardcoded staff, no way to resume sessions, or a single rigid entry point.
This section walks through four key principles that will help you build a solid foundation — one that everything else can safely grow on top of.
1. Keep State OutsideProblem:
This one’s not strictly a problem, but still:
(Memory is a whole separate issue — we’ll get to that soon)
Solution: Move state outside the agent — into a database, a cache, a storage layer — even a JSON file will do.
Checklist:
Problem: LLMs don't remember. Even within a single session, the model can forget what you've already explained, mix up stages, lose the thread of conversation, or start "filling in" details that weren't there. And it seems like time goes on, the context window grows bigger and bigger, delighting us with new possibilities. LinkedIn is full of posts where people compare which book or how many hours of YouTube's video now fit into the new model version. But still, LLMs don't remember and you should be prepared.
Especially if:
Even with increasing context windows (8k, 16k, 128k…), problems remain:
Solution: Separate "working memory" from "storage" — like in classical computing systems. The agent should be able to work with external memory: store, retrieve, summarize, and update knowledge outside the model. There are several architectural strategies, and each has its boundaries.
Approaches
Memory Buffer
Stores the last k messages. Take for quick prototyping.
+ simple, fast, sufficient for short tasks
- loses important info, doesn't scale, doesn't remember "yesterday"
\ Summarization Memory
Compresses history to fit more.
+ token savings, memory expansion
- distortions, loss of nuances, errors in multi-step compression
\ RAG (Retrieval-Augmented Generation)
Pulls knowledge from external databases. Most of the time you'll be here.
+ scalable, fresh, verifiable
- complex setup, sensitive to retrieval quality, latency
\ Knowledge Graphs
Structured connections between entities and facts. Always elegant, sexy and hard, you'll end up doing RAG anyway.
+ logic, explainability, stability
- high barrier to entry, complexity of LLM integration
Checklist:
Problem: LLMs are rapidly evolving; Google, Anthropic, OpenAI, etc. constantly release updates, racing against each other across different benchmarks. This is a feast for us as engineers, and we want to make the most of it. Our agent should be able to easily switch to a better (or conversely, cheaper) model seamlessly.
Solution:
Checklist:
Problem: Even if initially the agent is intended to have only one communication interface (for example, UI), you'll eventually want to give users more flexibility and convenience by adding interaction through Slack, WhatsApp, or, dare I say it, SMS - whatever. An API might turn into a CLI (or you'll want one for debugging). Build this into your design from the start; make it possible to use your agent wherever it's convenient.
Solution: Creating a unified input contract: Develop an API or other mechanism that will serve as a universal interface for all channels. Store channel interaction logic separately.
Checklist:
Agent is callable from CLI, API, UI
All input goes through a single endpoint/parser/schema
All interfaces use the same input format
No channel contains business logic
Adding a new channel = only an adapter, no changes to core
\
\
While there's only one task, everything is simple, as in the posts of AI evangelists. But as soon as you add tools, decision-making logic, and multiple stages, the agent turns into a mess.
It loses track, doesn't know what to do with errors, forgets to call the right tool—and you're left alone again with logs where "well, everything seems to be written there."
To avoid this, the agent needs a clear behavioral model: what it does, what tools it has, who makes decisions, how humans intervene, and what to do when something goes wrong.
This section covers principles that will help you give your agent a coherent action strategy instead of hoping "the model will figure it out somehow."
5. Design for Tool UseProblem: This point might seem obvious, but you still encounter agents built on "Plain Prompting + raw LLM output parsing." It's like trying to control a complex mechanism by pulling random strings and hoping for the best. When LLMs return plain text that we then try to parse with regex or string methods, we face:
Solution: The model returns JSON (or another structured format)—the system executes.
The key idea here is to leave the responsibility for interpreting user intent and choosing tools to the LLM, while still assigning the execution of that intent to the system through a clearly defined interface.
Fortunately, practically all providers (OpenAI, Google, Anthropic, or whoever else you prefer) support so-called "function calling" or the ability to generate output in a strictly defined JSON format.
Just to refresh how this works:
Important: Tool descriptions are also prompts. Unclear description = wrong function choice.
What to do without function calling?
If the model doesn't support tool calls or you want to avoid them for some reason:
Checklist:
Problem: Usually agents work as "dialogues"—first the user speaks, then the agent responds. It's like playing ping-pong: hit-response. Convenient, but limiting.
Such an agent cannot:
Instead, the agent should manage its own "execution flow"—decide what to do next and how to do it. This is like a task scheduler: the agent looks at what needs to be done and executes steps in order.
This means the agent:
Solution: Instead of letting the LLM control all the logic, we extract the control flow into code. The model only helps within steps or suggests the next one. This is a shift from "writing prompts" to engineering a system with controlled behaviour.
Let's look at three popular approaches:
1. FSM (Finite State Machines)
\
2. DAG (Directed Graphs)
\
3. Planner + Executor
\
Why this matters:
\
Checklist:
Problem: Even if an agent uses structured tools and has a clear control flow, full autonomy of LLM agents in the real world is still more of a dream (or nightmare, depending on context). LLMs don't possess true understanding and aren't accountable for anything. They can and will make suboptimal decisions. Especially in complex or ambiguous situations.
Main risks of full autonomy:
Solution: ~~Strategic summoning of Carbon-based lifeforms~~ Integrate humans into the decision-making process at key stages.
HITL Implementation Options
1. Approval Flow
2. Confidence-aware Routing
3. Human-as-a-Tool
4. Fallback Escalation
5. RLHF (Human Feedback)
Checklist:
Problem: The standard behavior of many systems when an error occurs is either to "crash" or simply report the error and stop. For an agent that should autonomously solve tasks, this isn't exactly the best behavioral model. But we also don't want it to hallucinate around the problem.
What we'll face:
Solution: Errors are included in the prompt or memory. The idea is to try implementing some kind of "self-healing." Agent should at least try to correct its behavior and adapt.
Rough flow:
Checklist:
Problem: Let's back to the key LLM limitation (that context window thing), but look at this problem from another angle. The bigger and more complex the task, the more steps it will take, which means a longer context window. As context grows, LLMs are more likely to get lost or lose focus. By focusing agents on specific domains with 3-10, maybe maximum 20 steps, we maintain manageable context windows and high LLM performance.
Solution: Use smaller agents targeted at specific tasks. One agent = one task; orchestration from above.
Benefits of small, focused agents:
Unfortunately, there's no clear heuristic for understanding when a piece of logic is already big enough to split into multiple agents. I'm pretty sure that while you're reading this text, LLMs have gotten smarter somewhere in labs. And they keep getting better and better, so any attempt to formalize this boundary is doomed from the start. Yes, the smaller the task, the simpler it is, but the bigger it gets, the better the potential is realized. The right intuition will only come with experience. But that's not certain.
Checklist:
Scenario is built from microservice calls
Agents can be restarted and tested separately
Agent = minimal autonomous logic. You can explain what it does in 1-2 sentences.
\
\
The model handles generation. Everything else is on you.
How you formulated the request, what you passed in context, what instructions you gave—all this determines whether the result will be coherent or "creative."
LLMs don't read minds. They read tokens.
Which means any input error turns into an output bug—just not immediately noticeable.
This section is about not letting everything drift: prompts = code, explicit context management, constraining the model within boundaries. We don't hope that the LLM will "figure it out on its own."
10. Treat Prompts as CodeProblem: A very common pattern, especially among folks without ML or SE background, is storing prompts directly in code. Or at best, unsystematic storage in external files.
This approach leads to several maintenance and scaling difficulties:
Solution: Prompts in this context are not much different from code and the same basic engineering practices should be applied to them
This implies:
We'll talk about testing in more detail in the principe 14.
Checklist:
Problem: We've already discussed the "forgetfulness" of LLMs, partially solving this by offloading history to external memory and using different agents for different tasks. But that's not all. I propose we also consider explicit context window management (and here I'm not just talking about compressing history to fit the optimal size or including errors from previous steps in the context).
Standard formats aren't always optimal: A simple list of messages in the "role-content" (system/user/assistant) format is the baseline, but it can be token-heavy, not informative enough, or poor at conveying the complex state of your agent.
Most LLM clients use the standard message format (a list of objects with role: "system", "user", "assistant", content, and sometimes tool_calls fields).
While this "works great for most cases," to achieve maximum efficiency (in terms of both tokens and the model's attention), we can approach context formation more creatively.
Solution: To engineer it. To treat the creation of the entire information package passed to the LLM as "Context Engineering." This means:
(Instead of a checklist) How do you know when this makes sense?
If you're interested in any of the following:
Problem: We've already done a lot in the name of stability, but nothing is a silver bullet. This means it's worth looking at the most critical potential problems separately and explicitly taking out some "insurance".
In this principle, we think about:
The security and grounding of an LLM agent isn't a single measure, but a multi-layered system of protection ("defense-in-depth") that covers the entire interaction lifecycle. The threats are diverse, and no single method of protection is a panacea. Effective protection requires a combination of techniques.
Solution: We must commit to a multi-layered defense system, thinking through and explicitly handling all corner cases and potential scenarios, and having a clear response ready for whatever might happen.
In a basic setup, you should consider:
Secure Inputs.
Check for known attack-indicator phrases (e.g., "ignore all previous instructions"). It sometimes makes sense to combat potential obfuscation.
Try to determine the user's intent separately. You can use another LLM for this, to analyze the input for the current one.
Control input from external sources, even if they are your own tools.
Guarded Actions. Control the privileges of both the agent and its tools (granting the minimum necessary), clearly define and limit the list of available tools, validate parameters at the input to tools, and enable Principle #7 (Human in the Loop).
Output Moderation. Design a system of checks for what the model outputs, especially if it goes directly to the user. These can be checks for relevance (ensuring the model uses what's in the RAG and doesn't just make things up) as well as checks for general appropriateness. There are also ready-made solutions (e.g., the OpenAI Moderation API).
The final system, however, depends on your tasks and your risk assessment. In the checklist, we'll try to sketch out some options.
Checklist:
User input validation is in place.
For tasks requiring factual information, the data within the RAG is used.
The prompt for the LLM in a RAG system explicitly instructs the model to base its answer on the retrieved context.
LLM output filtering is implemented to prevent PII (Personally Identifiable Information) leakage.
The response includes a link or reference to the source.
LLM output moderation for undesirable content is implemented.
The agent and its tools operate following the principle of least privilege.
The agent's actions are monitored, with HITL (Human-in-the-Loop) for critical operations.
\
\
An agent that "kinda works" is a bug with a delayed effect.
In prod, not everything breaks at once. And you don't find out about it instantly. Sometimes, you don't find out at all.
This section is about the engineering habit of seeing what's happening and checking that everything is still working. Logs, tracing, tests—everything that makes an agent's behavior transparent and reliable, even when you're sleeping or developing your next agent.
13. Trace EverythingProblem: One way or another, you will constantly face situations where the agent doesn't work as you expected. During development, testing, making changes, or during normal operation. This is inevitable, and at the moment, it's normal to some extent. This means you're doomed to spend hours and days debugging, trying to understand what's wrong, reproducing the issue, and fixing it. I'd like to think that by this point you've already implemented Principle #1 (Keep State Outside) and #8 (Compact Errors into Context). In most cases, that will be enough to make your life much simpler. Some other principles will also indirectly help here.
Even so (and especially if you've decided not to bother with them for now), it makes a lot of sense to think about debugging in advance and save yourself time and nerves in the future by adhering to this principle.
Solution: Log the entire path from request to action. Even if you already have logs for individual components, tracing the entire chain can be a hassle. Even if you're a big fan of puzzles or Lego, at some point, it will stop being fun. Therefore, logs must exist, they must be end-to-end, and they must cover everything.
Why it's needed:
The basic "gentleman's set" looks like this:
Note: Look into existing tracing tools; under certain conditions, they will make your life much easier. LangSmith, for example, provides detailed visualization of call chains, prompts, responses, and tool usage. You can also adapt tools like Arize, Weights & Biases, OpenTelemetry, etc. for your needs. But first, see Principle #15.
Checklist:
Problem: By this point, you most likely have some kind of practically finished solution. It works, maybe even just the way you wanted. Ship it to prod? But how do we ensure it keeps working? Even after the next minor update? Yes, I'm leading us to the topic of testing.
Obviously, updates in LLM systems, just like in any other—be it changes to the application code, updates to datasets for fine-tuning or RAG, a new version of the base LLM, or even minor prompt adjustments—often lead to unintentional breaks in existing logic and unexpected, sometimes degrading, agent behavior. Traditional software testing approaches prove to be insufficient for comprehensive quality control of LLM systems. This is due to a number of risks and characteristics specific to large language models:
…and I suppose we could go on. But we already understand that traditional tests, focused on verifying explicit code logic, are not fully capable of covering these issues.
Solution: We'll have to devise a complex, comprehensive approach that covers many things, combining classic and domain-specific solutions. This solution should address the following aspects:
Checklist:
Logic is broken down into modules: functions, prompts, APIs—everything is tested separately and in combination.
Response quality is checked against benchmark data, evaluating meaning, style, and correctness.
Scenarios cover typical and edge cases: from normal dialogues to failures and provocative inputs.
The agent must not fail due to noise, erroneous input, or prompt injections—all of this is tested.
Any updates are run through CI and monitored in prod—the agent's behavior must not change unnoticed.
This is a meta-principle; it runs through all the ones listed above.
Fortunately, today we have dozens of tools and frameworks for any task. This is great, it's convenient, and it's a trap.
Almost always, choosing a ready-made solution means a trade-off: you get speed and an easy start, but you lose flexibility, control, and, potentially, security.
This is especially critical in agent development, where it's important to manage:
Frameworks bring inversion of control: they decide for you how the agent should work. This can simplify a prototype but complicate its long-term development.
Many of the principles described above can be implemented using off-the-shelf solutions—and this is often justified. But in some cases, an explicit implementation of the core logic takes a comparable amount of time and provides incomparably more transparency, manageability, and adaptability.
The opposite extreme also exists—over-engineering, the desire to write everything from scratch. This is also a mistake.
This is why the key is balance. The engineer chooses for themselves: where it's reasonable to rely on a framework, and where it's important to maintain control. And they make this decision consciously, understanding the cost and consequences.
You have to remember: the industry is still taking shape. Many tools were created before current standards emerged. Tomorrow, they might become obsolete—but the limitations baked into your architecture today will remain.
\
ConclusionOkay, we've gone over 15 principles that, experience shows, help turn the initial excitement of "it's alive!" into confidence that your LLM agent will work in a stable, predictable, and useful way under real-world conditions.
You should consider each of them to see if it makes sense to apply it to your project. In the end, it's your project, your task, and your creation.
Key takeaways to carry with you:
I hope you've found something new and useful here, and maybe you'll even want to come back to this while designing your next agent.
All Rights Reserved. Copyright , Central Coast Communications, Inc.