Introduction: “I Think It Works” Is Not Good Enough
With regular software, testing is straightforward: give a specific input, get a specific output. If 2+2 equals 4, it is correct. If not, it is wrong.
With Agents, things are different. Agent output is non-deterministic — give the same input, and you might get two different answers. How do you test something that gives a different answer every time?
In this episode, you will learn techniques for testing and debugging Agents — from logging to professional tools.
Why Is Agent Testing Hard?
1. Non-deterministic Output
The LLM may generate different text each time. You cannot use assertEqual and say “the output must be exactly this.”
2. Chain of Decisions
The Agent does not give a simple response — it makes a series of decisions: which tool to use? What parameters? How to interpret the result? Each one could go wrong.
3. External Service Dependencies
The Agent depends on the LLM API, external tools, and databases. Any of these could change or become unavailable.
Level 1: Log Everything (Trace Logging)
First of all, you need to see what the Agent is doing. Without logs, debugging an Agent is like fixing a car in the dark.
import json
import time
from datetime import datetime
class AgentTracer:
"""Records all Agent actions."""
def __init__(self):
self.traces = []
self.current_trace = None
def start_trace(self, user_input: str):
"""Start a new trace."""
self.current_trace = {
"id": f"trace_{int(time.time())}",
"timestamp": datetime.now().isoformat(),
"input": user_input,
"steps": [],
"total_tokens": 0,
"total_cost": 0.0,
"total_time": 0.0,
}
self._start_time = time.time()
def log_step(self, step_type: str, data: dict):
"""Record a step."""
step = {
"type": step_type,
"timestamp": datetime.now().isoformat(),
"data": data,
}
self.current_trace["steps"].append(step)
if "usage" in data:
tokens = data["usage"].get("total_tokens", 0)
self.current_trace["total_tokens"] += tokens
self.current_trace["total_cost"] += tokens * 0.00015 / 1000
def end_trace(self, output: str):
"""End the trace."""
self.current_trace["output"] = output
self.current_trace["total_time"] = time.time() - self._start_time
self.traces.append(self.current_trace)
return self.current_trace
def print_trace(self, trace: dict = None):
"""Pretty-print the trace."""
t = trace or self.current_trace
print(f"\n{'='*60}")
print(f"Trace: {t['id']}")
print(f"Input: {t['input']}")
print(f"{'='*60}")
for i, step in enumerate(t["steps"], 1):
print(f"\n--- Step {i}: {step['type']} ---")
for key, value in step["data"].items():
if key == "usage":
continue
val_str = str(value)[:200]
print(f" {key}: {val_str}")
print(f"\n{'='*60}")
print(f"Output: {t['output'][:200]}")
print(f"Tokens: {t['total_tokens']}")
print(f"Cost: ${t['total_cost']:.4f}")
print(f"Time: {t['total_time']:.2f}s")
print(f"{'='*60}\n")
Level 2: Unit Testing Tools
Tools are deterministic — specific input, specific output. These you can easily unit test.
import pytest
class TestCalculateTool:
def test_basic_addition(self):
result = calculate("2 + 3")
assert "5" in result
def test_complex_expression(self):
result = calculate("(10 + 5) * 2")
assert "30" in result
def test_division(self):
result = calculate("10 / 3")
assert "3.33" in result
def test_invalid_expression(self):
result = calculate("hello")
assert "Error" in result or "error" in result
def test_dangerous_input(self):
"""Should not allow code execution."""
result = calculate("__import__('os').system('ls')")
assert "Error" in result or "allowed" in result
class TestWeatherTool:
@pytest.mark.asyncio
async def test_valid_city(self):
result = await get_weather("Tehran")
assert "Temperature" in result or "temp" in result
@pytest.mark.asyncio
async def test_invalid_city(self):
result = await get_weather("XYZNotACity")
assert "Error" in result or "error" in result
class TestTimeTool:
def test_valid_timezone(self):
result = get_current_time("Asia/Tehran")
assert len(result) > 0
assert "invalid" not in result.lower()
def test_invalid_timezone(self):
result = get_current_time("Invalid/Zone")
assert "invalid" in result.lower() or "Invalid" in result
Level 3: Behavioral Testing
This is where it gets interesting. We cannot check exact output, but we can check behavior.
import pytest
from agent import process_message
class TestAgentBehavior:
"""Test Agent behavior — not exact output."""
@pytest.mark.asyncio
async def test_uses_weather_tool_for_weather_query(self):
"""Should use weather tool when asked about weather."""
tracer.start_trace("What is the weather in Tehran?")
result = await process_message(
chat_id=999,
user_message="What is the weather in Tehran?",
user_name="Test"
)
trace = tracer.end_trace(result)
tool_steps = [s for s in trace["steps"] if s["type"] == "tool_call"]
assert len(tool_steps) > 0
assert tool_steps[0]["data"]["tool"] == "get_weather"
@pytest.mark.asyncio
async def test_responds_in_expected_language(self):
"""Response should be in the expected language."""
result = await process_message(
chat_id=999,
user_message="Hello, how are you?",
user_name="Test"
)
# Check that it contains English characters
assert any(c.isalpha() for c in result)
@pytest.mark.asyncio
async def test_says_dont_know_for_unknown(self):
"""Should say it does not know for unanswerable questions."""
result = await process_message(
chat_id=999,
user_message="What is my grandmother's phone number?",
user_name="Test"
)
negative_phrases = ["don't know", "cannot", "no information", "not able"]
has_negative = any(p in result.lower() for p in negative_phrases)
assert has_negative, f"Should say it doesn't know: {result}"
@pytest.mark.asyncio
async def test_remembers_context(self):
"""Should remember previous conversation."""
chat_id = 888
# First message
await process_message(
chat_id=chat_id,
user_message="My name is Alex",
user_name="Test"
)
# Second message
result = await process_message(
chat_id=chat_id,
user_message="What is my name?",
user_name="Test"
)
assert "Alex" in result, f"Memory did not work: {result}"
Level 4: Golden Dataset
Create a collection of questions and expected answers. Every time you change the Agent, test it against this collection.
GOLDEN_DATASET = [
{
"input": "What is the weather in London?",
"expected_tool": "get_weather",
"expected_contains": ["London"],
"category": "weather",
},
{
"input": "What is 25 times 4?",
"expected_tool": "calculate",
"expected_contains": ["100"],
"category": "math",
},
{
"input": "Hello, how are you?",
"expected_tool": None,
"expected_contains": [],
"category": "greeting",
},
{
"input": "What time is it now?",
"expected_tool": "get_current_time",
"expected_contains": [":"],
"category": "time",
},
]
async def run_golden_tests():
"""Run golden tests."""
results = []
for test_case in GOLDEN_DATASET:
tracer.start_trace(test_case["input"])
result = await process_message(
chat_id=777,
user_message=test_case["input"],
user_name="GoldenTest"
)
trace = tracer.end_trace(result)
# Check tool
tool_used = None
tool_steps = [s for s in trace["steps"] if s["type"] == "tool_call"]
if tool_steps:
tool_used = tool_steps[0]["data"]["tool"]
tool_correct = tool_used == test_case["expected_tool"]
# Check content
content_correct = all(
keyword in result for keyword in test_case["expected_contains"]
)
passed = tool_correct and content_correct
results.append({
"input": test_case["input"],
"category": test_case["category"],
"passed": passed,
"tool_expected": test_case["expected_tool"],
"tool_actual": tool_used,
"output_preview": result[:100],
})
status = "PASS" if passed else "FAIL"
print(f"[{status}] {test_case['input'][:40]}")
total = len(results)
passed = sum(1 for r in results if r["passed"])
print(f"\n{passed}/{total} tests passed ({passed/total*100:.0f}%)")
return results
LangSmith — Professional Monitoring Tool
LangSmith is from the LangChain team but works with any framework. It gives you a web dashboard to see all LLM calls, debug them, and evaluate them.
# Install: pip install langsmith
# Set environment variables:
# LANGCHAIN_API_KEY=your_key
# LANGCHAIN_TRACING_V2=true
# LANGCHAIN_PROJECT=my-agent
from langsmith import traceable
from openai import OpenAI
client = OpenAI()
@traceable(name="agent_response")
def get_response(user_input: str) -> str:
"""Every call is automatically recorded in LangSmith."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Helpful assistant"},
{"role": "user", "content": user_input}
]
)
return response.choices[0].message.content
# Every time you call get_response,
# a complete trace appears in the LangSmith dashboard:
# - Input and output
# - Token usage
# - Execution time
# - Errors
Why is LangSmith useful?
- See all traces in one dashboard
- Filter — for example, only traces with errors
- Shows cost per conversation
- Build datasets and set up automated evaluations
Regression Testing — Make Sure Nothing Broke
class RegressionTest:
"""Regression test — make sure a new update
did not break anything."""
def __init__(self, baseline_file: str):
self.baseline_file = baseline_file
def save_baseline(self, results: list):
"""Save current results as baseline."""
with open(self.baseline_file, "w") as f:
json.dump(results, f, ensure_ascii=False)
def compare(self, new_results: list) -> dict:
"""Compare new results with baseline."""
with open(self.baseline_file) as f:
baseline = json.load(f)
regressions = []
improvements = []
for old, new in zip(baseline, new_results):
if old["passed"] and not new["passed"]:
regressions.append({
"input": new["input"],
"was": "PASS",
"now": "FAIL",
})
elif not old["passed"] and new["passed"]:
improvements.append({
"input": new["input"],
"was": "FAIL",
"now": "PASS",
})
return {
"regressions": regressions,
"improvements": improvements,
"regression_count": len(regressions),
"improvement_count": len(improvements),
}
Practical Debugging Tips
1. Always turn on verbose mode: During development, see all logs. Every LLM call, every tool, every decision.
2. Isolate the problem: If the Agent acts incorrectly, first determine if the issue is with the LLM or the tool. Test the tool separately. If the tool is correct, the problem is with the prompt or model.
3. Lower the temperature: During testing, set temperature to 0 to make output as reproducible as possible.
4. Save real conversations: Save real user conversations (with privacy considerations) and use them to improve the Agent.
5. Run A/B tests: Run two Agent versions simultaneously and see which performs better.
Summary
- Agent testing is hard but essential
- Log everything — without logs, debugging is impossible
- Unit test the tools — they are deterministic
- Test Agent behavior, not exact output
- Build a golden dataset and run regression tests
- Use LangSmith or similar tools for monitoring
Next episode — the final one in the series — we build a complete Research Agent from scratch!