Managing Agentic Behaviors: How We Engineer Predictable Outcomes from Autonomous Systems
The hard problem with AI agents isn't intelligence — it's predictability. How Mazalgo engineers deterministic outcomes via architectural guardrails.
The hardest problem in building AI agents isn't making them smart. It's making them predictable. Anyone can wire a language model to a data source and call it an "agent." The real engineering challenge is ensuring that agent does exactly what it should, every time, without supervision — and never does what it shouldn't.
The Predictability Problem
Deterministic vs. Probabilistic
Traditional software is deterministic: the same input produces the same output. AI systems are probabilistic: the same input can produce different outputs depending on context, phrasing, and model state. When you're building a system that computes trade margins with real money on the line, variability is not a feature — it's a bug.
A pipeline that correctly parses "WTS Sub 126610LN $12.5k mint" but misreads "Letting go of my 126610 for twelve five, complete" is a pipeline you can't trust. Not because the model is bad — but because language is ambiguous and probabilistic systems can't guarantee consistency.
Deterministic Where It Matters, Intelligent Where It Helps
The foundational architecture decision at Mazalgo: we don't use AI for everything. We use deterministic systems for anything that touches a number and AI for anything that touches language.
Mazalgo System Design — Where Each Approach Is Applied
| System Component | Approach | Why |
|---|---|---|
| Reference number extraction | Regex patterns per brand | A Rolex ref is a finite set of patterns — no ambiguity, no hallucination |
| Price extraction | Rule-based ($XX,XXX / XXk) | Dollar amounts are deterministic; rules handle edge cases explicitly |
| Condition detection | Keyword matching (BNIB, mint, etc.) | Condition terms are finite and consistent across dealer language |
| Margin calculation | Formula: (median − asking) / asking × 100 | Math doesn't vary between runs; the same inputs always produce the same output |
| Deal scoring (STEAL/BUY/THIN/PASS) | Threshold logic on margin % | Verdicts are computed, not inferred — fully auditable and reproducible |
| Natural language summaries | LLM inference | Appropriate use: flexibility here has low stakes and high value |
| Sentiment analysis | LLM classification | Appropriate use: probabilistic output acceptable; no single result is load-bearing |
| Morning brief composition | LLM with structured data inputs | Language model writes narrative; structured data provides ground truth |
Guardrails, Not Guidelines
The second architectural principle is the most important for production systems: constraints are enforced in code, not in prompts.
The Difference Between a Guideline and a Guardrail
Telling an AI "never send messages in WhatsApp groups" in a system prompt is a guideline. Building a service that physically cannot send messages — because the send function does not exist in its codebase — is a guardrail. Guardrails hold under adversarial inputs. Guidelines do not.
This principle applies at every level of our system. Our group monitoring service is listen-only by architecture: it has a receive function and no send function. Data writes go through validated schemas — an agent cannot store a deal without a reference number, a price, and a source. Rate limits and resource caps are infrastructure-level, not prompt-level.
Measuring Predictability in Production
How do you know an autonomous system is behaving correctly when no one is watching? You measure the outputs.
Every pipeline run produces countable results: leads scanned, deals extracted, matches found, alerts dispatched. These metrics are tracked per-interval and compared against historical baselines. When a scanner that normally finds 15–30 WTB leads per run suddenly returns zero, that's a signal — not that the market went quiet, but that the pipeline needs attention. When a WhatsApp bridge that processes 200 group messages per hour drops to 10, the health check surfaces it before any user notices missing deals.
Predictability isn't about perfection. It's about knowing when something deviates from expected behavior and having the instrumentation to catch it quickly.
The Trust Equation
Agentic systems succeed or fail based on trust. Can a trader trust that the system is watching while they sleep? Can they trust that a "STEAL" verdict means the margin is real? Can they trust that the agent won't accidentally send a message in a dealer group?
Trust comes from architecture, not promises. Deterministic extraction, mathematical pricing, structural guardrails, and continuous measurement — these engineering decisions are what make autonomous systems trustworthy. The alternative — an AI that's "usually right" — isn't good enough when real money is on the line.
Key Takeaways
- ✓Deterministic systems (regex, rules, formulas) handle anything that touches money — reference extraction, pricing, margin calculations, deal scoring
- ✓AI is used only where probabilistic output is appropriate: language summaries, sentiment classification, narrative composition
- ✓Constraints are enforced in code (guardrails), not in system prompts (guidelines) — guardrails hold under adversarial inputs