Why not let the AI extract reference numbers too?

Because reference numbers are a finite, deterministic space — every Rolex ref matches a known pattern ("1XXXXX" for modern production, "5XXXX" for vintage, letters for suffixes). A regex extractor handles this with 100% accuracy and zero variability. Using an LLM for something a regex solves perfectly introduces probabilistic failure modes (hallucinated numbers, misread characters) on work where there is no ambiguity to reason over. Rule of thumb: if the input space is enumerable, code it; if it is open-ended, let the model handle it.

What is the most common failure mode in production AI agents?

Trusting prompt guidelines for safety-critical constraints. "Never send a message without approval" in a system prompt works 99% of the time — which means it fails 1% of the time, usually when the input is slightly different than expected. Guardrails at the code layer (the send function literally does not exist in the agent's callable tools) work 100% of the time regardless of prompt engineering. Every prompt-based safety constraint should be treated as a suggestion, not a control.

How do you detect when a pipeline silently breaks?

Baseline metrics per scanner plus anomaly detection. Each pipeline has expected throughput bands (e.g., Reddit WTB scanner finds 15–30 leads per run on a typical day). When a run returns outside the band — especially below — it triggers a health check alert before any user notices missing deals. We also track extraction success rates (what percentage of scanned posts produce structured output) and alert when those drop significantly. Silent failures are the most dangerous; instrumentation is the defense.

Does the AI have access to the same data across users?

No — each user's agent session is scoped to their own inventory, hunt list, deal history, and alert queue via user-level access controls at the database and MCP-server layer. The underlying market data (auction comps, reference pricing) is shared because it is factual, but anything tied to an individual dealer's business is strictly isolated. This is enforced at the infrastructure layer, not the application layer — the agent does not have a tool to query another user's data.

What happens when an AI recommendation turns out to be wrong?

It goes into the agent_outcomes feedback log with the actual outcome (deal turned out bad, outreach did not convert, etc.) attached. Over time, this builds a calibration dataset showing where the agent's confidence intervals are honest and where they are optimistic. Threshold adjustments and context improvements flow from that data. A single wrong recommendation is not a failure — a pattern of wrong recommendations at a specific verdict tier is, and that is what the outcome log is designed to surface.

technology

agentic AI

AI guardrails

predictable AI

autonomous systems

AI architecture

watch trading technology

deterministic systems

Managing Agentic Behaviors: How We Engineer Predictable Outcomes from Autonomous Systems

The hard problem with AI agents isn't intelligence — it's predictability. How Mazalgo engineers deterministic outcomes via architectural guardrails.

Mazalgo Intelligence

4/16/2026

9 min read

The hardest problem in building AI agents isn't making them smart. It's making them predictable. Anyone can wire a language model to a data source and call it an "agent." The real engineering challenge is ensuring that agent does exactly what it should, every time, without supervision — and never does what it shouldn't.

The Predictability Problem

Deterministic vs. Probabilistic

Traditional software is deterministic: the same input produces the same output. AI systems are probabilistic: the same input can produce different outputs depending on context, phrasing, and model state. When you're building a system that computes trade margins with real money on the line, variability is not a feature — it's a bug.

A pipeline that correctly parses "WTS Sub 126610LN $12.5k mint" but misreads "Letting go of my 126610 for twelve five, complete" is a pipeline you can't trust. Not because the model is bad — but because language is ambiguous and probabilistic systems can't guarantee consistency.

Deterministic Where It Matters, Intelligent Where It Helps

The foundational architecture decision at Mazalgo: we don't use AI for everything. We use deterministic systems for anything that touches a number and AI for anything that touches language.

Mazalgo System Design — Where Each Approach Is Applied

System Component	Approach	Why
Reference number extraction	Regex patterns per brand	A Rolex ref is a finite set of patterns — no ambiguity, no hallucination
Price extraction	Rule-based ($XX,XXX / XXk)	Dollar amounts are deterministic; rules handle edge cases explicitly
Condition detection	Keyword matching (BNIB, mint, etc.)	Condition terms are finite and consistent across dealer language
Margin calculation	Formula: (median − asking) / asking × 100	Math doesn't vary between runs; the same inputs always produce the same output
Deal scoring (STEAL/BUY/THIN/PASS)	Threshold logic on margin %	Verdicts are computed, not inferred — fully auditable and reproducible
Natural language summaries	LLM inference	Appropriate use: flexibility here has low stakes and high value
Sentiment analysis	LLM classification	Appropriate use: probabilistic output acceptable; no single result is load-bearing
Morning brief composition	LLM with structured data inputs	Language model writes narrative; structured data provides ground truth

Guardrails, Not Guidelines

The second architectural principle is the most important for production systems: constraints are enforced in code, not in prompts.

The Difference Between a Guideline and a Guardrail

Telling an AI "never send messages in WhatsApp groups" in a system prompt is a guideline. Building a service that physically cannot send messages — because the send function does not exist in its codebase — is a guardrail. Guardrails hold under adversarial inputs. Guidelines do not.

This principle applies at every level of our system. Our group monitoring service is listen-only by architecture: it has a receive function and no send function. Data writes go through validated schemas — an agent cannot store a deal without a reference number, a price, and a source. Rate limits and resource caps are infrastructure-level, not prompt-level.

Measuring Predictability in Production

How do you know an autonomous system is behaving correctly when no one is watching? You measure the outputs.

Every pipeline run produces countable results: leads scanned, deals extracted, matches found, alerts dispatched. These metrics are tracked per-interval and compared against historical baselines. When a scanner that normally finds 15–30 WTB leads per run suddenly returns zero, that's a signal — not that the market went quiet, but that the pipeline needs attention. When a WhatsApp bridge that processes 200 group messages per hour drops to 10, the health check surfaces it before any user notices missing deals.

Predictability isn't about perfection. It's about knowing when something deviates from expected behavior and having the instrumentation to catch it quickly.

The Trust Equation

Agentic systems succeed or fail based on trust. Can a trader trust that the system is watching while they sleep? Can they trust that a "STEAL" verdict means the margin is real? Can they trust that the agent won't accidentally send a message in a dealer group?

Trust comes from architecture, not promises. Deterministic extraction, mathematical pricing, structural guardrails, and continuous measurement — these engineering decisions are what make autonomous systems trustworthy. The alternative — an AI that's "usually right" — isn't good enough when real money is on the line.

Key Takeaways

✓Deterministic systems (regex, rules, formulas) handle anything that touches money — reference extraction, pricing, margin calculations, deal scoring
✓AI is used only where probabilistic output is appropriate: language summaries, sentiment classification, narrative composition
✓Constraints are enforced in code (guardrails), not in system prompts (guidelines) — guardrails hold under adversarial inputs

Mazalgo's agentic architecture is built on deterministic intelligence — buy zone calculations that are mathematical, not guessed.

Frequently Asked Questions

technology

Agentic Containerization: The Hybrid Operating System Running Your Watch Business

10 min read

technology

How We Decide What Gets an Agent — And What Doesn't

7 min read

Managing Agentic Behaviors: How We Engineer Predictable Outcomes from Autonomous Systems

The Predictability Problem

Deterministic Where It Matters, Intelligent Where It Helps

Guardrails, Not Guidelines

Measuring Predictability in Production

The Trust Equation

Frequently Asked Questions

Why not let the AI extract reference numbers too?

What is the most common failure mode in production AI agents?

How do you detect when a pipeline silently breaks?

Does the AI have access to the same data across users?

What happens when an AI recommendation turns out to be wrong?

Related Articles

Agentic Containerization: The Hybrid Operating System Running Your Watch Business

How We Decide What Gets an Agent — And What Doesn't