Six months ago, I set out to build something specific: an AI system that could do what I'd been doing manually for seven years as a treasury operations manager. Not a chatbot. Not a summariser. A treasury analyst that could ingest real financial data, reason about it the way an experienced operator would, and produce outputs that a CFO could actually use.

I chose Claude for the build. What followed was a crash course in the difference between prompt engineering (getting an LLM to produce good text) and AI product design (building a system that reliably produces correct decisions under real-world conditions). They're not the same thing, and conflating them is the most common mistake I see in the AI product space.

This is a walkthrough of what I built, the architecture decisions behind each component, what worked, what failed, and the design principles I extracted from the process.

The Problem: Cash Flow Forecasting Is Broken

Every treasury team I've worked with forecasts cash flow the same way. They take historical collection behaviours (what percentage of invoices get paid at 30, 60, 90 days), average them, and apply those averages to the current receivables portfolio. It's a spreadsheet exercise, usually maintained by one person who understands the formulas, and it breaks the moment that person goes on holiday.

The fundamental problem isn't complexity. It's that the standard approach ignores at least three things that matter enormously: seasonality (clients pay differently in December than in March), recency bias (last month's collection behaviour is more predictive than behaviour from 12 months ago), and partial payments (a growing percentage of collections that traditional models either ignore or misclassify).

I'd already solved this manually. I built a collections behaviour model using weighted moving averages with a decay factor (λ=0.9, meaning each month's data is weighted 10% less than the next), seasonal indices calculated per collection bucket, and explicit handling of partial payment patterns. It improved forecast accuracy by roughly 30% compared to simple averages. But it was a spreadsheet, and it took me a full day each month to update across 10 countries.

The question was: could Claude do this reliably?

Architecture: Skills, Not Prompts

The first thing I learned is that a single prompt, no matter how well-crafted, is the wrong abstraction for a complex analytical task. What you need is a skill: a structured set of instructions, context, and tool definitions that the model can use to execute a multi-step workflow.

My treasury analyst skill has four components:

System Architecture
SKILL DEFINITION CONTEXT Domain rules, terminology, country-specific norms TOOLS Data ingestion, SQL generation, calculation engine, output formatting WORKFLOW Ingest Validate Analyse Calculate Output GUARDS Confidence thresholds, sanity checks, escalation rules DATA LAYER NetSuite data Bank statements Cohort files Country configs OUTPUT LAYER Looker-ready forecasts Slack alerts Exception reports Confidence scores

The context layer is where operational knowledge lives. This isn't a generic system prompt. It encodes specific domain rules: how securitisation facilities affect cash-in timing, why UK and Spanish collection patterns differ structurally, what a "partial payment" means in practice versus theory. I spent more time writing this context than any other part of the system. It's the equivalent of hiring an analyst and giving them a month of onboarding.

Design Decision 1: Structured Outputs Over Free Text

My first iteration let Claude generate free-text analysis. "Based on the data, Spain's collections appear to be declining..." The output was eloquent and useless. A CFO doesn't want prose. They want a number, a confidence interval, and a one-line explanation of what changed.

I redesigned the output layer to enforce structured responses. Every analysis produces a JSON object with specific fields: forecast value, confidence level (0-100), variance from previous forecast, top three drivers of change, and a recommended action. The free-text explanation became optional context, not the primary output.

Design principle: LLMs are seductive because they produce human-readable output. In operational contexts, this is a bug, not a feature. Your AI analyst should produce machine-readable decisions that humans can review, not essays that humans need to interpret.

Design Decision 2: Calculation Verification

LLMs are unreliable at arithmetic. This is well-known, but the implications for financial product design are underappreciated. My treasury analyst performs calculations: weighted averages, seasonal adjustments, variance analysis. If any of these are wrong, the output is worse than useless because it looks authoritative.

The solution was to separate reasoning from calculation. Claude determines what needs to be calculated and why, then calls external tools to perform the actual maths. The weighted moving average formula, the seasonal index computation, the forecast aggregation: these all execute as deterministic code, not LLM inference.

Claude's role becomes orchestration and interpretation: deciding which calculation to run, on which data subset, and what the result means in business context. This is what LLMs are genuinely good at. The numbers themselves never touch the model's reasoning.

Design Decision 3: Country-Specific Reasoning

One of the biggest failures in my early iterations was treating all countries identically. The model would analyse Spanish receivables using assumptions that were only valid for the UK. This produced technically correct but operationally meaningless outputs.

I solved this with country configuration files: structured documents that encode the specific characteristics of each market. Spain's config includes: typical payment terms (60-90 days), securitisation facility rules and their impact on cash-in timing, seasonal patterns (August is dead, January is slow, September recovery). The UK config is entirely different: shorter payment terms (30 days), invoice finance mechanics, and a December slowdown that doesn't affect Spain the same way.

When the skill processes data for a specific country, it loads the corresponding config and adjusts its reasoning accordingly. This was a product design decision, not a prompt engineering one. The model doesn't need to "learn" country differences from examples. It needs access to structured domain knowledge that constrains its reasoning appropriately.

What Failed

Failure 1: Overconfident anomaly detection

I built an anomaly detection module that was supposed to flag unusual collection patterns. It flagged everything. A client paying 3 days earlier than usual? Anomaly. A seasonal dip that happens every August? Anomaly. The model had no concept of baseline volatility.

The fix was adding a statistical context layer: for each metric, the system calculates the historical coefficient of variation. Anomalies are only flagged when they exceed 2 standard deviations from the country-specific, seasonally-adjusted baseline. This cut false positives by roughly 80%.

Failure 2: The "helpful analyst" problem

Claude is trained to be helpful. In a treasury context, this means it will find a narrative to explain any data pattern, even when the correct answer is "the data is incomplete" or "there's no statistically significant trend here." My early system would confidently explain a 2% variance that was well within normal noise.

I addressed this by adding explicit materiality thresholds to the skill definition. Variances below €100K or 5% are noted but not analysed. The system is instructed to distinguish between "this is a real signal" and "this is noise" before generating any explanation. It sounds simple. Getting it to work reliably took dozens of iterations.

Failure 3: Temporal reasoning

The skill needed to reason about time: "collections from invoices issued in March that are currently at 90+ days." LLMs are surprisingly poor at temporal reasoning when the relationships are implicit. I had to restructure data inputs so that temporal relationships were explicit: every data point tagged with issue date, due date, current aging bucket, and collection date. Pre-processing that would be trivial for a human analyst turned out to be critical for reliable AI performance.

What I Learned About AI Product Design

Building this system taught me five things that I haven't seen in any AI PM curriculum:

  1. The 80/20 rule is inverted. 80% of your development time should go into the 20% of cases that are exceptions. The standard cases are easy. The edge cases are where your product either earns trust or loses it permanently.
  2. Domain context is more important than model selection. The difference between Claude Sonnet and Claude Opus mattered far less than the difference between a generic prompt and a well-structured skill with proper domain context. Most AI product discussions focus on the wrong variable.
  3. Deterministic rails matter. Any step that requires precision (arithmetic, date calculations, currency conversion) should be executed deterministically, not inferred by the model. The LLM orchestrates; code calculates.
  4. Confidence calibration is a product feature. The system's ability to say "I'm 60% confident and here's why" is more valuable than being right 90% of the time without expressing uncertainty. Finance professionals trust systems that know their own limits.
  5. Iteration speed is everything. I went through 40+ versions of the skill definition. Each iteration was informed by real data producing wrong results. You can't design an AI product in a specification document. You design it by shipping, failing, and fixing.

Where This Goes Next

The treasury analyst skill is a proof of concept for something bigger: autonomous financial agents that can operate across entire treasury functions. Cash application, collections optimisation, liquidity forecasting, counterparty risk monitoring: each of these is a skill that can be designed with the same framework. Connected together through an orchestration layer, they form the autonomous treasury stack.

I'm not building a product to sell. I'm building the technical intuition to design these systems at scale. Every skill I build, every failure I debug, every design principle I extract: it's preparation for the moment when a company with €1B+ in receivables decides they want their treasury function to run itself.

That moment is coming faster than most people think.