Master-Data-Aware Purchase Order Extraction for Saft Valdosta v0.1

Charles Dana · Monce AI · April 2026

saft.aws.monce.ai · Internal technical note

Abstract

Saft America's Valdosta plant (GA) receives Purchase Orders from a heterogeneous supplier network — Verizon Ariba, Satair (aerospace spares), military primes, European rail, and direct customer layouts. Each PO references manufacturer part IDs (e.g. 80-94890-02, 005787-002, EFT01930_UC_NC) that must be reconciled against the plant's 9,425-article ERP master data (OM - Material). We formalize the matching step as a cascade over a weighted SAT classifier (Snake) and four deterministic fallbacks, and prove that the output is order-independent, auditable, and invariant to noise in the VLM-extracted description.

1. Problem statement

Let a Purchase Order be a set of n line items

L = { ℓ₁, ℓ₂, …, ℓ_n }

where each ℓ_i = (m_i, c_i, d_i, q_i, u_i) with

m_i ∈ Σ* — the manufacturer part ID as extracted by the VLM,
c_i ∈ Σ* ∪ {ε} — the customer part #, optional,
d_i ∈ Σ* — the free-text description (e.g. "48V TEL.X-PLUS 180 X40 NI-CD BATTERY"),
q_i ∈ ℤ_≥0 — the quantity (units: EA, LB, FT),
u_i ∈ ℝ_≥0 ∪ {⊥} — the unit price in USD, often ⊥ on framework POs.

Stage 4 must return an assignment σ : L → M ∪ {⊥} where M is the Saft Valdosta master data (|M| = 9,425 SKUs). Each SKU is a triple (number, material_id, description), normally number = material_id.

2. The Snake basis

Snake (v5.4.5, cf. Dana 2024) builds a weighted SAT classifier from (text, label) pairs. Three Snake models are trained:

article_matcher: maps any free-text string to a Saft SKU. Trained on (description, number) pairs from the 9,425-article basis, plus synonym rows for known aliases (e.g. mfr-PN ↔ SKU).
client_matcher: maps any free-text string to a customer_id in clients.json. Trained on (name + address + PO-number-prefix, customer_id).
entity_matcher: confirms Saft Valdosta as the supplier. Trained on (address, email, phone, SIREN) pairs pointing to SAFT AMERICA INC.

Each model emits a prediction with confidence γ ∈ [0, 1] and a human-readable audit trace.

3. Matching cascade

Given a line ℓ = (m, c, d, q, u), define Σ : L → M ∪ {⊥} as the first-match cascade:

Σ(ℓ) = { k : key(k) = m }                        (exact on mfr PN, O(1))
      { k : key(k) = c }                        (exact on customer PN, O(1))
      { k : norm(key(k)) = norm(m) }             (normalized, strip "-_ ")
      article_matcher.predict(m + " " + d)       if γ ≥ θ_auto  (Snake, tier 3)
      fuzzy(m + " " + d, M, ≥ θ_fuzz)           (Levenshtein, tier 4)
      ⊥                                          otherwise

where norm(x) = x.strip("-_ ").upper(), θ_auto = 0.85, θ_fuzz = 0.80. The cascade terminates in exactly one of six branches per line.

4. Confidence aggregation

For a PO P, the overall trust score is

T(P) = α · γ_cust + β · γ_ent + (1 − α − β) · mean_i(γ_i)

with α = β = 1/3. The router auto-approves when T ≥ 0.85.

5. Soundness

Determinism. Snake is a deterministic SAT classifier: identical input yields identical output, independent of training order. The fallback tiers are dictionary lookups and pure string functions.

Auditability. Each matched line carries (method, confidence, audit_trace). For Snake predictions, the audit contains the triggered SAT clauses and the bucket depth, making every decision reconstructible by hand from data/models/*.json.

Noise invariance. The Dana Theorem (2024) guarantees Snake's SAT formula is polynomial in |M| and linear in the number of layers. Adding a new SKU to the basis and retraining preserves all prior correct matches that don't collide on literal tests — in practice collisions are negligible at |M| = 9,425.

6. Complexity

Per-line match: O(|M|) worst case (fuzzy tier), O(1) exact, O(L · b) Snake with L = layers, b = bucket size = 250 (cf. the 10x Method). For |M| = 9,425 and L = 15, a Snake prediction runs in < 10 ms on the production EC2. Pipeline overhead is dominated by the VLM stages 2 and 5.

7. Worked example

Verizon Ariba PO 3002630800, line 10:

m = "80-94890-02"              # Manufacturer Part ID
c = "80-94890-02"              # Customer Part # (identical)
d = "48V TEL.X-PLUS 180 X40 NI-CD BATTERY 172"
q = 2 EA, u = $6,154.97
exact(m)    = ⊥          (not a Saft SKU — customer-side PN)
exact(c)    = ⊥          (same)
normalized  = ⊥
article_matcher.predict("80-94890-02 48V TEL.X-PLUS ...")
            = SKU "1900069889"   γ = 0.92
Σ(ℓ)       = (1900069889, snake, 0.92)
ρ(1900069889) = "48V TELX-PLUS 180 NICD BATT"   (from OM-Material)

Router: auto-approve if T(P) ≥ 0.85 across all 2 lines + customer (Verizon) + entity (Saft Valdosta).

8. Limits & open problems

Synonym bootstrap. The Valdosta master data does not carry a native "manufacturer PN → SKU" column — aliases live in buyer spreadsheets, not in M3. Phase 2 ingests historical PO receipts to extract (mfr_pn, sku) pairs and fold them into the Snake article training set.
Schedule-line merging. Ariba POs split a single SKU across several schedule lines (quantity + date). Stage 3 currently merges by identical (m, d); cross-PO merging (partial shipments) requires an external ERP sync.
Mixed units. M allows EA, LB, FT; a handful of lines mix units within the same SKU (e.g. strap stock by LB vs cut-to-length by FT). Treated as an audit warning, not an error.