Overview

The Schema Mapper service accepts two schemas — a source and a target — in any supported format and returns a comprehensive set of field-level mappings. Each mapping includes a confidence score, a suggested transformation function, and metadata explaining how the match was derived.

Instead of formal schemas, you can provide sample payloads (JSON objects, CSV rows, XML documents) and the service will infer the schema automatically. This makes the service ideal for rapid integration prototyping, ETL pipeline design, and data migration planning.

Sub-Second Mapping

Deterministic pre-matching resolves 40–60% of fields instantly, with full pipeline completing in under 2 seconds for typical schemas.

🧠

LLM-Enhanced

Ambiguous fields are resolved by Gemini semantic analysis, understanding domain context and business meaning beyond simple string similarity.

🛠

Executable Output

Optionally generates executable mapping functions in your target language, ready to drop into your codebase for immediate use.

6-Stage Mapping Pipeline

Each schema mapping request flows through a multi-stage pipeline that balances speed and accuracy.

1

Schema Normalization

Converts all 6 supported input formats (JSON Schema, JSON sample, CSV headers, XML sample, OpenAPI, natural language) into a Canonical Intermediate Representation (CIR). The CIR captures field names, types, nesting depth, cardinality, and sample values in a unified structure.

2

Deterministic Pre-Matching

Runs three sub-passes — exact match, abbreviation expansion (e.g., amtamount), and normalized match (case-folding, underscore/camelCase normalization). Typically resolves 40–60% of all field pairs with 0.95+ confidence, consuming zero LLM tokens.

3

Type-Based Filtering

For remaining unmatched fields, the engine builds a candidate matrix filtered by type compatibility. A string field will not be paired with a boolean unless a known coercion path exists. This step reduces the candidate space by 60–80%, dramatically lowering LLM costs.

4

LLM Semantic Matching

The filtered candidate pairs are sent to Gemini for semantic analysis. The LLM evaluates domain context, business meaning, and field descriptions to produce a confidence score for each candidate mapping. Only the ambiguous 40–60% of fields reach this stage.

5

Transformation Inference

For each confirmed mapping, the engine infers the required transformation: type coercion (string→integer), unit conversion (lbs→kg), date format conversion (MM/DD/YYYY→ISO 8601), string manipulation (concatenation, splitting), and enum value mapping.

6

Code Generation

Optionally generates executable mapping functions in the specified target language (JavaScript, Python, C#, etc.). The generated code includes null checks, type validation, and fallback values, ready for production use.

Request & Response

A complete example showing a JSON sample-to-JSON Schema mapping request and the resulting output.

POST /api/v1/schema/map Request
{
  "source": {
    "format": "json_sample",
    "content": {
      "custName": "Jane Doe",
      "custEmail": "[email protected]",
      "orderAmt": 149.99,
      "orderDt": "03/18/2026",
      "shipAddr": "123 Main St",
      "isActive": true,
      "qty": 3
    }
  },
  "target": {
    "format": "json_schema",
    "content": {
      "type": "object",
      "properties": {
        "customer_name": { "type": "string" },
        "email_address": { "type": "string", "format": "email" },
        "order_amount": { "type": "number" },
        "order_date": { "type": "string", "format": "date" },
        "shipping_address": { "type": "string" },
        "is_active": { "type": "boolean" },
        "quantity": { "type": "integer" }
      }
    }
  },
  "options": {
    "min_confidence": 0.6,
    "include_transformations": true,
    "generate_code": true,
    "code_language": "javascript",
    "include_unmapped": true
  }
}
200 OK Response
{
  "mapping_id": "map_8f3a1b2c",
  "overall_confidence": 0.94,
  "mappings": [
    {
      "source_path": "custName",
      "target_path": "customer_name",
      "confidence": 0.97,
      "match_type": "abbreviation_expansion",
      "transformation": "direct_copy"
    },
    {
      "source_path": "orderAmt",
      "target_path": "order_amount",
      "confidence": 0.96,
      "match_type": "abbreviation_expansion",
      "transformation": "direct_copy"
    },
    {
      "source_path": "orderDt",
      "target_path": "order_date",
      "confidence": 0.93,
      "match_type": "abbreviation_expansion",
      "transformation": "date_format_convert",
      "transform_detail": "MM/DD/YYYY -> ISO 8601"
    },
    {
      "source_path": "qty",
      "target_path": "quantity",
      "confidence": 0.95,
      "match_type": "abbreviation_expansion",
      "transformation": "direct_copy"
    },
    {
      "source_path": "isActive",
      "target_path": "is_active",
      "confidence": 0.98,
      "match_type": "normalized_match",
      "transformation": "direct_copy"
    }
  ],
  "unmapped_source": [],
  "unmapped_target": [],
  "generated_code": "function mapSchema(src) {\n  return {\n    customer_name: src.custName,\n    email_address: src.custEmail,\n    order_amount: src.orderAmt,\n    order_date: new Date(src.orderDt).toISOString().split('T')[0],\n    shipping_address: src.shipAddr,\n    is_active: src.isActive,\n    quantity: src.qty\n  };\n}",
  "pipeline_stats": {
    "deterministic_matches": 5,
    "llm_matches": 2,
    "total_fields": 7,
    "processing_time_ms": 847
  }
}

Confidence Scoring

Every mapping includes a confidence score derived from five weighted signals.

Signal Weight Description
Name Similarity 0.25 Levenshtein distance, Jaro-Winkler, and token-level overlap between source and target field names after normalization.
Type Compatibility 0.15 Whether source and target types are identical, coercible (e.g., int→float), or incompatible. Coercible pairs receive partial credit.
Semantic Similarity 0.30 LLM-derived understanding of the business meaning of each field, considering descriptions, parent object context, and domain conventions.
Sample Value Match 0.15 Statistical comparison of sample values: format patterns, value ranges, cardinality, and regex matches between source and target examples.
Structural Context 0.15 Nesting depth, sibling field patterns, and parent-child relationships. Fields nested under similar parent objects receive a boost.
Note: The overall_confidence in the response is the harmonic mean of individual mapping confidences, which penalizes low-confidence outliers more heavily than an arithmetic mean. A mapping set with one poor match will score lower overall, surfacing integration risks early.

Supported Formats

The Schema Mapper accepts source and target schemas in any of the following formats.

Format Identifier Notes
JSON Schema json_schema Draft-07 and later. Properties, required fields, descriptions, and nested $ref are extracted.
JSON Sample json_sample A concrete JSON object. Types and structure are inferred from values. Arrays are unwrapped to detect item schemas.
CSV Headers csv_headers First row treated as field names. Optionally include 1–5 data rows for type inference and sample value matching.
XML Sample xml_sample Elements and attributes are flattened to a field list. Namespaces are preserved as path prefixes.
OpenAPI openapi Extracts schemas from components/schemas or inline request/response bodies. Supports OpenAPI 3.0 and 3.1.
Natural Language natural_language A plain-text description of the schema (e.g., "customer with name, email, and order total"). The LLM infers field names and types.

Deterministic Pre-Matching

Before any LLM call, a fast three-pass deterministic matcher resolves the easy field pairs.

Pass 1 — Exact Match

Case-sensitive comparison of raw field names. Fields like emailemail are resolved instantly with confidence 1.0.

Pass 2 — Abbreviation Expansion

A built-in dictionary expands common abbreviations before comparison. Examples from the dictionary:

amtamount qtyquantity dtdate addraddress descdescription numnumber msgmessage custcustomer orgorganization txntransaction catcategory prodproduct

Pass 3 — Normalized Match

All field names are normalized by: converting to lowercase, expanding camelCase to tokens (shipAddrship addr), replacing underscores and hyphens with spaces, and comparing the resulting token sets. This catches pairs like shipAddrshipping_address after abbreviation expansion.

Performance impact: Deterministic pre-matching typically resolves 40–60% of all field pairs, eliminating them from the LLM candidate matrix. For a 50-field schema pair, this means the LLM evaluates ~400 candidates instead of ~2,500 — reducing latency by 60% and token cost by 70%.