Intelligent field-by-field mapping between unknown schemas with confidence scoring and transformation suggestions
The Schema Mapper service accepts two schemas — a source and a target — in any supported format and returns a comprehensive set of field-level mappings. Each mapping includes a confidence score, a suggested transformation function, and metadata explaining how the match was derived.
Instead of formal schemas, you can provide sample payloads (JSON objects, CSV rows, XML documents) and the service will infer the schema automatically. This makes the service ideal for rapid integration prototyping, ETL pipeline design, and data migration planning.
Deterministic pre-matching resolves 40–60% of fields instantly, with full pipeline completing in under 2 seconds for typical schemas.
Ambiguous fields are resolved by Gemini semantic analysis, understanding domain context and business meaning beyond simple string similarity.
Optionally generates executable mapping functions in your target language, ready to drop into your codebase for immediate use.
Each schema mapping request flows through a multi-stage pipeline that balances speed and accuracy.
Converts all 6 supported input formats (JSON Schema, JSON sample, CSV headers, XML sample, OpenAPI, natural language) into a Canonical Intermediate Representation (CIR). The CIR captures field names, types, nesting depth, cardinality, and sample values in a unified structure.
Runs three sub-passes — exact match, abbreviation expansion (e.g., amt→amount), and normalized match (case-folding, underscore/camelCase normalization). Typically resolves 40–60% of all field pairs with 0.95+ confidence, consuming zero LLM tokens.
For remaining unmatched fields, the engine builds a candidate matrix filtered by type compatibility. A string field will not be paired with a boolean unless a known coercion path exists. This step reduces the candidate space by 60–80%, dramatically lowering LLM costs.
The filtered candidate pairs are sent to Gemini for semantic analysis. The LLM evaluates domain context, business meaning, and field descriptions to produce a confidence score for each candidate mapping. Only the ambiguous 40–60% of fields reach this stage.
For each confirmed mapping, the engine infers the required transformation: type coercion (string→integer), unit conversion (lbs→kg), date format conversion (MM/DD/YYYY→ISO 8601), string manipulation (concatenation, splitting), and enum value mapping.
Optionally generates executable mapping functions in the specified target language (JavaScript, Python, C#, etc.). The generated code includes null checks, type validation, and fallback values, ready for production use.
A complete example showing a JSON sample-to-JSON Schema mapping request and the resulting output.
{
"source": {
"format": "json_sample",
"content": {
"custName": "Jane Doe",
"custEmail": "[email protected]",
"orderAmt": 149.99,
"orderDt": "03/18/2026",
"shipAddr": "123 Main St",
"isActive": true,
"qty": 3
}
},
"target": {
"format": "json_schema",
"content": {
"type": "object",
"properties": {
"customer_name": { "type": "string" },
"email_address": { "type": "string", "format": "email" },
"order_amount": { "type": "number" },
"order_date": { "type": "string", "format": "date" },
"shipping_address": { "type": "string" },
"is_active": { "type": "boolean" },
"quantity": { "type": "integer" }
}
}
},
"options": {
"min_confidence": 0.6,
"include_transformations": true,
"generate_code": true,
"code_language": "javascript",
"include_unmapped": true
}
}{
"mapping_id": "map_8f3a1b2c",
"overall_confidence": 0.94,
"mappings": [
{
"source_path": "custName",
"target_path": "customer_name",
"confidence": 0.97,
"match_type": "abbreviation_expansion",
"transformation": "direct_copy"
},
{
"source_path": "orderAmt",
"target_path": "order_amount",
"confidence": 0.96,
"match_type": "abbreviation_expansion",
"transformation": "direct_copy"
},
{
"source_path": "orderDt",
"target_path": "order_date",
"confidence": 0.93,
"match_type": "abbreviation_expansion",
"transformation": "date_format_convert",
"transform_detail": "MM/DD/YYYY -> ISO 8601"
},
{
"source_path": "qty",
"target_path": "quantity",
"confidence": 0.95,
"match_type": "abbreviation_expansion",
"transformation": "direct_copy"
},
{
"source_path": "isActive",
"target_path": "is_active",
"confidence": 0.98,
"match_type": "normalized_match",
"transformation": "direct_copy"
}
],
"unmapped_source": [],
"unmapped_target": [],
"generated_code": "function mapSchema(src) {\n return {\n customer_name: src.custName,\n email_address: src.custEmail,\n order_amount: src.orderAmt,\n order_date: new Date(src.orderDt).toISOString().split('T')[0],\n shipping_address: src.shipAddr,\n is_active: src.isActive,\n quantity: src.qty\n };\n}",
"pipeline_stats": {
"deterministic_matches": 5,
"llm_matches": 2,
"total_fields": 7,
"processing_time_ms": 847
}
}Every mapping includes a confidence score derived from five weighted signals.
| Signal | Weight | Description |
|---|---|---|
| Name Similarity | 0.25 |
Levenshtein distance, Jaro-Winkler, and token-level overlap between source and target field names after normalization. |
| Type Compatibility | 0.15 |
Whether source and target types are identical, coercible (e.g., int→float), or incompatible. Coercible pairs receive partial credit. |
| Semantic Similarity | 0.30 |
LLM-derived understanding of the business meaning of each field, considering descriptions, parent object context, and domain conventions. |
| Sample Value Match | 0.15 |
Statistical comparison of sample values: format patterns, value ranges, cardinality, and regex matches between source and target examples. |
| Structural Context | 0.15 |
Nesting depth, sibling field patterns, and parent-child relationships. Fields nested under similar parent objects receive a boost. |
overall_confidence in the response is the harmonic mean of individual mapping confidences, which penalizes low-confidence outliers more heavily than an arithmetic mean. A mapping set with one poor match will score lower overall, surfacing integration risks early.
The Schema Mapper accepts source and target schemas in any of the following formats.
| Format | Identifier | Notes |
|---|---|---|
| JSON Schema | json_schema |
Draft-07 and later. Properties, required fields, descriptions, and nested $ref are extracted. |
| JSON Sample | json_sample |
A concrete JSON object. Types and structure are inferred from values. Arrays are unwrapped to detect item schemas. |
| CSV Headers | csv_headers |
First row treated as field names. Optionally include 1–5 data rows for type inference and sample value matching. |
| XML Sample | xml_sample |
Elements and attributes are flattened to a field list. Namespaces are preserved as path prefixes. |
| OpenAPI | openapi |
Extracts schemas from components/schemas or inline request/response bodies. Supports OpenAPI 3.0 and 3.1. |
| Natural Language | natural_language |
A plain-text description of the schema (e.g., "customer with name, email, and order total"). The LLM infers field names and types. |
Before any LLM call, a fast three-pass deterministic matcher resolves the easy field pairs.
Case-sensitive comparison of raw field names. Fields like email ↔ email are resolved instantly with confidence 1.0.
A built-in dictionary expands common abbreviations before comparison. Examples from the dictionary:
amt → amount
qty → quantity
dt → date
addr → address
desc → description
num → number
msg → message
cust → customer
org → organization
txn → transaction
cat → category
prod → product
All field names are normalized by: converting to lowercase, expanding camelCase to tokens (shipAddr → ship addr), replacing underscores and hyphens with spaces, and comparing the resulting token sets. This catches pairs like shipAddr ↔ shipping_address after abbreviation expansion.