Data Quality Scorer

Seven Quality Dimensions

Each dimension is scored independently from 0 to 1.0, then combined into a weighted composite.

C

Completeness 0 – 1.0

Percentage of non-null fields across all records. Required fields are weighted 2x compared to optional fields, ensuring critical data gaps are penalized more heavily.
T

Type Conformance 0 – 1.0

Ratio of values that match their declared type in the schema. Detects strings in numeric fields, malformed dates, and other type mismatches.
U

Uniqueness 0 – 1.0

Duplicate detection on designated key fields. Measures the ratio of distinct values to total values, flagging exact and near-duplicates.
F

Format Consistency 0 – 1.0

Detects mixed formats within a single column, such as dates alternating between "MM/DD/YYYY" and "YYYY-MM-DD", or phone numbers with inconsistent separators.
R

Range Validity 0 – 1.0

Outlier detection using the Interquartile Range (IQR) method. Values beyond 1.5x IQR from Q1/Q3 are flagged. Custom min/max bounds can also be supplied.
I

Referential Integrity 0 – 1.0

Cross-field and cross-table consistency checks. Validates that foreign key references resolve, enum values are within allowed sets, and dependent fields are logically consistent.
S

Semantic Anomaly 0 – 1.0

LLM-powered detection of semantically invalid data that passes structural checks. Catches values like a city named "12345" or an email in a phone field.

Request & Response

Submit a dataset with an optional schema for a full quality assessment.

POST /api/v1/quality/score Request

{
  "schema": {
    "fields": [
      { "name": "id",    "type": "integer", "required": true,  "unique": true },
      { "name": "email", "type": "string",  "required": true,  "format": "email" },
      { "name": "age",   "type": "integer", "required": false, "min": 0, "max": 150 },
      { "name": "role",  "type": "string",  "required": true,  "enum": ["admin", "user", "viewer"] }
    ]
  },
  "data": [
    { "id": 1, "email": "[email protected]", "age": 29,   "role": "admin" },
    { "id": 2, "email": "[email protected]",  "age": -5,   "role": "user" },
    { "id": 3, "email": "not-an-email",     "age": null, "role": "superadmin" }
  ]
}

200 OK Response

{
  "composite_score": 0.72,
  "dimensions": {
    "completeness":         0.92,
    "type_conformance":     0.92,
    "uniqueness":            1.00,
    "format_consistency":   0.67,
    "range_validity":        0.67,
    "referential_integrity": 0.67,
    "semantic_anomaly":      0.67
  },
  "record_count": 3,
  "field_count": 4,
  "issues_found": 4,
  "duration_ms": 12.8
}

Suggested Fixes

The API returns actionable fix suggestions for every issue detected.

Suggested Fixes Example Included in response

{
  "suggested_fixes": [
    {
      "record": 1,
      "field": "age",
      "issue": "range_violation",
      "severity": "error",
      "message": "Value -5 is below minimum 0",
      "suggestion": "Verify source data. If age is unknown, set to null rather than a negative value."
    },
    {
      "record": 2,
      "field": "email",
      "issue": "format_mismatch",
      "severity": "error",
      "message": "'not-an-email' does not match email format",
      "suggestion": "Validate email format at ingestion. Expected pattern: [email protected]"
    },
    {
      "record": 2,
      "field": "role",
      "issue": "enum_violation",
      "severity": "warning",
      "message": "'superadmin' is not in allowed values [admin, user, viewer]",
      "suggestion": "Map to closest allowed value 'admin' or update the schema enum list."
    },
    {
      "record": 2,
      "field": "age",
      "issue": "missing_required",
      "severity": "info",
      "message": "Optional field 'age' is null",
      "suggestion": "Consider collecting age data or marking as not applicable."
    }
  ]
}

Scoring Methodology

The composite score is a weighted average of all seven dimensions.

Dimension	Default Weight	Description
`completeness`	0.20	Non-null field coverage, required fields weighted 2x
`type_conformance`	0.20	Values matching declared schema types
`uniqueness`	0.15	Distinct-to-total ratio on key fields
`format_consistency`	0.15	Uniform formatting within columns
`range_validity`	0.10	Values within IQR or specified bounds
`referential_integrity`	0.10	Cross-field and cross-table consistency
`semantic_anomaly`	0.10	LLM-detected semantic issues

Composite Score Formula Weighted Average

// composite_score = sum(dimension_score * weight) / sum(weights)
//
// Example with default weights:
//   (0.92 * 0.20) + (0.92 * 0.20) + (1.00 * 0.15) +
//   (0.67 * 0.15) + (0.67 * 0.10) + (0.67 * 0.10) +
//   (0.67 * 0.10)
//   = 0.184 + 0.184 + 0.150 + 0.101 + 0.067 + 0.067 + 0.067
//   = 0.82
//
// Weights are customizable per request via the
// "weights" object in the request body.

Seven Quality Dimensions

Completeness 0 – 1.0

Type Conformance 0 – 1.0

Uniqueness 0 – 1.0

Format Consistency 0 – 1.0

Range Validity 0 – 1.0

Referential Integrity 0 – 1.0

Semantic Anomaly 0 – 1.0

Request & Response

Suggested Fixes

Scoring Methodology