Insight Xform · Product overview

Profile, cleanse, and validate enterprise data — in your warehouse

In-warehouse compute on Snowflake, BigQuery, Databricks, and Redshift. Preview every transformation on a sample before it touches production. Lineage propagates to your catalog automatically. AI-augmented where deterministic rules can’t keep up.

Runs in · integrates with · deploys on
AWS Google Cloud Azure

Built for teams who ship clean data

Data engineering Data platform Analytics engineering ML / AI teams Master data & governance
Why InsightXform

Three things that separate it from the field

Cleansing tools that pull data out, transformations you can’t preview, and lineage that lives in a separate tool — that’s the field. InsightXform fixes all three.

In-warehouse

Runs where your data lives

Pushes compute into Snowflake, BigQuery, Databricks, or Redshift — not into a separate ETL plane. No data movement, no cross-perimeter hop, no parallel compute bill.

  • Snowflake, BigQuery, Databricks, Redshift
  • SQL pushdown for every transformation
  • No copy-out, no shadow lake
  • Compute billed by your warehouse, not by us
Preview-before-apply

See the change before it ships

Every cleansing rule and transformation previews on a representative sample before touching production. Diff-view before commit. No more "we pushed a fix at 4pm and corrupted the export at 6pm".

  • Sample-based preview on every rule change
  • Before/after diff per column
  • Row counts & distribution comparison
  • Approval gate before production apply
Lineage-native

Every change propagates downstream

Every transformation, every rule firing, every reviewer override emits OpenLineage events that flow into your catalog. Downstream consumers know what changed and why.

  • OpenLineage output on every run
  • Integrates with Insight Catalog, DataHub, Atlan, Collibra
  • Per-column transformation history
  • Reviewer overrides recorded in lineage
How it works

Profile → cleanse → validate → ship

A four-step agentic pipeline. Each step is independently configurable, previewable, and reversible.

1. Profile

Auto-profile your data

Scan incoming data for null rates, distributions, outliers, format violations, duplicates, and reference-data mismatches. Surface them as a punch list, not a 60-page PDF.

2. Cleanse

Rule + AI-assisted fixes

Apply deterministic rules where they belong (formats, currencies, units) and AI assistance where they don’t (entity resolution, fuzzy matches, anomalous values). Preview every fix on a sample.

3. Validate

Business + statistical checks

Business-rule library, statistical drift detection, golden-query comparison, distribution stability. Failures route to a reviewer queue or block the pipeline, your choice.

4. Ship

Output as a data product

Write back to the warehouse as a versioned table, publish a data contract, emit OpenLineage events — downstream consumers know exactly what they’re getting.

Reversible by design

Every step is versioned. A bad cleansing pass can be rolled back per-table or per-rule without rerunning the upstream pipeline. The audit trail records every state transition.

Use cases

Six patterns we see most

Teams typically start with one and add others once the platform is wired into the warehouse.

Master data

Entity resolution & dedup

Customer / product / supplier masters built from messy multi-source records with AI-assisted matching and reviewer-confirmable merges.

  • Multi-source dedup with match scoring
  • Human-in-the-loop for low-confidence merges
  • ~80% reduction in manual dedup effort
Migration

Migration validation

Cleanse and reconcile data before, during, and after migrations — row-count, distribution, and golden-query diffs between source and target.

  • Pre-migration profiling & cleansing
  • Post-migration validation diffs
  • Pairs with InsightAutoHub for end-to-end
ML training

Training data preparation

Clean, balance, and feature-standardise training datasets — with leakage detection, class-imbalance scoring, and reproducible sampling.

  • Class-imbalance detection & resampling
  • Train/test leakage detection
  • Feature standardisation & encoding
BI / reporting

Reporting normalisation

Currency, units, time zones, region-level rollups — the cross-cutting normalisations that turn raw data into consistent reports.

  • Currency & unit conversion
  • Time-zone & calendar normalisation
  • Pre-aggregation for dashboards
Customer 360

Customer 360 assembly

Stitch customer records across CRM, billing, support, and product telemetry — with AI-assisted matching and reviewable golden records.

  • Cross-source identity resolution
  • Golden-record promotion workflow
  • Real-time updates as new sources land
Data SLA

Continuous data-quality SLA

Treat cleansed datasets as data products with SLAs. Continuous monitoring + auto-remediation for known issue patterns; escalation for the rest.

  • SLA per data product (freshness, completeness, accuracy)
  • Auto-remediation for known patterns
  • Escalation + audit on SLA breach
Ecosystem

Fits the data stack you already run

Built to slot into your warehouse, orchestrator, quality tooling, and catalog — not replace them.

Warehouses & lakehouses

Pushdown SQL on the warehouse you already pay for. No separate Spark cluster, no data egress.

Orchestration

Drop InsightXform steps into the orchestrator your team already uses — not yet-another-scheduler.

Quality & observability

Emit validation results into the quality tools your platform team already trusts.

Catalogs

Lineage events flow into your catalog — every transformation, every reviewer override, every SLA breach.

Deployment

Control plane wherever your security team accepts

The control plane is small — the heavy compute already happens in your warehouse. Cloud SaaS for quick starts; self-hosted when sovereignty matters.

Cloud SaaS

Hosted by usStart in minutes

AWS VPC

Your accountYour VPC

Azure Private

Your tenantYour VNet

GCP Private

Your projectYour VPC
Impact

What InsightXform delivers in production

Aggregated outcomes from enterprise deployments. We’ll size the ROI for your data platform on the discovery call.

10x

Faster onboarding of new data sources

~75%

Less rule maintenance overhead

Zero

Data egress — runs in your warehouse

100%

Lineage coverage on every transform

FAQ

Common questions from data platform teams

What we hear most often from platform leads and architects. More in our discovery call.

Does it pull data out or push compute in?

Pushes compute into your warehouse via SQL pushdown. Data never moves to a separate ETL plane. Your Snowflake / BigQuery / Databricks credits do the heavy lifting; the InsightXform control plane is small.

Does it work with dbt / Airflow / Dagster?

Yes. Compose InsightXform steps as dbt models, Airflow tasks, or Dagster ops. We don’t replace your orchestrator — we plug into the one your team already runs.

Can a bad cleansing pass be reversed?

Yes — every pass writes a versioned table, not in-place. Roll back per-table or per-rule without rerunning upstream. Full audit trail of who applied what, when, and why.

How does this compare to Great Expectations / Soda?

GE and Soda are quality-assertion tools — they tell you something’s wrong. InsightXform is a transformation engine that fixes what’s wrong, then emits results into GE / Soda for downstream visibility. They complement, they don’t overlap.

How does lineage propagate?

Every run emits OpenLineage events captured by your catalog (Insight Catalog, DataHub, Atlan, Collibra). Per-column transformation history, reviewer overrides, and SLA breaches all flow to the same place.

Where does our data live?

In your warehouse. The control plane (self-host or SaaS) only sees metadata, lineage events, and the small samples used for preview — never raw production rows.

See InsightXform on your warehouse

Bring a messy table from your Snowflake or BigQuery. We’ll profile, cleanse, and validate it live in a 30-minute walkthrough — preview-before-apply included.

Request a demo