All research articles

Research methodology

Seven questions decide whether your AI project fails

A reproducible data-quality audit for mid-cap data teams. We publish the tool with which, at myBytes, we check every AI undertaking against seven dimensions in the first ten minutes. You can apply it to your own data yourself.

⚠️ Safety notice - please read without fail.

The tool is applied at your own risk. It may run exclusively on backups or sample extracts of your data, never directly on production data. Make a copy first, work on the copy. The detailed safety rationale is in DISCLAIMER.md in the companion repository. We repeat the notice deliberately several times. Data corruption on a production database is irreversible damage that a methodological tool must not cause.

In our mid-cap implementation conversations one finding turns up with tiring regularity: AI pilots do not fail on the model. They fail on the data quality that nobody audited beforehand. The discussion then lands on the provider, on feature engineering or on hyperparameter tuning, rarely where the error actually lies.

We wanted a tool that checks a dataset in ten minutes against the seven dimensions that, in our experience, decide the success or failure of an AI undertaking. We built it and publish it today. This article describes the seven dimensions, the methodological basis and the findings from a first test against a synthetic dataset with known defects.

1 · The seven dimensions

Each dimension measures a different aspect of data quality and produces a traffic-light rating (green, yellow, red) plus a numerical value.

No.DimensionWhat it measuresOperational use
1Completenessmissing share per column and overallAre mandatory fields filled? Which optional fields actually reach collection quality?
2Schema consistencytype drift within a column; wrong type inferenceWhere are numbers passed as strings? Where do date fields sleep as object?
3Uniquenessduplicates on the full row and on declared key columnsAre the aggregates inflated? Did the ETL run load twice?
4Value plausibilityTukey 1.5·IQR and 5σ outliers per numeric columnAre the units correct? Did a sensor produce a jump?
5Temporal gapsgaps larger than three times the median step between observationsDid an ETL source go silent in a period?
6Referential integrityorphan share on declared foreign-key relationsAre rows silently dropped in the join?
7Representativenessdrift between two subsets (e.g. train vs. test) via Kolmogorov-SmirnovDoes the model generalize?

Three of them (5, 6, 7) are optional and are only evaluated if the audit call provides the necessary information (time column, foreign-key specification, split definition).

2 · The tool

Installation and execution in two minutes:

git clone https://github.com/myBytesResearch/data-quality-audit.git
cd data-quality-audit
pip install -e .

# Make a copy first! Never directly on production data.
cp /path/to/production.csv /tmp/audit_copy.csv

dqa audit /tmp/audit_copy.csv \
    --report my_report.md \
    --json scorecard.json

The tool outputs a Markdown report with all seven dimensions plus a JSON scorecard for downstream automation. The CLI prints a safety warning before every data-read operation and requires an explicit confirmation if the path is not visibly marked as a sample or copy (--sample or --copy).

Technical guarantees of the tool:

  • The input file is opened exclusively in read mode.
  • No network calls at runtime. No telemetry. No external paths.
  • No automatic data repair. The tool diagnoses, it does not repair. Repair is a deliberate decision in the data-engineering workflow.

The complete code, the seven dimension modules, the tests and a synthetic dataset with known defects are in the companion repository data-quality-audit.

3 · Test against a synthetic dataset

We built a synthetic dataset with five deliberately injected defects and let the tool loose on it. The defects:

  1. email column with a 40 % missing share
  2. phone column with a 20 % missing share
  3. a mixed_col with mixed Python types (strings and integers)
  4. 50 duplicated rows
  5. three extreme outliers in amount (1·10⁹ and −1·10⁶)

The tool detected and correctly rated the first four in full. On the fifth defect (mixed types) an interesting limitation appeared: when loading from CSV all values are first interpreted as strings, so the mixed types are lost at row granularity. The tool detects this case cleanly on Parquet inputs; with CSV it becomes visible only in the downstream loading step. This limitation is stated explicitly in the companion repository and in the limitations section below.

Excerpt from the generated report:

**Overall verdict:** 🟡 YELLOW
**Reason:** Yellow on: completeness, schema_consistency,
uniqueness, value_plausibility

## Dimension 1 · Completeness
🟡 YELLOW - Overall missing share: 6.6 %. Worst column: `email`
at 40.4 % missing.

## Dimension 4 · Value plausibility
🟡 YELLOW - `amount`: 407 Tukey outliers, 1 |z| > 5σ
(range −1 M … 1 bn).

The per-dimension traffic-light evaluation plus the weighted overall verdict is the one output a head of data can show in a stand-up without having to attach an explanation.

4 · Three real audits, three instructive findings

We let the tool loose on three public datasets, each in its unchanged original form, without prior cleaning. The results are in the audit-reports/ of the companion repository for independent verification.

4.1 UCI Adult Income (32,561 rows) - 🔴 red

A classic classification benchmark, frequently used in courses.

DimensionVerdictFinding
Completeness🟢0 % missing after correct na_values='?' handling
Schema consistency🟢uniform types per column
Uniqueness🟡24 duplicated rows (0.074 %), low, but non-zero
Value plausibility🔴four columns with massive outlier concentration: fnlwgt, capital_gain, capital_loss, plus one

The fnlwgt finding is methodologically interesting: it is a sampling weight from the census design, not a real observation feature. Statistically it shows heavy-tail character; substantively it is legitimate. This case is a textbook example of why a tool diagnosis must always be interpreted with domain knowledge.

4.2 Titanic (891 rows) - 🔴 red

The notorious introductory dataset for ML tutorials.

DimensionVerdictFinding
Completeness🟡8.1 % missing overall; Cabin at 77.1 % missing (legendary), Age at 19.9 % missing
Schema consistency🟢clean
Uniqueness🟢no duplicates
Value plausibility🔴SibSp, Parch, Fare with outlier clusters

The Cabin finding (77.1 % missing) has been present in every Kaggle Titanic notebook for years. The tool finds it in ten seconds, without anyone having had to know beforehand that it exists. That is exactly the operational value: not the remembering, but the systematic finding.

4.3 Apple daily prices, 25 years (6,288 rows) - 🔴 red

Yahoo Finance, AAPL 2000-01-01 to 2024-12-31.

DimensionVerdictFinding
Completeness🟢clean
Schema consistency🟡Date column as string, should be datetime
Uniqueness🟢clean
Value plausibility🔴six numeric columns with outlier clusters (typical for financial prices with splits and crashes)
Temporal gaps🔴168 gaps (2.7 % of intervals), largest gap 4 days

The Date-as-string finding is the most frequent cause of silently failing time-series splits that we see in mid-cap implementation work.

The 168 temporal gaps are methodologically not data defects, they are weekends and US exchange holidays. The tool detects them correctly; the substantive interpretation (“this is normal for daily stock prices”) lies with the domain expert. The finding is still operationally valuable: a time-series model architect working with this dataset must build in the business-day mechanics explicitly, otherwise it comes back as a bug.

4b · Three cross-cutting lessons from the audits

The three datasets are very different, and yet a common pattern emerges:

  1. Value plausibility is the most frequently red-lit dimension. Heavy tails are so widespread in real data that the tool almost always raises an alarm here. The next necessary step is always a human one: is the outlier a sensor error, a unit confusion or a genuine rare event?
  2. Schema drift on date columns is inconspicuous at the start across all three datasets and becomes drama in the time-series split. AAPL is the textbook example.
  3. Completeness can be falsely green if the CSV loader does not recognize the sentinel values. The UCI Adult example is instructive: without na_values='?', Workclass would be 100 % filled; in reality 5.6 % of values are missing. A methodologically honest data-quality evaluation requires a deliberate configuration of the sentinel values before the evaluation.

5 · How to use it as a discovery trigger

The tool is deliberately also an entry point to a conversation with us. Concretely:

  1. Make a copy of a dataset you work with or are preparing for an AI project.
  2. Run dqa audit over it.
  3. If the scorecard shows three or more dimensions yellow or red, or if the findings surprise you, get in touch. We read your scorecard (anonymized if you prefer) within the next 48 hours and reply with an assessment that helps you sort your data strategy.
Start the pre-meeting questionnaire → or by email to contact@mybytes.com · directly on +49 40 60940415

Three ways to reach us:

  • Structured via our pre-meeting questionnaire. You describe your starting situation in ten minutes, we prepare specifically for the conversation. That is the most effective route if you have a concrete AI or data-quality undertaking in preparation.
  • By email to contact@mybytes.com. Send us the scorecard and a few sentences of context, we reply in under 48 hours.
  • Directly on +49 40 60940415. If you prefer to talk briefly before sending material.

No sales pitch, no multi-part sequence. One conversation in which we hear whether the tool touches you at a point where we can help you further with our implementation experience.

6 · Steel-man: “we already do data quality with great_expectations

A plausible counter-position: “We already have a data-quality stack with great_expectations, pydantic or dbt tests. Do we even need this?” Three answers:

  • Different layers. great_expectations and similar frameworks are excellent for expected tests against a known schema. They require you to formulate the expectation in advance. Our tool starts without an expectation and shows you what is actually hiding in your dataset. It is the first step before the tests.
  • Ten minutes instead of two weeks. Setting up a great_expectations suite fully costs, depending on the data model, between one and four weeks. A dqa evaluation over a fresh dataset costs ten minutes. Both have their place; they do not exclude each other.
  • Method tradition. The dimensions are not invented but drawn from the established data-engineering literature (Tukey 1977, Kolmogorov 1933, Smirnov 1939) and the pandas convention. The tool bundles them in a reproducible CLI form.

7 · What this tool does not do

Six methodological caveats that belong to the candid picture:

  • It does not replace a schema contract. If you need a data-engineering tool for contract tests against a known schema, great_expectations or pandera is the better choice. dqa is the diagnostic first stage before it.
  • It does not repair. The tool diagnoses. The repair is a deliberate decision that must happen in your data-engineering workflow.
  • CSV limit on schema consistency. On the CSV read path, mixed types are masked by the pandas CSV reader. If you need this dimension sharply, export to Parquet for the audit phase.
  • Optional dimensions need hints. Temporal gaps, referential integrity and representativeness are only meaningful if the call provides the time column, the foreign-key specification or the split.
  • No domain knowledge. “Outlier” means statistically exceptional. Whether a value is actually implausible in your domain only your domain expert knows.
  • Safety limit. The tool is only safe if you stick to the disclaimer. Copy first. Audit on the copy. Never on production.

The complete limitations list with further points is in DISCLAIMER.md and in the companion-repo README.

8 · Reading list

  1. Tukey 1977, Exploratory Data Analysis, a classic, source of the 1.5·IQR outlier test.
  2. Kolmogorov 1933 / Smirnov 1939, two-sample KS test, the drift-detection basis in dimension 7.
  3. pandas documentation, Missing data, the conventions the tool follows.
  4. Great Expectations docs, the contract-test layer above the diagnosis layer.

Related work

Companion repository

myBytesResearch/data-quality-audit, complete code, seven dimension modules, CLI, tests, a synthetic example dataset with known defects. Private at publication time; the visibility flip to public is a separate decision. Important: read DISCLAIMER.md first. Use at your own risk, exclusively on backups or samples.

Disclaimer

This article describes a data-quality audit tool from our implementation practice at mid-cap companies. It is neither an investment recommendation nor data-engineering strategy advice. The tool is to be applied at your own risk and exclusively on backups or sample extracts, never directly on production data. Repairing identified defects is the responsibility of your data-engineering workflow, not of this tool.
Independent reviewer: open invitation. Companion repository data-quality-audit with a pip-install pipeline, seven dimension modules, a CLI with safety banner and confirmation gate, JSON+Markdown output, a test suite and a synthetic example dataset.