LogoLogo
  • Datafold
  • Introduction
    • Data Diff
      • Continuous Integration
      • Manual Data Diff
      • Diff Results
    • Column-level lineage
      • Usage, popularity, & impact per table or column
    • Alerting
  • ⏱️Quickstart Guide
  • Getting Started
    • Data Warehouses
      • Snowflake
      • BigQuery
      • Redshift
      • Postgres
      • Databricks
    • Configuration
      • Indexing
      • Filtering
      • Profiling
      • Lineage
    • On-prem Deployment
      • AWS
      • GCP
    • SSO
      • Okta
      • Google OAuth
      • SAML
  • Integrations
    • Continuous Integration
      • Source Control with Git
        • GitHub
          • On-prem Github
        • GitLab
      • dbt Cloud
      • dbt Core / datafold-sdk
        • GitHub example
        • GitLab example
      • dbt Configurations
      • datafold-sdk
    • Alert Integrations
      • Slack integration
        • Slack Alerts
        • On-prem Slack Integration
      • Alerting webhooks
    • Data Apps
      • Mode
      • Hightouch
  • Developer
    • Datafold API
      • Alerting
      • GraphQL Metadata API
      • Data Diff
      • Error handling
    • Security
      • GDPR
      • Network Security
Powered by GitBook
On this page

Was this helpful?

  1. Introduction
  2. Data Diff

Continuous Integration

Learn more about using Diff in your CI process

PreviousData DiffNextManual Data Diff

Last updated 2 years ago

Was this helpful?

Data Diff works best when being part of the . As with automated testing, this makes sure that you're in full control of the changes that you make to your data pipeline and you exactly know what the impact is of the changes that you make.

Let's consider :

Let's say that you want to trim the input values that you're getting from an external source. , Datafold will automatically look at the differences between the version that you've created, and the latest production run, and post the differences:

The compiled report gives a high-level overview of what changed in the table, and what changed in the downstream dependencies:

  • Table rows – total number of rows in each table

  • Table columns - total number of columns in each table

  • Schema diff

    • Total columns – total number of columns in each table

    • Mismatched columns – number of columns that are exclusive to one table, have been reordered or changed their type.

  • Primary keys diff

    • Distinct PKs – number of unique primary keys in each table

    • Exclusive PKs – number of primary keys that exist in one table but not in the other.

    • Null PKs – number of rows where primary key field is NULL

    • Duplicate PKs – number of rows with the same value in the primary key field

  • Values diff

    • Differing columns – number of columns that have differences in values. Only matching columns (same name, same type) are compared.

    • Total differing rows – number of rows that have at least one column value different.

    • Total differing values – total number of different values across both rows and columns (cells) in the table.

For more details, you can easily jump into Datafold, here you can see a more detailed overview of the changes:

And you can jump into the differences in the distributions of the columns:

At Datafold we believe that this process needs to be automated. If it isn't automated, a team under stress might do some shallow checks, which might cause bugs to sneak into the pipeline.

In this case, we've added orders that we're marked as pending to the sales table. The reviewer of the PR can easily do the review by , and doesn't need to do compose queries for sanity checking:

checking Datafold
CI process
the following PR
When opening the PR