LogoLogo
  • Datafold
  • Introduction
    • Data Diff
      • Continuous Integration
      • Manual Data Diff
      • Diff Results
    • Column-level lineage
      • Usage, popularity, & impact per table or column
    • Alerting
  • ⏱️Quickstart Guide
  • Getting Started
    • Data Warehouses
      • Snowflake
      • BigQuery
      • Redshift
      • Postgres
      • Databricks
    • Configuration
      • Indexing
      • Filtering
      • Profiling
      • Lineage
    • On-prem Deployment
      • AWS
      • GCP
    • SSO
      • Okta
      • Google OAuth
      • SAML
  • Integrations
    • Continuous Integration
      • Source Control with Git
        • GitHub
          • On-prem Github
        • GitLab
      • dbt Cloud
      • dbt Core / datafold-sdk
        • GitHub example
        • GitLab example
      • dbt Configurations
      • datafold-sdk
    • Alert Integrations
      • Slack integration
        • Slack Alerts
        • On-prem Slack Integration
      • Alerting webhooks
    • Data Apps
      • Mode
      • Hightouch
  • Developer
    • Datafold API
      • Alerting
      • GraphQL Metadata API
      • Data Diff
      • Error handling
    • Security
      • GDPR
      • Network Security
Powered by GitBook
On this page
  • Data Catalog
  • Column level lineage

Was this helpful?

  1. Introduction

Column-level lineage

At the heart of Datafold is the data catalog including column level lineage

PreviousDiff ResultsNextUsage, popularity, & impact per table or column

Last updated 2 years ago

Was this helpful?

In order to make the most of the data, data practitioners need to be able to review and assess the various data assets they are working with. Datafold’s Catalog feature helps data practitioners discover their relevant data assets and learn important information about them.

Seeing how the data looks like, and how it flows throughout the system. Allowing people to quickly see where the data comes from, and which tables are affected when making a change to the pipeline.

Data Catalog

When opening the catalog, you can filter down on a database/schema level to quickly see which tables live within the data warehouse.

There are two ways to search for data assets:

  • Simply use the search bar: type a keyword in the search field and find the relevant asset in the dropdown list or resulting table.

  • Use the filters on the left side of the screen to narrow down the results. You can filter for specific schemas or tags, and we can also filter by owners and asset types.

The information shown can be ingested from many sources:

  • Through the Datafold UI you can easily add descriptions, tags, and set owners of certain tables.

  • Fetching the table/column descriptions directly from the data warehouse from the schemas.

  • For full control, you can use the API you set the metadata.

  • If there are any integrations that you would like to see, please don't hesistate to reach out!

These features make sure that the UI is rich, and that you can easily integrate Datafold into your existing data stack. In addition to that, Datafold gives easy insight into the data itself:

The Profile views shows:

  • When the profile were computed

  • Number of rows and columns

  • When the table was last updated

  • The data owner of the data asset

  • The upstream and downstream tables

  • Table/Column description, tags, etc.

Next to that, for each column is gives the profile of the data:

This view gives the analyst, engineer, or scientist quick insights into:

  • Sparsity, which gives an indication of how well that field is populated. Do all rows have an entry or are there many nulls?

  • Distribution, which shows how data is distributed across the number range in the case of numerical data.

Column level lineage

Most data in your database is not loaded directly in the database, but has been assembled from underlying raw data using queries. This is the T step in ELT to refine and combine data from different sources, to combine it into useful information. One table may be generated by an SQL query on the raw data, and that table may again be used by an ETL function to generate another dataset. This creates directional relationships between the various datasets, similar to a family tree with descendants and ancestors.

This relationship is called lineage and can be shown in a lineage graph. It can be generated on a column level (showing the relationships of each column to columns in other tables), or on the table level (showing the relationships of tables in their entirety).

In order to learn where the data comes from, you can easily see an overview in the Lineage tab. Datafold gets the data from the SQL statements that are executed by your ETL tool against the Data Warehouse.

This information is very powerful because we can easily trace downstream and upstream dependencies for any data set in our warehouse and understand exactly how the data is produced and consumed.

It is very easy to navigate through the lineage. By right clicking on another table, you can easily see the up- and downstream dependencies of the specific table.

In the screenshot above you can see the . Datafold also has a rich search engine that allows you to quickly find your data assets:

Sourcing the table/column descriptions, tags, and owners .

For certain databases, we have additional ways of ingesting metadata. For example, for Snowflake we support ingesting the

Original query (click on “Show SQL query” to show the that created the table)

The profiles are computed on an ad hoc basis by default, but it is to compute them outside of business hours, to have the profiles cached in Datafold.

public Datafold dbt-beers repository
from dbt
Query tags.
SQL statement
highly recommended