Great Expectation on Databricks

Run great_expectations on the hosted environment like Databricks

Probhakar
5 min readAug 31, 2021

I was working with great_expectations on Databrics Community Edition, initially, I was facing some issues, since GE (i.e. great_expectations) can be easily used using the CLI and in a hosted environment, it is difficult to use them. Here I am trying to summarize the below steps which can be done without CLI and any context.

For those who might be thinking about what GE is, it is a python library for data validation.

So, simply put, when we are accepting some data from some source or we are transforming the data, we want to check its quality, if there is any anomaly before handing over the data to the client or storing it into the storage. GE plays the role of data validator. It has some cool features —

  1. Easy CLI based operation
  2. Automated data profiling
  3. Compatible with many data sources like Pandas, Spark, SQL, etc.
  4. HTML report that is very human friendly
  5. After validation, the reports can be sent to email, Slack, Microsoft Teams, etc.

Here we will mainly focus on how to use GE in a hosted environment. In the local environment, it is very well documented here.

So, how does GE works? We have some data as shown below —

We basically define our expectations:

  1. I want the month column should not to be null 98% of the time
  2. I want the month column should be an integer and the minimum value should be 1 (i.e. January) and the maximum value should be 12 (i.e. December).

In GE terms it is written as

{'kwargs': {'column': 'month', 'mostly': 0.95},    'expectation_type': 'expect_column_values_to_not_be_null',    'meta': {}},

This is the list of all the expectations.

Prerequisite:

  1. One cluster up and running on Databricks

Run the below commands to create a folder named GE_spark_tutorial and download the flights dataset.

# create the directory
dbutils.fs.mkdirs('/FileStore/GE_spark_tutorial/')

Then download the file

!wget --no-check-certificate "https://assets.datacamp.com/production/repositories/1237/datasets/fa47bb54e83abd422831cbd4f441bd30fd18bd15/flights_small.csv" -O /tmp/flights.csv

Now move the file inside /GE_spark_tutorial/data/

# move the file
dbutils.fs.mv('file:/tmp/flights.csv', '/FileStore/GE_spark_tutorial/data/flights.csv')

Now I will be posting images, the full notebook can be found at the end of this article.

  1. Creating unique run id to uniquely identify each validation run

2. Creating the spark data frame

3. Create a wrapper around the spark data frame

4. Now that we have gdf object we can do all sorts of things like

  1. profiling
  2. validation

4.1. Profiling

We will be using BasicDatasetProfiler profiler to prile the data.

Now It will create an initial version of the expectation suite. So basically GE goes through the data and suggests some expectations based on the data. It may be wrong or not so good but obviously a great point to start. If we print expectation_suite_based_on_profiling then it will show some JSON that is not user-friendly. GE comes handly to generate HTML based on this.

Generating the HTMLs

Now we can display them, for example, display the initial draft expectations —

To see the profiling result —

Now, when you profile the data, it automatically adds the expectation to the existing expectation suite named “default”. We can edit that expectation suite by remove unnecessary expectations and adding required one —

Like this we can edit the expectation suite and do gdf.validate() to validate against the edited suited “default”, but there we will create a brand new expectation suite which we will use to validate —

Now we can validate the data against our custom suite

Now let’s create the HTML for this

Now save the files

If you are using Databricks then you can access the files from the browser, for example, I am using Databricks community edition, so I can access the validation.html like this

https://community.cloud.databricks.com/files/GE_spark_tutorial/great_expectations_1630421008/validation.html?o=44xxxxxxxxxxxx79

Finally, if you want to send the validation result to Microsoft teams or slack you can do that

Few extras: If you want to store the expectation as a JSON file and want to use it later

File the files here on Databrics(it is valid till Feb 2022)

Or Github link

--

--