Data Quality

Data Expectations

Fennel's powerful type system lets you maintain data integrity by outright rejecting any data that doesn't meet the given types. However, sometimes there are situations when data expectations are more probabilistic in nature.

As an example, you may have a field in dataset of type Optional[str] that denotes user city (can be None if user didn't provide their city). While this is nullable, in practice, we expect most people to fill out their city. In other words, we don't want to reject Null values outright but still track if fraction of null values is higher than what we expected.

Fennel lets you do this by writing data expectations. Once expectations are specified, Fennel tracks the % of the rows that fail the expectation -- and can alert you when the failure rate is higher than the specified tolerance.

1from fennel.datasets import dataset
2from fennel.lib.expectations import (
3    expectations,
4    expect_column_values_to_be_between,
5    expect_column_values_to_be_in_set,
6    expect_column_pair_values_A_to_be_greater_than_B,
7)
8from fennel.lib.schema import between
9
10
11@dataset
12class Sample:
13    passenger_count: between(int, 0, 100)
14    gender: str
15    age: between(int, 0, 100, strict_min=True)
16    mothers_age: between(int, 0, 100, strict_min=True)
17    timestamp: datetime
18
19    @expectations
20    def my_function(cls):
21        return [
22            expect_column_values_to_be_between(
23                column=str(cls.passenger_count),
24                min_value=1,
25                max_value=6,
26                mostly=0.95,
27            ),
28            expect_column_values_to_be_in_set(
29                str(cls.gender), ["male", "female"], mostly=0.99
30            ),
31            # Pairwise expectation
32            expect_column_pair_values_A_to_be_greater_than_B(
33                column_A=str(cls.age), column_B=str(cls.mothers_age)
34            ),
35        ]

python

Type Restrictions vs Expectations

Type restrictions and expectations may appear to be similar but solve very different purposes. Type Restrictions simply reject any row/data that doesn't satisfy the restriction - as a result, all data stored in Fennel datasets can be trusted to follow the type restriction rules.

Data expectations, on the other hand, don't reject the data - just passively track the frequency of expectation mismatch and alert if it is higher than some threshold. Type restrictions are a stronger check and should be preferred if no expectations to the restriction are allowed.

On This Page

Edit this Page on Github