Data Quality
Data Expectations
Fennel's powerful type system lets you maintain data integrity by outright rejecting any data that doesn't meet the given types. However, sometimes there are situations when data expectations are more probabilistic in nature.
As an example, you may have a field in dataset of type Optional[str]
that denotes
user city (can be None if user didn't provide their city). While this is nullable,
in practice, we expect most people to fill out their city. In other words, we
don't want to reject Null values outright but still track if fraction of null
values is higher than what we expected.
Fennel lets you do this by writing data expectations. Once expectations are specified, Fennel tracks the % of the rows that fail the expectation -- and can alert you when the failure rate is higher than the specified tolerance.
1from fennel.datasets import dataset
2from fennel.lib import (
3 expectations,
4 expect_column_values_to_be_between,
5 expect_column_values_to_be_in_set,
6 expect_column_pair_values_A_to_be_greater_than_B,
7)
8from fennel.dtypes import between
9
10
11@dataset
12class Sample:
13 passenger_count: between(int, 0, 100)
14 gender: str
15 age: between(int, 0, 100, strict_min=True)
16 mothers_age: between(int, 0, 100, strict_min=True)
17 timestamp: datetime
18
19 @expectations
20 def my_function(cls):
21 return [
22 expect_column_values_to_be_between(
23 column=str(cls.passenger_count),
24 min_value=1,
25 max_value=6,
26 mostly=0.95,
27 ),
28 expect_column_values_to_be_in_set(
29 str(cls.gender), ["male", "female"], mostly=0.99
30 ),
31 # Pairwise expectation
32 expect_column_pair_values_A_to_be_greater_than_B(
33 column_A=str(cls.age), column_B=str(cls.mothers_age)
34 ),
35 ]
python
Type Restrictions vs Expectations
Type restrictions and expectations may appear to be similar but solve very different purposes. Type Restrictions simply reject any row/data that doesn't satisfy the restriction - as a result, all data stored in Fennel datasets can be trusted to follow the type restriction rules.
Data expectations, on the other hand, don't reject the data - just passively track the frequency of expectation mismatch and alert if it is higher than some threshold. Type restrictions are a stronger check and should be preferred if no expectations to the restriction are allowed.