Fennel's powerful type system lets you maintain data integrity by outright rejecting any data that doesn't meet the given types. However, sometimes there are situations when data expectations are more probabilistic in nature.
As an example, you may have a field in dataset of type
Optional[str] that denotes
user city (can be None if user didn't provide their city). While this is nullable,
in practice, we expect most people to fill out their city. In other words, we
don't want to reject Null values outright but still track if fraction of null
values is higher than what we expected.
Fennel lets you do this by writing data expectations. Once expectations are specified, Fennel tracks the % of the rows that fail the expectation -- and can alert you when the failure rate is higher than the specified tolerance.
1from fennel.datasets import dataset 2from fennel.lib.expectations import ( 3 expectations, 4 expect_column_values_to_be_between, 5 expect_column_values_to_be_in_set, 6 expect_column_pair_values_A_to_be_greater_than_B, 7) 8from fennel.lib.schema import between 9 10 11@dataset 12class Sample: 13 passenger_count: between(int, 0, 100) 14 gender: str 15 age: between(int, 0, 100, strict_min=True) 16 mothers_age: between(int, 0, 100, strict_min=True) 17 timestamp: datetime 18 19 @expectations 20 def my_function(cls): 21 return [ 22 expect_column_values_to_be_between( 23 column=str(cls.passenger_count), 24 min_value=1, 25 max_value=6, 26 mostly=0.95, 27 ), 28 expect_column_values_to_be_in_set( 29 str(cls.gender), ["male", "female"], mostly=0.99 30 ), 31 # Pairwise expectation 32 expect_column_pair_values_A_to_be_greater_than_B( 33 column_A=str(cls.age), column_B=str(cls.mothers_age) 34 ), 35 ]
Type Restrictions vs Expectations
Type restrictions and expectations may appear to be similar but solve very different purposes. Type Restrictions simply reject any row/data that doesn't satisfy the restriction - as a result, all data stored in Fennel datasets can be trusted to follow the type restriction rules.
Data expectations, on the other hand, don't reject the data - just passively track the frequency of expectation mismatch and alert if it is higher than some threshold. Type restrictions are a stronger check and should be preferred if no expectations to the restriction are allowed.