Data Quality
Strong Typing
Fennel supports a rich and powerful data type system.
All dataset fields and features in Fennel must be given a type and Fennel
enforces these types strongly. In particular, types don't auto typecast (e.g.
int
values can not be passed where float
is expected) and nullable types
are explicitly declared (e.g. Optional[str]
can take nulls but not str
).
Let's see how this helps prevent quality bugs:
Dataset Fields
Every dataset field must be given a type. Fennel simply rejects any incoming data that doesn't match the type of any field in the dataset - as a result, datasets can always be trusted to only have type compliant data. This prevents any downstream bugs/failures arising due to operations on invalid data.
Type Restrictions
Sometimes application data models require much finer grained enforcement of types
than what is supported by programming languages. For instance, if a dataset field
represents a zip code, while the datatype is str
, only a subset of strings that
match a zip code regex are semantically valid.
Or as another example, if a dataset field represents gender, maybe only a
handful of values are valid (e.g. male
, female
, non-binary
etc.). Fennel's
type system supports type restrictions
using which all these and lot more constraints can be encoded as data types and
thus get checked at compile and runtime everywhere.
Datetime Parsing
Timestamps can be encoded in a variety of formats and this often creates a bunch of
bugs in data engineering world. Fennel has a separate data type for datetime
which
is automatically parsed from a wide variety of formats. As a result, some of the data
may be encoding time as milliseconds since epoch and another as a string in RFC 3339
format and Fennel supports their inter-operation quite nicely.