Concepts
Dataset
Datasets refer to a table like data with typed columns. Datasets can be
sourced from external datasets or derived from other
datasets via pipelines. Datasets are written as
Pydantic inspired Python classes decorated
with the @dataset
decorator.
Example
1@meta(owner="[email protected]")
2@dataset
3class User:
4 uid: int = field(key=True)
5 dob: datetime
6 country: str
7 update_time: datetime = field(timestamp=True)
Dataset Schema
A dataset has few fields (interchangeably referred to as columns) with types and unique names. Each field must have has a pre-specified datatype. See the typing section to learn the types supported by Fennel.
Field Descriptors
You might have noticed the field(...)
descriptor next to uid
and
update_time
fields. These optional descriptors are used to provide non-typing
related information about the field. Currently, Fennel supports two
field descriptors:
Key Fields
Key fields are those with field(key=True)
set on them. The semantics of this
are somewhat similar to those of primary key in relational datasets. Datasets
can be looked-up by providing the value of key fields. It is okay to have a
dataset with zero key fields (e.g. click streams) - in those cases, it's not
possible to do lookups on the dataset at all. Fennel also supports having
multiple key fields on a dataset - in that case, all of those need to be
provided while doing a lookup. Field with optional data types can not be key
fields.
Timestamp Field
Timestamp fields are those with field(timestamp=True)
set on them. Every
dataset must have exactly one timestamp field and this field should always
be of type datetime
. Semantically, this field must represent 'event time'
of the row. Fennel uses timestamp field to associate a particular state of
a row with a timestamp. This allows Fennel to handle out of order events,
do time window aggregations, and compute point in time correct features for
training data generation.
If a dataset has exactly one field of datetime
type, it is assumed to be
the timestamp field of the dataset without explicit annotation. However, if
a dataset has multiple timestamp fields, one of those needs to be explicitly
annotated to be the timestamp field.
Here are some examples of valid and invalid datasets:
Valid - has no key fields, which is fine. No explicitly marked timestamp fields so update_time, which is of type datetime is automatically assumed to be the timestamp field
1@meta(owner="[email protected]")
2@dataset
3class UserValidDataset:
4 uid: int
5 country: str
6 update_time: datetime
Invalid - key fields can not have an optional type
1@meta(owner="[email protected]")
2@dataset
3class User:
4 uid: Optional[int] = field(key=True)
5 country: str
6 update_time: datetime
Invalid - no field of datetime
type
1@meta(owner="[email protected]")
2@dataset
3class User:
4 uid: int
5 country: str
6 update_time: int
Invalid - no explicitly marked timestamp
field
and multiple fields of type datetime
, hence the timestamp
field is ambiguous
1@meta(owner="[email protected]")
2@dataset
3class User:
4 uid: int
5 country: str
6 created_time: datetime
7 updated_time: datetime
Valid - even though there are multiple datetime fields, one of them is explicitly annotated as timestamp field.
1@meta(owner="[email protected]")
2@dataset
3class User:
4 uid: int
5 country: str
6 update_time: datetime = field(timestamp=True)
7 signup_time: datetime
Meta Flags
Datasets can be annotated with useful meta information via metaflags - either at the dataset level or at the single field level. To ensure code ownership, Fennel requires every dataset to have an owner. Here is an example:
1@meta(owner="[email protected]", tags=["PII", "experimental"])
2@dataset
3class UserWithMetaFlags:
4 uid: int = field(key=True)
5 height: float = field().meta(description="height in inches")
6 weight: float = field().meta(description="weight in lbs")
7 updated: datetime