Concepts
Dataset
Datasets refer to a table like data with typed columns. Datasets can be
sourced from external datasets or derived from other
datasets via pipelines. Datasets are written as
Pydantic inspired Python classes decorated
with the @dataset
decorator. Let's look at an example:
1from fennel.datasets import dataset, field
2from fennel.lib import meta
3
4@meta(owner="[email protected]")
5@dataset
6class User:
7 uid: int = field(key=True)
8 dob: datetime
9 country: str
10 update_time: datetime = field(timestamp=True)
python
Dataset Schema
A dataset has few fields (interchangeably referred to as columns throughout the documentation) with types and unique names. Each field must have has a pre-specified datatype. See the typing section to learn the types supported by Fennel.
Field Descriptors
You might have noticed the field(...)
descriptor next to uid
and
update_time
fields. These optional descriptors are used to provide non-typing
related information about the field. Currently, Fennel supports two
field descriptors:
Key Fields
Key fields are those with field(key=True)
set on them. The semantics of this
are somewhat similar to those of primary key in relational datasets. Datasets
can be looked-up by providing the value of key fields. It is okay to have a
dataset with zero key fields (e.g. click streams) - in those cases, it's not
possible to do lookups on the dataset at all. Fennel also supports having
multiple key fields on a dataset - in that case, all of those need to be
provided while doing a lookup. Field with optional data types can not be key
fields.
Timestamp Field
Timestamp fields are those with field(timestamp=True)
set on them. Every
dataset must have exactly one timestamp field and this field should always
be of type datetime
. Semantically, this field must represent 'event time'
of the row. Fennel uses timestamp field to associate a particular state of
a row with a timestamp. This allows Fennel to handle out of order events,
do time window aggregations, and compute point in time correct features for
training data generation.
If a dataset has exactly one field of datetime
type, it is assumed to be
the timestamp field of the dataset without explicit annotation. However, if
a dataset has multiple timestamp fields, one of those needs to be explicitly
annotated to be the timestamp field.
Here are some examples of valid and invalid datasets:
1from fennel.datasets import dataset, field
2from fennel.lib import meta
3
4@meta(owner="[email protected]")
5@dataset
6class UserValidDataset:
7 uid: int
8 country: str
9 update_time: datetime
python
1from fennel.datasets import dataset, field
2from fennel.lib import meta
3
4@meta(owner="[email protected]")
5@dataset
6class User:
7 uid: Optional[int] = field(key=True)
8 country: str
9 update_time: datetime
python
1from fennel.datasets import dataset, field
2from fennel.lib import meta
3
4@meta(owner="[email protected]")
5@dataset
6class User:
7 uid: int
8 country: str
9 update_time: int
python
1from fennel.datasets import dataset, field
2from fennel.lib import meta
3
4@meta(owner="[email protected]")
5@dataset
6class User:
7 uid: int
8 country: str
9 created_time: datetime
10 updated_time: datetime
python
1from fennel.datasets import dataset, field
2from fennel.lib import meta
3
4@meta(owner="[email protected]")
5@dataset
6class User:
7 uid: int
8 country: str
9 update_time: datetime = field(timestamp=True)
10 signup_time: datetime
python
Index
Keyed datasets can be indexed by setting index
to be True
in the dataset
decorator and indexed datasets can be looked up via the lookup
method on
datasets.
1from fennel.datasets import dataset, field
2
3@dataset(index=True)
4class User:
5 uid: int = field(key=True)
6 dob: datetime
7 country: str
8 update_time: datetime = field(timestamp=True)
python
1from fennel.datasets import dataset
2
3@dataset(index=True)
4class UserValidDataset:
5 uid: int
6 country: str
7 update_time: datetime
python
By default, setting index=True
builds both the online and full historical
offline indices. That can be configured by setting online
and offline
kwargs on the decorator (the default configuration is online
set to True and
offline
set to forever
).
1from fennel.datasets import dataset, field
2
3@dataset(index=True, offline=None)
4class User:
5 uid: int = field(key=True)
6 dob: datetime
7 country: str
8 update_time: datetime = field(timestamp=True)
python
Selectively turning only one of these on can be a good way to save storage and
compute costs. A dataset needs to be indexed with offline
set to forever
for
it to be joinable as the RHS dataset.
Versioning
All Fennel datasets are versioned and each version is immutable. The version of
a dataset can be explicitly specified in the @dataset
decorator as follows:
1@dataset(version=2)
2class Product:
3 pid: int = field(key=True)
4 price: float
5 update_time: datetime = field(timestamp=True)
python
Increasing the version of a dataset can be accompanied with any other changes - changing schema, changing source, change pipeline code, adding/removing index etc. In either scenario, Fennel recomputes the full dataset when the version is incremented.
The version of a dataset also captures all its ancestry graph. In other words, when the version of a dataset is incremented, the version of all downstream datasets that depend on it must also be incremented, leading to their reconstruction as well.
As of right now, Fennel doesn't support keeping two versions of a dataset alive simultaneously and recommends to either create datasets with different names or run two parallel branches with different versions of the dataset.
Meta Flags
Datasets can be annotated with useful meta information via metaflags - either at the dataset level or at the single field level. To ensure code ownership, Fennel requires every dataset to have an owner. Here is an example:
1from fennel.datasets import dataset, field
2from fennel.lib import meta
3
4@meta(owner="[email protected]", tags=["PII", "experimental"])
5@dataset
6class UserWithMetaFlags:
7 uid: int = field(key=True)
8 height: float = field().meta(description="height in inches")
9 weight: float = field().meta(description="weight in lbs")
10 updated: datetime
python
Typing the same owner name again and again for each dataset can get somewhat
repetitive. To prevent that, you can also specify an owner at the module level
by specifying __owner__
and all the datasets in that module inherit
this owner. Example:
1from fennel.datasets import dataset, field
2from fennel.lib import meta
3
4__owner__ = "[email protected]"
5
6@dataset
7class UserBMI:
8 uid: int = field(key=True)
9 height: float
10 weight: float
11 bmi: float
12 updated: datetime
13
14@meta(owner="[email protected]")
15@dataset
16class UserName:
17 uid: int = field(key=True)
18 name: str
19 updated: datetime
20
21@dataset
22class UserLocation:
23 uid: int = field(key=True)
24 city: str
25 updated: datetime
python
In this example, datasets UserBMI
and UserLocation
both inherit the owner
from the module level __owner__
whereas dataset UserName
overrides it by
providing an explicit meta flag.