Concepts

Dataset

Datasets refer to a table like data with typed columns. Datasets can be sourced from external datasets or derived from other datasets via pipelines. Datasets are written as Pydantic inspired Python classes decorated with the @dataset decorator. Let's look at an example:

1from fennel.datasets import dataset, field
2from fennel.lib.metadata import meta
3
4@meta(owner="[email protected]")
5@dataset
6class User:
7    uid: int = field(key=True)
8    dob: datetime
9    country: str
10    update_time: datetime = field(timestamp=True)

python

Dataset Schema

A dataset has few fields (interchangeably referred to as columns throughout the documentation) with types and unique names. Each field must have has a pre-specified datatype. See the typing section to learn the types supported by Fennel.

Field Descriptors

You might have noticed the field(...) descriptor next to uid and update_time fields. These optional descriptors are used to provide non-typing related information about the field. Currently, Fennel supports two field descriptors:

Key Fields

Key fields are those with field(key=True) set on them. The semantics of this are somewhat similar to those of primary key in relational datasets. Datasets can be looked-up by providing the value of key fields. It is okay to have a dataset with zero key fields (e.g. click streams) - in those cases, it's not possible to do lookups on the dataset at all. Fennel also supports having multiple key fields on a dataset - in that case, all of those need to be provided while doing a lookup. Field with optional data types can not be key fields.

Timestamp Field

Timestamp fields are those with field(timestamp=True) set on them. Every dataset must have exactly one timestamp field and this field should always be of type datetime. Semantically, this field must represent 'event time' of the row. Fennel uses timestamp field to associate a particular state of a row with a timestamp. This allows Fennel to handle out of order events, do time window aggregations, and compute point in time correct features for training data generation.

If a dataset has exactly one field of datetime type, it is assumed to be the timestamp field of the dataset without explicit annotation. However, if a dataset has multiple timestamp fields, one of those needs to be explicitly annotated to be the timestamp field.

Here are some examples of valid and invalid datasets:

1from fennel.datasets import dataset, field
2from fennel.lib.metadata import meta
3
4@meta(owner="[email protected]")
5@dataset
6class UserValidDataset:
7    uid: int
8    country: str
9    update_time: datetime
Sole datetime field as timestamp field, okay to have no keys

python

1from fennel.datasets import dataset, field
2from fennel.lib.metadata import meta
3
4@meta(owner="[email protected]")
5@dataset
6class User:
7    uid: Optional[int] = field(key=True)
8    country: str
9    update_time: datetime
Key fields can not have optional type

python

1from fennel.datasets import dataset, field
2from fennel.lib.metadata import meta
3
4@meta(owner="[email protected]")
5@dataset
6class User:
7    uid: int
8    country: str
9    update_time: int
No datetime field, so no timestamp field

python

1from fennel.datasets import dataset, field
2from fennel.lib.metadata import meta
3
4@meta(owner="[email protected]")
5@dataset
6class User:
7    uid: int
8    country: str
9    created_time: datetime
10    updated_time: datetime
Multile datetime fields without explicit timestamp field

python

1from fennel.datasets import dataset, field
2from fennel.lib.metadata import meta
3
4@meta(owner="[email protected]")
5@dataset
6class User:
7    uid: int
8    country: str
9    update_time: datetime = field(timestamp=True)
10    signup_time: datetime
Multiple datetime fields but one explicit timestamp field

python

Meta Flags

Datasets can be annotated with useful meta information via metaflags - either at the dataset level or at the single field level. To ensure code ownership, Fennel requires every dataset to have an owner. Here is an example:

1from fennel.datasets import dataset, field
2from fennel.lib.metadata import meta
3
4@meta(owner="[email protected]", tags=["PII", "experimental"])
5@dataset
6class UserWithMetaFlags:
7    uid: int = field(key=True)
8    height: float = field().meta(description="height in inches")
9    weight: float = field().meta(description="weight in lbs")
10    updated: datetime

python

Typing the same owner name again and again for each dataset can get somewhat repetitive. To prevent that, you can also specify an owner at the module level by specifying __owner__ and all the datasets in that module inherit this owner. Example:

1from fennel.datasets import dataset, field
2from fennel.lib.metadata import meta
3
4__owner__ = "[email protected]"
5
6@dataset
7class UserBMI:
8    uid: int = field(key=True)
9    height: float
10    weight: float
11    bmi: float
12    updated: datetime
13
14@meta(owner="[email protected]")
15@dataset
16class UserName:
17    uid: int = field(key=True)
18    name: str
19    updated: datetime
20
21@dataset
22class UserLocation:
23    uid: int = field(key=True)
24    city: str
25    updated: datetime

python

In this example, datasets UserBMI and UserLocation both inherit the owner from the module level __owner__ whereas dataset UserName overrides it by providing an explicit meta flag.

On This Page

Edit this Page on Github