Concepts
Featureset
Featuresets refer to a group of logically related features where each feature is
backed by a Python function that knows how to extract it. A featureset is written
as a Python class annotated with @featureset
decorator. A single application
will typically have many featuresets. Let's see an example:
1from fennel.featuresets import featureset, extractor, feature as F
2from fennel.lib import inputs, outputs
3
4@featureset
5class Movie:
6 duration: int
7 over_2hrs: bool
8
9 @extractor
10 @inputs("duration")
11 @outputs("over_2hrs")
12 def my_extractor(cls, ts: pd.Series, durations: pd.Series) -> pd.Series:
13 return pd.Series(name="over_2hrs", data=durations > 2 * 3600)
python
Above example defines a featureset called Movie
with two features - duration
,
over_2hrs
. Each feature has a type and
is given a monotonically increasing id
that is unique within the featureset.
This featureset has one extractor - my_extractor
that when given the duration
feature, knows how to extract the over_2hrs
feature.
Extractors
Extractors are stateless Python functions in a featureset that are annotated
by @extractor
decorator. Each extractor accepts zero or more inputs
(marked in inputs
decorator) and produces one or more features (marked in
outputs
decorator).
In the above example, if the value of feature durations
is known, my_extractor
can extract the value of over_2hrs
. When you are interested in getting values of
some features, Fennel locates the extractors responsible for those features, verifies
their inputs are available and runs their code to obtain the feature values.
API of Extractors
Conceptually, an extractor could be thought of as getting a table of timestamped input features and producing a few output features for each row. Something like this (assuming 3 input features and 2 output features):
Timestamp | Input 1 | Input 2 | Input 3 | Output 1 | Output 2 |
---|---|---|---|---|---|
Jan 11, 2022, 11:00am | 123 | 'hello' | True | ? | ? |
Jan 12, 2022, 8:30am | 456 | 'world' | True | ? | ? |
Jan 13, 2022, 10:15am | 789 | 'fennel' | False | ? | ? |
The output is supposed to be the value of that feature as of the given timestamp assuming other input feature values to be as given.
Extractor is a classmethod so its first argument is always cls
. After that, the
second argument is always a series of timestamps (as shown above) and after that,
it gets one series for each input feature. The output is a named series or dataframe,
depending on the number of output features, of the same length as the input features.
Validity Rules of Extractors
- A featureset can have zero or more extractors.
- An extractor can have zero or more inputs but must have at least one output.
- Input features of an extractor can belong to any number of featuresets but all output features must be from the same featureset as the extractor.
- For any feature, there can be at most one extractor where the feature appears in the output list.
With this, let's look at a few valid and invalid examples:
1from fennel.featuresets import featureset
2
3@featureset
4class MoviesZeroExtractors:
5 duration: int
6 over_2hrs: bool
python
A featureset having a feature without an extractor simply means that Fennel doesn't know how to compute that feature and hence that feature must always be provided as an input to the resolution process.
1from fennel.featuresets import featureset, extractor
2from fennel.lib import inputs, outputs
3
4@featureset
5class MoviesManyExtractors:
6 duration: int
7 over_2hrs: bool
8 over_3hrs: bool
9
10 @extractor
11 @inputs("duration")
12 @outputs("over_2hrs")
13 def e1(cls, ts: pd.Series, durations: pd.Series) -> pd.Series:
14 return pd.Series(name="over_2hrs", data=durations > 2 * 3600)
15
16 @extractor
17 @inputs("duration")
18 @outputs("over_3hrs")
19 def e2(cls, ts: pd.Series, durations: pd.Series) -> pd.Series:
20 return pd.Series(name="over_3hrs", data=durations > 3 * 3600)
python
1from fennel.featuresets import featureset, extractor
2from fennel.lib import inputs, outputs
3
4@featureset
5class Movies:
6 duration: int
7 over_2hrs: bool
8 over_3hrs: bool
9
10 @extractor
11 @inputs("duration")
12 @outputs("over_2hrs", "over_3hrs")
13 def e1(cls, ts: pd.Series, durations: pd.Series) -> pd.DataFrame:
14 two_hrs = durations > 2 * 3600
15 three_hrs = durations > 3 * 3600
16 return pd.DataFrame({"over_2hrs": two_hrs, "over_3hrs": three_hrs})
17
18 @extractor
19 @inputs("duration")
20 @outputs("over_3hrs")
21 def e2(cls, ts: pd.Series, durations: pd.Series) -> pd.Series:
22 return pd.Series(name="over_3hrs", data=durations > 3 * 3600)
python
1from fennel.featuresets import featureset, extractor
2from fennel.lib import inputs, outputs
3
4@featureset
5class Length:
6 limit_secs: int
7
8@featureset
9class MoviesForeignFeatureInput:
10 duration: int
11 over_limit: bool
12
13 @extractor
14 @inputs(Length.limit_secs, "duration")
15 @outputs("over_limit")
16 def e(cls, ts: pd.Series, limits: pd.Series, durations: pd.Series):
17 return pd.Series(name="over_limit", data=durations > limits)
python
1from fennel.featuresets import featureset, extractor
2from fennel.lib import inputs, outputs
3
4@featureset
5class Request:
6 too_long: bool
7
8@featureset
9class Movies:
10 duration: int
11 limit_secs: int
12
13 @extractor
14 @inputs("limit_secs", "duration")
15 @outputs(Request.too_long)
16 def e(cls, ts: pd.Series, limits: pd.Series, durations: pd.Series):
17 return pd.Series(name="movie_too_long", data=durations > limits)
python
Extractor Resolution
Support you have an extractor A
that takes feature f1
as input and outputs f2
and there is another extractor B
that takes f2
as input and returns f3
as
output. Further, suppose that the value of f1
is available and you're interested
in computing the value of f3
.
Fennel can automatically deduce that in order to go from f1
to f3
, it
must first run extractor A
to get to f2
and then run B
on f2
to get f3
.
More generally, Fennel is able is able to do recursive resolution of feature extractors and find a path via extractors to go from a set of input features to a set of output features. This allows you to reuse feature logic and not have every feature depend on root level inputs like uid.
Features on Features
A natural consequence of the extractor resolution is that it's trivial to build
features using other features - just have the extractor for a feature f1
take
another feature f2
of the same featureset as an input. Example:
1from fennel.featuresets import featureset, extractor
2from fennel.lib import inputs, outputs
3
4@featureset
5class Movies:
6 duration: int
7 over_3hrs: bool
8
9 @extractor
10 @inputs("duration")
11 @outputs("over_3hrs")
12 def e(cls, ts: pd.Series, durations: pd.Series) -> pd.Series:
13 return pd.Series(name="over_3hrs", data=durations > 3 * 3600)
python
Dataset Lookups
A large fraction of real world ML features are built on top of stored data. However, featuresets don't have any storage of their own and are completely stateless. Instead, they are able to do random lookups on datasets and use that for the feature computation. Let's see an example:
1from fennel.datasets import dataset, field
2from fennel.connectors import source, Webhook
3from fennel.featuresets import featureset, extractor, feature as F
4from fennel.lib import inputs, outputs
5
6webhook = Webhook(name="fennel_webhook")
7
8
9@source(webhook.endpoint("User"), disorder="14d", cdc="upsert")
10@dataset(index=True)
11class User:
12 uid: int = field(key=True)
13 name: str
14 timestamp: datetime
15
16
17@featureset
18class UserFeatures:
19 uid: int
20 name: str
21
22 @extractor(deps=[User])
23 @inputs("uid")
24 @outputs("name")
25 def func(cls, ts: pd.Series, uids: pd.Series):
26 names, found = User.lookup(ts, uid=uids)
27 names.fillna("Unknown", inplace=True)
28 return names[["name"]]
python
In this example, the extractor is just looking up the value of name
given the
uid
from the User
dataset and returning that as feature value. Note a couple
of things:
- Extractor has to explicitly declare that it depends on
User
dataset. This helps Fennel build an explicit lineage between features and the datasets they depend on. - Inside the body of the extractor, function
lookup
is invoked on the dataset. This function also takes series of timestamps as the first argument - you'd almost always pass the extractor's timestamp list to this function as it is. In addition, all the key fields in the dataset become kwarg to the lookup function. - It's not possible to do lookups on dataset without keys.
Versioning
Fennel supports versioning of features. Since a feature is defined
by its extractor, feature versioning is managed by setting version
on extractors.
1from fennel.featuresets import featureset, extractor
2from fennel.lib import inputs, outputs
3
4@featureset
5class Movie:
6 duration: int
7 is_long: bool
8
9 @extractor(version=1)
10 @inputs("duration")
11 @outputs("is_long")
12 def fn(cls, ts: pd.Series, durations: pd.Series) -> pd.Series:
13 return pd.Series(name="is_long", data=durations > 2 * 3600)
python
If not specified explicitly, versions in Fennel begin at 1. Feature/extractor details can not be mutated without changing the version. To update the definition of a feature, its extractor must be replaced with another extractor of a large version number. For instance, the above featureset can be updated to this one without an error:
1@featureset
2class Movie:
3 duration: int
4 is_long: bool
5
6 @extractor(version=2)
7 @inputs("duration")
8 @outputs("is_long")
9 def fn(cls, ts: pd.Series, durations: pd.Series) -> pd.Series:
10 return pd.Series(name="is_long", data=durations > 3 * 3600)
python
Auto Generated Extractors
Fennel extractors are flexible enough to mix any combination of read & write side computations. In practice, however, the most common scenario is computing something on the write-side in a pipeline and then serving it as it is. Fennel supports a shorthand annotation to auto-generate these formulaic extractors.
1from fennel.featuresets import feature
2
3@source(webhook.endpoint("endpoint"), disorder="14d", cdc="upsert")
4@dataset(index=True)
5class Movie:
6 id: int = field(key=True)
7 duration: int
8 updated_at: datetime
9
10@featureset
11class MovieFeatures:
12 id: int
13 duration: int = F(Movie.duration, default=-1)
14 over_2hrs: bool
15
16 @extractor
17 @inputs("duration")
18 @outputs("over_2hrs")
19 def is_over_2hrs(cls, ts: pd.Series, durations: pd.Series) -> pd.Series:
20 return pd.Series(name="over_2hrs", data=durations > 2 * 3600)
python
In the above example, the extractor for the feature duration
is auto-generated
via the feature(...)
initializer. In particular, the first expression to the
feature
function, in this case, a field of the dataset defined above, describes
what the extractor should result in. This snippet roughly generates code that
is equivalent to the following:
1@featureset
2class MovieFeatures:
3 id: int
4 duration: int
5 over_2hrs: bool
6
7 @extractor(deps=[Movie], version=1)
8 @inputs("id")
9 @outputs("duration")
10 def get_duration(cls, ts: pd.Series, id: pd.Series) -> pd.Series:
11 df, found = Movie.lookup(ts, id=id)
12 df["duration"] = np.where(found, -1, df["duration"])
13 return df["duration"]
14
15
16 @extractor
17 @inputs("duration")
18 @outputs("over_2hrs")
19 def is_over_2hrs(cls, ts: pd.Series, durations: pd.Series) -> pd.Series:
20 return pd.Series(name="over_2hrs", data=durations > 2 * 3600)
python
The way this works is that Fennel looks at the dataset whose field is being
referred to and verifies that it is a keyed dataset so that it can be looked up.
It then verifies that for each key field (i.e. id
here), there is a feature
in the featureset with the same name. If default
is setup, logic equivalent to
fillna
line is added and the types are validated.
As shown above, auto-generated extractors can be arbitrarily mixed with manually written extractors - with subset of extractors written manually and the rest derived in an automatic manner.
Feature Aliases
Above, we saw how to auto generate an extractor that looks up a field of a keyed dataset. Fennel also lets you create aliases of features via auto-generated extractors.
1@source(webhook.endpoint("Movie"), disorder="14d", cdc="upsert")
2@dataset(index=True)
3class Movie:
4 id: int = field(key=True)
5 duration: int
6 updated_at: datetime
7
8@featureset
9class Request:
10 movie_id: int
11
12@featureset
13class MovieFeatures:
14 id: int = F(Request.movie_id)
15 duration: int = F(Movie.duration, default=-1)
python
Two extractors are auto-generated in the above example - one that looks up field
of the dataset, as before. But we are now generating a second extractor which simply
aliases feature Request.movie_id
to MovieFeatures.id
that roughly
generates code equivalent to the following:
1@featureset
2class MovieFeatures:
3 id: int
4 duration: int
5
6 @extractor(version=1)
7 @inputs(Request.movie_id)
8 @outputs("id")
9 def get_id(cls, ts: pd.Series, movie_id: pd.Series) -> pd.Series:
10 return movie_id
11
12
13 @extractor(deps=[Movie], version=1)
14 @inputs("id")
15 @outputs("duration")
16 def get_duration(cls, ts: pd.Series, id: pd.Series) -> pd.Series:
17 df, found = Movie.lookup(ts, id=id)
18 df["duration"] = np.where(found, -1, df["duration"])
19 return df["duration"]
python
Performance
Fennel completely bypasses Python for auto-generated extractors and is able to execute the whole computation in the Rust land itself. As a result, they are significantly faster (by >2x based on some benchmarks) and should be used preferably whenever possible.
Convention
Most of your extractors will likely be auto-generated. To reduce visual clutter
by repeatedly writing feature
, a convention followed in the Fennel docs is to
import feature
as F
.
1from fennel.featuresets import feature as F
2
3@source(webhook.endpoint("Movie"), disorder="14d", cdc="upsert")
4@dataset(index=True)
5class Movie:
6 id: int = field(key=True)
7 duration: int
8 updated_at: datetime
9
10@featureset
11class Request:
12 movie_id: int
13
14@featureset
15class MovieFeatures:
16 id: int = F(Request.movie_id)
17 duration: int = F(Movie.duration, default=-1)
python