Concepts

Featureset

Featuresets refer to a group of logically related features where each feature is backed by a Python function that knows how to extract it. A featureset is written as a Python class annotated with @featureset decorator. A single application will typically have many featuresets. Let's see an example:

1from fennel.featuresets import featureset, extractor
2from fennel.lib import inputs, outputs
3
4@featureset
5class Movie:
6    duration: int
7    over_2hrs: bool
8
9    @extractor
10    @inputs("duration")
11    @outputs("over_2hrs")
12    def my_extractor(cls, ts: pd.Series, durations: pd.Series) -> pd.Series:
13        return pd.Series(name="over_2hrs", data=durations > 2 * 3600)

python

Above example defines a featureset called Movie with two features - duration, over_2hrs. Each feature has a type and is given a monotonically increasing id that is unique within the featureset. This featureset has one extractor - my_extractor that when given the duration feature, knows how to extract the over_2hrs feature.

Extractors

Extractors are stateless Python functions in a featureset that are annotated by @extractor decorator. Each extractor accepts zero or more inputs (marked in inputs decorator) and produces one or more features (marked in outputs decorator).

In the above example, if the value of feature durations is known, my_extractor can extract the value of over_2hrs. When you are interested in getting values of some features, Fennel locates the extractors responsible for those features, verifies their inputs are available and runs their code to obtain the feature values.

API of Extractors

Conceptually, an extractor could be thought of as getting a table of timestamped input features and producing a few output features for each row. Something like this (assuming 3 input features and 2 output features):

TimestampInput 1Input 2Input 3Output 1Output 2
Jan 11, 2022, 11:00am123'hello'True??
Jan 12, 2022, 8:30am456'world'True??
Jan 13, 2022, 10:15am789'fennel'False??

The output is supposed to be the value of that feature as of the given timestamp assuming other input feature values to be as given.

Extractor is a classmethod so its first argument is always cls. After that, the second argument is always a series of timestamps (as shown above) and after that, it gets one series for each input feature. The output is a named series or dataframe, depending on the number of output features, of the same length as the input features.

Validity Rules of Extractors

  1. A featureset can have zero or more extractors.
  2. An extractor can have zero or more inputs but must have at least one output.
  3. Input features of an extractor can belong to any number of featuresets but all output features must be from the same featureset as the extractor.
  4. For any feature, there can be at most one extractor where the feature appears in the output list.

With this, let's look at a few valid and invalid examples:

1from fennel.featuresets import featureset
2
3@featureset
4class MoviesZeroExtractors:
5    duration: int
6    over_2hrs: bool
Featureset can have zero extractors

python

A featureset having a feature without an extractor simply means that Fennel doesn't know how to compute that feature and hence that feature must always be provided as an input to the resolution process.

1from fennel.featuresets import featureset, extractor
2from fennel.lib import inputs, outputs
3
4@featureset
5class MoviesManyExtractors:
6    duration: int
7    over_2hrs: bool
8    over_3hrs: bool
9
10    @extractor
11    @inputs("duration")
12    @outputs("over_2hrs")
13    def e1(cls, ts: pd.Series, durations: pd.Series) -> pd.Series:
14        return pd.Series(name="over_2hrs", data=durations > 2 * 3600)
15
16    @extractor
17    @inputs("duration")
18    @outputs("over_3hrs")
19    def e2(cls, ts: pd.Series, durations: pd.Series) -> pd.Series:
20        return pd.Series(name="over_3hrs", data=durations > 3 * 3600)
Can have multiple extractors for different features

python

1from fennel.featuresets import featureset, extractor
2from fennel.lib import meta, inputs, outputs
3
4@meta(owner="[email protected]")
5@featureset
6class Movies:
7    duration: int
8    over_2hrs: bool
9    over_3hrs: bool
10
11    @extractor
12    @inputs("duration")
13    @outputs("over_2hrs", "over_3hrs")
14    def e1(cls, ts: pd.Series, durations: pd.Series) -> pd.DataFrame:
15        two_hrs = durations > 2 * 3600
16        three_hrs = durations > 3 * 3600
17        return pd.DataFrame({"over_2hrs": two_hrs, "over_3hrs": three_hrs})
18
19    @extractor
20    @inputs("duration")
21    @outputs("over_3hrs")
22    def e2(cls, ts: pd.Series, durations: pd.Series) -> pd.Series:
23        return pd.Series(name="over_3hrs", data=durations > 3 * 3600)
Multiple extractors extracting the same feature over_3hrs

python

1from fennel.featuresets import featureset, extractor
2from fennel.lib import inputs, outputs
3
4@featureset
5class Length:
6    limit_secs: int
7
8@featureset
9class MoviesForeignFeatureInput:
10    duration: int
11    over_limit: bool
12
13    @extractor
14    @inputs(Length.limit_secs, "duration")
15    @outputs("over_limit")
16    def e(cls, ts: pd.Series, limits: pd.Series, durations: pd.Series):
17        return pd.Series(name="over_limit", data=durations > limits)
Input feature coming from another featureset

python

1from fennel.featuresets import featureset, extractor
2from fennel.lib import inputs, outputs
3
4@featureset
5class Request:
6    too_long: bool
7
8@featureset
9class Movies:
10    duration: int
11    limit_secs: int
12
13    @extractor
14    @inputs("limit_secs", "duration")
15    @outputs(Request.too_long)
16    def e(cls, ts: pd.Series, limits: pd.Series, durations: pd.Series):
17        return pd.Series(name="movie_too_long", data=durations > limits)
Extractor outputs feature from another featureset

python

Extractor Resolution

Support you have an extractor A that takes feature f1 as input and outputs f2 and there is another extractor B that takes f2 as input and returns f3 as output. Further, suppose that the value of f1 is available and you're interested in computing the value of f3.

Fennel can automatically deduce that in order to go from f1 to f3, it must first run extractor A to get to f2 and then run B on f2 to get f3.

More generally, Fennel is able is able to do recursive resolution of feature extractors and find a path via extractors to go from a set of input features to a set of output features. This allows you to reuse feature logic and not have every feature depend on root level inputs like uid.

Dataset Lookups

A large fraction of real world ML features are built on top of stored data. However, featuresets don't have any storage of their own and are completely stateless. Instead, they are able to do random lookups on datasets and use that for the feature computation. Let's see an example:

1from fennel.datasets import dataset, field
2from fennel.connectors import source, Webhook
3from fennel.featuresets import featureset, extractor, feature as F
4from fennel.lib import inputs, outputs
5
6webhook = Webhook(name="fennel_webhook")
7
8
9@source(webhook.endpoint("User"), disorder="14d", cdc="upsert")
10@dataset(index=True)
11class User:
12    uid: int = field(key=True)
13    name: str
14    timestamp: datetime
15
16
17@featureset
18class UserFeatures:
19    uid: int
20    name: str
21
22    @extractor(deps=[User])
23    @inputs("uid")
24    @outputs("name")
25    def func(cls, ts: pd.Series, uids: pd.Series):
26        names, found = User.lookup(ts, uid=uids)
27        names.fillna("Unknown", inplace=True)
28        return names[["name"]]

python

In this example, the extractor is just looking up the value of name given the uid from the User dataset and returning that as feature value. Note a couple of things:

  • Extractor has to explicitly declare that it depends on User dataset. This helps Fennel build an explicit lineage between features and the datasets they depend on.
  • Inside the body of the extractor, function lookup is invoked on the dataset. This function also takes series of timestamps as the first argument - you'd almost always pass the extractor's timestamp list to this function as it is. In addition, all the key fields in the dataset become kwarg to the lookup function.
  • It's not possible to do lookups on dataset without keys.

Versioning

Fennel supports versioning of features. Since a feature is defined by its extractor, feature versioning is managed by setting version on extractors.

1from fennel.featuresets import featureset, extractor
2from fennel.lib import inputs, outputs
3
4@featureset
5class Movie:
6    duration: int
7    is_long: bool
8
9    @extractor(version=1)
10    @inputs("duration")
11    @outputs("is_long")
12    def fn(cls, ts: pd.Series, durations: pd.Series) -> pd.Series:
13        return pd.Series(name="is_long", data=durations > 2 * 3600)

python

If not specified explicitly, versions in Fennel begin at 1. Feature/extractor details can not be mutated without changing the version. To update the definition of a feature, its extractor must be replaced with another extractor of a large version number. For instance, the above featureset can be updated to this one without an error:

1@featureset
2class Movie:
3    duration: int
4    is_long: bool
5
6    @extractor(version=2)
7    @inputs("duration")
8    @outputs("is_long")
9    def fn(cls, ts: pd.Series, durations: pd.Series) -> pd.Series:
10        return pd.Series(name="is_long", data=durations > 3 * 3600)

python

Auto Generated Extractors

Fennel extractors are flexible enough to mix any combination of read & write side computations. In practice, however, the most common scenario is computing something on the write-side in a pipeline and then serving it as it is. Fennel supports a shorthand annotation to auto-generate these formulaic extractors.

1from fennel.featuresets import feature
2
3@source(webhook.endpoint("endpoint"), disorder="14d", cdc="upsert")
4@dataset(index=True)
5class Movie:
6    id: int = field(key=True)
7    duration: int
8    updated_at: datetime
9
10@featureset
11class MovieFeatures:
12    id: int
13    duration: int = feature(Movie.duration, default=-1)
14    over_2hrs: bool
15
16    @extractor
17    @inputs("duration")
18    @outputs("over_2hrs")
19    def is_over_2hrs(cls, ts: pd.Series, durations: pd.Series) -> pd.Series:
20        return pd.Series(name="over_2hrs", data=durations > 2 * 3600)

python

In the above example, the extractor for the feature duration is auto-generated via the feature(...) initializer. In particular, the first expression to the feature function, in this case, a field of the dataset defined above, describes what the extractor should result in. This snippet roughly generates code that is equivalent to the following:

1@featureset
2class MovieFeatures:
3    id: int
4    duration: int
5    over_2hrs: bool
6
7    @extractor(deps=[Movie], version=1)
8    @inputs("id")
9    @outputs("duration")
10    def get_duration(cls, ts: pd.Series, id: pd.Series) -> pd.Series:
11        df, found = Movie.lookup(ts, id=id)
12        df["duration"] = np.where(found, -1, df["duration"])
13        return df["duration"]
14
15
16    @extractor
17    @inputs("duration")
18    @outputs("over_2hrs")
19    def is_over_2hrs(cls, ts: pd.Series, durations: pd.Series) -> pd.Series:
20        return pd.Series(name="over_2hrs", data=durations > 2 * 3600)

python

The way this works is that Fennel looks at the dataset whose field is being referred to and verifies that it is a keyed dataset so that it can be looked up. It then verifies that for each key field (i.e. id here), there is a feature in the featureset with the same name. If default is setup, logic equivalent to fillna line is added and the types are validated.

As shown above, auto-generated extractors can be arbitrarily mixed with manually written extractors - with subset of extractors written manually and the rest derived in an automatic manner.

Feature Aliases

Above, we saw how to auto generate an extractor that looks up a field of a keyed dataset. Fennel also lets you create aliases of features via auto-generated extractors.

1@source(webhook.endpoint("Movie"), disorder="14d", cdc="upsert")
2@dataset(index=True)
3class Movie:
4    id: int = field(key=True)
5    duration: int
6    updated_at: datetime
7
8@featureset
9class Request:
10    movie_id: int
11
12@featureset
13class MovieFeatures:
14    id: int = feature(Request.movie_id)
15    duration: int = feature(Movie.duration, default=-1)

python

Two extractors are auto-generated in the above example - one that looks up field of the dataset, as before. But we are now generating a second extractor which simply aliases feature Request.movie_id to MovieFeatures.id that roughly generates code equivalent to the following:

1@featureset
2class MovieFeatures:
3    id: int
4    duration: int
5
6    @extractor(version=1)
7    @inputs(Request.movie_id)
8    @outputs("id")
9    def get_id(cls, ts: pd.Series, movie_id: pd.Series) -> pd.Series:
10        return movie_id
11
12
13    @extractor(deps=[Movie], version=1)
14    @inputs("id")
15    @outputs("duration")
16    def get_duration(cls, ts: pd.Series, id: pd.Series) -> pd.Series:
17        df, found = Movie.lookup(ts, id=id)
18        df["duration"] = np.where(found, -1, df["duration"])
19        return df["duration"]

python

Performance

Fennel completely bypasses Python for auto-generated extractors and is able to execute the whole computation in the Rust land itself. As a result, they are significantly faster (by >2x based on some benchmarks) and should be used preferably whenever possible.

Convention

Most of your extractors will likely be auto-generated. To reduce visual clutter by repeatedly writing feature, a convention followed in the Fennel docs is to import feature as F.

1from fennel.featuresets import feature as F
2
3@source(webhook.endpoint("Movie"), disorder="14d", cdc="upsert")
4@dataset(index=True)
5class Movie:
6    id: int = field(key=True)
7    duration: int
8    updated_at: datetime
9
10@featureset
11class Request:
12    movie_id: int
13
14@featureset
15class MovieFeatures:
16    id: int = F(Request.movie_id)
17    duration: int = F(Movie.duration, default=-1)

python

On This Page

Edit this Page on Github