Documentation

Concepts

Featureset

Featuresets refer to a group of logically related features where each feature is backed by a Python function that knows how to extract it. A featureset is written as a Python class annotated with @featureset decorator. A single application will typically have many featuresets. Let's see an example:

Example

overview.py
1@featureset
2class Movies:
3    duration: int = feature(id=1)
4    over_2hrs: bool = feature(id=2)
5
6    @extractor
7    @inputs(duration)
8    @outputs(over_2hrs)
9    def my_extractor(cls, ts: pd.Series, durations: pd.Series) -> pd.Series:
10        return pd.Series(name="over_2hrs", data=durations > 2 * 3600)

Above example defines a featureset called Movie with two features - duration, over_2hrs. Each feature has a type and is given a monotonically increasing id that is unique within the featureset. This featureset has one extractor - my_extractor that when given the duration feature, knows how to extract the over_2hrs feature. Let's look at extractors in a bit more detail.

Extractors

Extractors are stateless Python functions in a featureset that are annotated by @extractor decorator. Each extractor accepts zero or more inputs (marked in inputs decorator) and produces one or more features (marked in outputs decorator).

In the above example, if the value of feature durations is known, my_extractor can extract the value of over_2hrs. When you are interested in getting values of some features, Fennel locates the extractors responsible for those features, verifies their inputs are available and runs their code to obtain the feature values.

API of Extractors

Conceptually, an extractor could be thought of as getting a table of timestamped input features and producing a few output features for each row. Something like this (assuming 3 input features and 2 output features):

TimestampInput 1Input 2Input 3Output 1Output 2
Jan 11, 2022, 11:00am123'hello'True??
Jan 12, 2022, 8:30am456'world'True??
Jan 13, 2022, 10:15am789'fennel'False??

The output is supposed to be the value of that feature as of the given timestamp assuming other input feature values to be as given.

Extractor is a classmethod so its first argument is always cls. After that, the second argument is always a series of timestamps (as shown above) and after that, it gets one series for each input feature. The output is a named series or dataframe, depending on the number of output features, of the same length as the input features.

Rules of Extractors

  1. A featureset can have zero or more extractors.
  2. An extractor can have zero or more inputs but must have at least one output.
  3. Input features of an extractor can belong to any number of featuresets but all output features must be from the same featureset as the extractor.
  4. For any feature, there can be at most one extractor where the feature appears in the output list.

With this, let's look at a few valid and invalid examples:

Valid - featureset can have zero extractors

1@featureset
2class MoviesZeroExtractors:
3    duration: int = feature(id=1)
4    over_2hrs: bool = feature(id=2)

Valid - featureset can have multiple extractors provided they are all valid

1@featureset
2class MoviesManyExtractors:
3    duration: int = feature(id=1)
4    over_2hrs: bool = feature(id=2)
5    over_3hrs: bool = feature(id=3)
6
7    @extractor
8    @inputs(duration)
9    @outputs(over_2hrs)
10    def e1(cls, ts: pd.Series, durations: pd.Series) -> pd.Series:
11        return pd.Series(name="over_2hrs", data=durations > 2 * 3600)
12
13    @extractor
14    @inputs(duration)
15    @outputs(over_3hrs)
16    def e2(cls, ts: pd.Series, durations: pd.Series) -> pd.Series:
17        return pd.Series(name="over_3hrs", data=durations > 3 * 3600)

Invalid - multiple extractors extracting the same feature. over_3hrs is extracted both by e1 and e2:

1@featureset
2class Movies:
3    duration: int = feature(id=1)
4    over_2hrs: bool = feature(id=2)
5    # invalid: both e1 & e2 output `over_3hrs`
6    over_3hrs: bool = feature(id=3)
7
8    @extractor
9    @inputs(duration)
10    @outputs(over_2hrs, over_3hrs)
11    def e1(cls, ts: pd.Series, durations: pd.Series) -> pd.DataFrame:
12        two_hrs = durations > 2 * 3600
13        three_hrs = durations > 3 * 3600
14        return pd.DataFrame(
15            {"over_2hrs": two_hrs, "over_3hrs": three_hrs}
16        )
17
18    @extractor
19    @inputs(duration)
20    @outputs(over_3hrs)
21    def e2(cls, ts: pd.Series, durations: pd.Series) -> pd.Series:
22        return pd.Series(name="over_3hrs", data=durations > 3 * 3600)

Valid - input feature of extractor coming from another featureset.

1@featureset
2class Length:
3    limit_secs: int = feature(id=1)
4
5
6@featureset
7class MoviesForeignFeatureInput:
8    duration: int = feature(id=1)
9    over_limit: bool = feature(id=2)
10
11    @extractor
12    @inputs(Length.limit_secs, duration)
13    @outputs(over_limit)
14    def e(cls, ts: pd.Series, limits: pd.Series, durations: pd.Series):
15        return pd.Series(name="over_limit", data=durations > limits)

Invalid - output feature of extractor from another featureset

1@featureset
2class Request:
3    too_long: bool = feature(id=1)
4
5@featureset
6class Movies:
7    duration: int = feature(id=1)
8    limit_secs: int = feature(id=2)
9
10    @extractor
11    @inputs(limit_secs, duration)
12    @outputs(
13        Request.too_long
14    )  # can not output feature of another featureset
15    def e(cls, ts: pd.Series, limits: pd.Series, durations: pd.Series):
16        return pd.Series(name="movie_too_long", data=durations > limits)

Dataset Lookups

A large fraction of real world ML features are built on top of stored data. However, featuresets don't have any storage of their own and are completely stateless. Instead, they are able to do random lookups on datasets and use that for the feature computation. Let's see an example:

1@meta(owner="[email protected]")
2@source(webhook.endpoint("User"))
3@dataset
4class User:
5    uid: int = field(key=True)
6    name: str
7    timestamp: datetime
8
9
10@meta(owner="[email protected]")
11@featureset
12class UserFeatures:
13    uid: int = feature(id=1)
14    name: str = feature(id=2)
15
16    @extractor(depends_on=[User])
17    @inputs(uid)
18    @outputs(name)
19    def func(cls, ts: pd.Series, uids: pd.Series):
20        names, found = User.lookup(ts, uid=uids)
21        names.fillna("Unknown", inplace=True)
22        return names[["name"]]

In this example, the extractor is just looking up the value of name given the uid from the User dataset and returning that as feature value. Note a couple of things:

  • In line 15, extractor has to explicitly declare that it depends on User dataset. This helps Fennel build an explicit lineage between features and the datasets they depend on.
  • In line 19, extractor is able to call a lookup function on the dataset. This function also takes series of timestamps as the first argument - you'd almost always pass the extractor's timestamp list to this function as it is. In addition, all the key fields in the dataset become kwarg to the lookup function.
  • It's not possible to do lookups on dataset without keys.