Concepts
Featureset
Featuresets refer to a group of logically related features where each feature is
backed by a Python function that knows how to extract it. A featureset is written
as a Python class annotated with @featureset
decorator. A single application
will typically have many featuresets. Let's see an example:
Example
1@featureset
2class Movies:
3 duration: int = feature(id=1)
4 over_2hrs: bool = feature(id=2)
5
6 @extractor
7 @inputs(duration)
8 @outputs(over_2hrs)
9 def my_extractor(cls, ts: pd.Series, durations: pd.Series) -> pd.Series:
10 return pd.Series(name="over_2hrs", data=durations > 2 * 3600)
Above example defines a featureset called Movie
with two features - duration
,
over_2hrs
. Each feature has a type and is given
a monotonically increasing id
that is unique within the featureset. This
featureset has one extractor - my_extractor
that when given the duration
feature, knows how to extract the over_2hrs
feature. Let's look at extractors in
a bit more detail.
Extractors
Extractors are stateless Python functions in a featureset that are annotated
by @extractor
decorator. Each extractor accepts zero or more inputs
(marked in inputs
decorator) and produces one or more features (marked in
outputs
decorator).
In the above example, if the value of feature durations
is known, my_extractor
can extract the value of over_2hrs
. When you are interested in getting values of
some features, Fennel locates the extractors responsible for those features, verifies
their inputs are available and runs their code to obtain the feature values.
API of Extractors
Conceptually, an extractor could be thought of as getting a table of timestamped input features and producing a few output features for each row. Something like this (assuming 3 input features and 2 output features):
Timestamp | Input 1 | Input 2 | Input 3 | Output 1 | Output 2 |
---|---|---|---|---|---|
Jan 11, 2022, 11:00am | 123 | 'hello' | True | ? | ? |
Jan 12, 2022, 8:30am | 456 | 'world' | True | ? | ? |
Jan 13, 2022, 10:15am | 789 | 'fennel' | False | ? | ? |
The output is supposed to be the value of that feature as of the given timestamp assuming other input feature values to be as given.
Extractor is a classmethod so its first argument is always cls
. After that, the
second argument is always a series of timestamps (as shown above) and after that,
it gets one series for each input feature. The output is a named series or dataframe,
depending on the number of output features, of the same length as the input features.
Rules of Extractors
- A featureset can have zero or more extractors.
- An extractor can have zero or more inputs but must have at least one output.
- Input features of an extractor can belong to any number of featuresets but all output features must be from the same featureset as the extractor.
- For any feature, there can be at most one extractor where the feature appears in the output list.
With this, let's look at a few valid and invalid examples:
Valid - featureset can have zero extractors
1@featureset
2class MoviesZeroExtractors:
3 duration: int = feature(id=1)
4 over_2hrs: bool = feature(id=2)
Valid - featureset can have multiple extractors provided they are all valid
1@featureset
2class MoviesManyExtractors:
3 duration: int = feature(id=1)
4 over_2hrs: bool = feature(id=2)
5 over_3hrs: bool = feature(id=3)
6
7 @extractor
8 @inputs(duration)
9 @outputs(over_2hrs)
10 def e1(cls, ts: pd.Series, durations: pd.Series) -> pd.Series:
11 return pd.Series(name="over_2hrs", data=durations > 2 * 3600)
12
13 @extractor
14 @inputs(duration)
15 @outputs(over_3hrs)
16 def e2(cls, ts: pd.Series, durations: pd.Series) -> pd.Series:
17 return pd.Series(name="over_3hrs", data=durations > 3 * 3600)
Invalid - multiple extractors extracting the same feature. over_3hrs
is
extracted both by e1 and e2:
1@featureset
2class Movies:
3 duration: int = feature(id=1)
4 over_2hrs: bool = feature(id=2)
5 # invalid: both e1 & e2 output `over_3hrs`
6 over_3hrs: bool = feature(id=3)
7
8 @extractor
9 @inputs(duration)
10 @outputs(over_2hrs, over_3hrs)
11 def e1(cls, ts: pd.Series, durations: pd.Series) -> pd.DataFrame:
12 two_hrs = durations > 2 * 3600
13 three_hrs = durations > 3 * 3600
14 return pd.DataFrame(
15 {"over_2hrs": two_hrs, "over_3hrs": three_hrs}
16 )
17
18 @extractor
19 @inputs(duration)
20 @outputs(over_3hrs)
21 def e2(cls, ts: pd.Series, durations: pd.Series) -> pd.Series:
22 return pd.Series(name="over_3hrs", data=durations > 3 * 3600)
Valid - input feature of extractor coming from another featureset.
1@featureset
2class Length:
3 limit_secs: int = feature(id=1)
4
5
6@featureset
7class MoviesForeignFeatureInput:
8 duration: int = feature(id=1)
9 over_limit: bool = feature(id=2)
10
11 @extractor
12 @inputs(Length.limit_secs, duration)
13 @outputs(over_limit)
14 def e(cls, ts: pd.Series, limits: pd.Series, durations: pd.Series):
15 return pd.Series(name="over_limit", data=durations > limits)
Invalid - output feature of extractor from another featureset
1@featureset
2class Request:
3 too_long: bool = feature(id=1)
4
5@featureset
6class Movies:
7 duration: int = feature(id=1)
8 limit_secs: int = feature(id=2)
9
10 @extractor
11 @inputs(limit_secs, duration)
12 @outputs(
13 Request.too_long
14 ) # can not output feature of another featureset
15 def e(cls, ts: pd.Series, limits: pd.Series, durations: pd.Series):
16 return pd.Series(name="movie_too_long", data=durations > limits)
Dataset Lookups
A large fraction of real world ML features are built on top of stored data. However, featuresets don't have any storage of their own and are completely stateless. Instead, they are able to do random lookups on datasets and use that for the feature computation. Let's see an example:
1@meta(owner="[email protected]")
2@source(webhook.endpoint("User"))
3@dataset
4class User:
5 uid: int = field(key=True)
6 name: str
7 timestamp: datetime
8
9
10@meta(owner="[email protected]")
11@featureset
12class UserFeatures:
13 uid: int = feature(id=1)
14 name: str = feature(id=2)
15
16 @extractor(depends_on=[User])
17 @inputs(uid)
18 @outputs(name)
19 def func(cls, ts: pd.Series, uids: pd.Series):
20 names, found = User.lookup(ts, uid=uids)
21 names.fillna("Unknown", inplace=True)
22 return names[["name"]]
In this example, the extractor is just looking up the value of name
given the
uid
from the User
dataset and returning that as feature value. Note a couple
of things:
- In line 15, extractor has to explicitly declare that it depends on
User
dataset. This helps Fennel build an explicit lineage between features and the datasets they depend on. - In line 19, extractor is able to call a
lookup
function on the dataset. This function also takes series of timestamps as the first argument - you'd almost always pass the extractor's timestamp list to this function as it is. In addition, all the key fields in the dataset become kwarg to the lookup function. - It's not possible to do lookups on dataset without keys.