Data Quality

Metaflags

Features and datasets are not static in the real world and have their own life cycle. Metaflags is a mechanism to annotate and manage the lifecycle of Fennel objects.

Here are a few common scenarios where Metaflags help:

  • Ownership of a dataset needs to be tracked so that if it is having data quality issues, the problem can be routed to an appropriate person to investigate.
  • Features and data need to be documented so that their users can easily understand what they are doing.
  • Due to compliance reasons, all features that depend on PII data either directly or through a long list of upstream dependencies need to be audited - but for that, first all such features need to be identified.

Let's look at an example:

1@meta(owner='[email protected]', tags=['PII', 'hackathon'])
2@dataset(index=True)
3class User:
4    uid: int = field(key=True)
5    height: float = field().meta(description='in inches')
6    weight: float = field().meta(description='in lbs')
7    at: datetime
8
9@meta(owner='[email protected]')
10@featureset
11class UserFeatures:
12    uid: int = feature()
13    zip: str = feature().meta(tags=['PII'])
14    bmi: float = feature().meta(owner='[email protected]')
15    bmr: float = feature().meta(deprecated=True)
16    ..
17
18    @meta(description='based on algorithm specified here: bit.ly/xyy123')
19    @extractor
20    @inputs(...)
21    @outputs(...)
22    def some_fn(...):
23        ...

python

Fennel currently supports 5 metaflags:

  1. owner - email address of the owner of the object. The ownership flows down transitively. For instance, the owner of a featureset becomes the default owner of all the features unless it is explicitly overwritten by specifying an owner for that feature.
  2. description - description of the object, used solely for documentation purposes.
  3. tags - list of arbitrary string tags associated with the object. Tags flow across the lineage graph and are additive. For instance, if a dataset is tagged with tag 'PII', all other objects that read from the dataset will inherit this tag. Fennel supports searching for objects with a given tag.
  4. deleted - whether the object is deleted or not. Sometimes it is desirable to delete the object but keep a marker tombstone in the codebase - that is where deleted should be used. For instance, maybe a feature is now deleted but its ID should not be reused again. It'd be a good idea to mark it as deleted and leave it like that forever (the code for its extractor can be removed)
  5. deprecated - same as deleted but just marks the object as to be deprecated in the near future. Future users of this object can see this flag in code and hence are nudged to not use it.

Enforcement of Ownership

Enforcing ownership of code is a well known approach in software engineering to maintain the quality & health of code but most ML teams don't enforce ownership of pipelines or features.

Fennel requires that every dataset and feature has an explicit owner email and routes alerts/notifications about those constructs to the owners. As people change teams or move around, this makes it more likely that context will be transferred.

Fennel also makes it easy to identify downstream dependencies - e.g. given a dataset, it's easy to see if any other datasets or features depend on it. Knowing that a construct has truly no dependencies makes it that much easier for teams to simply delete them on ownership transitions vs keeping them around.

Ownership and other metaflags in itself don't magically prevent any quality issues but hopefully should lead to subjectively higher hygiene for code and data.

Module Level Ownership

Having to type owner for each dataset/featureset can get repetitive over time. You can save this drudgery by specifying __owner__ once at the start of a module and any dataset/featureset without an explicit owner specified will just use this module level ownership value.

In the following example, both User and Transaction inherit their owner from the module level __owner__ property.

1
2__owner__ = "[email protected]"
3
4
5@dataset
6class User:
7    uid: int
8    country: str
9    age: float
10    signup_at: datetime
11
12
13@dataset
14class Transaction:
15    uid: int
16    amount: float
17    payment_country: str
18    merchant_id: int
19    timestamp: datetime

python

On This Page

Edit this Page on Github