Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SIP-78] Proposal for Advanced Types #17852

Closed
cccs-RyanS opened this issue Dec 22, 2021 · 8 comments
Closed

[SIP-78] Proposal for Advanced Types #17852

cccs-RyanS opened this issue Dec 22, 2021 · 8 comments
Labels
sip Superset Improvement Proposal

Comments

@cccs-RyanS
Copy link
Contributor

cccs-RyanS commented Dec 22, 2021

[SIP-78] Proposal for Advanced Types

Motivation

Our users are leveraging Superset to explore and visualize network security logs. These logs often contain IP, CIDR range, port, hostname, and/or protocol fields, among others. While such fields might be stored as simple INTEGERS or VARCHARS in the backend database, these fields also have a “advanced” type which, if leveraged, can provide additional context and improve the user experience.

User-Friendly Display

A field’s advanced type can influence how that field is interacted with by the users. Consider, for example, an IP field. For performance reasons, we store IP values as INTEGERs in our DW e.g. the IP “127.0.0.1” would be stored as its integer value 2130706433. However, the string representation is substantially more human-readable and, as such, we wish to display IP fields in dot-decimal notation in visualizations and filters. This numeric format is fairly standard (link)and having this display functionality would be valuable in superset in case the underlying DB does not support this conversion functionality out of the box. As another example, a Port field stored as an integer (e.g., 443) could be displayed as a protocol name where such mapping is known (e.g., HTTPS).

Applicable Operators for filters

Certain operators (e.g., LIKE, =, <, IS NULL, IN, etc.) may be valid for a database type, but may not be applicable to a advanced type. For example, an organization might decide that the LIKE operator is not relevant to Port fields. Ideally, we want to be able to configure a list of applicable operators that will appear for a field of a certain advanced type.

Input Validation

Advanced type can be associated with (configurable) input validation and type conversion rules. For example, an input validator for a Port field could validate that the filter string is a parsable number between 0 and 65536, or that it contains known port names such as HTTPS, SSH, FTP.

Human Readable input values

As with displaying the human readable format of a value to the user (i.e the dot notation of an IP) users should be able to enter the more readable form of a value and have it be applied properly to a query. This would mean a user could create a filter such as {col: "ip_col", op: "IN", val: [“1.1.1.1”, “1.1.1.2”]} and have the query be executed as ip_col IN [16842752, 16842753]. The functionality for a user to simply enter {col: "ip_col", op: "IN", val: [16842752, 16842753]} should also be preserved.

Behavior specified in the config

The code to specify the behavior of each advanced type should be defined outside of the superset codebase (i.e the config). This way those who wish to implement a advanced type may do so freely without having to make any code changes to superset.

Proof of concept

Figure 1: a video displaying a proof of concept for this proposal. a version of the modified filters popover is shown. The user inputs both the human readable and numeric values for ports and the conversion is shown. Also it is shown that the user cannot save an invalid value, abc in this case. The user then saves and submits a valid filter and runs the query. the query is then displayed.

Demo.Port.mp4

Proposed Changes

We propose that the concept of an “advanced type” be introduced to dataset columns in Superset and that this advanced type be leveraged to allow display formats, applicable operators, and validation rules to be added to fields of a particular advanced type. Advanced types would be a purely optional attribute which could be specified on columns; if a advanced type is not specified for a particular column, there would be no changes to how a column is displayed, what operators are available, or what validation is performed on filter inputs.

Adding the advanced type to dataset

Introduce the concept of a advanced type on dataset columns

  • Add a advanced_type attribute to the BaseColumn model. This column only needs to be a nullable VARCHAR.
  • Add a “Advanced Type” field to the “Edit Datasource” modal (Figure 1) where dataset creators/editors can set the advanced type for a column. This should be restricted to applicable advanced types

Figure 2: Edit Datasource modal with Advanced Type field

unnamed

Allow applicable operator sets to be defined for each advanced type

The Explorer UI has an ad-hoc filter which lets the user enter a condition on a column, a value and an operator (>, =, LIKE, IS NULL, etc).

Currently all operators are available to the user. We propose returning a list of applicable operators in the response payload. The list of applicable operators should be configurable for each custom advanced type. For maximum flexibility, the applicable operators should be tied to the advanced type.

Figure 3: Current list of operators (left) vs customized list of operators (right)

Before After
all_ops unnamed

Allow input validation and value conversion on advanced type

Currently any input string can be entered into an ad-hoc filter. We propose exposing a server REST endpoint to validate input strings and also return the "normalized" representation. The normalized representation should be accessible to the user so they can verify the conversion. This could be displayed via a tooltip in order to minimize UI changes to the control.

Example inputs

  • Time range: "Last day" -> "2021-03-16T21:57:33.000Z - 2021-03-17T21:57:33.000Z"
  • Numeric IP: "192.168.0.1" -> 3232235521
  • Numeric IP: 3232235521 -> 3232235521
  • Port: "HTTPS" -> "443"
  • Port: "8080" -> "8080"
  • Numeric CIDR: "1.1.1.1/16" -> {16842752 - 16908287}

An input validator for a PORT could validate that the string is a parsable number between 0 and 65536 or that it contains known port names such as HTTPS, SSH, FTP. An input validator for a DOMAIN could validate that the string is parsable as a domain and contains only alphanumeric characters, periods (.), and dashes (-).

It should be possible to add additional input validators via a configuration like config.py.

Figure 3: Error message for a filter value that fails validation for a field of type Port

unnamed

Allow custom behavior for each operator on a advanced type

Ad-hoc filters currently work as would be expected on the backend: a filter is translated directly into an SQL statement that matches the filter, so the filter {col: "ip_col", op: "IN", val: "1.1.1.1/16"} is translated to the statement ip_col IN '1.1.1.1/16'. Based on the idea of advanced types, we want the value '1.1.1.1/16' to first be converted to a decimal range (16842752 - 16908287) as covered above, but just doing the conversion is not enough for this range type as we would get something like this in the best case: ip_col IN [16842752, 16842753, ..., 16908287]. We would more likely want this: ip_col >= 16842752 AND ip_col <= 16908287. In order to achieve this, we propose making it possible to specify custom behavior for each defined operator associated with an advanced type. This would allow a lot of flexibility for each type.

Allow for custom display values

Columns with advanced types should (optionally) be displayed in their human readable format (i.e dot decimal notation for IPs even though they are stored as integers). In order to accomplish this, we should generate a list of user-friendly values based on the target column and append it to the result set or replace the original column with these new values. This could be accomplished in the filter in superset/views/core.py. We should optimally clean up some technical debt in the backend code by porting it to the v1 API, and while doing so we can introduce the additional functionality needed for this feature (two birds with one stone).

New or Changed Public Interfaces

Changes to the config

In order to implement advanced logic for advanced types we propose adding a collection of AdvancedType objects. These will define the available advanced types and their functionalities. The proposed class will be as follows are as follows:

description:

  • a description of the usage of the advanced type

verbose_name:

  • name used for displaying purposes, e.g. “IP and CIDR (v4)”

Applicatable_types

  • a list of GenericDataType types that this advanced type applies to. Using the earlier example of IPs stored as ints, that advanced type would only function on integer value columns

type_translation:

  • a function that accepts unparsed values and return: a status indicating whether the provided value was valid, the parsed value(s) based on advanced type, valid operators, and formatted value(s)

filter_translation:

  • a function that accepts a SqlaColumn, FilterOperator, and an unparsed value (or values). The function calls the advanced_type_translations function (for appropriate advanced type), and returns a valid [SqlaExpression].
  • The purpose of these functions is to take in a valid filter operator and a column and return a valid sqla expression that can be used to form the query.

Model changes

BaseColumn model

  • Add advanced type field (Nullable VARCHAR)

Change SqlaTable's get_sql_query function inside the filter loop so that if a advanced type is present on a column, the operator logic is overridden by an advanced type’s operator logic (via callback defined in advanced_type_filter_translations). This callback’s response will then be appended to the where_and list to produce the query.

Addition of the v1/advanced_types/ endpoints

In order to use any of this logic in the front end we will need to define a set of API endpoints for advanced types. Under this section we would define the following endpoints:

[GET] v1/advanced_type/convert

  • Calls the appropriate advanced_type_translation function as defined in the config for the given advanced type.
  • Parameters:
  • advanced type
  • a set of value(s) to be validated/converted
  • Returns:
  • The output of the function retrieved from advanced_type_translation

[GET] v1/advanced_type/types

  • Get the list of implemented advanced types and their descriptions. This list would be defined by the intersection of the set of keys for the advanced_type_translations and advanced_type_filter_translations dictionaries. This would be used in order to populate lists to select advanced types for a given column.
  • Parameters:
  • None
  • Returns:
  • A list of implemented advanced types and their descriptions

UI changes

The AdhocFilterEditPopoverSimpleTabContent will need to be modified. If an advanced type is present on a selected column, we use the v1/advanced_type/convert endpoint to retrieve: an input’s valid/invalid status, the list of operators to display, and the currently selected comparator’s converted value. The valid_operators list will need to override the current operators list defined by the isOperatorRevelant function. A tooltip component will also need to be added to the component to display the returned converted value. No behavioral changes should be present if a column without a advanced type is selected.

Migration Plan and Compatibility

A database migration is required. This will only effect the BaseColumn model. We will need to add a varchar field to the columns table.

in order to maintain compatibility we will insure the advanced type column in nullable as well as hiding changes behind the ENABLE_ADVANCED_TYPES feature flag.

Rejected Alternatives

no alternatives have been strongly considered at this time

Note

Please note that the API endpoints are subject to change during the development process. This was the first logical attempt

@cccs-RyanS cccs-RyanS changed the title [SIP-79] Proposal for Business Types [SIP-78] Proposal for Business Types Dec 22, 2021
@rusackas
Copy link
Member

Thank you for this detailed proposal! I think this includes a LOT of fantastic ideas/potential.

I may be misunderstanding some intent/approach here, but I'd love to see these business types (and validators, etc) contributed to the core codebase, rather than as some form of configuration, since I can see how a library of these business types might grow/evolve and add value to all sorts of users/orgs. I feel like changes to config should be limited to enabling/disabling/overriding/extending these business types, but the default should be to define them in a way that adds value for everyone, unless it's something super obscure/proprietary.

Thinking ahead further on that, it may be worth considering (at some point) making this a plugin-based architecture, if we expect there to be a proliferation of business types that Superset maintainers might want to add/remove/contribute. The only reason I consider this worth mentioning here and now is that we may have a similar problem to consider with things like formatters (e.g. supporting all the currencies of the world) which could use some organization/scalability efforts in a similar way.

One tiny nit you're probably way ahead of - the Business Type selector in the Dataset modal should probably be a Select component, and display those enabled business types (presumably populated by v1/business_type/types, rather than expecting the user to know/type them.

My biggest question is how all of this fits into any known/proposed/speculated plans for Superset's semantic layer, to make sure it intersects well with any potentially related efforts (e.g. how we build drill-down/across/through). I'll solicit some feedback on this front, and hopefully more folks will chime in on the thread.

Again, this is great work... thank you for all of the hard work and consideration that's already gone into this!

@betodealmeida
Copy link
Member

betodealmeida commented Dec 22, 2021

This is great! A couple comments:

  1. BaseColumn is going away ([SIP-68] A better model for Datasets #14909 and feat: new dataset/table/column models #17543) in the next few weeks, so I wonder if we should wait a little and implement this directly in the new Column model.
  2. Instead of having this in the config, we could leverage Python entry points to have a plugin system, like @rusackas mentioned. This way we can have a builtin set of business types that comes with Superset, and admins can develop, distribute, and install custom business types that would be automatically discovered by Superset. Best of both worlds! :)

@cccs-RyanS
Copy link
Contributor Author

cccs-RyanS commented Dec 22, 2021

Thanks for the feedback @betodealmeida @rusackas I really appreciate the interest!

So first addressing the idea of having a plugin system, we are 100% on board and had that in mind as well. We were just unsure if introducing this idea might be a separate SIP, extra feature or something long those lines. This could also address the idea of having some of these validators / types / etc be pre-canned and but also extensible as you both have mentioned, I think this is certainly a good idea.

in in addition v1/business_type/types will be used to populate a select control to populate a drop-down to select the business type. just so users don't get into weird situations selecting types that don't exist and such.

I was unaware about the changes to BaseColumn so implementing this on the new Column model does make sense to me.

at this time I do not have a ton of insight into how this fits into known/proposed/speculated plans for Superset's semantic layer but am willing to take on any feedback in regards to this.

Thanks again for the feedback!

@ktmud
Copy link
Member

ktmud commented Mar 2, 2022

What is proposed here seems to be an extension of data types. I'd probably call it AdvancedDataType instead of BusinessType. I was quite confused when I first saw the PR #18794 because I didn't know which "business" it was referring to.

@cccs-RyanS
Copy link
Contributor Author

cccs-RyanS commented Mar 2, 2022

What is proposed here seems to be an extension of data types. I'd probably call it AdvancedDataType instead of BusinessType. I was quite confused when I first saw the PR #18794 because I didn't know which "business" it was referring to.

For sure, I'm fine with the name change. I think the original name was in reference to business logic, however AdvancedDataType makes sense to me.

@cccs-RyanS cccs-RyanS changed the title [SIP-78] Proposal for Business Types [SIP-78] Proposal for Advanced Types Mar 31, 2022
@ktmud
Copy link
Member

ktmud commented Apr 11, 2022

While this will definitely be a very useful feature, we need to be careful with the implementation here. From a user experience point of view, I think it'd be a better experience if we simply combine the Data Type and Advanced Data Type column into one dropdown and allow users to coerce column types to any standard and advanced types (similar to Tableau). We should also make it very clear when values are displayed as originals and when are they translated, otherwise it could cause confusions. For example, it seems currently the filter search and dropdown values can be a mix of raw values and the translated values, which feels fragile and could produce unexpected results depending on the data types.

However, I do agree current implementation in the POC is the most straightforward and we can always iterate as long as this is behind a feature flag.

@villebro
Copy link
Member

@ktmud I agree that there is a lot of complexity in this feature which needs to be iterated on to make sure we get it fully right (@zhaoyongjie raised similar concerns). For that reason we'll keep the API and general functionality open to breaking changes until the full set of features has been completed. Which reminds me, I'll open a PR today to migrate the feature flag docs to the official documentation and add a few lines warning about the possibility of breaking changes during development: https://github.com/apache/superset/blob/master/RESOURCES/FEATURE_FLAGS.md

@rusackas rusackas added the sip Superset Improvement Proposal label Apr 27, 2022
@rusackas rusackas moved this to APPROVED (in mailing list) in SIPs (Superset Improvement Proposals) Dec 8, 2022
@rusackas rusackas moved this from APPROVED (in mailing list) to IMPLEMENTED / DONE in SIPs (Superset Improvement Proposals) Jan 24, 2023
@rusackas
Copy link
Member

Closing as done/implemented, since it's already merged. If there's more to do on this on the immediate horizon, just say the word, and we can re-open it if that makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sip Superset Improvement Proposal
Projects
Status: Implemented / Done
Development

No branches or pull requests

5 participants