[Feature request] Add field mapping correlation type metadata concept #7082
Labels
discuss
Issues intended to help drive brainstorming and decision making
enhancement
Enhancement or improvement to existing feature or request
feature
New feature or request
Indexing
Indexing, Bulk Indexing and anything related to indexing
Is your feature request related to a problem?
As part of the Integration campaign and [Integration RFC(https://github.com/opensearch-project/OpenSearch-Dashboards/issues/3412) , we have introduction the SimpleSchema for Observability Domain that is based on the concept of a well-structured index which is based on a schema
Schema
A schema is associated to an index using the mapping configuration .
This mapping structure is also composable using the
composed_of
template capabilities which is used extensively to allow the different assemblies of various log types.Another concept behind the schema is the capability of reflecting relationships.
This representation is currently defined in a proprietary way of adding this information to
the index mapping template's metadata
In the Observability domain - a
log's
entity relationship to atrace
entity(:log)-[:associated]-(:trace)
using thetraceId
correlation field is described in the log's mapping metadata section:What solution would you like?
I would like that the field mapping API would be extended with this metadata information.
Recently there have been large extensions in the conceptual operation of opensearch as a search engine.
These extensions include:
The evolution of the knowledge layer on top of the data layer is an existing trend both in opensearch and in additional storage engines.
Key part of any knowledge layer is the concept of relationships between the different Entities .
P1 - The First Step
This step includes the introduction of the
correlations
concept into the field mapping.Even though the concept of index relationships does exist today:
Both options imply a physical explicit index interrelationship that has a strong side effect of index physical storage and query time.
In addition, the specific field mapping has no reflection of this join which is only present in the higher index mapping level.
The new
field-mapping-correlation feature
is addressing the metadata aspect of the relationship between well-structuredentities residing in different indices.
A
correlation
is a weaker constraint in the sense that it doesn't impose a relational like DB foreign key constraint but rather implies that such correlation exist and may be joinedusing a query engine
Another difference from the existing
join
fields is that thiscorrelation
will be at first a metadata declarative definition that will not be enforced with respect to theactual data inside the indices - only the mapping correlation metadata will be enforced as detailed below.
New Correlation Section in Field mapping
Field mapping for a field which has a relationship to another foreign field in the target entity's index:
GET log/_mapping/field/traceId
Will respond with:
This metadata information will be used by the SQL / PPL query engine to allow explicit correlation between different data-streams or datasources.
Having this information explicitly will allow better understanding and enhance investigation capabilities.
Once a SQL / PPL correlation (join) query is submitted to the corresponding index - it will create a regular sql join query.
Enforcement
In the first
P1
step themapping
API would enforce the following when a field mapping correlation is requested:foreign-schema
mapping exists ( in the above example the"foreign-schema": "traces"
must imply an index template traces exist)foreign-field
mapping exists ( in the above example the"foreign-field": "traceId"
must imply a field named traceId must exist)The
correlations
field may accept multiple correlations for additional remote indices including remote tables including datasourcesP2 - The next Step
The next phase of the correlation capability would be including the actual precompute of the correlated data using some auxiliary data structure / indices
The auxiliary data structure may take the form of an eager correlation task which precomputes the join and materialized it into a secondary storage.
An additional skipping-index can be introduced to further optimize the filter based queries using bloomfilter of other probabilistic data sketch
The result of an SQL query would be much faster due to these auxiliary structures and allow faster and investigative driven use cases on top of huge indices and event data-lake
based correlations.
What alternatives have you considered?
A clear and concise description of any alternative solutions or features you've considered.
Do you have any additional context?
The text was updated successfully, but these errors were encountered: