Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinction between structured Body and Attributes #1613

Closed
jmm opened this issue Apr 13, 2021 · 31 comments
Closed

Distinction between structured Body and Attributes #1613

jmm opened this issue Apr 13, 2021 · 31 comments
Assignees
Labels
help wanted Extra attention is needed release:required-logdatamodel-ga Required for declaring log data model stable spec:logs Related to the specification/logs directory

Comments

@jmm
Copy link

jmm commented Apr 13, 2021

Is there a clear distinction between what belongs in a structured Body vs. in Attributes? If there's not, it seems to make it less predictable where data is expected, and how to map between other models.

A top-level "message" string seems to be pretty common among other log data models. In this model, if you want to have a top-level message string and structured data describing the event, are you supposed to put the message in Body and other data that would otherwise be in Body in Attributes? I don't see anything about special-casing a property of Body as a top-level message.

From the mapping perspective, considering the Elastic Common Schema (ECS) example, it doesn't include all ECS fields, but message is the only one shown as mapping to Body. Body seems like a logical (perhaps the most logical) place for fields like error.message or event.id. So the fact that fields like error.message are shown as mapping into Attributes makes it seem that there's not a clear distinction, and that the mapping could be ambiguous depending on whether message is populated.

(BTW, unlike most of the rest of the document, the ECS example refers to "body" and "attributes" rather than "Body" and "Attributes", which I assume is a typo or holdover from a previous version.)


Attributes is documented as:

SHOULD follow OpenTelemetry semantic conventions for Attributes.

If I understand correctly, that means that a property of Attributes that has the name of a well-known attribute should have the meaning and data type defined for that attribute, but meanwhile Attributes can also include arbitrary custom attributes.

That also seems to have implications for placement of data within Body vs. Attributes, because whereas no semantic conventions apply to Body, if a property gets bumped from Body into Attributes, then it may be that conventions are supposed to apply that wouldn't otherwise.

@jmm jmm added the spec:logs Related to the specification/logs directory label Apr 13, 2021
@tigrannajaryan tigrannajaryan added the help wanted Extra attention is needed label Apr 28, 2021
@pmm-sumo
Copy link
Contributor

pmm-sumo commented Apr 29, 2021

Does it actually help to have another container for key/values associated with Log Record? If something is not covered by semantic conventions, I think it simply means that its definition is out of scope of OT - it might be related to a given environment, come in MDC, etc. As long as there's no conflict in the key names, why we cannot reuse Attributes?

@tigrannajaryan
Copy link
Member

As long as there's no conflict in the key names, why we cannot reuse Attributes?

Attributes require a key to record a value. Body does not. Body is better suited for the most common legacy use case of logs: an unstructured text log line. To record it in the Attributes we would need a semantic convention for what key to use which is different from everything else.

Separate Body and Attributes appear to better fit existing logging data models (e.g. MSG vs STRUCTURED-DATA in Syslog, or log message vs log fields in Zap logger).

@jmm
Copy link
Author

jmm commented May 4, 2021

Thank you both for commenting.

What I was really wondering is if it would make the most sense to have a top-level Message that's always a string when defined, and structured data that's part of the event would always go into a map in a top-level property such as Attributes (either directly or perhaps under a key reserved for a nested map of custom data with no semantic conventions?), or perhaps a top-level map reserved for that data (though maybe that clashes with reserving top-level properties for things that are almost always present).

Maybe I misunderstood, but I got the impression @pmm-sumo was suggesting that any structured data that could currently be placed in Body could instead go in Attributes, instead of Body sometimes containing structured data, which is part of what I was getting at. By "another container" did you mean Body or Context from #1660?

@pmm-sumo
Copy link
Contributor

pmm-sumo commented May 5, 2021

I think I messed up and my comment was largely referring to Context of #1660 (somehow I mixed up the two issues). Apologies for that @jmm!

I think we may also want to change perspective when looking at that. Let's consider that both Body and Attributes contain key-values. Does it make any practical difference from processor, exporter or vendor perspective if a given key-value is present in Body or in Attributes?

One case I can think of if someone wants to put a boundary between metadata (present in Attributes) and record content (which might be a structured Body), so there would be a clear distinction between those.

To bring an example, consider someone is having a temperature sensor and logging its output. The sensor has some metadata assigned, e.g. id, connection type, etc. that are not part of the record. Practically, this might look like following:

Body: {"temperature": 21.4, "unit": "degrees_celcius"}
Attributes: {"sensor_id": "1a90c", "manufacturer": "Acme", "connection": "usb 3.2"}

@jmm
Copy link
Author

jmm commented May 6, 2021

@pmm-sumo no worries!

The only case I can think of if someone wants to put a boundary between metadata (present in Attributes) and record content (which might be a structured Body), so there would be a clear distinction between those.

Yeah that's what I was getting at. But the way it's designed currently, any time you want to have structured record content and a top-level message, the structured content would have to be bumped into Attributes anyway. Unless the idea is that message and structured record content are mutually exclusive. And the example mapping of ECS to this model shows data that I see as part of the event, and therefore probably most at home in Body, being mapped into Attributes.

@pmm-sumo
Copy link
Contributor

pmm-sumo commented May 6, 2021

Unless the idea is that message and structured record content are mutually exclusive.

My understanding of Body field description is exactly that - either a raw message or a structured content (map or array). This is further reinforced by any type definition

@jmm
Copy link
Author

jmm commented May 6, 2021

Right, it's mutually exclusive in terms of Body. But if you populate Body with a message string, is that supposed to mean you can't also populate arbitrary structured data that doesn't have standardized semantics? If that's not the intent then the structured data would have to get bumped into Attributes, right? Let's say a message and tags, for example.

So what I'm really wondering is if things would be more straightforward by having a top-level Message that's always a string when populated, and additional structured data without standardized semantics would always go within a certain top-level property (whether it's under Attributes or Body or something else).

@pmm-sumo
Copy link
Contributor

pmm-sumo commented May 6, 2021

But if you populate Body with a message string, is that supposed to mean you can't also populate arbitrary structured data that doesn't have standardized semantics?

I believe they can hold any sort of data. They SHOULD (not MUST) follow Semantic Conventions according to the data model.

Lets consider several options:

  1. Attributes only
    My understanding is that it's much like ECS schema. There can be a special key, say Message that denotes an attribute containing raw string message. If the message is structured, the related fields are mixed with metadata

  2. Attributes and Message
    This is the same like above, except Message is now a field of the record. Everything else holds true

  3. Attributes, Message and MessageAttributes (for lack of a better name)
    Let's say we want to separate record-level attributes and message-level attributes (if the original message was structured). That's one way to do it.

  4. Attributes and Body
    This is the current approach. Body is flexible enough to cover either a structured original message or a plain-text original message but not really both (unless some standard field name would be introduced for plain-text message).

Things get bit more complex when there's mix of structured and unstructured data. For a practical example, here's a random output from OpenTelemetry Collector:

2021-05-06T21:43:23.740+0200	info	service/application.go:261	Starting OpenTelemetry Collector...	{"Version": "v0.24.0-27-gfa73baf8", "GitHash": "fa73baf8", "NumCPU": 16}

Taking the timestamp, log level and caller aside, we end up with essentially a message: Starting OpenTelemetry Collector... and some attributes: {"Version": "v0.24.0-27-gfa73baf8", "GitHash": "fa73baf8", "NumCPU": 16}. With the current approach, should they land in Attributes or as a part of a structured Body? If it's the latter, what about the message?

Actually, log data model comes with an answer for that case - the attributes go to Attributes and message to Body:

Field Type Description Maps to Unified Model Field
ts Timestamp Time when an event occurred measured by the origin clock. Timestamp
level enum Logging level. Severity
caller string Calling function's filename and line number. Attributes, key=TBD
msg string Human readable message. Body
All other fields any Structured data. Attributes

@tigrannajaryan
Copy link
Member

@jmm @pmm-sumo let's discuss today in Log SIG meeting if you plan to attend.

@tigrannajaryan
Copy link
Member

What do you think about the following?


If the log record contains one or more "pieces" of data that may fit either in Body or Attributes but do not fit the description of the other top-level fields (Timestamp, Severity, etc) then follow these guidelines to decide how to record these pieces of data in the Body and Attribute fields:

  1. If the log record contains a single raw byte sequence that is not associated with any particular key (such that we cannot think of it as a key/value pair) put it in the Body.
  2. If the log record contains a single raw Unicode character string that is not associated with any particular key then put it in the Body.
  3. If the log record contains a single JSON-like array value or a primitive value that is not associated with any particular key then put it in the Body.
  4. If the log record contains a single JSON-like object with key/value pairs it may be recorded either in the Body or in the Attributes. The choice in this case should be based on whether we are interested in individual key/value pairs (e.g. each key/value has a particular meaning, possibly independently from the meaning of the other key/value pairs), in which case recording them in the Attributes would be preferable. Otherwise if we are more interested in just recording the object as a whole and getting it delivered then the Body would be preferable.
  5. If the log record contains a piece of data that is associated with a key then record that piece as a key/value pair in the Attributes.

Note: if there is more than one piece of data that matches the rules 1-4 then we cannot record then in the Body, we have to come up with some keys and record each piece of data as a key/value pair in the Attributes.


This is a non-exhaustive set of heuristics but should be probably a good starting point. What do you think?

@tigrannajaryan
Copy link
Member

@pmm-sumo @djaglowski will one of you be able able to submit a PR to make corresponding changes in the spec?

@pmm-sumo
Copy link
Contributor

pmm-sumo commented May 27, 2021

@tigrannajaryan @djaglowski sure, preparing a proposal. Since the guidelines are quite clear, I am going to literally put those into a dedicated section

@tigrannajaryan
Copy link
Member

@pmm-sumo since this is assigned to you do you feel that you can make progress on this or given the lack of concensus on #1727 we should close this issue and keep it unspecified for now until we have a better understanding?

@pmm-sumo
Copy link
Contributor

We had great discussion but reached no consensus on neither #1727 (providing guidelines) nor #1752 (be more restrictive). It is currently one of the major ones blocking Logs GA. Perhaps we could sync during the next Log SIG and discuss it online?

The next Log SIG is scheduled Oct 27th, 10am Pacific, though we suggest to move it a week earlier, Oct 20th, same time. If that time does not work, we could organise a dedicated sync for that issue.

Would that work for you? @yurishkuro @SergeyKanzhelev @errordeveloper @tigrannajaryan @djaglowski @jmm

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Oct 13, 2021

@yurishkuro your last comment on the PR was that you fail to see what the confusion is. Here is one more example of the confusion https://cloud-native.slack.com/archives/CJFCJHG4Q/p1627987975028000
Do you still think it is not worth addressing?

@yurishkuro
Copy link
Member

the question in chat was this (with my emphasis):

Knowing that the body on logs can be a structured data, when should one be using the body to provide additional attributes instead of using the attributes from the log record itself?

I think the problem we're having is we are not clear about who is this "one". Which persona is that? An application developer using a logging API from an application? Or an infrastructure engineer writing a transformation of one data format into OTLP? These are distinct use cases requiring distinct solutions.

Having said that, I like the suggestion in https://github.com/open-telemetry/opentelemetry-specification/pull/1727/files#r643105246 of restricting Body to be a primitive type. I think it resolves the confusions without limiting the expressiveness of OTLP model.

@tigrannajaryan
Copy link
Member

I think the problem we're having is we are not clear about who is this "one". Which persona is that? An application developer using a logging API from an application?

Yes, I think the confusion primarily arises when one needs to generate a log record (via a logging API or emit an OTLP log record). This is a clean slate situation: I know I want to generate a log record, I know roughly what data I want to put in the record, but I can structure the same data such that some primary bit goes into the Body (e.g. the human readable message) and the rest goes into the Attributes or I can put everything into Attributes (with the human readable message being just another Attribute). In other words one can also say that it is a data modelling problem: given some information I know I want to record and having ability to record it in a few different ways how do I make a decision about the shape I want this information to take?

Or an infrastructure engineer writing a transformation of one data format into OTLP?

This likely is not an issue, since it is primarily driven by the semantics of that particular data format. We have examples of transformations in the log data model doc, a few of which I wrote and I think it was fairly straightforward to know what to put in the Body vs Attributes.

Having said that, I like the suggestion in https://github.com/open-telemetry/opentelemetry-specification/pull/1727/files#r643105246 of restricting Body to be a primitive type.

I don't think that will work. There are formats where log body is a complex structured data (there are a couple examples in log data model doc).

@yurishkuro
Copy link
Member

I think the confusion primarily arises when one needs to generate a log record (via a logging API or emit an OTLP log record)

Well, to me this is not at all the same case. Writing via Logging API should avoid this problem altogether because the API should allow the user to express their intent. Nobody just "emits OTLP log record", OTLP is not something end users are ever exposed to directly.

I don't think that will work. There are formats where log body is a complex structured data (there are a couple examples in log data model doc).

@tigrannajaryan please elaborate / point to which examples you mean. Logically I don't see how a body could be simultaneously structured and yet unnamed ("body" == no name). Take Zap logger API, for example. The Body there (aka "message") is always a string, you cannot pass a structured data via API as the body. But you can pass it as an attribute as long as you name it. If there is use case in the examples where a body is both structured and unnamed, then I think the structure in that case is not arbitrary (Any), but pre-defined (e.g. IOT device emitting some data), and therefore should be translated into attributes.

@tigrannajaryan
Copy link
Member

Well, to me this is not at all the same case. Writing via Logging API should avoid this problem altogether because the API should allow the user to express their intent. Nobody just "emits OTLP log record", OTLP is not something end users are ever exposed to directly.

Fair enough, I agree. It only will be a problem if we allow Logging API to accept a structured Body, which we don't have to if we don't think it is necessary. So, our API can restrict the Body to be a string only (actually a string as it is known in some languages may not be good enough, we may need to allow any sequence of bytes, not just valid Unicode strings).

please elaborate / point to which examples you mean. Logically I don't see how a body could be simultaneously structured and yet unnamed ("body" == no name).

Splunk HEC has a structured Body and additional fields (Attributes). Similarly, I believe Google Cloud Logging can have a structured json_payload mapped to Body.

@yurishkuro
Copy link
Member

So I find it interesting that both Splunk and Google examples have the exact same situation - a body and attributes. So what are their guidelines for which way structured data should go?

@yurishkuro
Copy link
Member

To put it differently, if they haven't solved the semantic distinction between structured body and attributes (and they sit much closer to user intent), how can we expect to solve it downstream of them? I would instead be inclined to change the mapping tables you linked to and say that those structured fields do NOT map to Body, but to similarly named attributes, like json_payload (especially if it helps to change Body field to a primitive type).

@tigrannajaryan
Copy link
Member

So I find it interesting that both Splunk and Google examples have the exact same situation - a body and attributes. So what are their guidelines for which way structured data should go?

Unfortunately there are no guidelines that I am aware of. The Google Cloud logging says that it is a union of one of the 3 things:

Union field payload. The log entry payload, which can be one of multiple types. payload can be only one of the following...

In a sense it takes a stance that Otel data model takes currently: it tells what can be represented, but doesn't give any recommendations on how to use it.

@tigrannajaryan
Copy link
Member

I would instead be inclined to change the mapping tables you linked to and say that those structured fields do NOT map to Body, but to similarly named attributes, like json_payload (especially if it helps to change Body field to a primitive type).

I have a feeling this is not the right approach. I think this makes things worse, we are wrapping something that doesn't need to be wrapped. We already have the exact matching concept for it, so wrapping it adds unnecessary data nesting (bad for UIs and for querying, etc).

@yurishkuro
Copy link
Member

Ok, but it seems the only reason we have body as structured field is to support these mappings from 3rd party formats. With respect to this ticket, we can easily say "we recommend treating body is a string", and only treat body as structured when doing those 3rd party conversions.

@tigrannajaryan
Copy link
Member

With respect to this ticket, we can easily say "we recommend treating body is a string", and only treat body as structured when doing those 3rd party conversions.

@pmm-sumo what do you think?

@pmm-sumo
Copy link
Contributor

I am subjectively inclined towards limiting the Body type to String and Byte Array:

What we gain:

  • the model is closer to spans and metrics, which makes it easier to reuse tools (e.g. processors) for other signals
  • much easier to manipulate logs when not having to deal with both Body and Attributes key-values (we still have Resource Attributes, which actually makes it one more)
  • the confusion we discuss is gone
  • somewhat following Occam's razor approach :)

What we lose:

  • we can no longer story complex arrays directly in Body (though such content can be put into Attributes)
  • we can no longer store basic primitive types natively (int64, bool, double) - though they can be converted to String or stored in Attributes
  • structured input (e.g. JSON) would either need to be stored as raw data in Body or parsed and put into Attributes, which might bring some confusion in the future (though I find it easier to explain)
  • ability to have bijective 3rd party conversion (e.g. HEC -> OTLP -> HEC will yield output than differs from the input)

I am not sure if either of the listed above is a hard requirement. The state we have now is somewhat confusing and we can avoid that by limiting allowed types. If we want to do it, I think it's easier to do it now, while the OTLP Logs adoption is still relatively low. Then, if we find there are valid use-cases, it can be brought back.

With respect to this ticket, we can easily say "we recommend treating body is a string", and only treat body as structured when doing those 3rd party conversions.

@pmm-sumo what do you think?

Yeah, that sounds like a viable approach as well. I find it similar to #1727 and further reducing when complex Body types should be used

@tigrannajaryan
Copy link
Member

I am subjectively inclined towards limiting the Body type to String and Byte Array

I don't mind such limitation in the Logging API.

However, if we are talking about the protocol I am strongly against doing this for 2 reasons.

Firstly, we have written code that uses Body in structured form. It would be a major breaking change for the Collector, including third-party distros, with unclear consequences. Yes, formally we are allowed to do such breaking changes because the protocols is in Beta, but we need to be considerate of pains we are causing.

Secondly, I do not think it is a good idea to limit the protocol in such a way anyway (see below why).

ability to have bijective 3rd party conversion (e.g. HEC -> OTLP -> HEC will yield output than differs from the input)

It is a stated goal of the data model to support lossless and unambiguous conversions. We will be breaking our design promise: It should be possible to unambiguously map existing log formats to this Data Model. Translating log data from an arbitrary log format to this Data Model and back should ideally result in identical data.

This design promise enables an important property of the Collector - passing data through the Collector in a particular protocol is lossless and does not change the data in any way. We will lose this if we drop that design constraint.

@pmm-sumo
Copy link
Contributor

Yeah, that is a strong argument. I think it's most reasonable to keep the model as it is and just clarify intended use

@tigrannajaryan
Copy link
Member

We discussed this today in Log SIG. To summarize:

  • We think that there is value in allowing Body to be structured in the data model, particularly for the purpose of representing non-OpenTelemetry data formats unambiguously. We need to explain this value in the data model document.
  • We want to produce guidelines for authors which implement a mapping from a data format to OpenTelemetry data model (e.g. Collector receiver authors), which tells where to put a particular piece of data they have.
  • We cannot think of a good use of a structured Body in the logging APIs. All of the logging libraries that we are aware of produce log body (message) as a string. Because of this we think that it is reasonable to limit logging APIs to accept the body as a string only and the structured data that logging libraries allow should go to the attributes.
  • We want to produce guidelines for logging library authors to know what goes into the Body (which is just a string) and what goes into the attributes or other fields (the limitation of Body as a string helps to write this guidelines).

These guidelines and explanations can be either in the spec or in the form of OTEPs.

I will open separate issues to address these 4 topics individually.

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Oct 27, 2021

Separate issues created:
#2066
#2067
#2068
#2069

Unless someone objects I suggest to close this issue since the 4 above capture the totality of the work to be done.

@tigrannajaryan tigrannajaryan added the release:required-logdatamodel-ga Required for declaring log data model stable label Nov 4, 2021
tigrannajaryan pushed a commit that referenced this issue Nov 4, 2021
Resolves #2066 and #1752

Supports #2068

## Changes

Adds a note to the log data model which explains the intended usage of the `Body` field. 

## Additional Context

Extensive discussion has been had on this issue on [#1613](#1613 (comment)), as well as in the Log SIG group.
@djaglowski
Copy link
Member

Unless someone objects I suggest to close this issue since the 4 above capture the totality of the work to be done.

Shall we close this?

joaopgrassi pushed a commit to dynatrace-oss-contrib/semantic-conventions that referenced this issue Mar 21, 2024
Resolves #2066 and open-telemetry#1752

Supports #2068

## Changes

Adds a note to the log data model which explains the intended usage of the `Body` field. 

## Additional Context

Extensive discussion has been had on this issue on [open-telemetry#1613](open-telemetry/opentelemetry-specification#1613 (comment)), as well as in the Log SIG group.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed release:required-logdatamodel-ga Required for declaring log data model stable spec:logs Related to the specification/logs directory
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants