-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New component: Blob Upload Processor #33737
Comments
I am willing to potentially sponsor this, but I would would love to see if any others have needed to store very large or sensitive attributes separately. I plan to raise this tomorrow at the SIG meeting. |
I raised this at the SIG meeting today, but this wasn't an issue people on the call had run into before. |
There is some consideration of moving the "larger" genai attributes. open-telemetry/semantic-conventions#483 (comment) |
We Langtrace are also interested to test out this span processor as we are also thinking about this problem. We currently have 2 GenAI OTEL instrumentation libraries - python and typescript. |
The LLM Semconv WG is considering reporting prompts and completions in event payloads (and breaking them down into individual structured pieces) - open-telemetry/semantic-conventions#980 Still, there is a possibility that prompts/completion messages could be big. There is interest in the community to record generated images, audio, etc for debugging/evaluation purposes. From general semconv perspective, we don't usually define span attributes that may contain unbounded data ( In this context, it could make sense to also support blob uploads with LogProcessor. See also open-telemetry/semantic-conventions#1217 where a similar concerns have been raised for logs. |
In the interests of transparency, I have started related work on this here: I originally started with a "processor", but I'm having doubts regarding whether this functionality is possible with a processor and am now looking into representing it as an "exporter" that wraps another exporter (but perhaps this is incorrect?). In any event, the (very early, not yet complete code) is in development here: I appreciate the insight that this may shift to a different representation... with that in mind, I am going to try to make this more general. While I will start with span attributes to handle current representations, I will keep the naming general and allow this to grow to address write-aside to blob storage from other signal types and other parts of the signal. |
Quick Status update:
Will give another update in 2 weeks time or when this is working, whichever is sooner. |
Apologies that this is taking longer than expected. I am, however, still working on this. |
The general shape of this is now present and can be found in: I still need to polish this and create end-to-end testing, but there is probably enough here to get early feedback. Note that while the original scope was intended to focus on spans, the above covers BOTH spans AND span events, given the pivot of the GenAI semantic conventions towards span event attributes. I also pivoted from hand-rolling the string interpolation, to trying to leverage OTTL to do it: ... this required some hackery in OTTL, though, and am wondering if there is an even cleaner approach than this. |
@michaelsafyan thanks! To catch you up to date, the current semver 1.27.0 is already span events, so this is relevant. What's a question mark to many is the change to log events. For example, not all backends know what to do with them, and there is some implied indexing. So, I would expect that once this is in, folks will want to transform log events (with span context) back to span events. Do you feel up to adding a function like interpolateSpanEvent to do that? Something like |
@codefromthecrypt can you elaborate on what you mean by The way that I'm thinking about this is that
What I have there now targets:
A logical expansion of this logic would be to also handle:
Other types of conversions (such as span events to logs, or logs back into span events) make sense and would be useful, but probably should be considered out of scope for this particular component (and should probably be tracked in a separate issue), though I agree that it is important for different users to decide whether their events data is recorded as events attached to a span or as separate logs (and that a connector is likely to be a good way to implement that). |
@michaelsafyan so the main q about log events was in relation to the genai spec which is about to switch to them. Since this spec is noted in the description, that's why I thought it might be in scope for this change/PR. What do you think is a better place to move the topic of transform "span events to log events" to? If you don't have a specific idea, I'll open a new issue, just didn't want to duplicate this, if it was in scope. |
I think new, separate issues for "Log Events -> Span Event Connector" and "Span Events -> Logs Connector" would make sense. |
cool. I opened #34695 first, and if I made any mistakes in the description please correct if you have karma to do so, or ask me to, if you don't. |
Just providing another update, since it has been a while. I was out on vacation last week and had other work to catch up on this past week. I am hoping to resume this work this coming week. This is still on my plate. |
Quick status update:
I am, however, encountering merge conflicts when attempting to sync from upstream ... so this may require some additional work to resolve. |
Status update: Still working on writing tests. As per usual, getting progressively from one error to a different kind of error. Now the errors that I'm getting are related to the string interpolation library which relates to open issue: #34700 I'm also realizing that the data model in https://github.com/michaelsafyan/open-telemetry.opentelemetry-collector-contrib/tree/blob_writer_span_processor/connector/blobattributeuploadconnector/internal/foreignattr is one that probably requires more input/agreement in OTel SemConv. I will be opening up an issue there shortly to discuss further and to ensure that it won't block up streaming this code when it is done. |
Status update: now have the string interpolation logic in OTTL working. Next steps:
|
Status update:
To keep the change from growing out of control and to prevent horrible merge conflicts down the road, I'm thinking about upstreaming parts of this piecemeal and then expanding capabilities rather than trying to include every single signal type from the outset before starting to upstream. |
I'm renaming this from A renamed version now exists in this development branch: I'm going to work on getting pieces of this upstreamed and, in parallel, I am going to start a new development branch for adding capabilities related to logs. That work will proceed here: |
I will sponsor this component. Thanks @michaelsafyan for working on this! |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping |
The purpose and use-cases of the new component
The Blob Uploader Processor takes selected attributes/fields (from spans, span events, logs, etc.) and:
This component is intended to address a number of concerns:
Motivating Examples:
http.request.body.content
andhttp.response.body.content
)gen_ai.prompt
andgen_ai.completion
)Use Cases Related to the Examples:
Additional restrictions around the access are needed beyond that of the general operations solution; writing to a separate blob storage allows additional access controls to be applied. Links to the destination enable the results to be located in a separate backend storage system that provides the necessary checks on access.
Full request/responses get used rarely by the oncallers, only when their end user opens a ticket through their support mechanism; writing this data to a separate, low-cost storage system allows the user to save on their ops storage costs.
Example configuration for the component (subject to change)
The following is intended to illustrate the general idea, but is subject to change:
The configuration consists of a list of
ConfigStanza
s:Each config stanza defines how it will handle exactly one type of attribute. The properties of the stanza are:
http.request.body.content
)SPAN
: only look at span-level attributes (not resource, scope, or event attributes)RESOURCE
: only look at resource-level attributes (not span, scope, or event attributes)SCOPE
: only look at scope-level attributes (not span, resource, or event attributes)EVENT
: only look at event-level attributes (not span, resource, or scope attributes)gs://example-bucket/full-http/request/payloads/${trace_id}/${span_id}.txt
trace_id
span_id
resource.attributes
span.attributes
scope.attributes
span.attributes.foo
orspan.attributes[foo]
).AUTO
)AUTO
: attempt to infer the content type automaticallyextract_from: expr
: derive it from other information in the signal- Ex:
extract_from: span.attributes["http.request.header.content-type"]
"application/json"
): to use a static valueREPLACE_WITH_REFERENCE
.REPLACE_WITH_REFERENCE
: replace the value with a reference to the destination location.KEEP
: the write is a copy, but the original data is not altered.DROP
: the fact that a write happened will not be recorded in the attributeDROP
.DROP
: remove the attribute in its entiretyKEEP
: don't modify the original data if this fraction wasn't matchedHere is a full example with the above in mind:
Telemetry data types supported
Traces
Is this a vendor-specific component?
Code Owner(s)
braydonk, michaelsafyan, dashpole
Sponsor (optional)
dashpole
Additional context
No response
The text was updated successfully, but these errors were encountered: