Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pkg/stanza] Support to Customize bufio.SplitFunc #14593

Closed
atingchen opened this issue Sep 29, 2022 · 11 comments
Closed

[pkg/stanza] Support to Customize bufio.SplitFunc #14593

atingchen opened this issue Sep 29, 2022 · 11 comments
Assignees
Labels
enhancement New feature or request pkg/stanza priority:p2 Medium

Comments

@atingchen
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Now fileexport support to marshal telemetry data in proto format and compress data.When using proto format or any kind of encoding, encoded telemetry data is no longer written to file line by line. The size of the encoded telemetry data will be written before writing the data to the file

When we need read the telemetry data back in, we read the size, then read the bytes into a separate buffer, then parse from that buffer.

Currently fileconsume.Manager only supports splitting log entries by newlines or regex patterns and reading logs from a file.

Can pkg/stanza provide a way to customize bufio.SplitFunc ?

Describe the solution you'd like

helper.SplitterConfig add the function to customize bufio.SplitFunc

func (c *SplitterConfig) CreateCustomizedSplitter(splitter bufio.SplitFunc) {
	c.CustomizedSplitter = splitter
}

fileconsumer.Config add a function to call CreateCustomizedSplitter

func (c *Config) SetSplitter(splitter bufio.SplitFunc) {
	c.Splitter.CreateCustomizedSplitter(splitter)
}

Describe alternatives you've considered

No response

Additional context

No response

@atingchen atingchen added enhancement New feature or request needs triage New item requiring triage labels Sep 29, 2022
@evan-bradley evan-bradley added priority:p2 Medium pkg/stanza and removed needs triage New item requiring triage labels Sep 29, 2022
@github-actions
Copy link
Contributor

Pinging code owners: @djaglowski. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@djaglowski
Copy link
Member

I think something like this makes sense, though I would want to look at a PR before being sure.

The part I am struggling with is that the fileconsumer embeds SplitterConfig which exposes several settings to users that will be ignored when using a custom splitfunc. This is not ideal but could probably be handled through config validation and documentation. It would be a lot to untangle otherwise, but I'd be happy to look at a PR for that too if you have ideas there.

@atingchen
Copy link
Contributor Author

I think something like this makes sense, though I would want to look at a PR before being sure.

The part I am struggling with is that the fileconsumer embeds SplitterConfig which exposes several settings to users that will be ignored when using a custom splitfunc. This is not ideal but could probably be handled through config validation and documentation. It would be a lot to untangle otherwise, but I'd be happy to look at a PR for that too if you have ideas there.

I'd like to do it.

@djaglowski
Copy link
Member

Thanks again for all the work on this @atingchen!

@atingchen
Copy link
Contributor Author

Thanks again for all the work on this @atingchen!

My pleasure😀

@h0cheung
Copy link
Contributor

Thanks a lot. But how can we use this in an existing receiver like filelogreceiver with minimal code change? Is further development needed?

@djaglowski
Copy link
Member

Thanks a lot. But how can we use this in an existing receiver like filelogreceiver with minimal code change? Is further development needed?

I don't see this feature as "user facing", since it requires code. It's really meant for component developers, to allow custom splitting logic in other components.

Can you describe the case you are trying to solve?

@h0cheung
Copy link
Contributor

We are trying to find a way to split logs, which is able to handle both multi-line logs and different formats.
For example:

2023-01-18 19:19:45.134 [ERROR] [some service] [some package] some message
Traceback
  ...
  ...
xxx.xxx.xxx.xxx - - [18/Jan/2023:19:19:47 +0800] "GET /xxx HTTP/1.1" 200 xxx "-" "curl/x.xx.x" "yyy.yyy.yyy.yyy"
2023-01-18 19:19:50.134 [INFO] [some service] [some package] some message

If split correctly, the result should be three logs.
We are still discussing whether this should be implemented in the splitFunc or transform/recombine.

@djaglowski
Copy link
Member

@h0cheung, you may be able to take advantage of non-linear data flow in the filelog receiver.

I think you could read each line individually, use the router operator to separate lines based on whether they match the 2023-01-18 19:19:45.134 [ERROR] [some service] [some package] some message format or not, and send the non-matching logs to the recombine operator. Something like:

receivers:
  filelog:
    include: ...
    operators:
      - type: router
        default: recombine
        routes:
        - expr: 'body matches "2023-01-18 19:19:45.134 [ERROR] [some service] [some package] some message"' # need to extract a proper regex here
          output: regex_parser
      - type: recombine
         ...
        output: noop # skip to last operator
      - type: regex_parser
        regex: ...
      - type: noop # See: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/stanza/docs/operators/noop.md#why-is-this-necessary

@h0cheung
Copy link
Contributor

@h0cheung, you may be able to take advantage of non-linear data flow in the filelog receiver.

I think you could read each line individually, use the router operator to separate lines based on whether they match the 2023-01-18 19:19:45.134 [ERROR] [some service] [some package] some message format or not, and send the non-matching logs to the recombine operator. Something like:

receivers:
  filelog:
    include: ...
    operators:
      - type: router
        default: recombine
        routes:
        - expr: 'body matches "2023-01-18 19:19:45.134 [ERROR] [some service] [some package] some message"' # need to extract a proper regex here
          output: regex_parser
      - type: recombine
         ...
        output: noop # skip to last operator
      - type: regex_parser
        regex: ...
      - type: noop # See: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/stanza/docs/operators/noop.md#why-is-this-necessary

Sorry. I didn't make it clear. The result should be like this:

1st entry:

2023-01-18 19:19:45.134 [ERROR] [some service] [some package] some message
Traceback
  ...
  ...

2nd entry:

xxx.xxx.xxx.xxx - - [18/Jan/2023:19:19:47 +0800] "GET /xxx HTTP/1.1" 200 xxx "-" "curl/x.xx.x" "yyy.yyy.yyy.yyy"

3rd entry:

2023-01-18 19:19:50.134 [INFO] [some service] [some package] some message

Anyway, thanks a lot.

@djaglowski
Copy link
Member

Got it. You may be able to use the same strategy with some tweaks:

receivers:
  filelog:
    include: ...
    operators:
      - type: router
        default: recombine
        routes:
        - expr: 'body matches "xxx.xxx.xxx.xxx - - "' # need to extract a proper regex here
          output: regex_parser
      - type: recombine
         is_first_entry: 'body matches "YYYY-MM-DD..."' # need to extract a proper regex here
        output: noop # skip to last operator
      - type: regex_parser
        regex: ...
      - type: noop # See: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/stanza/docs/operators/noop.md#why-is-this-necessary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pkg/stanza priority:p2 Medium
Projects
None yet
Development

No branches or pull requests

4 participants