Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype(symbolization): Add symbolization in Pyroscope read path #3799

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

marcsanmi
Copy link
Contributor

@marcsanmi marcsanmi commented Dec 20, 2024

Context

This PR introduces a prototype implementation for DWARF symbolization of unsymbolized profiles in the Pyroscope read path. While this approach may introduce additional latency and requires further optimization, it enables us to gather performance statistics by running it alongside the eBPF collector in development environments.

All development has been done with V2 architecture support in mind

Key features

  • Remote debug info fetching via debuginfod
  • DWARF debug info parsing and address resolution
  • Support for inline function resolution
  • Symbolization in the read path
  • Dedicated bucket storage for caching debug files with configurable expiration
  • Configurable cache to store downloaded debug info

Configuration Example

query_backend:
  symbolizer:
    debuginfod_url: "https://debuginfod.elfutils.org"
    storage:
      backend: s3
      s3:
        bucket_name: debug-symbols-bucket
        # other specific config
    cache:
      enabled: false  # disabled by default
      max_age: 168h  # 1 week default

Missing points

  • Caching layer for debug files
  • Prometheus metrics to get some statistics

Copy link
Collaborator

@korniltsev korniltsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to have some benchmarks of symbolizing different amount of locations and different file sizes, I think it can help us to pick the right place and architecture for using this


// Save the debuginfo to a temporary file
tempDir := os.TempDir()
filePath := filepath.Join(tempDir, fmt.Sprintf("%s.elf", buildID))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this may cause a path traversal, allowing somebody to specify buildID as ../../.../ we should sanitize user provided data in buildID

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also add that generally speaking we should also be avoiding file operations as much as possible. If it's possible to do this in memory I'd do this in memory. If the file is too large to fit in memory I'd see if we could stream it

@marcsanmi marcsanmi force-pushed the marcsanmi/symbolization-poc branch from efdde88 to 6b009d3 Compare January 16, 2025 12:15
if r.Symbolizer == nil {
return false
}
if len(loc.Line) == 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's time to move this logic (unsymbolized fallback)

gl.Line = append(gl.Line, &googleProfile.Line{
somewhere here to Symbols/resolver

Copy link
Collaborator

@kolesnikovae kolesnikovae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work, Marc! I'm excited to see some experimental results 🚀

I think we can implement a slightly more optimized version for production use:

sequenceDiagram
    autonumber

    participant QF as Query Frontend
    participant M  as Metastore
    participant QB as Query Backend
    participant SYM as Symbolizer

    QF ->>+M: Query Metadata
    Note left of M: Build identifiers are returned<br> along with the metadata records
    M ->>-QF: 

    par
        QF ->>+SYM: Request for symbolication
        Note left of SYM: Prepare symbols for<br>the objects requested
    and
        QF ->>+QB: Data retrieval and aggregation
        Note left of QB: The main data path<br>Might be serverless
    end

    QB ->>-QF: Data in pprof format
    Note over QF: Because of the truncation,<br> only a limited set of locations<br>make it here (16K by default) 

    QF --)SYM: Location addresses
    
    SYM ->>-QF: Symbols
    
    QF ->>QF: Flame graph rendering
Loading

Even without a parallel pipeline and dedicated symbolication service, we could implement something like this:

sequenceDiagram
    autonumber

    participant QF as Query Frontend
    participant M  as Metastore
    participant QB as Query Backend
    participant SYM as Symbols

    QF ->>+M: Query Metadata
    Note left of M: No build identifiers are returned
    M ->>-QF: 

    QF ->>+QB: Data retrieval and aggregation
    Note left of QB: The main data path<br>Might be serverless

    QB ->>-QF: Data in pprof format
    Note over QF: Because of the truncation,<br> only a limited set of locations<br>make it here (16K by default)

    QF ->>+SYM: Fetch symbols
    SYM ->>-QF: Symbols
    Note over QF: In terms of the added latency,<br>this approach is not worse than<br>block level symbolication
    
    QF ->>QF: Flame graph rendering
Loading

I think we should avoid symbolization at the block level if the symbols are not already present in the block itself. Otherwise, this approach leads to excessive processing, increased latency, and higher resource usage. Please keep in mind, that a query may span many thousands of blocks.

I won't delve too deeply into how we fetch and process ELF/DWARF files, but I strongly doubt we can bypass the need for an intermediate representation optimized for our access patterns. Additionally, we need a solution to prevent concurrent access to the debuginfod service.

@@ -19,6 +21,13 @@ func buildTree(
appender *SampleAppender,
maxNodes int64,
) (*model.Tree, error) {
// Try debuginfod symbolization first
if symbols != nil && symbols.Symbolizer != nil {
if err := symbolizeLocations(ctx, symbols); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that symbols here include all locations of the partition. The profile we want to symbolize typically includes only a subset of them.

What we should also leverage is truncation. After pruning insignificant nodes, only a limited number of locations are preserved (this could reduce the number from millions to thousands). This is also helpful because some of the mappings will be excluded entirely.

Comment on lines +260 to +261
// Find all locations needing symbolization
for i, loc := range symbols.Locations {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an observation (I'm not suggesting implementing it here): it is much more performant and simpler to sort locations by mapping ID and then split the slice into groups with a single pass. There should be no mappings with distinct IDs and matching BuildID.

locIdx := locs[i].idx

// Clear the existing lines for the location
symbols.Locations[locIdx].Line = nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

symbols are read only (as they are shared if multiple queries are accessing the same dataset simultaneously)


// Find all locations needing symbolization
for i, loc := range symbols.Locations {
if mapping := &symbols.Mappings[loc.MappingId]; symbols.needsDebuginfodSymbolization(&loc, mapping) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should check it mapping-wide. If a mapping has symbols, all locations referring it have them, and vice versa. If the mapping flags are not set correctly, we should fix it.

Comment on lines 65 to 85
var sym *symbolizer.Symbolizer
if config.DebuginfodURL != "" {
sym = symbolizer.NewSymbolizer(
symbolizer.NewDebuginfodClient(config.DebuginfodURL),
)
}

q := QueryBackend{
config: config,
logger: logger,
reg: reg,
backendClient: backendClient,
blockReader: blockReader,
symbolizer: sym,
}

// Pass symbolizer to BlockReader if it's the right type
if br, ok := blockReader.(*BlockReader); ok {
br.symbolizer = sym
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to merge it at some point, I'd suggest implement dependency injection differently.

  1. symbolizer.Symbolizer does not belong to the queryContext.
  2. symbolizer.Symbolizer should be passed to querybackend.NewBlockReader in initQueryBackend at initialization. querybackend.NewBlockReader should accept symbolizer.Symbolizer as an argument (might be an interface).
  3. symbolizer.Symbolizer should be injected to symbols via *symdb.Reader here:
for _, ds := range md.Datasets {
	dataset := block.NewDataset(ds, object)
	dataset.Symbols().SetSymbolizer(b.symbolizer)
	qcs = append(qcs, newQueryContext(ctx, b.log, r, agg, dataset))
}

https://github.com/grafana/pyroscope/blob/main/pkg/phlaredb/symdb/block_reader.go#L419-L427:

func (p *partition) Symbols() *Symbols {
	return &Symbols{
		Stacktraces: p,
		Locations:   p.locations.slice(),
		Mappings:    p.mappings.slice(),
		Functions:   p.functions.slice(),
		Strings:     p.strings.slice(),
	}
}

*partition has access to *symdb.Reader that should have Symbolizer (interface defined in the symdb package as it is the main/only consumer).

@korniltsev
Copy link
Collaborator

I have not look into the code yet, but I've tried to run it locally and it looks like it's trying to load a lot of unnecesarry debug files.

I run ebpf profiler with no ontarget symbolization , also run a simple python -m http.server to mock debug infod responses.

I then query only one executable process_cpu:cpu:nanoseconds:cpu:nanoseconds{service_name="unknown", process_executable_path="/home/korniltsev/.cache/JetBrains/IntelliJIdea2024.2/tmp/GoLand/___go_build_go_opentelemetry_io_ebpf_profiler"}

I see 268 GET requests, with 13 requests to "GET /buildid/fbce2598b34f1cf8d0c899f34c2218864e1da6d1/debuginfo HTTP/1.1" 200 - (which is the profiler binary I put into mock server for testing and a bunch of 404 which I assume are build ids for the filles in the other processes which the query does not target.

Other then that it works \M/ Can't wait to run it in dev.

image

@marcsanmi marcsanmi changed the title POC feat(symbolization): Add DWARF symbolization with debuginfod support Prototype(symbolization): Add symbolization for unsymbolized profiles in Pyroscope read path Jan 19, 2025
@marcsanmi marcsanmi changed the title Prototype(symbolization): Add symbolization for unsymbolized profiles in Pyroscope read path Prototype(symbolization): Add symbolization in Pyroscope read path Jan 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants