Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(blooms)!: Index structured metadata into blooms #14061

Merged
merged 7 commits into from
Sep 10, 2024

Conversation

chaudum
Copy link
Contributor

@chaudum chaudum commented Sep 5, 2024

What this PR does / why we need it:

Instead of indexing ngrams of the log line content, we index the plain values of the structured metadata keys and values.

Resulting tokens are:

  • name
  • chunkPrefix + name
  • value
  • chunkPrefix + value
  • name + '=' value
  • chunkPrefix + name + '=' + value

Indexed fields (metadata name) are also extracted into the series metadata.

⚠️ Indexed metadata values cannot be queried with the bloom gateways yet.

Special notes for your reviewer:

This PR does not cleanup unused code used for ngram-tokenization. This is scope for a follow up.

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
    • Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

}
for _, kv := range entry.StructuredMetadata {
stats.SourceBytes += len(kv.Name) + len(kv.Value)
stats.Fields.Add(Field(kv.Name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I don't think stats is the place to put the fields. Can we return the added fields instead? Also, if needed, we can add Fields to BloomCreation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it should not be called stats. It's metadata from the indexing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Stats" also threw me off, since it makes me think stats are for observability and not for business logic.

IndexingInfo would be a little more clear to me, personally.

Copy link
Contributor

@salvacorts salvacorts Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about renaming this to metadata or IndexingInfo as Robert suggests for now?
Alternatively I'd separate

  • Stats: for things that ww use for o11y
  • Metadata/IndexingInfo: for things that are relevant to the bussiness logic later on such as the indexed fields list. We may also add IndexedFields to BloomCreation right away.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I used a separate struct instead of adding the field IndexedFields to the BloomCreation is that I can use the Merge function to aggregate multiple population operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer IndexingInfo over Metadata, since the latter is associated with Structured Metadata as well

func (t *StructuredMetadataTokenizer) Tokens(kv push.LabelAdapter) iter.Iterator[string] {
combined := fmt.Sprintf("%s=%s", kv.Name, kv.Value)
t.buf = append(t.buf[:0],
kv.Name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we adding the name and the value separately?

I think we can achieve the same with combined

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would start indexing name and value for now as well. Once we see how to implement the read path, we can remove the unnecessary tokens.

@chaudum chaudum force-pushed the chaudum/structured-metadata-tokenizer branch 2 times, most recently from 7b68d6e to 9f9acdc Compare September 9, 2024 09:43
Comment on lines +58 to +69
// q.Reset()

// fmt.Println("-----------------------------")

// count = 0
// for q.Next() {
// swb := q.At()
// series := swb.Series
// fmt.Printf("%s (%3d) %v\n", series.Fingerprint, series.Chunks.Len(), swb.Meta.Fields.Items())
// count++
// }
// fmt.Printf("Stream count: %4d\n", count)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO

Currently we can either iterate over SeriesWithMeta via the LazySeriesIter using the BlockQuerier or over SeriesWithBlooms using the BlockQuerierIter. The former lets us inpect the indexed fields, the latter the individual pages and the blooms themselves).

Ideally, can only use a single SeriesWithMetaAndBlooms iterable that exposes both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chaudum chaudum marked this pull request as ready for review September 9, 2024 09:56
@chaudum chaudum requested a review from a team as a code owner September 9, 2024 09:56
@chaudum chaudum requested a review from salvacorts September 9, 2024 10:02
Copy link
Member

@rfratto rfratto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few questions

pkg/storage/bloom/v1/tokenizer.go Outdated Show resolved Hide resolved
iter "github.com/grafana/loki/v3/pkg/iter/v2"
)

const (
MaxRuneLen = 4
)

type StructuredMetadataTokenizer struct {
prefix string
buf []string
Copy link
Member

@rfratto rfratto Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Personally I think buf tends to imply []byte in Go, so I was thrown off by how the tokenizer worked (I thought it was building one massive string). I'd recommend calling this strings or stringsBuf instead.

pkg/storage/bloom/v1/bloom_tokenizer.go Outdated Show resolved Hide resolved
cachedInserts++
continue
}

// maxBloomSize is in bytes, but blooms operate at the bit level; adjust
var collision bool
collision, full = bloom.ScalableBloomFilter.TestAndAddWithMaxSize(tok, bt.maxBloomSize*eightBits)
collision, full = bloom.ScalableBloomFilter.TestAndAddWithMaxSize([]byte(tok), bt.maxBloomSize*eightBits)

if full {
// edge case: one line maxed out the bloom size -- retrying is futile
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(unrelated to the PR, but just a general question)

Would this edge case cause false negatives? Can the bloom gateway know whether something was skipped from being indexed?

(Either way I think it's worth expanding the comment here for why this is safe or why this leads to issues)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this edge case cause false negatives? Can the bloom gateway know whether something was skipped from being indexed?

No, we cannot know that on the read path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems potentially risky then, unless I'm misunderstanding it. Does the break outer skip creating blooms for the chunk?

I initially interpreted this as "the rest of the chunk still gets indexed, but lines which are too long don't" which is why I asked about false negatives.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the check and made sure that at all metadata of at least one entry per chunk is indexed, see d483c31

Copy link
Contributor

@salvacorts salvacorts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments. Approving to unblock.

_ = entryIter.Next()
}

break outer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the outer tag is not needed anymore

if len(bt.cache) >= cacheSize { // While crude, this has proven efficient in performance testing. This speaks to the similarity in log lines near each other
clear(bt.cache)
}
}
}

// Only break out of the loop if the bloom filter is full after indexing all structured metadata of an entry.
if full {
// edge case: one line maxed out the bloom size -- retrying is futile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment and if statement relevant now?

IIUC, we will add at least one line to the bloom with all it's structured metadata. So the break when full would be sufficient here, no need to skip the next line (I think that's actually buggy).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole peeking does not make sense any more tbh.

pkg/storage/bloom/v1/builder.go Show resolved Hide resolved
iter "github.com/grafana/loki/v3/pkg/iter/v2"
)

const (
MaxRuneLen = 4
)

type StructuredMetadataTokenizer struct {
prefix string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd add what the prefix typically is here:

Suggested change
prefix string
// prefix to add to the tokens: typically the chunkref
prefix string

Copy link
Contributor

@salvacorts salvacorts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@rfratto rfratto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % final comment

Comment on lines +239 to +244
// Only break out of the loop if the bloom filter is full after indexing all structured metadata of an entry.
if full {
break
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found where my confusion is w/r/t breaking out of the loop here. I see now that there's a comment at the call site explaining that you need to call addChunkToBloom multiple times with new blooms until the iterator is fully consumed.

I think it would help if that comment was moved or duplicated onto the doc comment for addChunkToBloom, something like:

addChunkToBloom returns true if the bloom has been completely filled, and may not have consumed the entire iterator. addChunkToBloom must be called multiple times until returning false with new blooms until the iterator has been fully consumed.

Signed-off-by: Christian Haudum <[email protected]>
Signed-off-by: Christian Haudum <[email protected]>
Signed-off-by: Christian Haudum <[email protected]>
this should clarify the usage of the datastructure for business logic
rather than metrics

Signed-off-by: Christian Haudum <[email protected]>
Signed-off-by: Christian Haudum <[email protected]>
@chaudum chaudum force-pushed the chaudum/structured-metadata-tokenizer branch from d97ff5b to 4c2bebb Compare September 10, 2024 15:03
@chaudum chaudum merged commit a2fbaa8 into main Sep 10, 2024
61 checks passed
@chaudum chaudum deleted the chaudum/structured-metadata-tokenizer branch September 10, 2024 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants