-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(blooms)!: Index structured metadata into blooms #14061
Conversation
} | ||
for _, kv := range entry.StructuredMetadata { | ||
stats.SourceBytes += len(kv.Name) + len(kv.Value) | ||
stats.Fields.Add(Field(kv.Name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I don't think stats is the place to put the fields. Can we return the added fields instead? Also, if needed, we can add Fields to BloomCreation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it should not be called stats. It's metadata from the indexing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Stats" also threw me off, since it makes me think stats are for observability and not for business logic.
IndexingInfo would be a little more clear to me, personally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about renaming this to metadata
or IndexingInfo
as Robert suggests for now?
Alternatively I'd separate
Stats
: for things that ww use for o11yMetadata
/IndexingInfo
: for things that are relevant to the bussiness logic later on such as the indexed fields list. We may also addIndexedFields
to BloomCreation right away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I used a separate struct instead of adding the field IndexedFields
to the BloomCreation
is that I can use the Merge
function to aggregate multiple population operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer IndexingInfo
over Metadata
, since the latter is associated with Structured Metadata as well
pkg/storage/bloom/v1/tokenizer.go
Outdated
func (t *StructuredMetadataTokenizer) Tokens(kv push.LabelAdapter) iter.Iterator[string] { | ||
combined := fmt.Sprintf("%s=%s", kv.Name, kv.Value) | ||
t.buf = append(t.buf[:0], | ||
kv.Name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we adding the name and the value separately?
I think we can achieve the same with combined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would start indexing name and value for now as well. Once we see how to implement the read path, we can remove the unnecessary tokens.
7b68d6e
to
9f9acdc
Compare
// q.Reset() | ||
|
||
// fmt.Println("-----------------------------") | ||
|
||
// count = 0 | ||
// for q.Next() { | ||
// swb := q.At() | ||
// series := swb.Series | ||
// fmt.Printf("%s (%3d) %v\n", series.Fingerprint, series.Chunks.Len(), swb.Meta.Fields.Items()) | ||
// count++ | ||
// } | ||
// fmt.Printf("Stream count: %4d\n", count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO
Currently we can either iterate over SeriesWithMeta
via the LazySeriesIter
using the BlockQuerier
or over SeriesWithBlooms
using the BlockQuerierIter
. The former lets us inpect the indexed fields, the latter the individual pages and the blooms themselves).
Ideally, can only use a single SeriesWithMetaAndBlooms
iterable that exposes both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a WIP branch for implementing this https://github.com/grafana/loki/compare/chaudum/structured-metadata-tokenizer...chaudum/series-with-meta-and-blooms?expand=1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a few questions
pkg/storage/bloom/v1/tokenizer.go
Outdated
iter "github.com/grafana/loki/v3/pkg/iter/v2" | ||
) | ||
|
||
const ( | ||
MaxRuneLen = 4 | ||
) | ||
|
||
type StructuredMetadataTokenizer struct { | ||
prefix string | ||
buf []string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Personally I think buf
tends to imply []byte
in Go, so I was thrown off by how the tokenizer worked (I thought it was building one massive string). I'd recommend calling this strings
or stringsBuf
instead.
cachedInserts++ | ||
continue | ||
} | ||
|
||
// maxBloomSize is in bytes, but blooms operate at the bit level; adjust | ||
var collision bool | ||
collision, full = bloom.ScalableBloomFilter.TestAndAddWithMaxSize(tok, bt.maxBloomSize*eightBits) | ||
collision, full = bloom.ScalableBloomFilter.TestAndAddWithMaxSize([]byte(tok), bt.maxBloomSize*eightBits) | ||
|
||
if full { | ||
// edge case: one line maxed out the bloom size -- retrying is futile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(unrelated to the PR, but just a general question)
Would this edge case cause false negatives? Can the bloom gateway know whether something was skipped from being indexed?
(Either way I think it's worth expanding the comment here for why this is safe or why this leads to issues)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this edge case cause false negatives? Can the bloom gateway know whether something was skipped from being indexed?
No, we cannot know that on the read path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems potentially risky then, unless I'm misunderstanding it. Does the break outer
skip creating blooms for the chunk?
I initially interpreted this as "the rest of the chunk still gets indexed, but lines which are too long don't" which is why I asked about false negatives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the check and made sure that at all metadata of at least one entry per chunk is indexed, see d483c31
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor comments. Approving to unblock.
_ = entryIter.Next() | ||
} | ||
|
||
break outer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the outer tag is not needed anymore
if len(bt.cache) >= cacheSize { // While crude, this has proven efficient in performance testing. This speaks to the similarity in log lines near each other | ||
clear(bt.cache) | ||
} | ||
} | ||
} | ||
|
||
// Only break out of the loop if the bloom filter is full after indexing all structured metadata of an entry. | ||
if full { | ||
// edge case: one line maxed out the bloom size -- retrying is futile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment and if statement relevant now?
IIUC, we will add at least one line to the bloom with all it's structured metadata. So the break when full would be sufficient here, no need to skip the next line (I think that's actually buggy).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole peeking does not make sense any more tbh.
iter "github.com/grafana/loki/v3/pkg/iter/v2" | ||
) | ||
|
||
const ( | ||
MaxRuneLen = 4 | ||
) | ||
|
||
type StructuredMetadataTokenizer struct { | ||
prefix string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd add what the prefix typically is here:
prefix string | |
// prefix to add to the tokens: typically the chunkref | |
prefix string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % final comment
// Only break out of the loop if the bloom filter is full after indexing all structured metadata of an entry. | ||
if full { | ||
break | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I found where my confusion is w/r/t breaking out of the loop here. I see now that there's a comment at the call site explaining that you need to call addChunkToBloom multiple times with new blooms until the iterator is fully consumed.
I think it would help if that comment was moved or duplicated onto the doc comment for addChunkToBloom, something like:
addChunkToBloom returns true if the bloom has been completely filled, and may not have consumed the entire iterator. addChunkToBloom must be called multiple times until returning false with new blooms until the iterator has been fully consumed.
Signed-off-by: Christian Haudum <[email protected]>
Signed-off-by: Christian Haudum <[email protected]>
Signed-off-by: Christian Haudum <[email protected]>
Signed-off-by: Christian Haudum <[email protected]>
Signed-off-by: Christian Haudum <[email protected]>
this should clarify the usage of the datastructure for business logic rather than metrics Signed-off-by: Christian Haudum <[email protected]>
Signed-off-by: Christian Haudum <[email protected]>
d97ff5b
to
4c2bebb
Compare
What this PR does / why we need it:
Instead of indexing ngrams of the log line content, we index the plain values of the structured metadata keys and values.
Resulting tokens are:
name
chunkPrefix + name
value
chunkPrefix + value
name + '=' value
chunkPrefix + name + '=' + value
Indexed fields (metadata name) are also extracted into the series metadata.
Special notes for your reviewer:
This PR does not cleanup unused code used for ngram-tokenization. This is scope for a follow up.
Checklist
CONTRIBUTING.md
guide (required)feat
PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.docs/sources/setup/upgrade/_index.md
production/helm/loki/Chart.yaml
and updateproduction/helm/loki/CHANGELOG.md
andproduction/helm/loki/README.md
. Example PRdeprecated-config.yaml
anddeleted-config.yaml
files respectively in thetools/deprecated-config-checker
directory. Example PR