feat(blooms)!: Index structured metadata into blooms #14061

chaudum · 2024-09-05T16:08:02Z

What this PR does / why we need it:

Instead of indexing ngrams of the log line content, we index the plain values of the structured metadata keys and values.

Resulting tokens are:

name
chunkPrefix + name
value
chunkPrefix + value
name + '=' value
chunkPrefix + name + '=' + value

Indexed fields (metadata name) are also extracted into the series metadata.

⚠️ Indexed metadata values cannot be queried with the bloom gateways yet.

Special notes for your reviewer:

This PR does not cleanup unused code used for ngram-tokenization. This is scope for a follow up.

Checklist

Reviewed the CONTRIBUTING.md guide (required)
Documentation added
Tests updated
Title matches the required conventional commits format, see here
- Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR
If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

salvacorts · 2024-09-06T13:49:48Z

pkg/storage/bloom/v1/bloom_tokenizer.go

-		}
+		for _, kv := range entry.StructuredMetadata {
+			stats.SourceBytes += len(kv.Name) + len(kv.Value)
+			stats.Fields.Add(Field(kv.Name))


Nit: I don't think stats is the place to put the fields. Can we return the added fields instead? Also, if needed, we can add Fields to BloomCreation?

Maybe it should not be called stats. It's metadata from the indexing.

"Stats" also threw me off, since it makes me think stats are for observability and not for business logic.

IndexingInfo would be a little more clear to me, personally.

What about renaming this to metadata or IndexingInfo as Robert suggests for now?
Alternatively I'd separate

Stats: for things that ww use for o11y

Metadata/IndexingInfo: for things that are relevant to the bussiness logic later on such as the indexed fields list. We may also add IndexedFields to BloomCreation right away.

The reason I used a separate struct instead of adding the field IndexedFields to the BloomCreation is that I can use the Merge function to aggregate multiple population operations.

I prefer IndexingInfo over Metadata, since the latter is associated with Structured Metadata as well

salvacorts · 2024-09-06T13:54:00Z

pkg/storage/bloom/v1/tokenizer.go

+func (t *StructuredMetadataTokenizer) Tokens(kv push.LabelAdapter) iter.Iterator[string] {
+	combined := fmt.Sprintf("%s=%s", kv.Name, kv.Value)
+	t.buf = append(t.buf[:0],
+		kv.Name,


why are we adding the name and the value separately?

I think we can achieve the same with combined

I would start indexing name and value for now as well. Once we see how to implement the read path, we can remove the unnecessary tokens.

chaudum · 2024-09-09T09:49:37Z

tools/bloom/inspector/main.go

+	// q.Reset()
+
+	// fmt.Println("-----------------------------")
+
+	// count = 0
+	// for q.Next() {
+	// 	swb := q.At()
+	// 	series := swb.Series
+	// 	fmt.Printf("%s (%3d) %v\n", series.Fingerprint, series.Chunks.Len(), swb.Meta.Fields.Items())
+	// 	count++
+	// }
+	// fmt.Printf("Stream count: %4d\n", count)


TODO

Currently we can either iterate over SeriesWithMeta via the LazySeriesIter using the BlockQuerier or over SeriesWithBlooms using the BlockQuerierIter. The former lets us inpect the indexed fields, the latter the individual pages and the blooms themselves).

Ideally, can only use a single SeriesWithMetaAndBlooms iterable that exposes both.

There is a WIP branch for implementing this https://github.com/grafana/loki/compare/chaudum/structured-metadata-tokenizer...chaudum/series-with-meta-and-blooms?expand=1

rfratto

Looks good, just a few questions

pkg/storage/bloom/v1/tokenizer.go

rfratto · 2024-09-09T14:56:45Z

pkg/storage/bloom/v1/tokenizer.go

 	iter "github.com/grafana/loki/v3/pkg/iter/v2"
 )

 const (
 	MaxRuneLen = 4
 )

+type StructuredMetadataTokenizer struct {
+	prefix string
+	buf    []string


nit: Personally I think buf tends to imply []byte in Go, so I was thrown off by how the tokenizer worked (I thought it was building one massive string). I'd recommend calling this strings or stringsBuf instead.

pkg/storage/bloom/v1/bloom_tokenizer.go

rfratto · 2024-09-09T14:59:28Z

pkg/storage/bloom/v1/bloom_tokenizer.go

 					cachedInserts++
 					continue
 				}

 				// maxBloomSize is in bytes, but blooms operate at the bit level; adjust
 				var collision bool
-				collision, full = bloom.ScalableBloomFilter.TestAndAddWithMaxSize(tok, bt.maxBloomSize*eightBits)
+				collision, full = bloom.ScalableBloomFilter.TestAndAddWithMaxSize([]byte(tok), bt.maxBloomSize*eightBits)

 				if full {
 					// edge case: one line maxed out the bloom size -- retrying is futile


(unrelated to the PR, but just a general question)

Would this edge case cause false negatives? Can the bloom gateway know whether something was skipped from being indexed?

(Either way I think it's worth expanding the comment here for why this is safe or why this leads to issues)

Would this edge case cause false negatives? Can the bloom gateway know whether something was skipped from being indexed?

No, we cannot know that on the read path.

Seems potentially risky then, unless I'm misunderstanding it. Does the break outer skip creating blooms for the chunk?

I initially interpreted this as "the rest of the chunk still gets indexed, but lines which are too long don't" which is why I asked about false negatives.

I removed the check and made sure that at all metadata of at least one entry per chunk is indexed, see d483c31

salvacorts

Left some minor comments. Approving to unblock.

salvacorts · 2024-09-10T09:42:30Z

pkg/storage/bloom/v1/bloom_tokenizer.go

+				_ = entryIter.Next()
+			}
+
+			break outer


the outer tag is not needed anymore

salvacorts · 2024-09-10T09:44:28Z

pkg/storage/bloom/v1/bloom_tokenizer.go

 				if len(bt.cache) >= cacheSize { // While crude, this has proven efficient in performance testing.  This speaks to the similarity in log lines near each other
 					clear(bt.cache)
 				}
 			}
 		}

+		// Only break out of the loop if the bloom filter is full after indexing all structured metadata of an entry.
+		if full {
+			// edge case: one line maxed out the bloom size -- retrying is futile


Is this comment and if statement relevant now?

IIUC, we will add at least one line to the bloom with all it's structured metadata. So the break when full would be sufficient here, no need to skip the next line (I think that's actually buggy).

The whole peeking does not make sense any more tbh.

pkg/storage/bloom/v1/builder.go

salvacorts · 2024-09-10T09:48:49Z

pkg/storage/bloom/v1/tokenizer.go

 	iter "github.com/grafana/loki/v3/pkg/iter/v2"
 )

 const (
 	MaxRuneLen = 4
 )

+type StructuredMetadataTokenizer struct {
+	prefix string


nit: I'd add what the prefix typically is here:

Suggested change

prefix string

// prefix to add to the tokens: typically the chunkref

prefix string

salvacorts

LGTM!

rfratto

LGTM % final comment

rfratto · 2024-09-10T14:52:12Z

pkg/storage/bloom/v1/bloom_tokenizer.go

+		// Only break out of the loop if the bloom filter is full after indexing all structured metadata of an entry.
+		if full {
+			break
+		}


I think I found where my confusion is w/r/t breaking out of the loop here. I see now that there's a comment at the call site explaining that you need to call addChunkToBloom multiple times with new blooms until the iterator is fully consumed.

I think it would help if that comment was moved or duplicated onto the doc comment for addChunkToBloom, something like:

addChunkToBloom returns true if the bloom has been completely filled, and may not have consumed the entire iterator. addChunkToBloom must be called multiple times until returning false with new blooms until the iterator has been fully consumed.

Signed-off-by: Christian Haudum <[email protected]>

this should clarify the usage of the datastructure for business logic rather than metrics Signed-off-by: Christian Haudum <[email protected]>

Signed-off-by: Christian Haudum <[email protected]>

pull-request-size bot added the size/L label Sep 5, 2024

salvacorts reviewed Sep 6, 2024

View reviewed changes

chaudum force-pushed the chaudum/structured-metadata-tokenizer branch 2 times, most recently from 7b68d6e to 9f9acdc Compare September 9, 2024 09:43

chaudum commented Sep 9, 2024

View reviewed changes

chaudum marked this pull request as ready for review September 9, 2024 09:56

chaudum requested a review from a team as a code owner September 9, 2024 09:56

chaudum requested a review from salvacorts September 9, 2024 10:02

rfratto reviewed Sep 9, 2024

View reviewed changes

salvacorts approved these changes Sep 10, 2024

View reviewed changes

rfratto approved these changes Sep 10, 2024

View reviewed changes

chaudum added 7 commits September 10, 2024 16:58

feat(blooms)!: Index structured metadata into blooms

c53717c

Signed-off-by: Christian Haudum <[email protected]>

Read fields in bloom block inspector

321f18e

Signed-off-by: Christian Haudum <[email protected]>

Make linter happy

8871429

Signed-off-by: Christian Haudum <[email protected]>

Update tests with new tokenizer

4c5fb3a

Signed-off-by: Christian Haudum <[email protected]>

Address comments from code review

029bde5

Signed-off-by: Christian Haudum <[email protected]>

Rename IndexingStats to indexingInfo

d4f19ac

this should clarify the usage of the datastructure for business logic rather than metrics Signed-off-by: Christian Haudum <[email protected]>

Address comments from code review

4c2bebb

Signed-off-by: Christian Haudum <[email protected]>

chaudum force-pushed the chaudum/structured-metadata-tokenizer branch from d97ff5b to 4c2bebb Compare September 10, 2024 15:03

chaudum merged commit a2fbaa8 into main Sep 10, 2024
61 checks passed

chaudum deleted the chaudum/structured-metadata-tokenizer branch September 10, 2024 15:16

loki-gh-app bot mentioned this pull request Sep 23, 2024

chore(k221): release 3.2.0 #14214

Open

loki-gh-app bot mentioned this pull request Sep 30, 2024

chore(k222): release 3.2.0 #14305

Open

loki-gh-app bot mentioned this pull request Oct 7, 2024

chore(k223): release 3.2.0 #14402

Closed

This was referenced Oct 14, 2024

chore(k224): release 3.2.0 #14486

Closed

chore(k225): release 3.2.0 #14543

Closed

This was referenced Oct 21, 2024

chore(k225): release 3.2.0 #14548

Closed

chore(k226): release 3.2.0 #14625

Closed

loki-gh-app bot mentioned this pull request Nov 4, 2024

chore(k227): release 3.3.0 #14750

Merged

loki-gh-app bot mentioned this pull request Nov 11, 2024

chore(k228): release 3.2.0 #14855

Closed

loki-gh-app bot mentioned this pull request Dec 20, 2024

chore(k228): release 3.2.0 #15521

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(blooms)!: Index structured metadata into blooms #14061

feat(blooms)!: Index structured metadata into blooms #14061

chaudum commented Sep 5, 2024 •

edited

Loading

salvacorts Sep 6, 2024

chaudum Sep 6, 2024

rfratto Sep 9, 2024

salvacorts Sep 10, 2024 •

edited

Loading

chaudum Sep 10, 2024

chaudum Sep 10, 2024

salvacorts Sep 6, 2024

chaudum Sep 9, 2024

chaudum Sep 9, 2024

chaudum Sep 10, 2024

rfratto left a comment

rfratto Sep 9, 2024 •

edited

Loading

rfratto Sep 9, 2024

chaudum Sep 10, 2024

rfratto Sep 10, 2024

chaudum Sep 10, 2024

salvacorts left a comment

salvacorts Sep 10, 2024

salvacorts Sep 10, 2024

chaudum Sep 10, 2024

salvacorts Sep 10, 2024

salvacorts left a comment

rfratto left a comment

rfratto Sep 10, 2024

	prefix string
	// prefix to add to the tokens: typically the chunkref
	prefix string

feat(blooms)!: Index structured metadata into blooms #14061

feat(blooms)!: Index structured metadata into blooms #14061

Conversation

chaudum commented Sep 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

salvacorts Sep 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rfratto left a comment

Choose a reason for hiding this comment

rfratto Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

salvacorts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

salvacorts left a comment

Choose a reason for hiding this comment

rfratto left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chaudum commented Sep 5, 2024 •

edited

Loading

salvacorts Sep 10, 2024 •

edited

Loading

rfratto Sep 9, 2024 •

edited

Loading