Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-4587: integrate lucene-monitor into solr #2382

Draft
wants to merge 82 commits into
base: main
Choose a base branch
from

Conversation

kotman12
Copy link

@kotman12 kotman12 commented Apr 1, 2024

https://issues.apache.org/jira/browse/SOLR-4587

Description

The module hopes to simplify distribution and scaling query-indexes for monitoring and alerting workflows (also known as reverse search) by providing a bridge between solr-managed search index and lucene-monitor's efficient reverse search algorithms.

Here is some evidence that the community might find this useful.

  1. Blog-post that partly inspired the current approach
  2. Users asking about a percolator-like feature on stackoverflow.
  3. Someone contributed this extension but it doesn't really provide percolator-like functionality and because it wasn't upstreamed it fell out of maintenance.
  4. Plug for my own question on the issue!

Solution

This is still a WiP but I am opening up as a PR to get community feedback. The current approach is to ingest queries as solr documents, decompose them for perfromance, and then use child-document feature to index the decomposed subqueries under one atomic parent document block. On the search side the latest approach is to use a dedicated component that creates hooks into lucene-monitor's Presearcher, QueryTermFilter and CandidateMatcher.

The current optional cache implementation uses caffeine instead of lucene-monitor's simpler ConcurrentHashMap. It's worth noting that this cache should likely be quite a bit larger than your average query or document cache since query parsing involves a non-trivial amount of compute and disk I/O (especially for large results and/or queries). It's also worth noting that lucene-monitor will keep all the indexed queries cached in memory with in its default configuration. A unique solr-monitor feature was the addition of a bespoke cache warmer that tries to populate the cache with approximately all the latest updated queries since the last commit. This approach was added to have a baseline when comparing with lucene-monitor performance. The goal was to make it possible to effectively cache all queries in memory (since that is what lucene-monitor enables by default) but not necessarily require it.

Currently the PR has some visitor classes in the org.apache.lucene.monitor package that exposes certain lucene-monitor internals. If this approach gets accepted then the lucene project will likely need to be updated to expose what is necessary.

Tests

  1. testMonitorQuery: basic functionality before and after an update
  2. testNoDocListInResponse: The current API allows for two types of responses, a special monitorDocuments response that can relay lucene-monitor's response structure and unique features such as "reverse" highlights. The other response structure is a regular solr document list with each "response" document really referring to a query that matches the "real" document that is being matched. This test ensures you can disable the solr document list from coming in the response.
  3. testDefaultParser: validate that solr-monitor routes to default parser when none is selected.
  4. testDisjunctionQuery: validate that subqueries of a disjunction get indexed seperately.
  5. testNoDanglingDecomposition: validate that deleting a top-level query also removes all the child disjuncts.
  6. testNotQuery
  7. testWildCardQuery
  8. testDefaultQueryMatchTypeIsNone: If no match type is selected with the monitorMatchType field then only a solr document list is returned (same behavior as "forward" search).
  9. testMultiDocHighlightMatchType: Test highlight matcher on a multi-document batch and ensure it returns the character offsets and positions of all individual matches. It is worth noting that percolator returns the actual matching text snippet. This is something we could consider supporting within solr or adding to lucene-monitor.
  10. testHighlightMatchType: Single doc highlight test. Slightly different than the one above in that the highlighted field does not need to be storeOffsetsWithPositions="true" which is pretty convenient. I am not sure if I am relying on a MemoryIndex implementation detail but it is a bit tedious for users to update their schemas to have storeOffsetsWithPositions="true" just to get character offsets back from the highlight matcher. I also don't know if there is a better way to handle the multi-doc case .. maybe break each doc into its own MemeoryIndex reader so that we got offsets by default without specifying storeOffsetsWithPositions="true"?
  11. manySegmentsQuery: The cache warmer has reader-leaf-dependent logic so this was included to verify everything works on a multi-segment index.

All of the above are also tested with below custom configurations:

  1. Parallel matcher - lucene monitor allows for running the final, most-expensive matching step in a multi-threaded environment. The current solr-monitor implementation allows for this with some restrictions. For instance, it is difficult to populate a document response list from a fully asynchronous matching component because it would require awkwardly opening and closing leaf collectors on-demand. The more idiomatic solr approach would be to just run this on many shards and gain parallelism as recommended here. Still, during testing I found that a fully async postfilter in a single shard had better performance than an equally-parallel multi-sharded, synchronous postfilter so I've decided to keep it in the initial proposal. On top of that, it helps achieve greater feature parity with lucene-monitor (which obviously has no concept of sharding so can only parallelize with a special matcher).
  2. Stored monitor query - allow storing queries with stored="true" instead of using the recommended docValues. docValues have stricter single-value size limits so this is mainly to accommodate humongous queries

I'll report here that I also have some local performance tests which are difficult to port but that helped guide some of the decisions so far. I've also "manually" tested the custom tlog deserialization of the derived query field but this verification should probably go somewhere in a TlogReplay test. I haven't gone down that rabbit hole yet as I wanted to poll for some feedback first. The reason I skip TLog for the derived query fields is because these fields wrap a tokenstream which in itself is difficult to serialize without a custom analyzer. The goal was to let users leverage their existing document schema as often as possible instead of having to create something custom for the query-monitoring use-case.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check. TODO some apparently unrelated test failures
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

-->

<config>
<luceneMatchVersion>9.4</luceneMatchVersion>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor/maybe

Suggested change
<luceneMatchVersion>9.4</luceneMatchVersion>
<luceneMatchVersion>${tests.luceneMatchVersion:LATEST}</luceneMatchVersion>

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for catching this! .. I had to change it in a few other places so I made a separate commit


apply plugin: 'java-library'

description = 'Apache Solr Monitor'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is so puzzling to anyone who isn't intimately familiar with Lucene Monitor. I don't even think we should be calling this "Solr Monitor"; looks like infrastructure monitoring thing. Possibly "Solr-Lucene-Monitor" but still... a puzzling name.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great point .. The library used to be called luwak which I find to be a much better name... I'll try to think of a better name (maybe solr-reverse-search or solr-query-alerting). I'll reply in more detail to your mailing list message also touching on solr.cool and the sandbox.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saved Searches is a common name, I assume it is possible to list a users's saved searches too. Or Alerting, but then most people will expect there to be some functionality to ship alerts somewhere...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, if anything this might be a part of some larger alerting system, but "saved search" is more accurate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saved searches is a pretty indicative name. Percolator is also a known name for this kind of functionally.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I thought ES invented "percolator" as more of a metaphor... I wasn't aware that this is a more generic name. I was worried that "percolator" might clash too much with ES.

Copy link
Contributor

@cpoerschke cpoerschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kotman12 - thanks for working on this!

I started "just browsing" on this PR this morning and so the inline comments may seem a bit random or general but sharing them anyhow in case they're useful. Not considered any naming or solr-versus-solr-sandbox-versus-elsewhere aspects at this point i.e. was just browsing.

Comment on lines 34 to 35
private String queryFieldNameOverride;
private String payloadFieldNameOverride;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subjective: could maybe initialise to the defaults here, overriding in init if applicable, and then avoid the null-or-not checks in getInstance

Suggested change
private String queryFieldNameOverride;
private String payloadFieldNameOverride;
private String queryFieldName = MonitorFields.MONITOR_QUERY ;
private String payloadFieldName = MonitorFields.PAYLOAD;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea .. this makes sense to me. Is there really any value to these overrides though? I don't have a good reason why I chose to make these two fields overridable but not the other reserved fields. Is it safe to assume that a field prefixed by _ won't be in the user space anyway? If that is the case then this override business is overkill. Otherwise, we probably should make everything overridable.


public class MonitorUpdateProcessorFactory extends UpdateRequestProcessorFactory {

private Presearcher presearcher = PresearcherFactory.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noting that init has PresearcherFactory.build(presearcherType) also.

Suggested change
private Presearcher presearcher = PresearcherFactory.build();
private Presearcher presearcher;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the latest change the Presearcher only gets initialized in the ReverseQueryParserPlugin and I share that core-level-singleton by making MonitorUpdateProcessorFactory a SolrCoreAware type. Not sure if there is a better pattern for this? This would be admittedly nicer with some kind of DI mechanism.

*/

/** This package contains Solr's lucene monitor integration. */
package org.apache.solr.monitor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: surprised that package-info.java seems to be not needed for the lucene/monitor sub-directory, or maybe the checking logic just isn't checking for it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping to actually make some changes to lucene in order to avoid the need for that package. I wanted to gauge the viability of "solr-monitor" before suggesting changes to the lucene upstream. The way I see it, lucene-monitor has very nice optimizations for making saved search fast but the interface is tightly sealed and makes very opinionated choices about stuff like caching which makes it hard to integrate into something like solr. Not to mention, lucene-monitor's index isn't "pluggable" or exposed in any way. It just seemed easier to expose the relevant algorithms within lucene-monitor rather than trying to hack in the whole kitchen sink into solr. Sorry about the tangent 😃

Comment on lines 72 to 75
this.queryFieldName = queryFieldName;
this.payloadFieldName = payloadFieldName;
this.core = core;
this.indexSchema = core.getLatestSchema();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if there's an assumption somehow w.r.t. queryFieldName and payloadFieldName being or not being within indexSchema -- and if there's an assumption to check it somewhere, maybe at initialisation time rather than when the first document(s) arrive to make use of the fields etc.

Likewise w.r.t. the MonitorFields.RESERVED_MONITOR_FIELDS referenced later on.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a narrower MonitorFields.REQUIRED_MONITOR_FIELDS set which gets cross-validated against the schema in MonitorUpdateProcessorFactory::inform. In the same place I've also added some more specific schema-validations which get invoked for more specific configurations, i.e. which type of presearcher you are using.

Comment on lines 44 to 52
@Override
public void close() throws IOException {
super.close();
}

@Override
public void init(NamedList<?> args) {
super.init(args);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@Override
public void close() throws IOException {
super.close();
}
@Override
public void init(NamedList<?> args) {
super.init(args);
}

import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;

class ParallelSolrMatcherSink<T extends QueryMatch> implements SolrMatcherSink {
Copy link
Author

@kotman12 kotman12 Apr 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpoerschke I wonder if this whole class would be made obviated by #2248 .. I found that because there can be significant overhead in pre-processing documents for reverse search (mainly analysis), parallelizing by throwing more solr cores at the problem wasn't quite as fast as simply parallelizing the expensive post filter. But it seems that if that PR (or something similar) was merged we might already run post filters in parallel for each segment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question, i don't know.

kotman12#2 proposes to defer parallel matching logic i.e. to not have it in the initial integration: that reduces scope and complexity initially code wise etc. and also from users' perspectives i.e. when setting a system up, should more shards or replica or more threads be used etc.

defer as opposed to remove i.e. it could come back as a future optimisation or enhancement, after an initial integration.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpoerschke after pulling latest upstream and playing around with multiThreaded I discovered that the multi-threaded searcher is explicitly skipped for any query that includes a post filter. As such, the current implementation won't easily benefit from this feature. I am considering refactoring this to be a custom query with a two-phase iterator instead. Another motivation for this would be to take advantage of the latest intra-segment parallelization work done in lucene. I might create a branch to explore this idea..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please do a TwoPhaseIterator. It's nice to see contributors like you come along that know of such Lucene low level depths :-) . PostFilter should be a choice of last resort; it pre-dated TwoPhaseIterator. I fought to reduce where PostFilter is used in Solr; and it's finally at the point where it's used where it must be used, except FunctionRangeQuery SOLR-14164

Copy link
Author

@kotman12 kotman12 Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsmiley I've PoC'd the two-phase iterator approach here kotman12#6 .. the existing tests pass but I think I want to add a few more. Most of the change is in the newly added ReverseSearchQuery... I'm far from a lucene expert so I invite you to take a look 😃 .. any feedback would be much appreciated (I've invited you as a collaborator).

QueryMatch.SIMPLE_MATCHER::createMatcher,
new IndexSearcher(documentBatch.get()),
matchingQueries -> {
if (rb.isDebug()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting that the 3e7a53c commit is a refactor except for the addition of the rb.isDebug() qualification here. WDYT?

And I wonder, if the solr/core DebugComponent had a 'custom info provider' concept, could ReverseSearchComponent potentially implement that i.e. no need for a custom reverse debug component then? Though maybe I'm still not understanding in detail enough the interaction between the reverse query component and the base query component and consequently the interaction with the debug component after that, and the resulting implementation nuances etc.

Comment on lines 51 to 59
var originalMatchQuery = entry.getMatchQuery();

var matchQuery = new ConstantScoreQuery(originalMatchQuery);

boolean isMatch = matcherSink.matchQuery(queryId, matchQuery, entry.getMetadata());
if (isMatch && !queryId.equals(lastQueryId)) {
lastQueryId = queryId;
super.collect(doc);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeking to better understand the lastQueryId logic here:

  • could this method be called from multiple threads concurrently?
  • could the equality check happen earlier? what would happen if there was no check and the doc was always collected if there's a match?
Suggested change
var originalMatchQuery = entry.getMatchQuery();
var matchQuery = new ConstantScoreQuery(originalMatchQuery);
boolean isMatch = matcherSink.matchQuery(queryId, matchQuery, entry.getMetadata());
if (isMatch && !queryId.equals(lastQueryId)) {
lastQueryId = queryId;
super.collect(doc);
}
if (!queryId.equals(lastQueryId)) {
var originalMatchQuery = entry.getMatchQuery();
var matchQuery = new ConstantScoreQuery(originalMatchQuery);
boolean isMatch = matcherSink.matchQuery(queryId, matchQuery, entry.getMetadata());
if (isMatch) {
lastQueryId = queryId;
super.collect(doc);
}
}

vs.

Suggested change
var originalMatchQuery = entry.getMatchQuery();
var matchQuery = new ConstantScoreQuery(originalMatchQuery);
boolean isMatch = matcherSink.matchQuery(queryId, matchQuery, entry.getMetadata());
if (isMatch && !queryId.equals(lastQueryId)) {
lastQueryId = queryId;
super.collect(doc);
}
var originalMatchQuery = entry.getMatchQuery();
var matchQuery = new ConstantScoreQuery(originalMatchQuery);
boolean isMatch = matcherSink.matchQuery(queryId, matchQuery, entry.getMetadata());
if (isMatch) {
super.collect(doc);
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can revert .. this was also driven by user feedback. Lucene-monitor decomposes top-level disjunctions, i.e. this or that into two separate queries this and that which roll up to the same queryId but are indexed as separate documents for performance reasons. Basically I wanted to deduplicate and was under the impression that query/leaf collectors weren't thread-safe and would always be called by a single thread at a time. But perhaps thats not true or its too much of an implementation detail. I haven't yet considered where else this deduplication could go...

Copy link
Author

@kotman12 kotman12 Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I index disjunctions as nested documents which I assume would always be collected in the correct order since they occupy the same block.

Edit

Ideally I wanted to avoid grouping, i.e. keeping large in-memory sets to achieve the deduplication... Hence the reliance on ordering .. but maybe this isn't the right place and/or right approach.

var searcher = req.getSearcher();
MonitorQueryCache solrMonitorCache =
(SharedMonitorCache) searcher.getCache(this.solrMonitorCacheName);
SolrMonitorQueryDecoder queryDecoder = new SolrMonitorQueryDecoder(req.getCore());
Copy link
Contributor

@cpoerschke cpoerschke Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From code reading

Suggested change
SolrMonitorQueryDecoder queryDecoder = new SolrMonitorQueryDecoder(req.getCore());
SolrMonitorQueryDecoder queryDecoder = new SolrMonitorQueryDecoder(this.queryDecomposer);

looks possible here, will attempt to do that.

edit: never mind, it looked tempting to remove the SolrCore (and thus solr) dependency from the class but the core is needed for the SharedMonitorCacheLatestRegenerator code path.

…reading comprehension

(also removes duplicate req.getSearcher().getCache() call)
Comment on lines 87 to 116
var req = rb.req;
var documentBatch = documentBatch(req);
var matcherSink =
new SyncSolrMatcherSink<>(
QueryMatch.SIMPLE_MATCHER::createMatcher,
new IndexSearcher(documentBatch.get()),
matchingQueries -> {
if (rb.isDebug()) {
rb.req
.getContext()
.put(
ReverseSearchDebugComponent.ReverseSearchDebugInfo.KEY,
new ReverseSearchDebugComponent.ReverseSearchDebugInfo(
matchingQueries.getQueriesRun()));
}
});
Query preFilterQuery = presearcher.buildQuery(documentBatch.get(), getTermAcceptor(rb.req));
List<Query> mutableFilters =
Optional.ofNullable(rb.getFilters()).map(ArrayList::new).orElseGet(ArrayList::new);
rb.setQuery(new MatchAllDocsQuery());
mutableFilters.add(preFilterQuery);
var searcher = req.getSearcher();
MonitorQueryCache solrMonitorCache =
(SharedMonitorCache) searcher.getCache(this.solrMonitorCacheName);
SolrMonitorQueryDecoder queryDecoder = new SolrMonitorQueryDecoder(req.getCore());
mutableFilters.add(
new MonitorPostFilter(
new SolrMonitorQueryCollector.CollectorContext(
solrMonitorCache, queryDecoder, matcherSink)));
rb.setFilters(mutableFilters);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 113ca8f commit with some reordering and edits here to aid code reading comprehension, also removes duplicate req.getSearcher().getCache() call as a side effect.

*
*/

package org.apache.lucene.monitor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened apache/lucene#13993 to propose to make DocumentBatch public.

@zzBrunoBrito
Copy link

Hi @kotman12 sorry for my late reply I wasn't able to check github lately and thank you for responding it. Yes I have interest in using it, that's a very interesting feature for me and has lots of potential to power impactful features. I'll follow your instructions to use it and I'm happy to provide any feedback from that. I'm curious to test that and see how it works

@kotman12
Copy link
Author

kotman12 commented Dec 7, 2024

Hey @zzBrunoBrito .. thanks for replying. I've added some general docs in this branch that could help you out. I plan to add it to the refguide but I'm still working through some stuff.

Until this is upstreamed though and solr picks up lucene 9.12 (which is in the works) you will have to do this hack:

  1. Build the module and copy build/packaging/saved-search/ into solr-9.7.0/modules/ of your solr installation. Alternatively, you can use the one I just built to get started: saved-search.zip. You might notice the strange name aa-solr-saved-search-10.0.0-SNAPSHOT.jar .. this is a temporary hack to load saved-search classes before lucene-monitor classes. This won't be necessary once lucene 9.12 is picked up.
  2. You can pretty much follow the instructions of Basic Configuration in the README but any time you see a classpath pointing to a new class like solr.SavedSearchUpdateProcessorFactory .. you will have to replace the resource path with solr.savedsearch.update.* or solr.savedsearch.cache.* or solr.savedsearch.search.* depending on the class. That is unless you want to build the whole solr distribution from scratch .. because the resource loader name mapping is in solr-core which I've updated in this PR but is obviously not something you'll pick up if you only patch in the saved-search jar.
  3. Run bin/solr start -e cloud -Dsolr.modules=saved-search

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants