-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add Meilisearch-compatible search engine #35650
Conversation
Thanks for the pull request, @regisb! What's next?Please work through the following steps to get your changes ready for engineering review: 🔘 Get product approvalIf you haven't already, check this list to see if your contribution needs to go through the product review process.
🔘 Provide contextTo help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:
🔘 Get a green buildIf one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green. 🔘 Let us know that your PR is ready for review:Who will review my changes?This repository is currently maintained by Where can I find more information?If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:
When can I expect my changes to be merged?Our goal is to get community contributions seen and reviewed as efficiently as possible. However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:
💡 As a result it may take up to several weeks or months to complete a review and merge your PR. |
39f469d
to
e071022
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR @regisb ! I've tested the course discovery and courseware content search, but need to get forums and notes set up to test these -- will tackle them tomorrow.
Noted a couple of minor issues inline.
SEARCH_ENGINE = "openedx.core.djangoapps.content.search.engine.MeilisearchEngine" | ||
|
||
You will then need to create the new indices by running: | ||
|
||
./manage.py lms shell -c "from openedx.core.djangoapps.content.search import engine; engine.create_indexes()" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To use Meilisearch in the LMS, we also need to move the meilisearch settings in the CMS env to lms.env.common
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. But then we would have to have the same settings in cms.envs.common, right? Right above the line from lms.envs.common import (...)
there is a big warning:
Although this module itself may not use these imported variables, other dependent modules may.
Warning: Do NOT add any new variables to this list. This is incompatible with future plans to
have more logical separation between LMS and Studio (CMS). It is also incompatible with the
direction documented in OEP-45: Configuring and Operating Open edX:
https://open-edx-proposals.readthedocs.io/en/latest/oep-0045-arch-ops-and-config.html
OEP-45 says we should use a defaults.py settings file: https://open-edx-proposals.readthedocs.io/en/latest/architectural-decisions/oep-0045-arch-ops-and-config.html#configuration
But there is no such settings file yet. I'd rather not use this PR to introduce this new module (scope creep...). What I can do is move all those settings to content/search/api.py, like this:
MEILISEARCH_ENABLED = getattr(settings, "MEILISEARCH_ENABLED", False)
MEILISEARCH_INDEX_PREFIX = getattr(settings, "MEILISEARCH_INDEX_PREFIX", "")
...
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol.. sounds like a Google situation, where the working approach is deprecated, but the new approach isn't developed yet 😄
So long as we can pass these settings from the env to the LMS + Studio, I'm good with having them in the api.py
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I'll do that. The only issue I'm facing is that I can't import from api.py because that application is not in the INSTALLED_APPS for the lms -- only for the cms. The api.py module imports models.py , and this fails in the context of the lms. So I'll move the settings to a settings.py module.
elif isinstance(value, dict): | ||
processed[key] = process_document(value) | ||
else: | ||
# Pray that there are not datetime objects inside lists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this comment is accurate, then we can just convert datetimes to timestamps like we're doing above, and not worry about their tzoffsets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. The way we process timezone-aware timestamps is that we convert dict entries from:
{
"mydatetime": datetime(...)
}
to:
{
"mydatetime": 1275767, # timestamp
"mydatetime__tzoffset": 3600 # timezone offset in seconds
}
If a datetime entry finds its way in a list, then I no longer know how to convert it in a way that preserves the timezone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could at least convert them to UTC timestamps, so at least the deserialized time is accurate, even if the timezone isn't preserved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Datetimes in lists will be converted to strings, as per the DocumentEncoder. I don't have other clues to decide whether this is the right decision, as opposed to timestamps...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is something we can deal with if and when we need to.
Thanks for the very quick review Jill, I appreciate it! |
"catalog_visibility", # exclude visibility="none" | ||
"enrollment_end", # include only enrollable courses | ||
], | ||
"courseware_content": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use CoursewareSearchIndexer.INDEX_NAME instead of a hard-coded string here.
There's surely a variable for "course_info"
too, but I couldn't find it quickly.
It's unfortunate that we need to store the filterable fields for all the uses of this engine here, instead of near the code that uses them, but I don't see a better way to deal with this..
Should we provide a way for people to inject their own index + fields into this hash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use CoursewareSearchIndexer.INDEX_NAME instead of a hard-coded string here.
There's surely a variable for "course_info" too, but I couldn't find it quickly.
This is actually one of the issues that are raised in this document: https://openedx.atlassian.net/wiki/spaces/AC/pages/3884744738/State+of+edx-search+2023
Basically, index names are hardcoded across multiple repositories in Open edX... which is why I decided to be lazy to also hardcode the index names. But you're right, let's not aggravate the problem.
I won't go with CoursewareSearchIndexer.INDEX_NAME, because it's in the cms. Instead I suggest to go with COURSEWARE_INFO_INDEX_NAME and COURSEWARE_CONTENT_INDEX_NAME.
What I can do is to change the CoursewareSearchIndexer to pick the index names from the same settings.
Should we provide a way for people to inject their own index + fields into this hash?
🤔 yes... but I'd rather avoid adding yet another configuration setting. What I propose is that we leave it up to developers to extend INDEX_FILTERABLES
. And then in create_indexes
we concatenate the lists of existing and required filterables, such that we don't accidentally delete any existing index. In addition, we make it possible to pass an index_filterables
argument to create_indexes
such that developers can use it to create their own indexes.
Hi @regisb -- my apologies, I ran out of time today to continue testing this for Discussions, but I will pick it up first thing tomorrow.
@ormsbee @bradenmacdonald What do you think? |
Previously, Elasticsearch indices configured for courseware and course info storage were hard-coded in the search indexer classes. With this change, these classes now follow the convention set by edx-search-api and load the index names from COURSEWARE_CONTENT_INDEX_NAME and COURSEWARE_INFO_INDEX_NAME. This does not resolve all inconsistencies, but it's an improvement. This is not considered a breaking change because when the settings are not defined they default to the same values: "course_info" and "courseware_content".
Moving settings from cms/envs/common.py makes it possible to reuse Meilisearch settings in the LMS without duplicating the settings (see OEP-45).
FYI discussions and notes won't work out of the box, because they use their own engines. This is on my TODO-list. |
The goal of this change is to introduce a search engine that is compatible with the edx-search API but that uses Meilisearch instead of Elasticsearch. That way, we can replace one by the other across edx-platform by simply changing a single SEARCH_ENGINE django setting. There are a couple of differences between Meilisearch and Elasticsearch: 1. Filterable attributes must be defined explicitly. 2. No support for datetime objects, which must be converted to timestamps (with an extra field to store the timezone). 3. No special characters allowed in the primary key values, such that we must hash course IDs before we can use them as primary key values. Note that this PR does not introduce any breaking change. This is an opt-in engine that anyone is free to use. There is some setup work for every search feature: see the engine module documentation for more information. See the corresponding conversation here: openedx/frontend-app-authoring#1334 (comment)
22c8f0b
to
4ae5722
Compare
Finding all the code related to edx-search and how it works has always been difficult for me, with the way it's spread out everywhere. I would prefer to locate this closer to some other code that's using edx-search, but I think this is OK. I just want to have a more clear separation between the Studio Search code that doesn't use edx-search APIs at all, and this engine backend that does use edx-search (even though they have a few settings in common, they're totally different implementations and used in totally different ways). So I'd suggest putting the new files in a subfolder, e.g. |
Ah yes, of course! I realised this when I remembered that we're still using the old cs_comments_service as the default forum :( Good news is there's a Meilisearch SDK for ruby, but I still don't envy you or anyone else who has to work on the cs_comments_service code!
I get what you're saying about an intention to replace edx-search, but marking this with "legacy" could be confusing too, since the "legacy" search engine is Elasticsearch. What if we put these new files under the existing FYI I ran into weird (unrelated) issues with my tutor dev stack today, so created #35662 to generate a grove k8s sandbox so others can test this out too. |
I totally get that. In such a case can I suggest moving this code to a separate repo? This would have several advantages:
I don't intend to change the ruby cs_comments_service repo. Instead, I'll only work on the new forum v2 repo (in the process of being migrated to the openedx org). This repo will go live in Sumac. |
@regisb Separate repo is fine with me.. so why not put this in edx-search ?
Ooo spectacular! |
That would be the most logical place. @bradenmacdonald you are (understandably) averse to all things related to edx-search, but since this is an implementation of an edx-search interface, do you agree that we should add the code there? EDIT: here's the PR: openedx/edx-search#162 I created it early such that we can install edx-search in the Tutor Docker image for Sumac. |
@regisb Yes that's totally fine with me! Thanks. |
Closing in favour of openedx/edx-search#162 |
Description
The goal of this change is to introduce a search engine that is compatible with the edx-search API but that uses Meilisearch instead of Elasticsearch. That way, we can replace one by the other across edx-platform by simply changing a single SEARCH_ENGINE django setting.
There are a couple of differences between Meilisearch and Elasticsearch:
Note that this PR does not introduce any breaking change. This is an opt-in engine that anyone is free to use. There is some setup work for every search feature: see the engine module documentation for more information.
Supporting information
This change was discussed here: openedx/frontend-app-authoring#1334 (comment)
Testing instructions
Set the following setting:
Then, follow the instructions from the engine module to try out the different search features.
Deadline
I'd like to merge this in time for the Sumac branch (October 23rd). If it's not possible, then I might opt to install this feature in Tutor via a separate Python package instead.
Open questions
As far as I'm concerned there are a couple of open questions for this PR: