feat: add Meilisearch-compatible search engine #35650

regisb · 2024-10-14T14:31:00Z

Description

The goal of this change is to introduce a search engine that is compatible with the edx-search API but that uses Meilisearch instead of Elasticsearch. That way, we can replace one by the other across edx-platform by simply changing a single SEARCH_ENGINE django setting.

There are a couple of differences between Meilisearch and Elasticsearch:

Filterable attributes must be defined explicitly.
No support for datetime objects, which must be converted to timestamps (with an extra field to store the timezone).
No special characters allowed in the primary key values, such that we must hash course IDs before we can use them as primary key values.

Note that this PR does not introduce any breaking change. This is an opt-in engine that anyone is free to use. There is some setup work for every search feature: see the engine module documentation for more information.

Supporting information

This change was discussed here: openedx/frontend-app-authoring#1334 (comment)

Testing instructions

Set the following setting:

SEARCH_ENGINE = "openedx.core.djangoapps.content.search.engine.MeilisearchEngine"

Then, follow the instructions from the engine module to try out the different search features.

Deadline

I'd like to merge this in time for the Sumac branch (October 23rd). If it's not possible, then I might opt to install this feature in Tutor via a separate Python package instead.

Open questions

As far as I'm concerned there are a couple of open questions for this PR:

Is edx-platform the right place to add this module? (I think it is, because edx-search is used only in edx-platform). If yes, did I pick the right location in the edx-platform file hierarchy, and did I pick the right module name?
Are there other features which make use of Elasticsearch and for which we should test this Meilisearch engine? I am aware of the following:
- student notes
- forum

openedx-webhooks · 2024-10-14T14:31:04Z

Thanks for the pull request, @regisb!

What's next?

Please work through the following steps to get your changes ready for engineering review:

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
- This process (including the steps you'll need to take) is documented here.
If it doesn't, simply proceed with the next step.

🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

Dependencies

This PR must be merged before / after / at the same time as ...
Blockers

This PR is waiting for OEP-1234 to be accepted.
Timeline information

This PR must be merged by XX date because ...
Partner information

This is for a course on edx.org.
Supporting documentation
Relevant Open edX discussion forum threads

🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.

🔘 Let us know that your PR is ready for review:

Who will review my changes?

This repository is currently maintained by @openedx/wg-maintenance-edx-platform. Tag them in a comment and let them know that your changes are ready for review.

Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

The size and impact of the changes that it introduces
The need for product review
Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

pomegranited

Thank you for this PR @regisb ! I've tested the course discovery and courseware content search, but need to get forums and notes set up to test these -- will tackle them tomorrow.

Noted a couple of minor issues inline.

pomegranited · 2024-10-15T04:52:02Z

openedx/core/djangoapps/content/search/engine.py

+    SEARCH_ENGINE = "openedx.core.djangoapps.content.search.engine.MeilisearchEngine"
+
+You will then need to create the new indices by running:
+
+    ./manage.py lms shell -c "from openedx.core.djangoapps.content.search import engine; engine.create_indexes()"


To use Meilisearch in the LMS, we also need to move the meilisearch settings in the CMS env to lms.env.common.

Good point. But then we would have to have the same settings in cms.envs.common, right? Right above the line from lms.envs.common import (...) there is a big warning:

Although this module itself may not use these imported variables, other dependent modules may.
Warning: Do NOT add any new variables to this list. This is incompatible with future plans to
have more logical separation between LMS and Studio (CMS). It is also incompatible with the
direction documented in OEP-45: Configuring and Operating Open edX:
https://open-edx-proposals.readthedocs.io/en/latest/oep-0045-arch-ops-and-config.html

OEP-45 says we should use a defaults.py settings file: https://open-edx-proposals.readthedocs.io/en/latest/architectural-decisions/oep-0045-arch-ops-and-config.html#configuration

But there is no such settings file yet. I'd rather not use this PR to introduce this new module (scope creep...). What I can do is move all those settings to content/search/api.py, like this:

MEILISEARCH_ENABLED = getattr(settings, "MEILISEARCH_ENABLED", False) MEILISEARCH_INDEX_PREFIX = getattr(settings, "MEILISEARCH_INDEX_PREFIX", "") ...

What do you think?

Lol.. sounds like a Google situation, where the working approach is deprecated, but the new approach isn't developed yet 😄

So long as we can pass these settings from the env to the LMS + Studio, I'm good with having them in the api.py file.

OK I'll do that. The only issue I'm facing is that I can't import from api.py because that application is not in the INSTALLED_APPS for the lms -- only for the cms. The api.py module imports models.py , and this fails in the context of the lms. So I'll move the settings to a settings.py module.

pomegranited · 2024-10-15T05:36:35Z

openedx/core/djangoapps/content/search/engine.py

+        elif isinstance(value, dict):
+            processed[key] = process_document(value)
+        else:
+            # Pray that there are not datetime objects inside lists


If this comment is accurate, then we can just convert datetimes to timestamps like we're doing above, and not worry about their tzoffsets.

Right. The way we process timezone-aware timestamps is that we convert dict entries from:

{ "mydatetime": datetime(...) }

to:

{ "mydatetime": 1275767, # timestamp "mydatetime__tzoffset": 3600 # timezone offset in seconds }

If a datetime entry finds its way in a list, then I no longer know how to convert it in a way that preserves the timezone.

We could at least convert them to UTC timestamps, so at least the deserialized time is accurate, even if the timezone isn't preserved?

Datetimes in lists will be converted to strings, as per the DocumentEncoder. I don't have other clues to decide whether this is the right decision, as opposed to timestamps...

I guess this is something we can deal with if and when we need to.

openedx/core/djangoapps/content/search/engine.py

regisb · 2024-10-15T08:28:06Z

Thanks for the very quick review Jill, I appreciate it!

pomegranited · 2024-10-16T07:08:34Z

openedx/core/djangoapps/content/search/engine.py

+        "catalog_visibility",  # exclude visibility="none"
+        "enrollment_end",  # include only enrollable courses
+    ],
+    "courseware_content": [


Should use CoursewareSearchIndexer.INDEX_NAME instead of a hard-coded string here.

There's surely a variable for "course_info" too, but I couldn't find it quickly.

It's unfortunate that we need to store the filterable fields for all the uses of this engine here, instead of near the code that uses them, but I don't see a better way to deal with this..

Should we provide a way for people to inject their own index + fields into this hash?

Should use CoursewareSearchIndexer.INDEX_NAME instead of a hard-coded string here.

There's surely a variable for "course_info" too, but I couldn't find it quickly.

This is actually one of the issues that are raised in this document: https://openedx.atlassian.net/wiki/spaces/AC/pages/3884744738/State+of+edx-search+2023

Basically, index names are hardcoded across multiple repositories in Open edX... which is why I decided ~~to be lazy~~ to also hardcode the index names. But you're right, let's not aggravate the problem.

I won't go with CoursewareSearchIndexer.INDEX_NAME, because it's in the cms. Instead I suggest to go with COURSEWARE_INFO_INDEX_NAME and COURSEWARE_CONTENT_INDEX_NAME.

What I can do is to change the CoursewareSearchIndexer to pick the index names from the same settings.

Should we provide a way for people to inject their own index + fields into this hash?

🤔 yes... but I'd rather avoid adding yet another configuration setting. What I propose is that we leave it up to developers to extend INDEX_FILTERABLES. And then in create_indexes we concatenate the lists of existing and required filterables, such that we don't accidentally delete any existing index. In addition, we make it possible to pass an index_filterables argument to create_indexes such that developers can use it to create their own indexes.

pomegranited · 2024-10-16T07:09:01Z

Hi @regisb -- my apologies, I ran out of time today to continue testing this for Discussions, but I will pick it up first thing tomorrow.

Is edx-platform the right place to add this module? (I think it is, because edx-search is used only in edx-platform). If yes, did I pick the right location in the edx-platform file hierarchy, and did I pick the right module name?

@ormsbee @bradenmacdonald What do you think?

Previously, Elasticsearch indices configured for courseware and course info storage were hard-coded in the search indexer classes. With this change, these classes now follow the convention set by edx-search-api and load the index names from COURSEWARE_CONTENT_INDEX_NAME and COURSEWARE_INFO_INDEX_NAME. This does not resolve all inconsistencies, but it's an improvement. This is not considered a breaking change because when the settings are not defined they default to the same values: "course_info" and "courseware_content".

Moving settings from cms/envs/common.py makes it possible to reuse Meilisearch settings in the LMS without duplicating the settings (see OEP-45).

regisb · 2024-10-16T15:18:10Z

FYI discussions and notes won't work out of the box, because they use their own engines. This is on my TODO-list.

The goal of this change is to introduce a search engine that is compatible with the edx-search API but that uses Meilisearch instead of Elasticsearch. That way, we can replace one by the other across edx-platform by simply changing a single SEARCH_ENGINE django setting. There are a couple of differences between Meilisearch and Elasticsearch: 1. Filterable attributes must be defined explicitly. 2. No support for datetime objects, which must be converted to timestamps (with an extra field to store the timezone). 3. No special characters allowed in the primary key values, such that we must hash course IDs before we can use them as primary key values. Note that this PR does not introduce any breaking change. This is an opt-in engine that anyone is free to use. There is some setup work for every search feature: see the engine module documentation for more information. See the corresponding conversation here: openedx/frontend-app-authoring#1334 (comment)

bradenmacdonald · 2024-10-17T00:27:40Z

Is edx-platform the right place to add this module? (I think it is, because edx-search is used only in edx-platform). If yes, did I pick the right location in the edx-platform file hierarchy, and did I pick the right module name?

@ormsbee @bradenmacdonald What do you think?

Finding all the code related to edx-search and how it works has always been difficult for me, with the way it's spread out everywhere. I would prefer to locate this closer to some other code that's using edx-search, but I think this is OK. I just want to have a more clear separation between the Studio Search code that doesn't use edx-search APIs at all, and this engine backend that does use edx-search (even though they have a few settings in common, they're totally different implementations and used in totally different ways). So I'd suggest putting the new files in a subfolder, e.g. openedx.core.djangoapps.content.search.legacy . This also highlights that we want to replace edx-search with a newer abstraction layer in future versions (or just support meilisearch if we're happy enough with it at that point).

pomegranited · 2024-10-17T06:05:35Z

@regisb

FYI discussions and notes won't work out of the box, because they use their own engines. This is on my TODO-list.

Ah yes, of course! I realised this when I remembered that we're still using the old cs_comments_service as the default forum :( Good news is there's a Meilisearch SDK for ruby, but I still don't envy you or anyone else who has to work on the cs_comments_service code!

@bradenmacdonald

So I'd suggest putting the new files in a subfolder, e.g. openedx.core.djangoapps.content.search.legacy . This also highlights that we want to replace edx-search with a newer abstraction layer in future versions (or just support meilisearch if we're happy enough with it at that point).

I get what you're saying about an intention to replace edx-search, but marking this with "legacy" could be confusing too, since the "legacy" search engine is Elasticsearch. What if we put these new files under the existing openedx.features.djangoapps.course_search module?

FYI I ran into weird (unrelated) issues with my tutor dev stack today, so created #35662 to generate a grove k8s sandbox so others can test this out too.

regisb · 2024-10-17T07:00:36Z

Finding all the code related to edx-search and how it works has always been difficult for me, with the way it's spread out everywhere.

I totally get that. In such a case can I suggest moving this code to a separate repo? This would have several advantages:

Reusable code in the context of edx-notes-api.
No need to make changes to edx-platform so close to the release branch cutoff.
Easily installable with Tutor.

Ah yes, of course! I realised this when I remembered that we're still using the old cs_comments_service as the default forum :( Good news is there's a Meilisearch SDK for ruby, but I still don't envy you or anyone else who has to work on the cs_comments_service code!

I don't intend to change the ruby cs_comments_service repo. Instead, I'll only work on the new forum v2 repo (in the process of being migrated to the openedx org). This repo will go live in Sumac.

pomegranited · 2024-10-17T07:25:06Z

@regisb Separate repo is fine with me.. so why not put this in edx-search ?

I don't intend to change the ruby cs_comments_service repo. Instead, I'll only work on the new forum v2 repo (in the process of being openedx/axim-engineering#1272 to the openedx org). This repo will go live in Sumac.

Ooo spectacular!

regisb · 2024-10-17T09:15:18Z

so why not put this in edx-search ?

That would be the most logical place. @bradenmacdonald you are (understandably) averse to all things related to edx-search, but since this is an implementation of an edx-search interface, do you agree that we should add the code there?

EDIT: here's the PR: openedx/edx-search#162 I created it early such that we can install edx-search in the Tutor Docker image for Sumac.

bradenmacdonald · 2024-10-17T16:16:07Z

@regisb Yes that's totally fine with me! Thanks.

regisb · 2024-10-18T07:34:55Z

Closing in favour of openedx/edx-search#162

openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Oct 14, 2024

regisb mentioned this pull request Oct 14, 2024

Default Configurations for new libraries in Sumac openedx/frontend-app-authoring#1334

Open

regisb force-pushed the regisb/meilisearch branch 3 times, most recently from 39f469d to e071022 Compare October 14, 2024 15:03

pomegranited reviewed Oct 15, 2024

View reviewed changes

regisb mentioned this pull request Oct 15, 2024

Content Search Powered by Meilisearch openedx/platform-roadmap#378

Open

pomegranited reviewed Oct 16, 2024

View reviewed changes

regisb added 2 commits October 16, 2024 15:54

chore: refactor search settings for reusability

8b58f95

Moving settings from cms/envs/common.py makes it possible to reuse Meilisearch settings in the LMS without duplicating the settings (see OEP-45).

regisb force-pushed the regisb/meilisearch branch from 22c8f0b to 4ae5722 Compare October 16, 2024 16:42

pomegranited mentioned this pull request Oct 17, 2024

DO NOT MERGE: sandbox for testing Meilisearch-compatible search engine #35662

Closed

itsjeyd mentioned this pull request Oct 17, 2024

feat: integrate openedx search api with existing CoursewareSearch.jsx openedx/frontend-app-learning#1433

Closed

regisb mentioned this pull request Oct 17, 2024

feat: add Meilisearch-compatible search engine openedx/edx-search#162

Closed

regisb closed this Oct 18, 2024

regisb deleted the regisb/meilisearch branch October 29, 2024 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Meilisearch-compatible search engine #35650

feat: add Meilisearch-compatible search engine #35650

regisb commented Oct 14, 2024 •

edited

Loading

openedx-webhooks commented Oct 14, 2024

pomegranited left a comment •

edited

Loading

pomegranited Oct 15, 2024

regisb Oct 15, 2024

pomegranited Oct 16, 2024

regisb Oct 16, 2024

pomegranited Oct 15, 2024

regisb Oct 15, 2024

pomegranited Oct 16, 2024

regisb Oct 16, 2024

pomegranited Oct 17, 2024

regisb commented Oct 15, 2024

pomegranited Oct 16, 2024

regisb Oct 16, 2024

pomegranited commented Oct 16, 2024

regisb commented Oct 16, 2024

bradenmacdonald commented Oct 17, 2024

pomegranited commented Oct 17, 2024

regisb commented Oct 17, 2024

pomegranited commented Oct 17, 2024 •

edited

Loading

regisb commented Oct 17, 2024 •

edited

Loading

bradenmacdonald commented Oct 17, 2024

regisb commented Oct 18, 2024

feat: add Meilisearch-compatible search engine #35650

feat: add Meilisearch-compatible search engine #35650

Conversation

regisb commented Oct 14, 2024 • edited Loading

Description

Supporting information

Testing instructions

Deadline

Open questions

openedx-webhooks commented Oct 14, 2024

What's next?

🔘 Get product approval

🔘 Provide context

🔘 Get a green build

🔘 Let us know that your PR is ready for review:

Who will review my changes?

Where can I find more information?

When can I expect my changes to be merged?

pomegranited left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

regisb commented Oct 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pomegranited commented Oct 16, 2024

regisb commented Oct 16, 2024

bradenmacdonald commented Oct 17, 2024

pomegranited commented Oct 17, 2024

regisb commented Oct 17, 2024

pomegranited commented Oct 17, 2024 • edited Loading

regisb commented Oct 17, 2024 • edited Loading

bradenmacdonald commented Oct 17, 2024

regisb commented Oct 18, 2024

regisb commented Oct 14, 2024 •

edited

Loading

pomegranited left a comment •

edited

Loading

pomegranited commented Oct 17, 2024 •

edited

Loading

regisb commented Oct 17, 2024 •

edited

Loading