Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add Meilisearch-compatible search engine #35650

Closed
wants to merge 3 commits into from

Conversation

regisb
Copy link
Contributor

@regisb regisb commented Oct 14, 2024

Description

The goal of this change is to introduce a search engine that is compatible with the edx-search API but that uses Meilisearch instead of Elasticsearch. That way, we can replace one by the other across edx-platform by simply changing a single SEARCH_ENGINE django setting.

There are a couple of differences between Meilisearch and Elasticsearch:

  1. Filterable attributes must be defined explicitly.
  2. No support for datetime objects, which must be converted to timestamps (with an extra field to store the timezone).
  3. No special characters allowed in the primary key values, such that we must hash course IDs before we can use them as primary key values.

Note that this PR does not introduce any breaking change. This is an opt-in engine that anyone is free to use. There is some setup work for every search feature: see the engine module documentation for more information.

Supporting information

This change was discussed here: openedx/frontend-app-authoring#1334 (comment)

Testing instructions

Set the following setting:

SEARCH_ENGINE = "openedx.core.djangoapps.content.search.engine.MeilisearchEngine"

Then, follow the instructions from the engine module to try out the different search features.

Deadline

I'd like to merge this in time for the Sumac branch (October 23rd). If it's not possible, then I might opt to install this feature in Tutor via a separate Python package instead.

Open questions

As far as I'm concerned there are a couple of open questions for this PR:

  • Is edx-platform the right place to add this module? (I think it is, because edx-search is used only in edx-platform). If yes, did I pick the right location in the edx-platform file hierarchy, and did I pick the right module name?
  • Are there other features which make use of Elasticsearch and for which we should test this Meilisearch engine? I am aware of the following:
    • student notes
    • forum

@openedx-webhooks
Copy link

Thanks for the pull request, @regisb!

What's next?

Please work through the following steps to get your changes ready for engineering review:

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

  • If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
    • This process (including the steps you'll need to take) is documented here.
  • If it doesn't, simply proceed with the next step.

🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

  • Dependencies

    This PR must be merged before / after / at the same time as ...

  • Blockers

    This PR is waiting for OEP-1234 to be accepted.

  • Timeline information

    This PR must be merged by XX date because ...

  • Partner information

    This is for a course on edx.org.

  • Supporting documentation
  • Relevant Open edX discussion forum threads

🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.

🔘 Let us know that your PR is ready for review:

Who will review my changes?

This repository is currently maintained by @openedx/wg-maintenance-edx-platform. Tag them in a comment and let them know that your changes are ready for review.

Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

  • The size and impact of the changes that it introduces
  • The need for product review
  • Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

Copy link
Contributor

@pomegranited pomegranited left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR @regisb ! I've tested the course discovery and courseware content search, but need to get forums and notes set up to test these -- will tackle them tomorrow.

Noted a couple of minor issues inline.

Comment on lines +7 to +11
SEARCH_ENGINE = "openedx.core.djangoapps.content.search.engine.MeilisearchEngine"

You will then need to create the new indices by running:

./manage.py lms shell -c "from openedx.core.djangoapps.content.search import engine; engine.create_indexes()"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To use Meilisearch in the LMS, we also need to move the meilisearch settings in the CMS env to lms.env.common.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. But then we would have to have the same settings in cms.envs.common, right? Right above the line from lms.envs.common import (...) there is a big warning:

Although this module itself may not use these imported variables, other dependent modules may.
Warning: Do NOT add any new variables to this list. This is incompatible with future plans to
have more logical separation between LMS and Studio (CMS). It is also incompatible with the
direction documented in OEP-45: Configuring and Operating Open edX:
https://open-edx-proposals.readthedocs.io/en/latest/oep-0045-arch-ops-and-config.html

OEP-45 says we should use a defaults.py settings file: https://open-edx-proposals.readthedocs.io/en/latest/architectural-decisions/oep-0045-arch-ops-and-config.html#configuration

But there is no such settings file yet. I'd rather not use this PR to introduce this new module (scope creep...). What I can do is move all those settings to content/search/api.py, like this:

MEILISEARCH_ENABLED = getattr(settings, "MEILISEARCH_ENABLED", False)
MEILISEARCH_INDEX_PREFIX = getattr(settings, "MEILISEARCH_INDEX_PREFIX", "")
...

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol.. sounds like a Google situation, where the working approach is deprecated, but the new approach isn't developed yet 😄

So long as we can pass these settings from the env to the LMS + Studio, I'm good with having them in the api.py file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I'll do that. The only issue I'm facing is that I can't import from api.py because that application is not in the INSTALLED_APPS for the lms -- only for the cms. The api.py module imports models.py , and this fails in the context of the lms. So I'll move the settings to a settings.py module.

elif isinstance(value, dict):
processed[key] = process_document(value)
else:
# Pray that there are not datetime objects inside lists
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this comment is accurate, then we can just convert datetimes to timestamps like we're doing above, and not worry about their tzoffsets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. The way we process timezone-aware timestamps is that we convert dict entries from:

{
    "mydatetime": datetime(...)
}

to:

{
    "mydatetime": 1275767, # timestamp
    "mydatetime__tzoffset": 3600 # timezone offset in seconds
}

If a datetime entry finds its way in a list, then I no longer know how to convert it in a way that preserves the timezone.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could at least convert them to UTC timestamps, so at least the deserialized time is accurate, even if the timezone isn't preserved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datetimes in lists will be converted to strings, as per the DocumentEncoder. I don't have other clues to decide whether this is the right decision, as opposed to timestamps...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is something we can deal with if and when we need to.

openedx/core/djangoapps/content/search/engine.py Outdated Show resolved Hide resolved
@regisb
Copy link
Contributor Author

regisb commented Oct 15, 2024

Thanks for the very quick review Jill, I appreciate it!

"catalog_visibility", # exclude visibility="none"
"enrollment_end", # include only enrollable courses
],
"courseware_content": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use CoursewareSearchIndexer.INDEX_NAME instead of a hard-coded string here.

There's surely a variable for "course_info" too, but I couldn't find it quickly.

It's unfortunate that we need to store the filterable fields for all the uses of this engine here, instead of near the code that uses them, but I don't see a better way to deal with this..

Should we provide a way for people to inject their own index + fields into this hash?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use CoursewareSearchIndexer.INDEX_NAME instead of a hard-coded string here.

There's surely a variable for "course_info" too, but I couldn't find it quickly.

This is actually one of the issues that are raised in this document: https://openedx.atlassian.net/wiki/spaces/AC/pages/3884744738/State+of+edx-search+2023

Basically, index names are hardcoded across multiple repositories in Open edX... which is why I decided to be lazy to also hardcode the index names. But you're right, let's not aggravate the problem.

I won't go with CoursewareSearchIndexer.INDEX_NAME, because it's in the cms. Instead I suggest to go with COURSEWARE_INFO_INDEX_NAME and COURSEWARE_CONTENT_INDEX_NAME.

What I can do is to change the CoursewareSearchIndexer to pick the index names from the same settings.

Should we provide a way for people to inject their own index + fields into this hash?

🤔 yes... but I'd rather avoid adding yet another configuration setting. What I propose is that we leave it up to developers to extend INDEX_FILTERABLES. And then in create_indexes we concatenate the lists of existing and required filterables, such that we don't accidentally delete any existing index. In addition, we make it possible to pass an index_filterables argument to create_indexes such that developers can use it to create their own indexes.

@pomegranited
Copy link
Contributor

Hi @regisb -- my apologies, I ran out of time today to continue testing this for Discussions, but I will pick it up first thing tomorrow.

  1. Is edx-platform the right place to add this module? (I think it is, because edx-search is used only in edx-platform). If yes, did I pick the right location in the edx-platform file hierarchy, and did I pick the right module name?

@ormsbee @bradenmacdonald What do you think?

Previously, Elasticsearch indices configured for courseware and course
info storage were hard-coded in the search indexer classes. With this
change, these classes now follow the convention set by edx-search-api
and load the index names from COURSEWARE_CONTENT_INDEX_NAME and
COURSEWARE_INFO_INDEX_NAME. This does not resolve all inconsistencies,
but it's an improvement.

This is not considered a breaking change because when the settings are
not defined they default to the same values: "course_info" and
"courseware_content".
Moving settings from cms/envs/common.py makes it possible to reuse
Meilisearch settings in the LMS without duplicating the settings (see
OEP-45).
@regisb
Copy link
Contributor Author

regisb commented Oct 16, 2024

FYI discussions and notes won't work out of the box, because they use their own engines. This is on my TODO-list.

The goal of this change is to introduce a search engine that is
compatible with the edx-search API but that uses Meilisearch instead of
Elasticsearch. That way, we can replace one by the other across
edx-platform by simply changing a single SEARCH_ENGINE django setting.

There are a couple of differences between Meilisearch and Elasticsearch:

1. Filterable attributes must be defined explicitly.
2. No support for datetime objects, which must be converted to
   timestamps (with an extra field to store the timezone).
3. No special characters allowed in the primary key values, such that we
   must hash course IDs before we can use them as primary key values.

Note that this PR does not introduce any breaking change. This is an
opt-in engine that anyone is free to use. There is some setup work for
every search feature: see the engine module documentation for more
information.

See the corresponding conversation here:
openedx/frontend-app-authoring#1334 (comment)
@bradenmacdonald
Copy link
Contributor

  1. Is edx-platform the right place to add this module? (I think it is, because edx-search is used only in edx-platform). If yes, did I pick the right location in the edx-platform file hierarchy, and did I pick the right module name?

@ormsbee @bradenmacdonald What do you think?

Finding all the code related to edx-search and how it works has always been difficult for me, with the way it's spread out everywhere. I would prefer to locate this closer to some other code that's using edx-search, but I think this is OK. I just want to have a more clear separation between the Studio Search code that doesn't use edx-search APIs at all, and this engine backend that does use edx-search (even though they have a few settings in common, they're totally different implementations and used in totally different ways). So I'd suggest putting the new files in a subfolder, e.g. openedx.core.djangoapps.content.search.legacy . This also highlights that we want to replace edx-search with a newer abstraction layer in future versions (or just support meilisearch if we're happy enough with it at that point).

@pomegranited
Copy link
Contributor

@regisb

FYI discussions and notes won't work out of the box, because they use their own engines. This is on my TODO-list.

Ah yes, of course! I realised this when I remembered that we're still using the old cs_comments_service as the default forum :( Good news is there's a Meilisearch SDK for ruby, but I still don't envy you or anyone else who has to work on the cs_comments_service code!

@bradenmacdonald

So I'd suggest putting the new files in a subfolder, e.g. openedx.core.djangoapps.content.search.legacy . This also highlights that we want to replace edx-search with a newer abstraction layer in future versions (or just support meilisearch if we're happy enough with it at that point).

I get what you're saying about an intention to replace edx-search, but marking this with "legacy" could be confusing too, since the "legacy" search engine is Elasticsearch. What if we put these new files under the existing openedx.features.djangoapps.course_search module?

FYI I ran into weird (unrelated) issues with my tutor dev stack today, so created #35662 to generate a grove k8s sandbox so others can test this out too.

@regisb
Copy link
Contributor Author

regisb commented Oct 17, 2024

Finding all the code related to edx-search and how it works has always been difficult for me, with the way it's spread out everywhere.

I totally get that. In such a case can I suggest moving this code to a separate repo? This would have several advantages:

  1. Reusable code in the context of edx-notes-api.
  2. No need to make changes to edx-platform so close to the release branch cutoff.
  3. Easily installable with Tutor.

Ah yes, of course! I realised this when I remembered that we're still using the old cs_comments_service as the default forum :( Good news is there's a Meilisearch SDK for ruby, but I still don't envy you or anyone else who has to work on the cs_comments_service code!

I don't intend to change the ruby cs_comments_service repo. Instead, I'll only work on the new forum v2 repo (in the process of being migrated to the openedx org). This repo will go live in Sumac.

@pomegranited
Copy link
Contributor

pomegranited commented Oct 17, 2024

@regisb Separate repo is fine with me.. so why not put this in edx-search ?

I don't intend to change the ruby cs_comments_service repo. Instead, I'll only work on the new forum v2 repo (in the process of being openedx/axim-engineering#1272 to the openedx org). This repo will go live in Sumac.

Ooo spectacular!

@regisb
Copy link
Contributor Author

regisb commented Oct 17, 2024

so why not put this in edx-search ?

That would be the most logical place. @bradenmacdonald you are (understandably) averse to all things related to edx-search, but since this is an implementation of an edx-search interface, do you agree that we should add the code there?

EDIT: here's the PR: openedx/edx-search#162 I created it early such that we can install edx-search in the Tutor Docker image for Sumac.

@bradenmacdonald
Copy link
Contributor

@regisb Yes that's totally fine with me! Thanks.

@regisb
Copy link
Contributor Author

regisb commented Oct 18, 2024

Closing in favour of openedx/edx-search#162

@regisb regisb closed this Oct 18, 2024
@regisb regisb deleted the regisb/meilisearch branch October 29, 2024 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
open-source-contribution PR author is not from Axim or 2U
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants