Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the search() method #119

Closed
Guikingone opened this issue Nov 23, 2020 · 23 comments · Fixed by #120
Closed

Improve the search() method #119

Guikingone opened this issue Nov 23, 2020 · 23 comments · Fixed by #120
Labels
breaking-change The related changes are breaking for the users SDK PHP

Comments

@Guikingone
Copy link
Contributor

Guikingone commented Nov 23, 2020

Hi everyone 👋

Small issue here (more a feature idea in fact), after using the search() method recently, I've faced a use case where a SearchResult object could be a good idea: sorting hits.

Actually, we can easily sort hits via native functions but it force use to grab the hits key, sort and put back it on the result array, it works but thinking on terms of DX, that's probably not an "easy" way of doing it.

So, the idea is to add a new class and update the search method:

    public function search($query, array $options = [], bool $asObject = false) // The return typehint cannot be specified until PHP 8
    {
        $parameters = array_merge(
            ['q' => $query],
            $options
        );

        $result = $this->http->post(self::PATH.'/'.$this->uid.'/search', $parameters);

        return $asObject ? SearchResult::create($result) : $result;
    }

Here's the SearchResult that I'm thinking about:

<?php

declare(strict_types=1);

namespace MeiliSearch;

use ArrayIterator;
use function array_filter;
use function array_key_exists;
use function count;
use function end;
use function is_array;

final class SearchResult
{
    /**
     * @var array<int, array>
     */
    private $hits = [];

    /**
     * @var int
     */
    private $offset;

    /**
     * @var int
     */
    private $limit;

    /**
     * @var int
     */
    private $nbHits;

    /**
     * @var bool
     */
    private $exhaustiveNbHits = false;

    /**
     * @var int
     */
    private $processingTimeMs;

    /**
     * @var string
     */
    private $query;

    /**
     * @var bool|null
     */
    private $exhaustiveFacetsCount;

    /**
     * @var array<string, mixed>
     */
    private $facetsDistribution;

    public static function create(array $result): SearchResult
    {
        $self = new self();

        // Build the object using the array

        return $self;
    }

    /**
     * {@inheritdoc}
     */
    public function filter(callable $callback): SearchResult
    {
        $results = array_filter($this->hits, $callback, ARRAY_FILTER_USE_BOTH);

        $this->hits = $results;
        $this->nbHits = count($results);

        return $this;
    }

    public function getHit(int $key, $default = null)
    {
        return $this->hits[$key] ?? $default;
    }

    /**
     * @return array<int, array>
     */
    public function getHits(): array
    {
        return $this->hits;
    }

    public function getOffset(): int
    {
        return $this->offset;
    }

    public function getLimit(): int
    {
        return $this->limit;
    }

    public function getNbHits(): int
    {
        return $this->nbHits;
    }

    public function getExhaustiveNbHits(): bool
    {
        return $this->exhaustiveNbHits;
    }

    public function getProcessingTimeMs(): int
    {
        return $this->processingTimeMs;
    }

    public function getQuery(): string
    {
        return $this->query;
    }

    public function getExhaustiveFacetsCount(): ?bool
    {
        return $this->exhaustiveFacetsCount;
    }

    /**
     * @return array<string, mixed>
     */
    public function getFacetsDistribution(): array
    {
        return $this->facetsDistribution;
    }

    /**
     * {@inheritdoc}
     */
    public function toArray(): array
    {
        return [
            'hits' => $this->hits,
            'offset' => $this->offset,
            'limit' => $this->limit,
            'nbHits' => $this->nbHits,
            'exhaustiveNbHits' => $this->exhaustiveNbHits,
            'processingTimeMs' => $this->processingTimeMs,
            'query' => $this->query,
            'exhaustiveFacetsCount' => $this->exhaustiveFacetsCount,
            'facetsDistribution' => $this->facetsDistribution,
        ];
    }

    /**
     * {@inheritdoc}
     */
    public function getIterator(): ArrayIterator
    {
        return new ArrayIterator($this->hits);
    }

    /**
     * {@inheritdoc}
     */
    public function count(): int
    {
        return $this->nbHits;
    }
}

The interesting method here is:

    /**
     * {@inheritdoc}
     */
    public function filter(callable $callback): SearchResult
    {
        $results = array_filter($this->hits, $callback, ARRAY_FILTER_USE_BOTH);

        $this->hits = $results;
        $this->nbHits = count($results);

        return $this;
    }

This one allows to filter the hits and return a filtered result without triggering a new search (a method retry() could trigger a new search if needed).

Of course, this class is not required as the developer can always retrieve an array but it could act as an helper when advanced use cases occurs and we need to mutate the result before playing with it.

Let me know if this improvements is a good idea and if it fits the current lib API, thanks for the feedback and have a great day 🙂

PS: Maybe the current API can be mutated to add a third argument in search (typed callable) that allows us to filter the result before retrieving it and without using an object if a new class is not an option.

@curquiza
Copy link
Member

curquiza commented Nov 24, 2020

Hello @Guikingone!
Thanks for this deeply detailed issue 😁
I like the filter method! However, I'm not sure to see the usecase: why would you like to sort the hits returned by MeiliSearch? It should be the job of MeiliSearch to give the perfect order. Or maybe do you also mean applying some changes like put a field in uppercase, not only "sorting"? I just want to be sure this issue is not a workaround of a relevancy issue you have with MeiliSearch 😉

Let's come back to your idea, tell me if I'm wrong, there would be two ways to use it:

search_result_object = $index->search('prince', options)
search_result_object->filters(your_callable)
print_r(serach_result_object->getHits)

or directly:

search_result_object = $index->search('prince', options, your_callable)
print_r(serach_result_object->getHits)

I like this callable possibility you introduce, I think this brings a great user experience 🙂

In the same idea, we (the Meili team) also noticed a need from the users: the need to sort/apply modification on the facetsDistribution, more than to the hits. What do you think about integrating this? Should it be the 4th parameter of the search method? Wouldn't it make the search usage to "heavy"? Would it be possible to write:

search_result_object = $index->search('prince', null, null, my_callable_for_facets_distribution)

?

I know the facetsDistribution sorting is not the main topic of this issue, if we implement it, it will be done in another issue and in another PR, but I want we all have it in mind when implementing the filter for hits. That's what I'm asking your POV 🙂

If you are around, what do you think @shokme and @ppshobi? Does the @Guikingone suggestion bring a great developer experience from your POV? 🙂

@Guikingone
Copy link
Contributor Author

Hello @curquiza 👋

Yes, sorry for the lack of details, here's the use case (maybe an edge case? 😄 ) that I'm facing:

  • Let's imagine that I made a search on foo and that I want to retrieve id and title (maybe for a list display)

I could specify facetFilters but what If I want to filter the hits this way:

  • Found in every hits that have no more than 3 tags (and specified in a list) and a title that's not longer than 20 words (weird use case but can be used on articles in a blog/users in a intranet)

This will not trigger a second search as I only need to filter the hits retrieved via MS, plus, I can use a specific filter/mutation on the hits as you mentioned (IMO, this could be a new document load but that's an other point).

On the usages, yes, both syntaxes are valid, IMO, the search method must return the SearchResult object or an array (as it does now), this way, the fluent syntax can be used.

The callable is another option that could filter the hits (as we don't really need to interact with other informations in some cases.

The use case that you're talking about could also be integrated IMO, it does not make the method "heavy" IMO as it's an optional argument.

If the "configuration" aspect of the method is important, maybe an array of "options" can be introduced:

    public function search($query, array $options = [], array $extraConfiguration = []): array
    {
        $parameters = array_merge(
            ['q' => $query],
            $options
        );

        $result = $this->http->post(self::PATH.'/'.$this->uid.'/search', $parameters);

        if (array_key_exist('hitsFilter', $extraConfiguration) {
            $result['hits'] = array_filter($result['hits'], $extraConfiguration['hitsFilter'], ARRAY_FILTER_USE_BOTH);
        }

        return $result;
    }

Same idea can be applied to facetsDistribution filter if needed.

@curquiza
Copy link
Member

Thanks for the details about your usecases, I understand better! 👍

I like your idea about extraConfiguration array.

Here what I suggest for the associated PR:

  • create a SearchResult class with the described methods
  • the search prototype is now:
search($query, array $search_params = [], array $options = [])

The $search_params only contains the MeiliSearch parameters during a search.
$options is all the options related to this SDK, like transformHits (and transformFacetsDstribution later). Naming could be discussed in the future PR. I just want to avoid the filtering term since it can be confused with the filters of MeiliSearch.

  • Thesearch method returns a SearchResult object. I'm not sure this is really useful to let the possibility to return an array instead of a SearchResult object, but if you think this is important, this option should be passed in the $options parameter of the search() method.

@curquiza
Copy link
Member

curquiza commented Nov 24, 2020

Also, about the nbHits and the count() method. I'm not sure, but I think there is a confusion. nbHits returned by MeiliSearch is the number of matches, not the number of hits. You can ask for a limit of 2, so the total number of hits returned by MeilISearch is 2, but in your dataset, you have 23 matches for your query. With this example, the nbHits would be set to 23 despite the length of the hits array would be 2.
Plus, (for the moment) this nbHits is always non-exhaustive. Indeed, exhaustiveNbHits always returns false, for the moment again. It means nbHits is not the real number of matches but an estimation. So, nbHits could not be used for pagination, for example.

The real count() you might want is the one you do in filter() which is count($this->hits).

I'm aware this is a bad naming coming from MeiliSearch, the name nbHits is really confusing because:

  • it does not return the exact number of hits but the number of matches
  • and this number is not even reliable for the moment because it's an estimation

Be sure I'm going to discuss this with the team.

TLDR; do not use nbHits to get the number of hits, use count($this->hits).

And sorry about this digression, but I wanted to be sure everyone is aware of this since I think this is not obvious from my POV.

@curquiza curquiza changed the title [API] Add SearchResult Improve the search() method Nov 24, 2020
@curquiza curquiza added breaking-change The related changes are breaking for the users SDK PHP labels Nov 24, 2020
@Guikingone
Copy link
Contributor Author

Guikingone commented Nov 24, 2020

The search method returns a SearchResult object. I'm not sure this is really useful to let the possibility to return an array instead of a SearchResult object, but if you think this is important, this option should be passed in the $options parameter of the search() method.

IMO, the search can return SearchResult, I was thinking about returning an array to not break the actual interface (which is not the goal of a minor version) but if it's not required, let's return directly the new class.

Thanks for the informations about nbHits and "matches", I'm gonna improve the SearchResult API according to these informations.

The PR should be up later this day and mention this issue 🙂

@ppshobi
Copy link
Contributor

ppshobi commented Nov 25, 2020

Even though I like the idea of having a DTO class for our search results, I would like to point out some cons here.

I'm not sure why something like this can be the burden of the user who wants to wrap the results in one of the classes that they like? Rather than we introduce a class from our side? I ask this because just like @Guikingone said and I quote

Of course, this class is not required as the developer can always retrieve an array but it could act as a helper when advanced use cases occurs and we need to mutate the result before playing with it.

I personally don't like mutating the data via some callbacks/static methods (Due to the fact that unexpected side effects can happen through these kinds of opportunities,) But if you want it for some reasonm it can be done from the user land, rather than we enforce it. without any performance issues or confusions.

Also having a class like this being the default return types can create breaking changes quite often, rather than we return a plain array, since every client depends on it understand it without any additional modification( eg: laravel-scout)

Having a plain array return type will always make sure that we remain cross compatible with meilisearch-core (well, mostly) and the clients depend on us without any additional effort. Otherwise we will have to keep adding methods into this giant class for every field or functionality meilisearch-core introduces.

So why not keep the use of these type of classes in the user land itself?

@Guikingone
Copy link
Contributor Author

Guikingone commented Nov 25, 2020

@ppshobi Actually, it depends regarding the return typehint, the first idea was to remove the : array and add a third arguments that allows to receive either an array of the class, this way, there's not BC breaks.

I personally don't like mutating the data via some callbacks/static methods

Actually, filter() does not mutate the data but the SearchResult object, a new hits array is passed through the new instance and the old one stays available in SearchResult::getRaw().

When it comes to the clients, I don't know if Rust supports this type of "DTO" but the Go one can easily support it IMHO, when it comes to bundles, this does not break things if the SDK remove the return typehint.

When it comes to adding methods and improving the class over time, keep in mind that every other classes requires to be updated according to new MS features, that's a normal thing IMHO, I agree that this issue is not about a new feature.

Again, that's just an idea/suggestion, if it doesn't fit, you can easily close the PR/issue and keep the current API as it 🙂

@ppshobi
Copy link
Contributor

ppshobi commented Nov 26, 2020

I think the whole thing should be on the userland, rather than we package it ourselves. let the user create abstractions if required.
Also, we don't use this library across many languages, rather it's used by libraries like laravel-scout and other possible clients like symfony-meilisearch. I was thinking that we might break them unnecessarily with the introduction of this class, where the simple solution does what it requires.

@curquiza
Copy link
Member

curquiza commented Dec 3, 2020

Hello @ppshobi, thanks for your feedback 😄

I personally would rather manipulate an instance of a class instead of a raw array. That's why I like using an SDK instead of a really basic API wrapper: having classes and methods provided by the library to make the experience easier. That's why I find the idea of @Guikingone is a good idea because it improves the dev experience IMO.

About the maintenance issue, I understand this class brings more maintenance, but I'm not afraid about this because we indeed work closely with the core team of MeiliSearch: make all the SDKs up-to-date with the last version of MeiliSearch is always a priority for us. So the changes in the SearchResult will be always in the pipeline of our team (the integration team).
Plus, I think this would not involve too much maintenance for a better dev experience.

About the potential increase of breaking changes that could impact the libraries depending on meilisearch-php (laravel-scout and symfony plugins): I'm not really sure to understand how this could happen more often.
Ex: If a field in the MeiliSearch response is renamed during a search (let's say hits becomes matches), it would be breakable for everyone using the hits, no matter what they use:

  • if the SearchResult instance is returned: it means they used the getHitsmethod, now renamed into getMatches, so this is indeed breaking.
  • if an array is returned: they used the response['hits'] to get the data, so now it should be response['matches'], so this is also breaking.

But maybe I miss something.
In any case, I'm currently asking the help/advice of the creators of the symfony and laravel plugins to be sure I don't miss anything. I'm currently waiting for their feedback. Sorry @Guikingone, it might make your PR a little longer in the pipeline before merging it (or not). Sorry about this!

@ppshobi
Copy link
Contributor

ppshobi commented Dec 3, 2020

For example

let's say hits becomes matches

This SDK doesn't need any changes at all (not that the work is heavy to modify the codebase), only the user-land libraries or code needs to be updated. They will know this this issue even without reading the exact sdk documentation, rather meilisearch documentation/upgrade guide would suffice. So, WE are not breaking anything.

Plain arrays are so flexible it does whatever we need, for example we don't need to do any additional work on serialization, deserialization, encoding/decoding etc... Also, this will ensure that multiple meilisearch version work among many versions of SDK.

Another example is when meilisearch introduces a new field, then also we don't need to do any work, by default it exposes all the fields as simple as possible (i.e: No need to call getRaw() )

This class abstracts the array and add getters on top of it. Which you can do from the array itself, just that the syntax is different and less work and simple.

Call me lazy 😂 , But why take airplane to goto another room where you can simply go there.

Plus the algolia and elasticsearch api clients return plain arrays.

@Guikingone
Copy link
Contributor Author

So, WE are not breaking anything.

I'm not okay with this statement as for now, previous versions have introduced BC breaks (think about the introduction of Indexes as return type hint), so IMHO, breaking things in a version which is not tagged as the first stable one could be something that the team does not worry about.

BUT, obviously, when this SDK is tagged 1.*, breaking things is no longer acceptable in a minor version.

Plain arrays are so flexible it does whatever we need

Flexible, yes, performant, that's a point that need to be debated IMO 😄
Plus, there's not work on serialization/deserialization "for now" as the result is calculated/built via the class.

When it comes to introducing a new field, for now, it only needs a getter/attribute, IMHO, that's not a lof of work and BTW, every SDK needs to be updated if a new field is introduced.

That's just my vision here, nothing personal against your message 🙂

@curquiza
Copy link
Member

curquiza commented Dec 3, 2020

Yes, the examples you mentioned are what I mean about maintenance I'm not really afraid of. This will be the job of the Meili Integration team to update it!
And you're not lazy 😂 that's really kind of you to mention all the problems we can encounter to be sure everyone is aware of it!

Yes Algolia and Elastic return arrays, but they have a lot of other high-level tools (using those API wrappers) we don't have yet. So fewer people use the client directly I guess. We already have symfony and laravel plugins but that's just the beginning. That why I would take the opportunity a contributor as @Guikingone provides this kind of feature, improving the dev experience of our current libraries.
Plus, libraries I like, as Twilo and Stripe, mostly use classes if I'm not wrong. I think both are doable, even if you're right to compare us to Elastic and Algolia first.

@ppshobi
Copy link
Contributor

ppshobi commented Dec 3, 2020

@Guikingone

I'm not okay with this statement as for now, previous versions have introduced BC breaks (think about the introduction of Indexes as return type hint), so IMHO, breaking things in a version which is not tagged as the first stable one could be something that the team does not worry about.

I can agree here that BC breaks are bound to happen, but my question is why do it unnecessarily when a simple solution covers that for free

Flexible, yes, performant, that's a point that need to be debated IMO 😄

So, you mean a class over an array is more performant?

Plus, there's not work on serialization/deserialization "for now" as the result is calculated/built via the class.

I didn't get this part, but Imagine you need to send the search result to another backend; how would you do it for example you want to encode it in url?

@Guikingone
Copy link
Contributor Author

So, you mean a class over an array is more performant?

Depending on the use case, yes, here's a link (not the one that I wanted to share but the first that I retrieved in my bookmarks 😄 ).

I didn't get this part, but Imagine you need to send the search result to another backend; how would you do it for example you want to encode it in url?

It depends, let's imagine that you need to send it to a REST API, you can just:

json_endode($result->getHits());

Same applies for URL's, it depends on the required format, BTW, encoding the result of a search into the URL is strange IMHO 🤔

If the idea is to send it to another backend, it depends on the format required but actually, the SearchResult is not meant to be serialized IMO as it's only a representation of the search.

@ppshobi
Copy link
Contributor

ppshobi commented Dec 3, 2020

From the link you shared seems like the article got translated from some other language using google translate. I was not able to understand a thing. Something was missing and no citations for any arguments. Regardless we are not replacing the array rather this class is on top of the array, anyway array access happens all the time. So, the performance argument is not valid here.

Regarding the second part, I was saying

json_endode($result['hits']);
json_encode($result); 
json_encode($result['any']['nested']['element']);

all of the above would work if the class is not there.

instead of $result->getHits() the $result['hits'] gives us the same thing with no work or maintenance in the future and keeps the simplicity. Of course, I agree that class adds the syntactical sugar, but that is it, I think.

This is my personal opinion, Because I have struggled with unnecessary abstractions in codebase. If everyone else is happy, let's go with it.

@Guikingone
Copy link
Contributor Author

IMO, the performances argument still relevant, the link that I've shared don't, here's the ressources I've been looking for:

I agree on the simplicity, problem is, when it comes to interacting with the "results" array, using an object is a better idea and a better DX without a lot of extra layers, again, I could be wrong 🙂

@ppshobi
Copy link
Contributor

ppshobi commented Dec 5, 2020

I think it comes down to the problems similar to the single vs double quote debates. In reality we rarely bother about them. From the links that you have given, the associative arrays were was assessed by sorting a really large array where the classed version was performing well with a barely noticeable change in performance with less memory. But at the same time serialization performance were really subpar. The pros and cons are anyways there. In our case we are just accessing elements of one array, so maybe it would have been better, If you were showing the access time for class vs array.

Regardless we can argue more about why such a thing in real scenario would be happening and when optimizing such thing is needed. A real performance optimization can be where, when we open multiple DB connections and try to reuse some of those connections, which can have significant performance gains. Which is far productive than of debating about whether we should keep the connection string in double or single quotes.

@curquiza
Copy link
Member

curquiza commented Dec 10, 2020

Hello @Guikingone and @ppshobi!
Sorry for the delay, thank you for your patience and most of all for your involvement! 👍

I discussed with the laravel and symfony plugin maintainers: there are ok with this change since the Meili team is comfortable with announcing breaking changes -> Yes, we are, we are trying to create the best tools in terms of dev experiences. And this, unfortunately, involves breaking changes, which is ok since we are in an early stage.

About performance issues the users may encounter in some context: we could also (this would need a new issue and a new PR) add an option to the options parameter of the search method: getRawSearch: false by default. If true, there would be no class manipulation.

@Guikingone
Copy link
Contributor Author

Hello @curquiza, sorry for the delay

I agree on the new option, IMHO, we could implement this behaviour in the current PR without taking too much time and BC breaks, the idea is to have 2 type of return if I'm right?

  • The SearchResult class
  • The default array

@curquiza curquiza linked a pull request Dec 15, 2020 that will close this issue
@curquiza
Copy link
Member

Hello @Guikingone! No worries for the delay: thank you for still being involved in this issue 👍

Yes, if the search method is called with the getRawSearch options, the method returns an array without creating any SearchResult instance (for performance issue, if I understand well). An example of call:

search('prince', null, [ 'getRawSearch': true ]);

By default, the search method returns a SearchResult instance.

Another addition I would like to see: a toJson method in the SearchResult class like the Stripe's one: https://github.com/stripe/stripe-php/blob/3febf22da8ed6a3b45842eac97d0fcece4d91e13/lib/StripeObject.php#L458-L466

If you don't have time to do one or both changes, tell me, I will do them after merging your PR 🙂

@curquiza
Copy link
Member

(I don't know your opinion about this: but I can't wait for threads in GitHub issues!)

@Guikingone
Copy link
Contributor Author

(I don't know your opinion about this: but I can't wait for threads in GitHub issues!)

❤️ 😄

No problem on my side to handle the modifications, I'll take a look at it tonight 🙂

@Guikingone
Copy link
Contributor Author

Small update, I've pushed a first POC of the raw option, the linter seems to fail on a strange rules, I'll take a look tomorrow morning 🙂

bors bot added a commit that referenced this issue Dec 21, 2020
120: Indexes::search() improvements r=curquiza a=Guikingone

Hi everyone 👋 

As mentioned here: #119, here's the `SearchResult` class and the improvements that comes with it. 

There's just a small "DX" way that I'm not sure about: 

- When using `SearchResult::filter()` should the `hits` array be reset to 0 or stay as it? 

IMO, the array could stays as it as we return a new `SearchResult` object but sometimes, it could be good idea to reset the keys from 0 (loops, etc) 🤔 

Thanks for the feedback and have a great day 🙂 

Co-authored-by: Loulier Guillaume <[email protected]>
@bors bors bot closed this as completed in #120 Dec 21, 2020
bors bot added a commit to meilisearch/meilisearch-laravel-scout that referenced this issue Jan 27, 2021
80: Fix getTotalCount() method r=curquiza a=curquiza

Remove `nbHits` usage because this is not reliable information.

The Meili team is aware of this. Here are the different issues and comments about it to explain why it's confusing, and why we should not use it:
- meilisearch/meilisearch#1120
- meilisearch/documentation#561
- meilisearch/meilisearch-php#119 (comment)

TLDR;
`nbHits` is not reliable for pagination because can be exhaustive or not, depending on the value of `exhaustiveNbHits` that MeiliSearch returns which is always `false` for the moment.

We are sorry for this. We all hope this confusion will be fixed asap in MeiliSearch.

⚠️ The linter error in the CI will be fixed with #82  

Co-authored-by: Clémentine Urquizar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change The related changes are breaking for the users SDK PHP
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants