Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find indexes of item in list #20812

Open
rjthoen opened this issue Jan 20, 2025 · 3 comments
Open

Find indexes of item in list #20812

rjthoen opened this issue Jan 20, 2025 · 3 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@rjthoen
Copy link
Contributor

rjthoen commented Jan 20, 2025

Description

In #19894 index_of was introduced, which gets the index of the first occurrence of an item in a column/list when available. This is a very helpful feature, but why only return the first occurrence and not the first n or all of them?

Could index_of be changed to indexes_of where it returns all the occurrences of the item in the list and an empty list when the item is not found? To get back the current behavior (minus the return type) an optional keyword argument n = number of indexes to return could be added to this expression or to indexes_of_exact (like in split_exact).

Expanding on the example in the documentation for index_of we would get the following usage:

>>> df = pl.DataFrame({"a": [1, None, 17, 1]})
>>> df.select(
...    [
...        pl.col("a").indexes_of(1).alias("one"),
...        pl.col("a").indexes_of(17).alias("seventeen"),
...        pl.col("a").indexes_of(None).alias("null"),
...        pl.col("a").indexes_of(55).alias("fiftyfive"),
...    ]
... )
shape: (1, 4)
┌───────────┬───────────┬───────────┬───────────┐
│ oneseventeennullfiftyfive │
│ ------------       │
│ list[u32] ┆ list[u32] ┆ list[u32] ┆ list[u32] │
╞═══════════╪═══════════╪═══════════╪═══════════╡
│ [0, 3]    ┆ [2]       ┆ [1]       ┆ []        │
└───────────┴───────────┴───────────┴───────────┘
@rjthoen rjthoen added the enhancement New feature or an improvement of an existing feature label Jan 20, 2025
@rjthoen
Copy link
Contributor Author

rjthoen commented Jan 20, 2025

The output of the proposed indexes_of expression can already be achieved with polars:

name_value_map = {
    "one": 1,
    "seventeen": 17,
    "null": None,
    "fiftyfive": 55
}

df = pl.DataFrame({"a": [1, None, 17, 1]})

(
    df
    .with_row_index()
    .select(
        [
            pl.col("index")
            .filter(
                # Option A: with `index_of`
                pl.col("a").index_of(value).over("index").is_not_null()
                
                # Option B: without `index_of`
                (pl.col("a") == value) |
                (pl.col("a").is_null() & (value is None))
            )
            .implode()
            .alias(name)
            for name, value in name_value_map.items()
        ]
    )
)

@cmdlineluser
Copy link
Contributor

There is also .arg_true()

df.select(
    pl.col("a").eq_missing(value).arg_true().implode()
      .alias(name)
    for name, value in name_value_map.items()
)

# shape: (1, 4)
# ┌───────────┬───────────┬───────────┬───────────┐
# │ one       ┆ seventeen ┆ null      ┆ fiftyfive │
# │ ---       ┆ ---       ┆ ---       ┆ ---       │
# │ list[u32] ┆ list[u32] ┆ list[u32] ┆ list[u32] │
# ╞═══════════╪═══════════╪═══════════╪═══════════╡
# │ [0, 3]    ┆ [2]       ┆ [1]       ┆ []        │
# └───────────┴───────────┴───────────┴───────────┘

@rjthoen
Copy link
Contributor Author

rjthoen commented Jan 21, 2025

There is also .arg_true()

Thanks @cmdlineluser, would you say that using the compound expression should be favored over adding a dedicated expression like indexes_of?

The result of index_of could (inefficiently) be replicated in a similar way:

df.select(
    pl.col("a").eq_missing(value).arg_true().first()
      .alias(name)
    for name, value in name_value_map.items()
)

If there is no intention to add an expression to return all the indexes of an item in a column/list then I would suggest to:

  • Rename index_of to first_index_of to increase the descriptiveness.
  • Mention in the documentation of (first_)index_of how to use the compound expression to find all indexes as I can imagine that it is a relatively common use case that is highly related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants