Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add DataFrame.iter_rows #317

Merged
merged 9 commits into from
Jun 21, 2024

Conversation

Priyansh121096
Copy link
Contributor

@Priyansh121096 Priyansh121096 commented Jun 18, 2024

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below.

I've matched the API with https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.rows.html

@github-actions github-actions bot added the enhancement New feature or request label Jun 18, 2024
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is amazing, well @Priyansh121096 for figuring it all out! πŸ™Œ

I just have some minor comments really - do you have time / interest to address them? if not no worries, I can take it from here, lmk your preference


We define a library agnostic function:

>>> def func(df_any, named):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick, but this is a "boolean trap" - can we make named keyword-only for the example please?

i.e.

func(df_any, *, named):

then, when you call it - func(df_pd, named=False), etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

),
],
)
@pytest.mark.filterwarnings("ignore:Determining|Resolving.*")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this in this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

],
)
@pytest.mark.filterwarnings("ignore:Determining|Resolving.*")
def test_rows(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you make a new file tests/frame/rows_test.py for this one please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is great! Thanks for the PR! I left a comment to add type hinting 😁
P.s. it would require some additional imports

Comment on lines 631 to 633
def rows(
self, *, named: bool = False
) -> list[tuple[Any, ...]] | list[dict[str, Any]]:
Copy link
Member

@FBruzzesi FBruzzesi Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a great addition to add type hinting for the two cases:

Suggested change
def rows(
self, *, named: bool = False
) -> list[tuple[Any, ...]] | list[dict[str, Any]]:
@overload
def rows(
self, *, named: Literal[True]
) -> list[dict[str, Any]]:
@overload
def rows(
self, *, named: Literal[False]
) -> list[tuple[Any, ...]]:
@overload
def rows(
self, *, named: bool
) -> list[tuple[Any, ...]] | list[dict[str, Any]]:
def rows(
self, *, named: bool = False
) -> list[tuple[Any, ...]] | list[dict[str, Any]]:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we want to put these overloads in an if TYPE_CHECKING: block. What do you think @FBruzzesi ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing it!

What do you think @FBruzzesi ?

This is personal taste grey zone! I will let @MarcoGorelli make the final call 😁

Copy link
Member

@MarcoGorelli MarcoGorelli Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we want to put these overloads in an if TYPE_CHECKING: block

I didn't know you could do that 😳 What you've done looks good to me anyway, it's probably nice to have the overloads near the definitions? no strong opinion

I think what you've done here looks great πŸ™Œ

@Priyansh121096
Copy link
Contributor Author

Thanks for your comments. I'll address them soon.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amazing stuff! thanks a tonne @Priyansh121096

Just got a question about this functionality - instead of DataFrame.rows, couldn't we just add DataFrame.iter_rows?

Because then, if someone wants a list, then can do

rows = list(df.iter_rows())

However, if someone is just going to loop over the rows, then

for row in df.iter_rows():
    # do something with `row`

would be more efficient than

for row in `df.rows()`:
    # do something with `row`

Comment on lines 631 to 633
def rows(
self, *, named: bool = False
) -> list[tuple[Any, ...]] | list[dict[str, Any]]:
Copy link
Member

@MarcoGorelli MarcoGorelli Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we want to put these overloads in an if TYPE_CHECKING: block

I didn't know you could do that 😳 What you've done looks good to me anyway, it's probably nice to have the overloads near the definitions? no strong opinion

I think what you've done here looks great πŸ™Œ

@Priyansh121096
Copy link
Contributor Author

Just got a question about this functionality - instead of DataFrame.rows, couldn't we just add DataFrame.iter_rows?

@MarcoGorelli I was going to propose something similar (you beat me to it πŸ˜„). I was thinking we could expose both rows and iter_rows since the polars API does as well.

One issue with this one is I'm not sure if pandas has a convenient equivalent to this when named=True. Rest of the cases are fine:

  1. polars + named=False = df.iter_rows(named=False)
  2. polars + named=True = df.iter_rows(named=True)
  3. pandas + named=False = df.itertuples(index=False, name=None)
  4. pandas + named=True = ?

Please let me know if you're aware of a pandas API which returns an iterator for iterating over rows as dictionaries.

@FBruzzesi
Copy link
Member

@Priyansh121096 I tried to play with the into argument of .to_dict() method, sadly with no success.

I guess that 4. can become iter(df.to_dict("records"))?!

@MarcoGorelli
Copy link
Member

How about using https://docs.python.org/3/library/collections.html#collections.somenamedtuple._asdict (which , despite the underscore, is public):

something like (simplified)

def iter_rows(df):
    yield from (row._asdict() for row in df.itertuples(index=False))

@Priyansh121096
Copy link
Contributor Author

I guess that 4. can become iter(df.to_dict("records"))?!

I feel like this defeats the purpose of using iter_rows over rows though.

How about using https://docs.python.org/3/library/collections.html#collections.somenamedtuple._asdict (which , despite the underscore, is public):

Amazing! This should work.

@Priyansh121096
Copy link
Contributor Author

@MarcoGorelli I'll raise another PR for iter_rows soon. Can we merge this one?

@MarcoGorelli
Copy link
Member

thanks - tbh I'm not really sure about DataFrame.rows, I might even suggest deprecating it altogether in Polars itself

could we just repurpose this one for DataFrame.iter_rows please? sorry for not having thought about this straight away

@Priyansh121096
Copy link
Contributor Author

could we just repurpose this one for DataFrame.iter_rows please?

@MarcoGorelli pushed a change for this.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @Priyansh121096 !

narwhals/dataframe.py Outdated Show resolved Hide resolved
@MarcoGorelli
Copy link
Member

@pre-commit.ci autofix

@MarcoGorelli MarcoGorelli merged commit 4018aec into narwhals-dev:main Jun 21, 2024
15 checks passed
@Priyansh121096 Priyansh121096 deleted the feat-rows branch June 21, 2024 16:24
@Priyansh121096 Priyansh121096 changed the title feat: add DataFrame.rows feat: add DataFrame.iter_rows Jun 21, 2024
@Priyansh121096 Priyansh121096 mentioned this pull request Jun 29, 2024
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Enh]: Add support for DataFrame.rows
3 participants