-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance problem: various table operations #575
Labels
performance 🏃
Speed things up
Comments
Concrete solution for if isinstance(rows, Table):
if rows.number_of_rows == 0:
return self
if self.number_of_columns == 0:
return rows
different_column_names = set()
different_column_names.update(set(self.column_names) - set(rows.column_names))
if len(different_column_names) > 0:
raise UnknownColumnNameError(
sorted(
different_column_names,
key={val: ix for ix, val in enumerate(self.column_names)}.__getitem__,
),
)
new_df = pd.concat([self._data, rows._data]).infer_objects()
new_df.columns = self.column_names
schema = Schema.merge_multiple_schemas([self.schema, rows.schema])
result = Table._from_pandas_dataframe(new_df, schema)
return result |
lars-reimann
added a commit
that referenced
this issue
Apr 3, 2024
Fixes partially #575 ### Summary of Changes Add special case to `add_rows` when adding a `Table`. The table is no longer split into a list of rows, instead it is directly added to the `DataFrame`. In special cases, this may improve the performance up to 1000x. --------- Co-authored-by: Lars Reimann <[email protected]>
lars-reimann
pushed a commit
that referenced
this issue
Apr 17, 2024
## [0.21.0](v0.20.0...v0.21.0) (2024-04-17) ### Features * add ARIMA model ([#577](#577)) ([8b9c7a9](8b9c7a9)), closes [#570](#570) * Add ImageList class ([#534](#534)) ([3cb74a2](3cb74a2)), closes [#528](#528) [#599](#599) [#600](#600) * more hash, sizeof and eq implementations ([#609](#609)) ([2bc0b0a](2bc0b0a)) ### Performance Improvements * Add special case to `Table.add_rows` to increase performance ([#608](#608)) ([ffb8304](ffb8304)), closes [#606](#606) * improve performance of model & forward layer ([#616](#616)) ([e856cd5](e856cd5)), closes [#610](#610) * lazily import our modules and external libraries ([#624](#624)) ([20fc313](20fc313)) * treat Tables specially when calling add_rows ([#606](#606)) ([e555b85](e555b85)), closes [#575](#575)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
Some
Table
operations take multiple hours to complete on large tables, although they should be much faster. When using pandasloc
functionality, these operations complete in a few seconds.These operations include:
filter_rows
group_rows_by
add_rows
(Table)sort_rows
(probably, untested)transform_column
(probably, untested)The common problem of these operations are, that the first step is boxing the rows to a pandas
Series
and wrapping them in aRow
object (which both leads to a huge overhead).Most of the time, the single row objects are only needed to be passed to the lambda function that is provided to these operations.
Possible resolutions:
Row
is passed to the lambda functions (forfilter_rows
,group_rows_by
,sort_rows
,transform_column
). Instead only pass a general type that is indexable with Strings (column names), likeMapping[str, Any]
to access the values. Maybe also pass the schema, as the other needed operations on aRow
object can be handled with a schema.add_rows
could handle adding a whole table as a special case, so they are not split intoRows
first, avoiding the problem without changing the interfaceTo Reproduce
Table
.from_csv_file
Expected behavior
Faster execution of the listed operations
Screenshots (optional)
No response
Additional Context (optional)
No response
The text was updated successfully, but these errors were encountered: