Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite of Sciline's Pipeline #165

Merged
merged 68 commits into from
Jun 3, 2024
Merged

Rewrite of Sciline's Pipeline #165

merged 68 commits into from
Jun 3, 2024

Conversation

SimonHeybrock
Copy link
Member

@SimonHeybrock SimonHeybrock commented May 8, 2024

Overview

This is a rewrite of Pipeline, as discussed in many places. Please see the rewrite.ipynb notebook for some context.

List of changes

  • TypeVars need to be constrained. The new mechanism eagerly inserts duplicate nodes based on this.
  • Mechanism to favor most-specialized provider is removed. The last provider being inserted is the one that is used. This may be a pitfall, but otherwise it is never clear if an insert should replace existing providers or not, and may even make it impossible to replace specialized providers with a generic one. The new implementation is very simple, so that is what we do.
  • Param table support removed, in favor of map and reduce based on Cyclebane. Not that groupby is not supported yet, but we have never used this in practice.
  • Provider input type hints must be unique.
  • __setitem__ and __getitem__ can be used to compose more complex graphs, sidestepping the dependency-injection-driven assembly of the base pipeline.

@SimonHeybrock SimonHeybrock changed the title V2 prototype Rewrite of Sciline's Pipeline May 8, 2024
@SimonHeybrock SimonHeybrock marked this pull request as ready for review May 8, 2024 08:27
@SimonHeybrock SimonHeybrock requested a review from jl-wynen May 8, 2024 08:27
Copy link
Member

@jl-wynen jl-wynen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed DataGraph yet. That will need more time.

pyproject.toml Outdated
@@ -30,6 +30,7 @@ requires-python = ">=3.10"
# Run 'tox -e deps' after making changes here. This will update requirement files.
# Make sure to list one dependency per line.
dependencies = [
"cyclebane @ git+https://github.com/scipp/cyclebane@fix-conceptual-naming-issues",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should merge this and release cyclebane before merging the current PR. I don't want to risk depending on a branch in a released package.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still requires review.

self._args = args
self._kwargs = kwargs
self._return = return_

def __len__(self) -> int:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think __len__ is the best choice here. The user might wonder whether this includes the return annotation or not. (This is how len(__annotations__) works)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the same argument, isn't what existing methods like keys() return also confusing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. It should be more like argument_keys.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is just an internal implementation detail, I suggest to leave this for now.

Comment on lines 232 to 235
out = self.map_keys(lambda arg: _bind_free_typevars(arg, bound=bound))
if self._return is not None:
out._return = _bind_free_typevars(self._return, bound=bound)
return out
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be moved into map_keys?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that make the method name misleading (since it is not only mapping the keys any more)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above. In Sciline, 'key' refers to the things that connect providers, i.e., domain types. So it applies to the return type as well.

But my question was in particular about whether it always makes sense to map the return key.

src/sciline/pipeline.py Outdated Show resolved Hide resolved
@@ -31,7 +31,7 @@ def as_float(x: int) -> float:

def make_task_graph() -> Graph:
pl = sl.Pipeline([as_float], params={int: 1})
return pl.build(float, handler=sl.HandleAsBuildTimeException())
return pl.build((float,), handler=sl.HandleAsBuildTimeException())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the interface of build and get different now? Why? This seems confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build is a lower-level method with a simpler interface. The complexity of distinguishing between single and multiple keys was moved in to get. build should (as before) not be used by users.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why is it public?

@SimonHeybrock SimonHeybrock requested a review from jl-wynen May 16, 2024 12:04
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about performance. Did you check if the new solution is any slower than the old?
I know that building graphs should be fast compared to the actual computation. But given that this may be done per update in Beamlime, I would like to make sure that it doesn't add too much latency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cyclebane currently has some sub-optimal handling of graph creation after map-reduce. But I have not tested/benchmarked so far.

But given that this may be done per update in Beamlime

I don't think we should or plan to do that?

"""Returns the set of all TypeVars in a type expression."""
if isinstance(t, TypeVar):
return {t}
return set(itertools.chain(*map(_find_all_typevars, get_args(t))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chain of map looked a bit odd to me at first. Is it equivalent to this?

Suggested change
return set(itertools.chain(*map(_find_all_typevars, get_args(t))))
return set.union(*map(_find_all_typevars, get_args(t)))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't get that to work, can you fix the suggested syntax?

src/sciline/data_graph.py Outdated Show resolved Hide resolved
if key is NoneType:
raise ValueError('Key must not be None')
if key in self._graph:
self._graph.remove_edges_from(list(self._graph.in_edges(key)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._graph.remove_edges_from(list(self._graph.in_edges(key)))
self._graph.remove_edges_from(self._graph.in_edges(key))

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think NetworkX' in_edges returns a view, so that would break? @MridulS can correct me?

Copy link
Member

@MridulS MridulS May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's the view of the dictionary underneath, so need to create a new list object to remove the edges.

"""
# This is a questionable approach: Using MyGeneric[T] as a key will actually
# not pass mypy [valid-type] checks. What we do on our side is ok, but the
# calling code is not.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate why this fails in mypy?

Does this mean that every insertion of a param needs to have a type: ignore[valid-type] comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently MyGeneric[T] is not a valid type. MyGeneric[MyType] is ok. So no, not every insertion needs a type: ignore[valid-type], just the less common ones using a generic (not specialized).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to annotate __setitem__ with something more lenient? Maybe even Any? I don't like having legitimate usages flagged as errors. That is, I would prefer false negatives over false positives.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should chat about that, as I have struggled quite a bit coming up with something sensible that is not just Any everywhere.


def __setitem__(self, key: Key, value: DataGraph | Any) -> None:
"""
Provide a concrete value for a type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or a data graph.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

src/sciline/data_graph.py Show resolved Hide resolved
docs/developer/architecture-and-design/rewrite.ipynb Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a bit of text explaining that type vars must have constraints?

"Then we can compute `Result` for each index in the parameter table:"
"Note how we used a node name of the pipeline as the column name in the parameter table.\n",
"For convenience we used a `pandas.DataFrame` to represent the table above, but the use of Pandas is entirely optional.\n",
"Equivalently the table could be represented as a `dict`, where each key corresponds to a column header and each value is a list of values for that column.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an example of this?

@SimonHeybrock
Copy link
Member Author

Are there more comments here that cannot be done in a follow-up?

Copy link
Member

@jl-wynen jl-wynen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with merging. But please make sure to add docstrings where missing in follow up work.

@SimonHeybrock SimonHeybrock merged commit 4a7b218 into main Jun 3, 2024
5 checks passed
@SimonHeybrock SimonHeybrock deleted the v2-prototype branch June 3, 2024 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants