Skip to content

Conversation

@cswartzvi
Copy link
Contributor

@cswartzvi cswartzvi commented Aug 31, 2024

Attempts to fix #1114 by introducing structural subtyping for Parallelizable and Collect.

Changes

Both Parallelizable and Collect were converted to generic protocols (vice abstract classes). The function is_parallelizable_type was also updated because generic protocols are not runtime checkable via issubclass - the function now checks the origin type via typing.get_origin.

How I tested this

As mentioned in #1114, I was having issues running the full test suite locally on my Windows 11 machine, but I was able to run all the tests where Parallelizable is used, that includes the following folders:

  • ./tests/execution/
  • ./tests/function_modifiers/
  • ./tests/resources/

Notes

I didn't find any usage of is_parallelizable_type in the codebase. Is this intended for a future feature?

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.
@elijahbenizzy
Copy link
Contributor

Hey! Thanks for this -- makes sense at a high-level. Also the issue is quite helpful in highlighting the problem.

TL;DR -- originally I intended it to be a generator but that's over-engineered. Looks good!

So, two changes here:
(1) changing it from a Generator to an Iterable
(2) Changing it to be covariant types + using P

Covariant types make sense -- this is a pretty clear-cut case of where any subclass of V would work. Re: iterable -- generator was implemented by design (although not strictly necessary). Specifically, the future of this feature intends to allow more fine-grained pipelining for streaming results -- the send/yield from of a generator would be critical there.

The problem is that python has no builtin protocol for Generator (it's a concrete type) which forces us to use ABC, which then goes up against the same problem as earlier. The "right" way would be to define a protocol, but ughh, that's a bit ridiculous. This is hitting the limitations of python's (quite lacking) type system. All this as far as I understand, I'm not an expert in python's typing system.

All that said, we haven't actually made use of the fact that it's a generator, and we could always have another type (E.G. Stream). So talking myself into a circle, I think it's fine to loosen it to Iterable, then we can have a special case when we ant for a generator.

Mind adding a few comments on the type? Also while you're working on this, maybe we should change the name of U and V to ParallelizableElement and CollectElement? Or better names.

Appreciate this, and the chance to dig into python's type system!

@cswartzvi
Copy link
Contributor Author

Ah! So you were thinking about streaming via send - I thought that might be the case when I saw Generator as the base class. For what it's worth, I think that adding an explicit Stream type down the road, separate from Parallelizable, fits nicely with the overall library.

I feel your frustrations with python's type system - I have been using it pretty heavily over the last 3-4 years (I work with a lot of C# people and that is the only way to get them involved with python) but I still bump into rough paths all the time - like no builtin protocol for Generator, which I have seen before, but definitely did not remember.

Anyway, let me add those comments and update the type variable names - I decided to go with your suggestions for the names. Note that I also updated the commented out Sequential so that it will start in a consistent place if/when it is introduced.

@elijahbenizzy
Copy link
Contributor

Ah! So you were thinking about streaming via send - I thought that might be the case when I saw Generator as the base class. For what it's worth, I think that adding an explicit Stream type down the road, separate from Parallelizable, fits nicely with the overall library.

I feel your frustrations with python's type system - I have been using it pretty heavily over the last 3-4 years (I work with a lot of C# people and that is the only way to get them involved with python) but I still bump into rough paths all the time - like no builtin protocol for Generator, which I have seen before, but definitely did not remember.

Anyway, let me add those comments and update the type variable names - I decided to go with your suggestions for the names. Note that I also updated the commented out Sequential so that it will start in a consistent place if/when it is introduced.

Yep, I agree on the way to proceed. I'm just that weird one hoping for true compiled python at some point 😆

Appreciate the comments + changes + in-depth thoughts! This looks good, merging!

@elijahbenizzy elijahbenizzy self-requested a review September 1, 2024 18:48
Copy link
Contributor

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@elijahbenizzy elijahbenizzy merged commit d90212f into apache:main Sep 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants