Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document rationale for numba #2427

Closed
MaxGhenis opened this issue Jul 13, 2020 · 7 comments
Closed

Document rationale for numba #2427

MaxGhenis opened this issue Jul 13, 2020 · 7 comments

Comments

@MaxGhenis
Copy link
Contributor

MaxGhenis commented Jul 13, 2020

With documentation moving to jupyter-book (#2420), we'll have a pretty home for documenting the taxcalc API (#2421, #2422). Much of this will be documenting the user-facing parts of the API, e.g. calc_all(), but calcfunctions.py has many other functions that could be useful. However (IME), the design of using numba with functions that take scalars limits the ability to reproduce underlying logic, compared to vectorized functions that take a DataFrame with the right columns and return a Series (e.g. for calculating AGI or some particular deduction). The chosen design is also more verbose in many cases. I'd suggest documenting the rationale for this choice in case users are curious about it, and in case there's a way to run those functions (I didn't fully follow #292).

In this test, I found that numba reduced the runtime by 75%, once it compiled, compared to a vectorized function taking a DataFrame. This is a basic reproduction of the ALD_InvInc_ec_base function only, without taxcalc's jitting, celery as mentioned here to avoid the need to warm up, etc., so I don't know if it's representative of overall time savings.

This could relate to the general taxcalc speed discussion in #2376.

@hdoupe
Copy link
Collaborator

hdoupe commented Jul 14, 2020

@MaxGhenis It seems like the main advantages of using numba are performance and code-readability. @MattHJensen has mentioned in the past that the code was difficult to work with and reason about before the iterate_jit decorator was introduced. Here's an example of some of the code that used the dataframe/vectorized based approach:
Screenshot from 2020-07-14 09-40-06

ref: #88 and where iterate_jit was actually introduced: #102

I've done a couple experiments that re-work how the JIT-ing is applied. Neither of them are complete, but I made it far enough along that both seemed likely to work.

  • df-based : Uses a DataFrame as the data store in Records instead of a collection of
  • add-dask-option : Use Dask to parallelize the calculations on Dask DataFrames.

I worked on add-dask-option first and then circled back with df-based to storing all data as a dataframe to make it easier to add Dask support. Both branches are pretty stale at this point, but if anyone is interested in these projects, I would love to pick them back up. (I would also be much more inclined to do so, if someone else wanted to help out.)

One last thing: a lot of the iterate_jit code was added several years ago and I would be surprised if there weren't off the shelf solutions that we could use to simplify and/or improve it.

@MaxGhenis
Copy link
Contributor Author

Thanks for the background @hdoupe. I'm not sure about the readability argument (that screenshot could be implemented as this vectorized function calculating tax liability from a MTR structure), but it does seem like JIT-ing improves performance.

These are nice experiments, I'll DM you.

@jdebacker
Copy link
Member

This issue highlights another benefit of the work in PR #2585

@martinholmer
Copy link
Collaborator

martinholmer commented Sep 28, 2021

@jdebacker said before closing this issue:

This issue highlights another benefit of the work in PR #2585

I don't understand this comment given that @MaxGhenis reports in this issue that:

In this test, I found that numba reduced the runtime by 75%, once it compiled, compared to a vectorized function taking a DataFrame. This is a basic reproduction of the ALD_InvInc_ec_base function only, without taxcalc's jitting, celery as mentioned here to avoid the need to warm up, etc.

Given Max's test results, why would you expect the work in PR #2585 to do anything other than slow down Tax-Calculator?

@jdebacker
Copy link
Member

@martinholmer you know that one has to balance:

  1. Run time
  2. Programming time

Max's provides some evidence re (1) (although it may be biased towards Numba given the small example dataset and by not using Dask DataFrames). At the same time, Max references (2), but noting that the API documentation is limited in that it can't see into the jitted functions. This affects programming time because it makes it harder for contributors/users to understand how the model works. Other discussion suggests that when looking at the source code, the logic may be easier to understand for jitted than for vectorized functions, but there's still a question how much more clear the code is.

I any case, I'd welcome you opening a PR to make the documentation more clear (Max's original comment).

@martinholmer
Copy link
Collaborator

martinholmer commented Sep 28, 2021

@jdebacker said in closed issue #2427:

Max references (2), noting that the API documentation is limited in that it can't see into the jitted functions.

Not sure what he means by "can't see into the jitted functions". The pre-jitted functions
are in the calcfunctions.py module, so he must mean can't see into the post-jitted functions. I'm not sure why anybody would value seeing the post-jitted functions. Anybody can use the pre-jitted functions in the Python debugger if they are worried about bugs in the tax calculation logic or if they just want to understand how the tax logic works.

@jdebacker continues:

This [not being able to see into the post-jitted functions] affects programming time because it makes it harder for contributors/users to understand how the model works.

I simply don't understand what you're saying. "How the model works" to compute taxes is clearly laid out in the pre-jitted calcfunctions.py code. If you want to see something much less easy to understand, take a look at parts of the vectorized code in PR #2585.
Why exactly would anybody want to look into the post-jitted code?

And @jdebacker concludes by saying:

I any case, I'd welcome you opening a PR to make the documentation more clear (Max's original comment).

I don't think it is my role to document something that I did not create. The jitting decorators used in the calcfunctions.py module and specified in the decorators.py module are the creation of @talumbau who, like me, is no longer part of the Tax-Calculator team.

@jdebacker
Copy link
Member

@martinholmer I don't know what you mean by "team" or "role". This is a public repository for an open source project. The project would welcome contributions from you (or anyone else) that help with documentation (or other aspects of the project).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants