Document rationale for numba #2427

MaxGhenis · 2020-07-13T21:42:07Z

With documentation moving to jupyter-book (#2420), we'll have a pretty home for documenting the taxcalc API (#2421, #2422). Much of this will be documenting the user-facing parts of the API, e.g. calc_all(), but calcfunctions.py has many other functions that could be useful. However (IME), the design of using numba with functions that take scalars limits the ability to reproduce underlying logic, compared to vectorized functions that take a DataFrame with the right columns and return a Series (e.g. for calculating AGI or some particular deduction). The chosen design is also more verbose in many cases. I'd suggest documenting the rationale for this choice in case users are curious about it, and in case there's a way to run those functions (I didn't fully follow #292).

In this test, I found that numba reduced the runtime by 75%, once it compiled, compared to a vectorized function taking a DataFrame. This is a basic reproduction of the ALD_InvInc_ec_base function only, without taxcalc's jitting, celery as mentioned here to avoid the need to warm up, etc., so I don't know if it's representative of overall time savings.

This could relate to the general taxcalc speed discussion in #2376.

The text was updated successfully, but these errors were encountered:

hdoupe · 2020-07-14T13:59:29Z

@MaxGhenis It seems like the main advantages of using numba are performance and code-readability. @MattHJensen has mentioned in the past that the code was difficult to work with and reason about before the iterate_jit decorator was introduced. Here's an example of some of the code that used the dataframe/vectorized based approach:

ref: #88 and where iterate_jit was actually introduced: #102

I've done a couple experiments that re-work how the JIT-ing is applied. Neither of them are complete, but I made it far enough along that both seemed likely to work.

df-based : Uses a DataFrame as the data store in Records instead of a collection of
add-dask-option : Use Dask to parallelize the calculations on Dask DataFrames.

I worked on add-dask-option first and then circled back with df-based to storing all data as a dataframe to make it easier to add Dask support. Both branches are pretty stale at this point, but if anyone is interested in these projects, I would love to pick them back up. (I would also be much more inclined to do so, if someone else wanted to help out.)

One last thing: a lot of the iterate_jit code was added several years ago and I would be surprised if there weren't off the shelf solutions that we could use to simplify and/or improve it.

MaxGhenis · 2020-07-14T16:14:05Z

Thanks for the background @hdoupe. I'm not sure about the readability argument (that screenshot could be implemented as this vectorized function calculating tax liability from a MTR structure), but it does seem like JIT-ing improves performance.

These are nice experiments, I'll DM you.

jdebacker · 2021-07-20T17:37:49Z

This issue highlights another benefit of the work in PR #2585

martinholmer · 2021-09-28T13:23:54Z

@jdebacker said before closing this issue:

This issue highlights another benefit of the work in PR #2585

I don't understand this comment given that @MaxGhenis reports in this issue that:

In this test, I found that numba reduced the runtime by 75%, once it compiled, compared to a vectorized function taking a DataFrame. This is a basic reproduction of the ALD_InvInc_ec_base function only, without taxcalc's jitting, celery as mentioned here to avoid the need to warm up, etc.

Given Max's test results, why would you expect the work in PR #2585 to do anything other than slow down Tax-Calculator?

jdebacker · 2021-09-28T16:04:32Z

@martinholmer you know that one has to balance:

Run time
Programming time

Max's provides some evidence re (1) (although it may be biased towards Numba given the small example dataset and by not using Dask DataFrames). At the same time, Max references (2), but noting that the API documentation is limited in that it can't see into the jitted functions. This affects programming time because it makes it harder for contributors/users to understand how the model works. Other discussion suggests that when looking at the source code, the logic may be easier to understand for jitted than for vectorized functions, but there's still a question how much more clear the code is.

I any case, I'd welcome you opening a PR to make the documentation more clear (Max's original comment).

martinholmer · 2021-09-28T16:36:08Z

@jdebacker said in closed issue #2427:

Max references (2), noting that the API documentation is limited in that it can't see into the jitted functions.

Not sure what he means by "can't see into the jitted functions". The pre-jitted functions
are in the calcfunctions.py module, so he must mean can't see into the post-jitted functions. I'm not sure why anybody would value seeing the post-jitted functions. Anybody can use the pre-jitted functions in the Python debugger if they are worried about bugs in the tax calculation logic or if they just want to understand how the tax logic works.

@jdebacker continues:

This [not being able to see into the post-jitted functions] affects programming time because it makes it harder for contributors/users to understand how the model works.

I simply don't understand what you're saying. "How the model works" to compute taxes is clearly laid out in the pre-jitted calcfunctions.py code. If you want to see something much less easy to understand, take a look at parts of the vectorized code in PR #2585.
Why exactly would anybody want to look into the post-jitted code?

And @jdebacker concludes by saying:

I any case, I'd welcome you opening a PR to make the documentation more clear (Max's original comment).

I don't think it is my role to document something that I did not create. The jitting decorators used in the calcfunctions.py module and specified in the decorators.py module are the creation of @talumbau who, like me, is no longer part of the Tax-Calculator team.

jdebacker · 2021-09-28T23:17:04Z

@martinholmer I don't know what you mean by "team" or "role". This is a public repository for an open source project. The project would welcome contributions from you (or anyone else) that help with documentation (or other aspects of the project).

jdebacker closed this as completed Sep 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document rationale for numba #2427

Document rationale for numba #2427

MaxGhenis commented Jul 13, 2020 •

edited

Loading

hdoupe commented Jul 14, 2020 •

edited

Loading

MaxGhenis commented Jul 14, 2020

jdebacker commented Jul 20, 2021

martinholmer commented Sep 28, 2021 •

edited

Loading

jdebacker commented Sep 28, 2021

martinholmer commented Sep 28, 2021 •

edited

Loading

jdebacker commented Sep 28, 2021

Document rationale for numba #2427

Document rationale for numba #2427

Comments

MaxGhenis commented Jul 13, 2020 • edited Loading

hdoupe commented Jul 14, 2020 • edited Loading

MaxGhenis commented Jul 14, 2020

jdebacker commented Jul 20, 2021

martinholmer commented Sep 28, 2021 • edited Loading

jdebacker commented Sep 28, 2021

martinholmer commented Sep 28, 2021 • edited Loading

jdebacker commented Sep 28, 2021

MaxGhenis commented Jul 13, 2020 •

edited

Loading

hdoupe commented Jul 14, 2020 •

edited

Loading

martinholmer commented Sep 28, 2021 •

edited

Loading

martinholmer commented Sep 28, 2021 •

edited

Loading