-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Impute itemized deduction amounts to non-itemizers #32
Comments
I have just finished the write-up (pdf attached), and the imputed version of 09 puf is also available. I'm aware that we'll be switching to 10 puf soon, and imputing 10 puf should be very efficient if we are not revising/updating our current donor (14 CEX). I'll be doing some code streamline and having all code posted on github once done. Meanwhile, please let me know if there're any comments, concerns or remarks regarding the write-up. |
@GoFroggyRun, I think it would be very helpful to do add a case study of converting an itemized deduction into a credit. That would give users confidence in the technique. You could incorporate the scores with and without the info for standard deduction filers. There could be some good material for comparisons in this document. http://www.cbo.gov/sites/default/files/cbofiles/ftpdocs/121xx/doc12167/charitablecontributions.pdf |
Sean Wang (@GoFroggyRun) and Chi Tran, I've briefly read your 22-Aug-2016 paper entitled "On Cold-Deck Imputation with Data Quality Improvement Using Simulation Model". I have three types of comments: some stylistic suggestions regarding the paper, a substantive issue regarding what you did in the imputation work, and an issue about the lack of imputation results in the paper. Stylistic Issues Substantive Issue Imputation Results |
Thank you so much for your thoughtful and detailed comments on the write-up. Regarding your concerns: Stylistic Issues
I am aware of that, and will try to pin-point my citation to make it more reader-friendly.
Yes. They are, however, being discussed somehow discursively. I will try to find a good to to cite them.
Right. Since they are web pages, I wasn't quite sure what's the best way to cite them. Substantive Issue My apologies that his part might look a bit confusing. There's one variable in the CEX dataset called Imputation Results Sure. I'll include a case study, as suggested by @MattHJensen, as well as some statistical distributions of imputed variables in the write-up. Thanks again for all your comments. And, as always, any comments, concerns or remarks would be more than appreciated. |
Sean Wang (@GoFroggyRun) said:
That sounds more reasonable, but is not the impression that the current draft gives. |
I have finished a revision of the draft (please find attachment) that addresses concerns mentioned in previous discussions. Thanks for @martinholmer's careful review and thoughtful comments. Before proceeding to any imputation-related reforms and comparisons (as suggested by @MattHJensen ), I'd be interested in checking in with you guys and see what reforms would be helpful, since there're a lot of reforms included in the report and some of the reforms are not applicable in TC. Any ideas would be appreciated. And feel free to let me know if there's any further comments, concerns or remarks regarding the revision. |
@GoFroggyRun, I've finally had a chance to read your revised description of imputing itemized expense amounts for non-itemizers. That description was attached to the conversation of taxdata issue #32 during April 2017. This latest version is much improved, so thanks for all the extra work. I've have a suggestion and several questions. (1) Need a shorter and more descriptive paper title. How about Imputing Deductible Expense Amounts for Non-Itemizers? This describes what you are doing and is shorter (so that the page numbers in the LaTeX header don't get swallowed by the long title). (2) Questions about Data Cleaning at bottom of page 3. I don't understand why you have dropped these three groups from the CEX sample. (3) Questions about categorizing CEX data by earnings and PUF data by income. At top of page 7, there is a confusing description where CEX units categorized by earnings group seem to be compared with PUF units categorized by income. I don't understand how that can be done in a sensible way. Perhaps more explanation will eliminate this issue. (4) Beginning in Section 4 there is no description of what the six e variables mean. Why not give the reader a break and explain in words what the deductible expense variables mean? (5) Question about what the phrase "non-ordinal categorial variable" means. I didn't see anyplace in the paper that describes what you mean by this term. Can you explain? (6) Questions about the imputed distributions shown on page 13. This is my biggest concern with what you have done (as far as I can tell from the paper). The six variables have imputed-value distributions on page 13 that are very different from what I would expect. For example, I would expect among non-itemizers that most would have zero non-cash charitable contributions and that a few would be positive non-cash charitable contributions. But the distribution for |
Where is the document discussed here?
dan feenberg
…On Mon, 1 May 2017, Martin Holmer wrote:
@GoFroggyRun, I've finally had a chance to read your revised description of
imputing itemized expense amounts for non-itemizers. That description was
attached to the conversation of taxdata issue #32 during April 2017.
This latest version is much improved, so thanks for all the extra work. I've
have a suggestion and several questions.
(1) Need a shorter and more descriptive paper title.
How about Imputing Deductible Expense Amounts for Non-Itemizers?
This describes what you are doing and is shorter (so that the page numbers in
the LaTeX header don't get swallowed by the long title).
(2) Questions about Data Cleaning at bottom of page 3.
I don't understand why you have dropped these three groups from the CEX sample.
(a) You are splitting families into filing units, so why can't you split CEX
consumer units into families?
(b) Why not treat "surviving spouse units" as single or head of household filing
units depending on whether or not they have dependents?
(c) Why are CEX units without earnings being dropped? Who is left in the CEX
sample to use to impute to non-itemizing PUF retirees, who will most likely have
zero earnings but positive social security and/or pension income. This seems
like a big mistake, but maybe further explanation can change my mind about that.
(3) Questions about categorizing CEX data by earnings and PUF data by income.
At top of page 7, there is a confusing description where CEX units categorized
by earnings group seem to be compared with PUF units categorized by income. I
don't understand how that can be done in a sensible way. Perhaps more
explanation will eliminate this issue.
(4) Beginning in Section 4 there is no description of what the six e variables
mean.
Why not give the reader a break and explain in words what the deductible expense
variable mean?
Also, why is e18500 (real-estate taxes paid) not imputed? Seems like we still
have a problem after all your imputation work because we still have a major
deductible expense missing.
(5) Question about what the phrase "non-ordinal categorial variable" means.
I didn't see anyplace in the paper that describes what you mean by this term.
Can you explain?
(6) Questions about the imputed distributions shown on page 13.
This is my biggest concern with what you have done (as far as I can tell from
the paper). The six variables have imputed-value distributions on page 13 that
are very different from what I would expect. For example, I would expect among
non-itemizers that most would have zero non-cash charitable contributions and
that a few would be positive non-cash charitable contributions. But the
distribution for e20100 on page 13 shows most non-itemizers having a value of
about $1400 and almost none having a value of zero. Why is that? Is my
expectation about this variable's distribution in the CEX subsample of
non-itemizers mistaken? Or, by taking the imputed value to be the average of the
nearest 80 neighbors (if I'm understanding correctly what you're doing) are you
distorting the CEX distribution of this variable? Put another way, I don't see
how your imputation method handles correctly the mass point a zero for these six
variables. Maybe more explanation would answer my question.
@MattHJensen @feenberg @Amy-Xu @andersonfrailey
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the
thread.[AHvQVQxfqxA1ZwYRf6IaPQHYzWsTDivAks5r1hiDgaJpZM4JcdFn.gif]
|
On Mon, May 1, 2017 at 2:44 PM, Daniel Feenberg <[email protected]>
wrote:
Where is the document discussed here?
I've attached the pdf to this email.
…
dan feenberg
On Mon, 1 May 2017, Martin Holmer wrote:
>
> @GoFroggyRun, I've finally had a chance to read your revised description
of
> imputing itemized expense amounts for non-itemizers. That description was
> attached to the conversation of taxdata issue #32 during April 2017.
>
> This latest version is much improved, so thanks for all the extra work.
I've
> have a suggestion and several questions.
>
> (1) Need a shorter and more descriptive paper title.
>
> How about Imputing Deductible Expense Amounts for Non-Itemizers?
>
> This describes what you are doing and is shorter (so that the page
numbers in
> the LaTeX header don't get swallowed by the long title).
>
> (2) Questions about Data Cleaning at bottom of page 3.
>
> I don't understand why you have dropped these three groups from the CEX
sample.
> (a) You are splitting families into filing units, so why can't you split
CEX
> consumer units into families?
> (b) Why not treat "surviving spouse units" as single or head of
household filing
> units depending on whether or not they have dependents?
> (c) Why are CEX units without earnings being dropped? Who is left in the
CEX
> sample to use to impute to non-itemizing PUF retirees, who will most
likely have
> zero earnings but positive social security and/or pension income. This
seems
> like a big mistake, but maybe further explanation can change my mind
about that.
>
> (3) Questions about categorizing CEX data by earnings and PUF data by
income.
>
> At top of page 7, there is a confusing description where CEX units
categorized
> by earnings group seem to be compared with PUF units categorized by
income. I
> don't understand how that can be done in a sensible way. Perhaps more
> explanation will eliminate this issue.
>
> (4) Beginning in Section 4 there is no description of what the six e
variables
> mean.
>
> Why not give the reader a break and explain in words what the deductible
expense
> variable mean?
> Also, why is e18500 (real-estate taxes paid) not imputed? Seems like we
still
> have a problem after all your imputation work because we still have a
major
> deductible expense missing.
>
> (5) Question about what the phrase "non-ordinal categorial variable"
means.
>
> I didn't see anyplace in the paper that describes what you mean by this
term.
> Can you explain?
>
> (6) Questions about the imputed distributions shown on page 13.
>
> This is my biggest concern with what you have done (as far as I can tell
from
> the paper). The six variables have imputed-value distributions on page
13 that
> are very different from what I would expect. For example, I would expect
among
> non-itemizers that most would have zero non-cash charitable
contributions and
> that a few would be positive non-cash charitable contributions. But the
> distribution for e20100 on page 13 shows most non-itemizers having a
value of
> about $1400 and almost none having a value of zero. Why is that? Is my
> expectation about this variable's distribution in the CEX subsample of
> non-itemizers mistaken? Or, by taking the imputed value to be the
average of the
> nearest 80 neighbors (if I'm understanding correctly what you're doing)
are you
> distorting the CEX distribution of this variable? Put another way, I
don't see
> how your imputation method handles correctly the mass point a zero for
these six
> variables. Maybe more explanation would answer my question.
>
> @MattHJensen @feenberg @Amy-Xu @andersonfrailey
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub, or mute the
> thread.[AHvQVQxfqxA1ZwYRf6IaPQHYzWsTDivAks5r1hiDgaJpZM4JcdFn.gif]
>
>
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#32 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ALm1-deXEB7E12d65Quj696vZwgpYDbyks5r1ifwgaJpZM4JcdFn>
.
|
@GoFroggyRun, Let me amplify the concerns I expressed in my question (6), which was posed in taxdata issue #32 on May 1, 2017. I have no idea what the distribution of non-cash charitable contributions ( Here is what I did:
So, only 21.09 million of the 42.12 million itemizers (about 50 percent) have a positive amount. But your graph on page 13 shows the vast majority of non-itemizers have imputed values of Can the basic shape of the |
@martinholmer thanks for your thoughtful comments and follow-up analysis, I'll first partially address your concerns.
I don't have a strong preference regarding the title, so I don't have problems with this one.
For (a), maybe I am confused, but why would us interested in families rather than filing units? For (b), the amount of observations who are considered "surviving spouse units" is rather insignificant comparing to either single group or HH group. My judgment thus is that it probably does not worth, nor matter, dealing with them. Moreover, having them included in either group could potentially introduce distortion. The quality of donor is much more important than such insignificant increment of amount in sample size, so I' rather not trade it off. For (c), earning is the factor we used to break down consumer units (CUs) in CEX. When zero, there's no way to determine how to split CUs. Indeed zero earning units can have positive social security and/or pension amount, but introducing these factors will make things more complicated (I prefer generalized treatments over special treatments). More importantly, CUs with zero earnings are not interesting themselves, in the way that their expenditures are mostly negligible. The effect, in terms of imputation, of including those CUs (suppose we have an ideal way to break them down) is more or less the same as having trivial records with zero or close to zero expenditures in donor dataset (Recall we only use number of exemptions, earnings and martial status to measure similarities).
The "income" I used in PUF is actually
I'll add descriptions for those variables. Thanks for your suggestion.
It means that this categorial variable has no clear ordering. I probably shouldn't have included the term "non-ordinal" since categorial variable readily implies that ordering is not clear. I'll have a separate comment to address rest of the concerns. |
@GoFroggyRun said about what @martinholmer said:
I'm simply suggesting (because you have so few CEX observations) that you not discard multiple-family CEX units. Spit those CEX units into families, and then, use you procedures to split each of those families into tax filing units. |
@martinholmer said:
and followed-up by:
First of all, the distribution I presented in the paper is not weighted. Each record in the distribution has a uniformed weight. I probably should have specified that in the paper. Given this, each distribution I presented can actually be viewed as a re-scaled version of the corresponding CEX's distribution. I won't use the word "distort" since I'm simply averaging everything without extra treatment. Not sure about how weights would affect the aggregate distributions and results, nor sure about whether the weighted distribution would meet your expectation or not. Last Friday, @feenberg also addressed concern regarding the distribution plots: he thinks these distributions are not smooth enough, in the way that they have multiple dips. In his opinion, taking the nearest 80 neighbors should have already alleviate such issue, comparing to taking only 1 neighbor. But these distributions still aren't good enough. One possible solution is to incorporate previous year CEX releases, but I am not sure how much time and effort it might take. I'm still thinking what's the best strategy is to deal with concerns regarding these distributions. Any comments, concerns or remarks are mostly welcomed. cc @MattHJensen |
I do think mixing 80 values is the problem. I left Sean with some ideas
last Friday for re-evaluating the optimal k and we should give him a
chance to implement that.
dan
…On Tue, 16 May 2017, Sean.Wang wrote:
@martinholmer said:
This is my biggest concern with what you have done (as far as I
can tell from the paper). The six variables have imputed-value
distributions on page 13 that are very different from what I
would expect. For example, I would expect among non-itemizers
that most would have zero non-cash charitable contributions and
that a few would be positive non-cash charitable contributions.
But the distribution for e20100 on page 13 shows most
non-itemizers having a value of about $1400 and almost none
having a value of zero. Why is that? Is my expectation about
this variable's distribution in the CEX subsample of
non-itemizers mistaken? Or, by taking the imputed value to be
the average of the nearest 80 neighbors (if I'm understanding
correctly what you're doing) are you distorting the CEX
distribution of this variable? Put another way, I don't see how
your imputation method handles correctly the mass point a zero
for these six variables. Maybe more explanation would answer my
question.
and followed-up by:
So, only 21.09 million of the 42.12 million itemizers (about 50
percent) have a positive amount.
And only 4.84 million (about 23 percent of the positives and 11
percent of all itemizers) have a 2013 e20100 value larger than
$1,000.
But your graph on page 13 shows the vast majority of
non-itemizers have imputed values of e20100 around $1,300 and
very few have zero.
Can the basic shape of the e20100 distribution among
non-itemizers really be that different from the basic shape of
the e20100 distribution among itemizers?
First of all, the distribution I presented in the paper is not weighted.
Each record in the distribution has a uniformed weight. I probably should
have specified that in the paper. Given this, each distribution I presented
can actually be viewed as a re-scaled version of the corresponding CEX's
distribution. I won't use the word "distort" since I'm simply averaging
everything without extra treatment. Not sure about how weights would affect
the aggregate distributions and results, nor sure about whether the weighted
distribution would meet your expectation or not.
Last Friday, @feenberg also addressed concern regarding the distribution
plots: he thinks these distributions are not smooth enough, in the way that
they have multiple dips. In his opinion, taking the nearest 80 neighbors
should have already alleviate such issue, comparing to taking only 1
neighbor. But these distributions still aren't good enough. One possible
solution is to incorporate previous year CEX releases, but I am not sure how
much time and effort it might take.
I'm still thinking what's the best solution is to deal with concerns
regarding these distributions. Any comments, concerns or remarks are mostly
welcomed.
cc @MattHJensen
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the
thread.[AHvQVb-_3WN3Gf2gB3dv7Fkp4NfPMMULks5r6b0rgaJpZM4JcdFn.gif]
|
@GoFroggyRun, in addition to posting the charts that we discussed today, could you post an overview of Dan's ideas relating to "mixing 80 values". |
Here's the two plots we've discussed: First one, weighted version of imputed variables: And the original CEX distribution in uniform weights: Last Friday, @feenberg suggested a way of evaluating the effect of "mixing 80 neighbors" by plotting the correlation (variance) plot (i.e. variance against number of neighbors). After giving the suggestion a second thought, I don't think it sensible. Currently I am using mean squared error (MSE) to evaluate model goodnesses. The merit of bias-variance tradeoff showed that two components of MSE, namely bias and variance, will be monotonically decreasing and increasing respectively as number of neighbors increase. Thus some appropriate choice of number of neighbors would minimize the MSE. The idea behind such curve is that we are picking a point where bias won't overwhelm variance and vice versa. @feenberg's idea is that, in simple words, we want method that minimizes the correlation(variance). An immediate consequence of such choice (in this case choosing one neighbor to impute) is that our result will be seriously biased. Maybe I am confused with our objective, since I'm using an algorithm that gives "global" optimization. |
On Wed, 17 May 2017, Sean.Wang wrote:
@MattHJensen:
Here's the two plots we've discussed:
First one, weighted version of imputed variables:
weighted-density
And the original CEX distribution in uniform weights:
cex_distribution
Last Friday, @feenberg suggested a way of evaluating the effect of "mixing
80 neighbors" by plotting the correlation (variance) plot (i.e. variance
against number of neighbors). After giving the suggestion a second thought,
I don't think it sensible. Currently I am using mean squared error (MSE) to
evaluate model goodnesses. The merit of bias-variance tradeoff showed that
two components of MSE, namely bias and variance, will be monotonically
decreasing and increasing respectively as number of neighbors increase. Thus
some appropriate choice of number of neighbors would minimize the MSE. The
idea behind such curve is that we are picking a point where bias won't
overwhelm variance and vice versa. @feenberg's idea is that, in simple
words, we want method that minimizes the correlation(variance). An immediate
consequence of such choice (in this case choosing one neighbor to impute) is
that our result will be seriously biased.
Maybe I am confused with our objective, since I'm using an algorithm that
gives "global" optimization.
I think we want to minimize the error in the estimate of the correlation.
dan
…
cc @martinholmer
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the
thread.[AHvQVc3SxGtiUhyunbh9S0raD21_S0aQks5r6ylegaJpZM4JcdFn.gif]
|
We should impute itemized deduction amounts to non-itemizers so that we can simulate reforms that increase the number of itemizers.
This issue was moved from PSLmodels/Tax-Calculator#230.
@GoFroggyRun recently took on this project. @GoFroggyRun, could you please post an update on your work?
Feel free to link to or attach your and Chi Tran's presentations or any other information you think might be relevant.
The text was updated successfully, but these errors were encountered: