-
Notifications
You must be signed in to change notification settings - Fork 76
/
Copy path10-household-allocation.Rmd
426 lines (351 loc) · 19.3 KB
/
10-household-allocation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
# Household allocation {#ha}
So far, this book has explored data on 2 levels: the individual level and
the level of administrative zones. The household is another
fundamental building block of human organisation around which key decision-making,
economic and data-collecting activities are centred. We will here develop results
for Belgium in a specific study. Note that this
chapter is written by Johan Barthélemy^[This is a contributed chapter by Johan Barthélemy, SMART Infrastructure Facility, University of Wollongong.] and Morgane Dumont. The second part of the chapter is based on a research made by Dumont Morgane (UNamur) and funded by the
Wallonia Region of Belgium. Timoteo Carletti (UNamur),
Eric Cornélis (UNamur), Philippe Toint (UNamur) and Thierry
Eggericks (UCL Louvain-La-Neuve) were involved in the research.
The academic groups of DEMO from UCL-Louvain-La-Neuve and the
OWS (Observatoire Wallon de la Santé) also provided
support.^[More precisely, we can cite Dominique Dubourg (OWS), Véronique Tellier (OWS),
Luc Dal (DEMO), Mélanie Bourguignon (DEMO) and Jean-Paul Sanderson (DEMO).]
\index{household}
This chapter explains how to take spatial microdata, of the type
we have generated in the previous chapters, and allocate the resulting
individuals into household units.
As with all spatial microsimulation work, the appropriate method
for household creation
depends on the data available. Data availability scenarios,
in descending order of detail, include:
- Access to a sample of
households for which you have information about
each member.
- Access to separate datasets about individuals and households, stored in independent
data tables that are not linked by household ID.
- No access to aggregate data relating to households, but access to some
individual level variables related to the household of which they belong
(e.g. number of people living with, type of household).
This chapter explains methods
for household level data generation in the latter two cases. The first
possibility, having a sample of households,
is the topic of next chapter (Chapter 11) on the TRESIS method.
In this chapter, we focus on the two cases
where you have no microdata for the households
(meaning data with one row per household).
The chapter is structured as followed:
-*Independent data (individuals and households)* (\@ref(IndData)) considers the
case in which data on households and individuals remain completely separate.
-*With additional household's data* (\@ref(AddData)) presents a strategy when
having additional data to the individual data.
Note that the first section explains a method of the literature only
theoretically, whereas the second section is developed more in detail
and present results for Belgium.
## Independent data (individuals and households) {#IndData}
When the individual level data are
independent from the household level data, they can rarely be linked.
Data coming from different sources,
sometimes implying different total populations,
can cause this inconsistency.
This section describes the method
proposed by Johan Barthélemy
for dealing with such situations.
The method is to proceed in three steps.
First, we determine the individual distribution `Indl`,
for example by using the package `mipfp`, as
explained before. Second, we determine the distribution
of characteristics for the household's data, hereafter named `Hh`. This can
be done using the same technique as for the individual level data, considering
the households instead of the individuals in
the previous chapters.
Third, after individual and household level distributions have been
estimated, the individuals can be allocated
to households. This is done one household
at a time by first selecting its type before randomly
drawing its constituent members [@Barthelemy2012].
### Household type selection
The household type selection is performed to ensure
the distribution of the generated synthetic households
is statistically similar to the previously estimated one, i.e. $Hh$.
This is achieved by choosing the type $hh*$ such that the
distribution $Hh'$ of the already generated households
(including the household being built) minimizes the
$\chi^2$ distance between $Hh'$ and $Hh$ i.e:
$$d_{\chi^2}=\sum_{i} \frac{(hh'_i-hh_i)^2}{hh_i^2} $$
where $hh_i$ and $hh_i'$, respectively, denote the number
of households of type $i$ in the estimated and generated
synthetic population. Note that this optimization is simple
as the number of household types is limited.
### Constituent members selection
Now that a household type has been determined, we can
detail the members selection process. First a household
head is drawn from the pool of individuals `IndPool` defined
by the estimated individuals distribution `Ind`. Then,
depending on the household types, a partner, children
and additional adults are also drawn if necessary.
This process is illustrated in Figure 10.1.
```{r, fig.cap="Constituent members selection process", fig.height=6, echo=FALSE}
img <- readPNG("figures/Jojo.png")
grid.raster(img)
```
Some attributes of the members can be directly
obtained from their household type (for instance
the gender of the head for an household of the type
`Isolated Man`). The remaining missing attributes are then:
- either randomly drawn according to some known distributions
(e.g. the household type x head's gender x head's age x mate's age);
- or, if different values are feasible and equally likely,
retained from the distribution which minimizes $\chi^2$ between generated
and estimated distributions.
This is similar to what is done for the household type selection.
After an individual type has been determined, then the
corresponding member is added to the household being generated:
- if the selected class is still populated in the `IndPool`,
we extract an individual from this class and add it to the household;
- else we find a suitable member by searching in the members of
the households already generated. This last individual is then replaced thanks to
an appropriate one drawn in `IndPool`.
Note if some additional data is available for instance the age difference between partners in a couple, then
we can use it to constraint the selection of the current individual type.
### End of the household generation process
The household generation process ends after any one of these three conditions:
if all households have been
constructed; if the pool of individual is empty; or if the process fails
to find a member for a household in the previously generated ones.
When the procedure stops, two types of inconsistencies
may remain in the synthetic population: the final number of
households may be smaller than estimated and/or the number of individuals estimated
may be less than the known population of the area. In this case,
we can form households with the remaining individuals even if
they are not probable and then try to make exchanges to improve
the fit. These
exchanges can be made by following the principles of a `tabu search`. This
consists of an algorithm that remembers the last tries
to avoid repeating the same exchange many times [@tabuSearch].
## Cross data: individual and household level information
In some cases, information about households is included in the
individual dataset.
For example, individual level data
may include variables on type of household or/and
the number of cohabitants in addition to gender and age. This provides cross-tabulated information between
the households and the individuals. Considering the microdataset,
IPF can help to obtain, per zone, inhabitants
described by individual level variables (such as sex, age and income) and some
household level information (such as household type and household's size).
To form the households with this resulting data, we have two possible alternatives.
The first is to aggregate the information concerning the individuals and
the households independently. By this way, we build two independent
tables and we can use an algorithm similar to the one in \@ref(IndData).
The second possibility aims to preserve the full potential of the data.
This means that individuals are joined with the constraints to follow
as well as possible their characteristics. For example, two people
being head cannot live together; if a person has 3 cohabitants,
he needs to be in a household of 4 individuals.
The former solution is simpler and
requires only the first chapters of the book. However it results in
a loss of possible precision.
The second possibility, which preserves all the information in individual and
household level tables, is explained in this section.
With cross data, we usually proceed in two stages. First, we create
the individuals with all their characteristics.
The second step is to group these individuals into households using
combinatorial optimisation. Each
person must be matched to
one and only one household.
For this process there are two possible methods. One assumes
access to household level variables only in the individual level
data. The other assumes
access to additional data concerning the structure of the
households such as the age difference amongst a married couple.
These options are described below.
Note that in both
situations, the aim is to form households where each individual
is contained in one and only one household. Moreover, each individual must respect,
as well as possible, its household's attributes.
### Without additional household's data {#WithoutHHdata}
When household level constraints are only
contained in the individual's characteristics,
they are often several possible groupings.
Consider the case of our Belgium study where the individual level variables are
age, gender, municipality, education level, professional status,
size and type of households and link
with the household head (e.g. wife, child). A good grouping is one that
maximises the number of well-fitted constraints.
The perfect
grouping would be one in which each individual respects its household size and its
type of household, as well as his link with the head. In general, it
is impossible to reach a perfect grouping, since the data are not
perfect. Indeed, it can happen, for example, that there are an odd number
of people who need to live in couple, making it impossible
to find a perfect coupling.
```{r, fig.cap="Illustration of the problem of grouping members of married couples with children.", fig.height=6, echo=FALSE}
img <- readPNG("figures/HH-CO.png")
grid.raster(img)
```
As illustrated
in Figure 10.2, the individuals can be categorised first by
type of household ("married couples with children"" in this case)
and then by size of household.
This household type has a size of at least 3
(two parents and at least a child).
Inside this restricted set of households, the next step is to
look at the link that each individual has with the household head
and again split the pool of individuals, per link. It is only after this classification
that we proceed to the random draw, respecting the links.
For example, for the married couples, we first draw randomly a head
and then a partner of the opposite gender (the national register
of Belgium for 2011 doesn't contain homosexual couples). Then, depending
on the size of household to be generated, the right number of children
are also drawn. This process ends when no additional household can be drawn
and respect the constraints. Figure 10.2 shows that we have
a household with head 1, who is a woman; partner 2, who is a man;
and two children (with ids 2 and 5).
The main sources of error with this method are incoherence
in the data and error caused by the IPF process before the grouping.
The method implicitly assumes that each household is equally likely to occur, independent
of its characteristics.
### With additional household's data {#AddData}
Without additional data on household structure, the only
possible method is the one described in \@ref(WithoutHHdata).
However, this allows improbable households, such as
a couple formed by an individual of 18 years old and
another of 81 years. For this reason, when we create households,
it is often very useful to take into account
the differences between age distributions (when these data are available).
We can consider the ages within a
couple, but also of parents and children.
To do this, we need tables of age differences. These tables are pertinent only when
considering variables already included in the simulation (for example, it is
impossible to consider a table of age differences per hair color if this
variable is not in the model). To explain
the process, we develop here the methodology used
for the creation of the couples. This means that we have
men and women of different ages and roles in the household (head or spouse)
and that we need to form the couples. The random draw executed when
having no additional data will be improved by considering the
real age distributions. Imagine that a part of the additional data is
the one in Table 10.1.
Table: Example of an age distribution table for couples without children.
Municipality | Woman's age | Man's age | Count |
|:--|:-----:|:------:|:------:|
TestCity | 20-25 | 15-20 | 4|
TestCity | 20-25 | 20-25 | 25|
TestCity | 20-25 | 25-30 | 18|
TestCity | 20-25 | 30-35 | 8|
TestCity | 20-25 | 35-40 | 2|
... | ... | ... | ... |
Note that this is a fictive table, non corresponding to any real
data, just to explain the reasoning. Thanks to this
table, we know that to fit the real population, we will need 25
couples with a man and a woman, both in the same age class 20-25, etc.
However, these data being not perfect (because coming from different sources with
very little variations in the counts or because the synthetic population is not perfect),
the marginals could be incoherent
with the ones from current synthetic population. For this reason, we will consider
the new information only as proportions. For our example,
it means that in the total of women having 20-25 years old
(57 individuals), $\frac{4}{57}=0.07=7$% are married with a
man of age 15-20. With this reasoning, we can calculate the new Table 10.2,
with a supplementary column considering the proportions.
\index{age distribution}
Table: Example of an age distribution table with the proportion of men married with a woman of each age.
Municipality | Woman's age | Man's age | Count | Proportion (%) |
|:--|:-----:|:------:|:------:| :------: |
TestCity | 20-25 | 15-20 | 4| 7|
TestCity | 20-25 | 20-25 | 25| 43.9 |
TestCity | 20-25 | 25-30 | 18| 31.6 |
TestCity | 20-25 | 30-35 | 8| 14 |
TestCity | 20-25 | 35-40 | 2| 3.5 |
... | ... | ... | ... | ... |
These proportions will be useful in next step of the global process.
The methodology for the couples with male heads is illustrated in Figure 10.3. For
female heads, the process is totally similar.
```{r, fig.cap="Illustration of the algorithm to form the couples", fig.height=6, echo=FALSE}
img <- readPNG("figures/IllustrationCouples.png")
grid.raster(img)
```
First, we split the set of
individuals depending on their role and gender.
This forms male heads and female
partners to join on one hand, and female heads
with male partners on the other hand. We consider
each male head turn by turn. For each head, we
determine the theoretical distributions of each women ages,
depending on the age of the head
(thanks to the additional age distribution table).
Out of this distribution, we
remove the ages that are no more available in the set of
possible partners. Indeed, at the end of the process,
only a few partners remain to be assigned. Thus,
all ages will not be represented any more.
Out of this distribution of ages, we calculate the
proportions, whose will be used as probabilities in the random draw.
Then, thanks to this, we draw an age. Finally, knowing
the age of his partner, we choose a wife randomly. This
process is repeated until the set of remaining individuals
is empty, or there are no remaining partners in the possible ages for the
remaining heads.
In our research, this algorithm has been applied to all municipalities in Belgium.
The final result is illustrated in Figure 10.4. On this graph,
each point corresponds to a combination of (age woman x age man) for a
municipality. Its abscissa is the theoretical count for this category,
included in the database of the age distributions inside couples.
Its ordinate is the number of couples in this category in our
synthetic population. Since the dots are on the line
formed by the points having both coordinates equal, we can argue that our
simulation worked well.
```{r, fig.cap="Illustration of the results for the couples in Belgium", fig.height=6, echo=FALSE}
img <- readPNG("figures/Belgium/Couples.png")
grid.raster(img)
```
The assembling of children to the head has been made by a similar process
and gives as good results in terms of distribution of ages. However,
when no new combination is still possible, some individuals could remain
without an assigned household. To improve the spatial microsimulation here, we
have chosen to join remaining individuals without regarding at their size
of household if this improves the age distribution. This implies less people
without an household, but some individuals are in a household not
corresponding to its size.
Figure 10.5 indicates that the
worst municipality has only 0.5% of non-assigned individuals. The vast majority
of the Belgian municipalities has less than 0.1% of these individuals.
```{r, fig.cap="Illustration of the non-assigned individuals in Belgium", fig.height=6, echo=FALSE}
img <- readPNG("figures/Belgium/NonAssigne.png")
grid.raster(img)
```
Figure 10.6 illustrates the proportion of individuals living
with a wrong number of cohabitants. In the different municipalities,
the proportion varies from 0.7 to 1.05%. These errors affect
only a small proportion of each municipality
and concern only the size of household (the type of household is always respected).
In conclusion, the simulation fits well the strong contraints
(age distributions inside couples and between head and children).
The individuals assigned to a household have the right link with the head
and most of them live with the right number of cohabitants. Only few
individuals has not been chosen to form an household.
Thus, we can consider the synthetic population acceptable.
Thanks to spatial microsimulation, these new synthetic data
are statistically very similar to the real Belgian population.
Since these individuals are 'synthetic', the resulting population doesn't
suffer of privacy law problems.
```{r, fig.cap="Illustration of the individuals in a household of a different size for Belgium", fig.height=6, echo=FALSE}
img <- readPNG("figures/Belgium/BadSize.png")
grid.raster(img)
```
\pagebreak
Note that a simulated annealing could be another method to resolve this kind of problems.
In our case, we have tested it, but it takes very long CPU time to obtain a result
as accurate as the one shown above. However, in cases where
the objectives are different, it is possible that a simulated annealing becomes
better. Indeed, for our purpose, it worked well, but only the computational time
was a major drawback. If you would like to fit age distributions,
diploma distributions, and more complicated cases, the simulated annealing could
become a good option.
## Chapter summary
This chapter has described methods for allocating individuals into households.
Depending on the precision of the available data, the process can result in
synthetic households that are more or less representative of the real population.
The more pertinent information available, the more realistic will be the resulting
households.