RDP 2003-08: A Tale of Two Surveys: Household Debt and Financial Constraints in Australia Appendix C: Imputation of Household Income in HILDA[19]

Compared with similar international household surveys, HILDA does not suffer greatly from problems of missing data (Watson and Wooden 2003). However, there is a relatively high incidence of missing data for income-related questions. We can separate the most common reasons for the missing data into ‘item non-response’ and ‘incomplete households’. Item non-response occurs when a member of a selected household agrees to be interviewed, but then either refuses, or is unable, to answer some of the questions asked. This is the main source of missing data, accounting for 64 per cent of the missing household income information. Most of the missing income data is due to item non-response for income sourced from business (missing 23.5 per cent) and investments (missing 8.1 per cent). Wages and salaries (missing 7.2 per cent) and government benefits and pensions (missing 1.4 per cent) have relatively low incidences of missing data.

The other major source of missing data is the 810 incomplete households, accounting for 10.5 per cent of the household sample and 36 per cent of the missing household income information; these are households in which not all eligible adult members agreed, or were able, to be interviewed. The HILDA unit record files do not include an entry for household income if any of its eligible members were not interviewed, or did not report complete income information; in all, 29 per cent of households have a missing value for household income.

In such circumstances we have two choices. We can drop the 29 per cent of households for which income data is missing from the sample, or impute the income of the individuals with missing data. Our choice to impute income for missing individuals is shaped by two factors. First, because income non-response is not random or uncorrelated with the variable(s) of interest, the missing cases cannot be safely dropped from the sample (Watson and Wooden 2003). For example, men, individuals outside the labour force, individuals living in Tasmania and Perth, people that have been recently divorced, and people that have a high regard for their leisure time (and generally have low incomes) were more likely to offer complete income information than other individuals. Second, we have a large cross-section of information from the HILDA Survey that permits us to do a reasonable job of imputing income for missing individuals.

Following the recommendations of the HILDA Survey team and methods adopted in the British Household Panel Survey (BHPS), we impute income using the ‘predictive mean matching’ method (Little 1988; Watson and Wooden 2003). This is a stochastic imputation technique that has the advantage of maintaining the underlying data distribution by allowing the imputation of error around the mean.

The nature of the missing data leaves us with the need to impute income for three separate types of missing cases:

  1. Individuals that did not complete a person questionnaire and therefore did not report any income information (Type I) (n = 1158).
  2. Individuals that completed a person questionnaire but did not provide information on wage income (Type II) (n = 673).
  3. Individuals that completed a person questionnaire but did not provide information on non-wage income (Type III) (n = 1621).

Three separate models are estimated to impute income for each type of missing case. For Type I respondents we have information on the characteristics of their household (e.g., value of the dwelling, geographic location, the number of bedrooms) and a limited range of personal information from the household questionnaire. We also have personal information collected about other respondents in the household. These ‘family variables’ include the income, labour force status and occupation of other household members. Both the household and family variables are likely to be correlated with both personal and household income and hence act as useful explanatory variables in the model. We impute total gross financial year income for these individuals.

We have the same information for Types II and III respondents, but also additional personal information obtained from items that they did complete during the interview – labour force status, age, gender, English-speaking background – including information about the sources of their income. This allows us to predict wage and non-wage income in the final two models, and add the income that individuals report from other sources to our estimates. For example, for Type III individuals we add their imputed non-wage income to any actual reported wage and salary income.

In the regression model for Type I households our model explains nearly 32 per cent of the variation in total gross household income. The root mean square error (RMSE) is about $26,000. In the regression model for Type II households our model explains about 46 per cent of the variation in individuals' wage and salary income and the RMSE is nearly $19,000. In the regression model for Type III households our model explains nearly 21 per cent of the variation in individuals' non-wage income and the RMSE is about $20,500. Although these errors are quite large, we regard the imputation as being relatively successful, not least because it allows us to use actual data for other income and household members that would otherwise be lost. The actual results from the three regression models are available from the authors upon request.

Our income imputation strategy allows us to recover household income estimates for all but 337 households (about 4 per cent of the sample), ensuring that any bias introduced by dropping missing observations from the sample is minimised.


The work in Appendix C was done by Gianni La Cava and Jeremy Lawson. Further discussion of the income imputation procedure may be found in Ellis, Lawson and Roberts-Thomson (2003). [19]