Estimating Family Income from Administrative Banking Data

A Machine Learning Approach

December 04, 2018

The JPMorgan Chase Institute was established to leverage the power of administrative banking data to deepen our understanding of critical economic issues and provide timely insights to decision makers. We recently developed a machine-learning based estimate of family income to enable deeper insights and improved representation in our research. We describe our approach and results in this new release.

Download Full Report

Q&A on JPMC Institute Income Estimate

We discussed the motivation for this new income estimate and the potential of applying machine learning approaches to administrative banking data.

What is the JPMC Institute Income Estimate (JPMC IIE) and why did the JPMorgan Chase Institute create it?

Put simply, the Institute Income Estimate is an estimate of gross family income for families who regularly use a Chase checking account. Analyzing and understanding financial behavior of families and how it varies across the income spectrum is a central theme of the Institute's work. To better assess these dynamics, we needed to create a methodology for estimating gross family income across our data sets.

Our ability to extend insights gained from the Chase portfolio to the US population relies on having or approximating a sample that is representative of the broader population and being able to differentiate results by key attributes, such as age, income, and geography. For example, if we want to measure growth in consumer spending in Houston, as we do with our Local Consumer Commerce Index, we want to make sure that the customers we observe in Houston are truly representative of that city, and we might also want to know who within Houston is contributing most of the growth.

We know that the Chase portfolio is not a perfect mirror of the US population and doesn't offer a perfect window into its customers' income. For example, it inherently excludes the unbanked, who tend to have lower incomes. Even for banked families, financial institutions might see payroll income arrive into a customer's account but not all of the deductions for taxes, insurance, retirement made by the employer. And there might be other sources of income that aren't deposited into the customer's account.

In order to make our samples more representative, we have to be able to re-weight our population to match the income distribution of the country. And in order to study economic behavior of low-income families, we want to define low-income consistent with national benchmarks. So we need an income measure that is comparable to Census in order to re-weight and benchmark our sample. That's why we chose to create the JPMC IIE.

At a high level, what is the methodology behind the JPMC IIE?

The idea behind JPMC IIE is quite simple in that it is a classic application of a “supervised learning” problem within machine learning. For some customers we actually know their gross family income, because they applied for a mortgage or credit card with us, and we were required to ask them about their income as part of the underwriting process. These customers represent our “truth set.” Among these customers we can then ascertain which characteristics that we observe for all of our customers are highly predictive of gross family income. In this sense we can train a model to predict gross family income that uses features observable for everyone. Once we have tuned that model to be as predictive as possible of the ground truth, we can then deploy it to generate a predicted gross family income for everyone else.

Just how predictive of family income is the JPMC IIE?

Our first version of JPMC IIE leveraged a wide variety of features to predict gross family income, including both account information internal to the bank and publically available data. It is able to come up with a prediction of gross family income that is, on average, within 41 percent of the truth. That is to say, on average, the estimate could be higher or lower than the actual by 41 percent. This is referred to as the “mean absolute error.”

Since we mostly care about ascertaining a family's income quintile, we also assessed performance based on how often predicted income fell into the same quintile as the family's true income. On that score, predicted quintile matched the true quintile 55 percent of the time and was equal or adjacent to the true quintile roughly 90 percent of the time.

Now, that certainly leaves room for improvement. But let us put those numbers into perspective. Had we simply guessed each family's income based on the average income among families living in their same zip code, according to tax records, we would have been off by 103 percent on average. So that shows the value of leveraging administrative banking data to predict family income.

We also took JPMC IIE on a test run by seeing how well it performed if we used it to weight the population in our Healthcare Out of Pocket Spending Panel. Sure enough, weighting by age and JPMC IIE made our population more representative of the general population than if we had weighted by age alone.

How is this work different from typical JPMorgan Chase Institute research, and what were the key lessons learned from it?

This is one of the first applications of machine learning to our work. In addition, whereas most of our research aims to answer concrete research questions, this publication describes the methodology behind a key data asset for us, the JPMC IIE, which is foundational for other research.

We certainly learned quite a bit from this exercise. We'll share one key highlight.

Early on it became clear to the team that our prediction was only going to be as strong as our truth set. And we needed to make sure that the truth set was representative of the larger universe of customers for whom we were trying to predict income. By relying on mortgage and credit card applicants for the ground truth, our truth set was biased in favor of higher-income families, so we had to oversample our truth set for mortgage and credit card applicants who had lower incomes. Stratifying the truth set by income yielded a 28 percentage point improvement in the quintile prediction for families in the lowest income quintile.

So what's next for JPMC IIE? Are there plans to continue to enhance or expand the scope of the income estimate and these approaches?

As we mentioned earlier, with a mean absolute error of 41 percent, there is a lot of room for further improvement. We are taking our initial learnings and continuing to improve the model.

We are busy refining and adding to our original features to see if we can improve the accuracy of the prediction. We are also trying to expand the size of our truth set by finding additional customers within the bank for whom we have gross family income.

We also see promise in expanding the scope of this income estimate beyond checking account customers to credit customers so that we have a uniform prediction estimate across our universe of customers.

We hope that publishing our initial methodology will not only teach the public about the power of leveraging administrative banking data for prediction but also generate a lot of feedback for future improvement. So reach out to us with ideas and stay tuned!