A Synthetic Penalized Logitboost to Model Mortgage Lending with Imbalanced Data

Most classical econometric methods and tree boosting based algorithms tend to increase the prediction error with binary imbalanced data. We propose a synthetic penalized logitboost based on weighting corrections. The procedure (i) improves the prediction performance under the phenomenon in question, (ii) allows interpretability since coefficients can get stabilized in the recursive procedure, and (iii) reduces the risk of overfitting. We consider a mortgage lending case study using publicly available data to illustrate our method. Results show that errors are smaller in many extreme prediction scores, outperforming a number of existing methods. Our interpretations are consistent with results obtained using a classic econometric model.


Introduction
Predicting binary decision problems is important in empirical economics. For instance, identifying whether an applicant will default in future or be turned down under the Home Mortgage Disclosure Act 1 (HMDA) contributes to the study of financial inclusion policy. In fact, the notion of having events versus non-events (a binary response) can be the result of a latent and unobserved random variable that triggers an event when it is high enough, so that extreme values then turn into event responses.
Class-imbalanced data are relevant primarily in the context of supervised machine learning involving two (dichotomous) or more classes. Imbalanced means that the number of observations is not the same for each class of a categorical variable, in other words, one class is represented by a large number of observations while the other is represented by only a few (Japkowicz and Stephen 2002).
In the context of mortgage lending, for example, Munnell et al. (1996) have dealt with an imbalanced class problem. They found that black and Hispanic applicants were more likely than whites to be denied mortgage loans. Thus, the class corresponding to the applicants who were denied was much smaller than the applicants who were approved. The minority class (denied mortgage lending) could be coded as one, while the majority class (approved for mortgage lending) could be coded as zero.
There is evidence that the prediction accuracy of this type of events seems to remain problematic. King and Zeng (2001) note that classical econometric methods can underestimate the probability of occurrence in the minority class, while Krawczyk (2016) finds that machine-learning methods tend to exhibit a bias towards the majority class.
There is a vast literature devoted to proposing techniques to handle the class imbalance problem. Barandela et al. (2003), Kotsiantis et al. (2006), Longadge et al. (2013), and Lin et al. (2017) summarize four types of techniques: (i) Data preprocessing (balancing the data by oversampling, which increases the number of observations in the minority class, or by undersampling, which reduces observations in the majority class) using an algorithm approach (creating or modifying algorithms with the threshold and one-class learning methods), (ii) Cost-sensitive solutions (minimizing the costs of misclassification), (iii) Feature selection (finding the optimal combination of covariates that gives the best classification), and (iv) Resampling techniques incorporated in classifier ensembles such as boosting or bagging, which have given risen to proposals such as Synthetic Minority Oversampling (SMOTE) (Chawla et al. 2002), RUSBoost (Seiffert et al. 2009), UnderBagging (UB) (Barandela et al. 2003), and OverBagging (Wang and Yao 2009).
In our view, however, the modelling and interpretability of imbalanced class phenomena in a joint process without overfitting data remains a subject beyond the scope of machine learning. We propose a Synthetic Penalized Logitboost that aims to decrease the mean square error in the highest and lowest prediction scores of the probability of minority class occurrence, by introducing a weighting mechanism that recalibrates a Logitboost to reduce the risk of overfitting. The Synthetic Penalized Logitboost improves the detection of extremes in the data if the purpose is to look for unusual patterns rather than for average cases. For this purpose, we borrow the specification of the model put forward by Munnell et al. (1996) to predict mortgage loan denial with a logistic regression. The paper is divided into five sections after the introduction. Section 2 describes the theoretical framework that motivates the paper. Section 3 describes the methodology in detail, specifically logistic regression (econometric model for binary prediction), Logitboost, Gradient Tree Boost (boosting-based machine learning for binary prediction) and the proposed algorithm. Section 4 describes the data set used in an illustrative example. Section 5 sets out the results and predictive performance measured by the root-mean-square error and includes the model's interpretation. Finally, Sect. 6 contains the conclusions.

A Closer Look at the Theoretical Framework
Considering a supervised statistical learning framework, let us start from a data set of n observations with a quantitative target variable (dependent variable) Y i , i = 1, …, n that has some relationship with a set of P predictor variables denoted as X ip , p = 1, …, P (also known as covariates). This can be written as: where F is a deterministic function of the X ip , and i is the error or disturbance term that captures the influence of omitted factors, is independent of X ip and has zero mean.
In econometrics, parametric models, such as linear or generalized linear models, and non-parametric models, such as spline regressions or generalized additive models, adopt their corresponding regression form. So, in simple models, instead of estimating the corresponding P-dimensional function F X ip , it is necessary only to obtain the P + 1 coefficient estimates β p of the linear predictor Machine learning also uses alternative F in the form of classification and decision trees (Breiman et al. 1984), radial basis functions (Gomez-Verdejo et al. 2005), and random Markov fields (Dietterich et al. 2008), among others. The function F is known as a base learner in the machine learning literature.
Function F can be used to make inferences or predictions, or both. Even though econometric models are aimed at explanatory or predictive modelling, or both, non-econometric models are mainly used for prediction purposes (classification or regression problems), 2 because their F functions are not able to provide coefficient estimates that are directly interpretable as marginal effects. (1) When F is used for prediction purposes, given that (1) has an error term that averages zero, a predicted target variable Ŷ i , for F that estimates the observed F, can be written as follows: In this setting, James et al. (2013) identify two types of errors: reducible and irreducible. When the expected value or average of the squared difference between the observed Y i and predicted Ŷ i is taken, we obtain: which gives as a result: where the reducible error is F X ip −F X ip 2 , and the irreducible error is Var i (variance of the error term). In fact, machine learning with non-econometric models aims to minimize the reducible error, which is equivalent to minimizing the distance between Y i and Ŷ i . This distance is known as the loss function, and will be denoted as φ(Y i , Ŷ i ). Note that Var i cannot be reduced because these models only have a deterministic part that excessively learns from a given data set, in other words, they remove the only stochastic term. Consequently, highly accurate predictive machine learning algorithms such as certain tree-based or boosting-based techniques may result in overfitting, which means that the fitted models do not perform well on other databases. This is known as non-reproducibility. This result has also been verified by Pesantez-Narvaez et al. (2019).
Many loss functions have been proposed to develop machine learning algorithms with greater predictive accuracy. They must be convex and differentiable. This paper will focus on the exponential loss function that is used in a Logitboost: In order to increase the predictive capacity, therefore, it makes sense to consider a simple econometric method like a base learner in a boosting-based algorithm. Firstly, the irreducible error may be effectively reduced by readjusting the base learner to improve the model fit. Secondly, the reducible error can also be computed. The statistical intuition behind choosing a primitive econometric model is that the newest iterations of boosting-based algorithms correct the prediction error by considering the previous iterations. This can be done more efficiently if the base learner is a weak 3 one, because there is more variability to learn in weak base learners than

3
A Synthetic Penalized Logitboost to Model Mortgage Lending… in strong ones that already have good predictive performance and no or almost no variability.

Description of Methodology
Three groups of boosting-based algorithms are considered: the classical econometric model, gradient boosting for classification and Logitboost-based algorithms. The first group consists of logistic regression. The second group consists of the original gradient boosting algorithm and gradient boosting tree. The third group consists of the original Logitboost and the proposed Synthetic Penalized Logitboost.
Note that F X ip ;u is the base learner mentioned earlier. It is a function of covariates X ip and the parameters 4 represented by u.
In the data set that will be used in Sect. 4, there are n individuals and P covariates. The target variable Y i is now an observed binary response variable that takes two values coded as 1 for the minority class (denied mortgage loan) and 0 for the majority class (approved for mortgage loan). Let D be the number of iterations of the boosting procedure, with d = 1, …, D.

Logistic Regression
Let us assume that in the data set of n individuals and P covariates, the target variable Y i is now an observed binary response variable that takes two values coded as 1 for the rare class and 0 for the majority class. A logistic regression is a classical econometric tool that is used to model and predict binary dependent variables explained by quantitative or qualitative covariates. It is a specific case of a generalized linear model when the link is the logit function and is given as: where o , 1 , …, p are the model parameters, and P Y i = 1 is the probability that Y i equals 1 conditional on the covariates. By a simple algebraic manipulation, P Y i = 1 is: A logistic regression can be estimated by maximum likelihood method (for further details, see for example McCullagh and Nelder 1989).
. 4 For example, if F X ip ;u is a regression model, u represents the coefficient estimates β, whereas if F X ip ;u is a classification and regression tree (CART), then u represents branches of the tree (splitting rules).

Gradient Boosting
The idea behind the Gradient Boosting proposed by Friedman (2001) is to compute a sum of optimized functions through an iterative process. The optimized functions are the result of a minimization of a loss function φ. Let us assume that in the data set of n individuals and P covariates, the target variable Y i is now continuous. The gradient boosting procedure starts with an initial guess of prediction Ŷ 0 i . It then consists of minimizing a loss function through an argmin between the observed Ŷ i and an arbitrary constant ρ.
Begin Algorithm: Then the squared error between the pseudo-residual and F(X, u) is minimized. This results in an updated u d ∶ Let γ be the result of a minimized loss function between the observed Y i and Ŷ d i + F X ip ;u d . Note that Ŷ d i is the prediction from the given covariates X ip and the updated parameters u at iteration d.
The final prediction at iteration D is the sum of the previous prediction Ŷ d−1 i and γŶ d i .
End For End Algorithm.

Gradient L2 TreeBoost (Two-Class Logistic Boost)
Let us assume that in the data set of n individuals and P covariates, the target variable Y i is now an observed binary response variable that takes two values coded as 1 for the rare class and 0 for the majority class. The L2 TreeBoost proposed by Friedman (2001) differs from the Original Gradient Boost in: • Initial prediction Ŷ 0 i • Loss function: Logistic loss function • Base learner: Decision tree The first estimation is calculated as follows: where Y is the mean of the dependent variable.
Begin Algorithm: The base learner F X ip ;u, R equals ∑ J j=1 u j 1(X ip R j ) with J terminal nodes known as leaves, and R j regions or classification rules, j = 1, …, J. Parameters u correspond to the score of each leaf, which is the proportion of cases classified into Y i given covariates X ip . The tree-based algorithms are theoretically more efficient than linear or generalized linear methods in capturing non-linearities. The idea is that tree-based algorithms use information gain (measured by Gini impurity or entropy) to split a node. This helps to order the decision nodes associated with each covariate X ip , so that the decision node with the highest information gain will split first, and so on until the one with lowest information gain. The information gain builds the R j classification rules that map each observation i onto the correct leaf j by minimizing the entropy or Gini impurity of each node, so that the observations contained in the node are the most homogeneous [see further details in Hastie et al. (2009)]. Now R jd is computed by mapping all observations onto leaf j of tree (j = 1, …, J) at iteration d, considering r i as the target variable and covariates X ip as follows: Therefore γ d j is calculated for each leaf by minimizing a logistic loss function between the observed Y i , and However, since there is no closed form for the previous equation, an approximation of γ d j is obtained through the Newton-Raphson method as follows: And the final prediction Ŷ d i is computed as: End For End Algorithm.
Since tree-based algorithms generally overfit, decision tree pruning is considered in order to build a smaller tree with fewer J terminal nodes that lead to smaller variance by retaining the most relevant information and removing the least relevant (see further details in Hastie et al. 2009). For simplicity, Gradient L2 TreeBoost will be referred to as Gradient Tree Boost from here on.

Logitboost
The previous gradient boosting algorithms require the minimization of a loss function φ(Y i , Ŷ i ). However, Friedman et al. (2000) have managed to approximate a logistic function as an additive logistic regression known as "Logitboost".
Let us assume that in the data set of n individuals and P covariates, the target variable Y i is now an observed binary response variable that takes two values coded as 1 for the rare class and 0 for the majority class.
The Logitboost has some initial conditions: Begin Algorithm: For d = 1 to D do: This algorithm initializes by computing the working response z i .
In this case the 2 is a quadratic approximation of the log-likelihood with which a logistic regression can be estimated, as explained in Sect. 3.2. According to Friedman et al. (2000), the 2 can be a gentle alternative when the exponential loss function is used. Therefore, the working response z i is an analogous expression to the pseudo-residuals r i . Again, the exponential loss function written in (5) is: where Ŷ i is obtained as follows: A vector of weights w i is computed as follows: A base learner F X i , u must be trained by fitting a weighted least squares regression as explained in Sect. 3.1, with a vector of weights w i and a target variable z i . Note that even though a binary target variable is set for this boosting, this F admits continuous target variables. The reason is that the working response z i transforms the binary variable Y i into a continuous one, so that two classes are still found in the first iteration. However, from the second iteration onwards, observations of z i start to change during the boosting, so that at the end several values of z i are found.
Y d i has to be updated as follows: Parameters u d are the coefficient estimates obtained in the linear regression. Then the probabilities have to be updated: End Algorithm.

Synthetic Penalized Logitboost
The proposed Synthetic Penalized Logitboost incorporates slight changes to the original Logitboost and introduces a new alternative weighting mechanism w i . This methodological proposal was particularly motivated by Pesantez-Narvaez and Guillen (2020a, b). They managed to propose weighting corrections in parametric models to improve their predictive performance for binary dependent variables. We keep the two initial conditions for Ŷ 0 i and p 0 X i : Begin Algorithm: For d = 1 to D do: This algorithm initializes by computing the working response z i .
where is a very small number (close to zero), e.g. 0.0001, so we avoid division by zero. Y i is obtained as follows: A vector of weights w i is computed as follows: This weighting mechanism aims to penalize by giving less weight to observations whose distance between the observed Y i and the probability estimates p X i is greater than the mean of the dependent variable. In other words, we penalize observations which are more likely to be misclassified. This weighting mechanism leads to stabilization after very few iterations of the boosting procedure. Weights must be normalized by dividing by the sum of the vector of weights:

3
A Synthetic Penalized Logitboost to Model Mortgage Lending… F X ip ;u has to be trained as weighted least squares with weights w i : Y d i has to be computed as follows: And we must update the probabilities: The final p X i is related to the log-odds through (35).
End For End Algorithm.

Illustrative Data and Descriptive Statistics
In order to illustrate the proposed methodology, we use a publicly available Home Mortgage Disclosure Act (HDMA) cross-section data set, which was collected by the US Government through a survey designed to gather additional information on minority group applicants. The intention was to uncover whether discrimination based on the applicants' race occurs in mortgage lending. The sample has 2381 applicants who were chosen by a simple random sample in Boston, Massachusetts (United States) in 1997-1998. 5 There is an equal number of denials among white and minority applicants in order to provide sufficient power to validate any discrimination. The HDMA database is also available in the Ecdat package in R. We drop the last observation due to missingness; no imputation technique was necessary. Table 1 describes the variables in the Home Mortgage Disclosure Act (HDMA) cross-section data set.
Even though these data are old, we believe that they are useful to show the implementation and testing of the newly proposed model since the data set contains the required variables to replicate the model proposed by Munnell et al. (1996).   Table 2 above shows the descriptive statistics for the HDMA data set. The last row reveals that a substantial part of the sample has an approved mortgage application (88.03%). The mean ratios corresponding to the debt to total income and housing expenses to income are slightly higher for applicants whose mortgage application was denied, which means that their debt is higher than it is for the other applicants. Additionally, the mean ratio of the size of loan to assessed value of property is almost 9% higher for people with a denied mortgage application. The credit score and mortgage score of approved mortgage applicants are, respectively, 0.6 times and 0.88 better than the scores of denied applicants. Whereas 56.57% of applicants with a bad public credit record were approved, 43.43% were denied. Moreover, 8.33% of applicants who were denied mortgage insurance had an approved mortgage application, while 91.67% were also denied their mortgage application. While 83.39% of self-employed applicants were approved, 88.65% of applicants who were not self-employed were approved. Also, 84.94% of single applicants were approved, while 15.06% were not. There is a slight percentage difference between applicants who live in a condominium and had an approved mortgage application and applicants who live in a condominium and had a denied mortgage application. Lastly, 71.68% of black applicants were approved, while 90.74% of non-black applicants were approved.

Results and Discussion
This section contains two parts. The first part presents the results of the prediction performance of the Synthetic Penalized Logitboost in comparison to the algorithms described in Sect. 3. The results are shown below based on three calculations. The second part presents a proposal to recover the interpretability of the Synthetic Penalized Logitboost model. The variables refer to the debt payment to total income ratio (Dir); housing expenses to income ratio (Hir); ratio of size of loan to assessed value of property (Lvr); consumer credit score from 1, as the best score, to 6 as the lowest score (Css); mortgage credit score from 1, as the best score, to 4 as the lowest score (Mcs); whether the applicant has a public bad credit record (Pbcr); whether the applicant was denied mortgage insurance (Dmi); whether the applicant is self-employed (Self); whether the applicant is single (Single); 1989 Massachusetts unemployment rate in the applicant's industry (Uria); whether the applicant lives in a condominium (Condominium); whether the applicant is black (Black); and finally, the mortgage application (Y), which was coded as 1 when the mortgage application was denied, and 0 otherwise    Table 3 presents the root-mean-square error 6 (RMSE) of Logistic regression, Logitboost, Gradient Tree Boost and the Synthetic Logitboost, tested for three scenarios: the entire sample (all observations), the observations that correspond to Y i = 1, and the observations that correspond to Y i = 0. The RMSE is suitable to measure the distance between the observed Y i and the predicted Ŷ i , so the predictive performance will not depend for example on the precision of the threshold picked to build a confusion matrix.

Prediction Performance
The Gradient Boost (tree) is built with the model developer's default hyperparameters from the gbm package in R, which correspond to the number of trees (100), the maximum depth of variable interactions (1), the minimum number of observations in the terminal nodes of the trees (10), and shrinkage (0.1). The Gradient Boost (tree) GS-CV is built with tenfold cross validation and optimized hyperparameters through grid search, which correspond to the number of trees (150), the maximum depth of variable interactions (2), the minimum number of observations in the terminal nodes of the trees (10), and shrinkage (0.1) with the caret package in R. Logistic, Logitboost, and Synthetic Penalized Logitboost are built according to the definitions in Sect. 3, and they do not have hyperparameters.
In the first calculation, Logistic regression and Logitboost perform almost the same, confirming numerically what was noted theoretically. Synthetic Penalized Logitboost has a smaller RMSE in some of the lowest and highest accumulated predictions, even when it is compared with the Gradient Tree Boosting models (with and without optimized hyperparameters).
When analysing the observations that correspond to denied applications ( Y i = 1), both Gradient Tree Boost models perform worse than Logistic and Logitboost for some high score predictions. This confirms the fact that optimized Gradient Tree Boost methods risk failing to predict the minority class ( Y i = 1) even when their performance is better with the complete data set. However, the Synthetic Penalized Logitboost performs better than Logistic and Logitboost in the lowest accumulated predictions, and better than the Gradient Tree Boost GS-CV in the 1% and 5% highest accumulated predictions.
When analysing the observations that correspond to accepted applications ( Y i = 0), Logitboost differs considerably from Logistic in the highest predictions, where it performs much better, while in the lowest scores, the results are very similar for both models. Now, Gradient Tree Boost GS-CV performs better than the two classical methods, while Synthetic Penalized Logitboost also generally performs better than the classical methods.
It can be concluded that Synthetic Penalized Logitboost makes slightly more accurate predictions than the other algorithms in most observations for the scores in the upper and lower extremes.
The second calculation in Table 4 shows the RMSE of the previously discussed methods split into testing and training HMDA data sets. The Synthetic Penalized 6 The root-mean-square error is calculated as follows: Logitboost performs quite similarly in the training and testing data sets. This result might be explained by the fact that the algorithm is built with an error term that allows for random variation in covariates when modelling the target variable; and consequently, it avoids overfitting. A similar behaviour is obtained with logistic regression, which is a parametric model. Gradient Tree Boost requires hyperparameter optimization and cross-validation procedures to correct overfitting. While correction methods to avoid overfitting are widely accepted in the machine learning literature, it is risky in terms of interpretation to tune shrinkage parameters.
As their values increase, they deliberately shrink or disappear variables (nodes) with smaller entropy or Gini impurity. However, empirical econometric analysis demands the measurement of the coefficient estimates even when they are not significant in the model; otherwise the analyst may lose control of their natural effect on the dependent variable.
The third calculation in Table 5 presents the predictive measures of the discussed methods. The Synthetic Penalized Logitboost has more accuracy than Logistic and Logitboost and more specificity than Gradient Boost (Tree) GS-CV in the testing data sets. In aggregate terms, the Synthetic Penalized Logitboost has larger RMSE than alternative methods. Note that the error correction through penalization is focused on observations which are far from the average values, so the proposed method tends not to affect the predictive improvement of mean observations. We observe quite similar patterns of performance when the Synthetic Penalized Logitboost is applied to data sets that have low frequencies, for example, in HDMA 2012 and 2017. The results obtained for the testing and training data sets are very close to each other and do not differ significantly. Moreover, the Synthetic Penalized Logitboost has lower RMSE than the alternative methods in the 1% and/or 5% lower and upper extremes. Further details and discussion of results obtained with HDMA 2012 and 2017 are presented in the "Appendix". Figure 1 shows the evolution of the RMSE within 100 iterations of the Synthetic Penalized Logitboost. This algorithm gets the RMSE stable after many iterations. While there is no theoretical guarantee that the proposed method will stabilize after some iterations, we obtained similar behaviour when applying the Synthetic Penalized Logitboost to the HDMA 2012 and 2017 data sets. We propose trying alternative initial values if this does not happen.
The RMSE is smaller and more homogeneous for observations in the minority group ( Y i = 1) in the lowest predictions, while the RMSE is larger and more heterogenous for observations in the majority group ( Y i = 0) in the highest predictions. In aggregate terms, the lowest 1% and the highest 1% of predicted scores (extreme values) have a much more accurate performance than the other accumulated percentages of predictions.

Recovering the Interpretability of the Model
Machine learning algorithms are sometimes considered black boxes since their interpretability is not straightforward. In contrast, the Synthetic Penalized Logitboost can Table 4 Root-mean-square error of logistic regression, logitboost, gradient tree boost and the synthetic penalized logitboost for the training and testing HMDA data sets The HMDA database was randomly split into training data (70%) and testing data (30%). Each RMSE for 1%, 5%, 10%, 20%, 30% and 40% of the lowest accumulated prediction scores is shown on the left-hand side of the table under "Lower Extreme", and each RMSE for 1%, 5%, 10%, 20%, 30% and 40% of the highest accumulated prediction scores is shown on the right-hand side of the  be seen as a method that recalibrates a least square regression in reweighted versions and penalizes incorrect predictions, so its interpretability can be recovered. Let us note again in Fig. 1 that when the RMSE achieves stabilization in the boosting procedure (minimum variance), so too do the coefficient estimates of the model. Therefore, if the coefficients are averaged, one might gain some intuition about the sign and magnitude of the covariate effect on the response. Table 6 shows the coefficient estimates obtained by a logistic regression and the Synthetic Penalized Logitboost. The results obtained by the logistic regression are consistent with the conclusions obtained by Munnell et al. (1996).
Moreover, the sign of the mean of the coefficient estimates of the Synthetic Penalized Logitboost within iterations is almost the same before and after the stabilization. The signs and the magnitude of the coefficients are consistent with the ones obtained by logistic regression. Nonetheless, the magnitude seems to be expressed on another scale, which was expected since the target variable used in the two methods is not the same.
Regarding the economic interpretation, Table 6 provides interesting results. Both the applicants with a high debt payment to income ratio and the applicants with a high ratio of size of loan to assessed value of property are more likely to receive a denied mortgage application. Moreover, the applicants with the lowest consumer and mortgage credit scores are more likely to be denied. Single applicants are more Table 5 Predictive measures of logistic regression, logitboost, gradient tree boost and the synthetic penalized logitboost for the testing and training HMDA data sets The HMDA database was randomly split into training data (70%) and testing data (30%). The threshold used to convert the continuous response into a binary response is the mean of the outcome variable. Recall measures the ratio of applicants who were classified in the denied mortgage application group to those who were effectively denied. Specificity measures the ratio of applicants who were classified in the denied group to those who were not denied. Accuracy measures the proportion of applicants who are correctly classified. Precision is the ratio of correctly predicted denied applicants to the total predicted denied applicants. The F1 Score is the weighted average of Precision and Recall likely than non-single applicants to have a denied mortgage application. A higher unemployment rate in the applicant's industry is also more likely to result in denial. Last but not least, black applicants are more likely than others to have a denied mortgage application, even when controlling for all the ratios and factors included in the model. The Synthetic Penalized Logitboost provides similar interpretations, as the mean coefficients have almost the same sign 7 as the logistic regression coefficients, even though they are not directly comparable in size.  Munnell et al. (1996), and the effect of Condominium should be analysed in depth since more types of living spaces (more and less expensive) must be controlled for to verify payment guarantee.

3
A Synthetic Penalized Logitboost to Model Mortgage Lending…

Conclusions
We borrowed the mortgage lending model specification put forward by Munnell et al. (1996) to provide a real-life application in empirical economics using the proposed algorithm. We conclude that weighting corrections in machine learning algorithms with an econometric base learner can improve the predictive performance by decreasing the RMSE in several segments of the predictions. The Synthetic Penalized Logitboost preserves a stochastic term and trains a weighted linear regression as base learner in order to prevent overfitting. Hence, the algorithm can be used to reproduce alternative data sets without losing power. Although the improvement in predictive performance is not excessively high, we provide evidence that it can lead to smaller RMSE than the Gradient Tree Boost (recognized for smartly capturing non-linearities) for observations that belong to the minority class in imbalanced data problems that tend to be underestimated by econometric methods and machine learning algorithms in general.
Beyond that, empirical sciences face challenges with machine learning architecture when their purpose is not only to make predictions using imbalanced data, but also to explain their causes in detail. On one hand, economists have used econometrics thus far to analyse the determinants of a specific phenomenon, but some models tend to be simplified due to the rigidity of linear specifications in most classical models. On the other hand, machine learning handles more large-scale complex data accurately but cannot provide direct coefficient estimates to link the corresponding effects of exogenous variables on the response outcome. The Synthetic Penalized Logitboost has started to combine these two approaches by providing some statistical intuition of its coefficient estimates since the base learner is a weighted least squares regression. As a result, the model always stabilizes its coefficients, while also being able to deal with complex structures and imbalanced phenomena.
Since the Synthetic Penalized Logitboost strongly penalizes observations whose probability estimates deviate considerably from the observed target variable, we wonder whether the predictive performance could be further improved in more imbalanced data sets or more complex models than the one presented here. While the model specification in Munnell et al. (1996) works with tailor-made survey data, our proposed model can also work with extensive data obtained through web scraping or with device-collected data.
This section provides the results of the prediction performance of the Synthetic Penalized Logitboost in comparison to the algorithms described in Sect. 3 for the HDMA 2012 and HDMA 2017 datasets. Table 7 shows the RMSE of Logistic regression, Logitboost, Gradient Tree Boost and Synthetic Penalized Logitboost for the training and testing HDMA 2012 data sets. The Synthetic Penalized Logitboost has a lower RMSE than the other methods, especially in the 1% and 5% of lower and upper extremes. The second-best prediction performance for the lower extremes corresponds to the results obtained by the Logistic and Logitboost with a prediction error equal to zero, while second-best for the upper extremes corresponds to the Gradient Boost (tree) GS-CV. Moreover, the Synthetic Penalized Logitboost has a similar performance in the testing and training data sets. Table 8 presents additional predictive measures of Logistic regression, Logitboost, Gradient Tree Boost and the Synthetic Penalized Logitboost for the testing and training HDMA 2012 data sets. The Synthetic Penalized Logitboost has the highest recall with similar rates in the training and testing data sets. The second highest recall corresponds to the Gradient Boost (tree) GS-CV. Figure 2 shows RMSE across 100 iterations of the Synthetic Penalized Logitboost for the HDMA Table 7 Root-mean-square error of logistic regression, logitboost, gradient tree boost and the synthetic penalized logitboost for the training and testing HMDA 2012 data sets The HMDA database was randomly split into training data (70%) and testing data (30%). Each RMSE for 1%, 5%, 10%, 20%, 30% and 40% of the lowest accumulated prediction scores is shown on the left-hand side of the table under "Lower Extreme", and each RMSE for 1%, 5%, 10%, 20%, 30% and 40% of the highest accumulated prediction scores is shown on the right-hand side of the  2012. Approximately the first 5-10 iterations have brusque changes, however after iteration 30 approximately the RMSE gets stable. Table 9 shows the RMSE of Logistic regression, Logitboost, Gradient Tree Boost and Synthetic Penalized Logitboost for the training and testing HDMA 2017 data sets. All methods have a prediction error equal to zero in the lower extreme, while the Penalized Logitboost and Logistic regression have the smallest RMSE in the upper extremes. Additionally, Table 10 presents alternative predictive measures for the mentioned methods. The Synthetic Penalized Logitboost has again the highest recall, even when RMSE in aggregated terms is higher than others. Finally, Fig. 3 shows that the RMSE gests stable after iteration 40 approximately. Table 8 Predictive measures of logistic regression, logitboost, gradient tree boost and the synthetic penalized logitboost for the testing and training HMDA 2012 data sets The HMDA database was randomly split into training data (70%) and testing data (30%). The threshold used to convert the continuous response into a binary response is the mean of the outcome variable  Considering the results examined for HMDA, HMDA 2012 and HDMA 2017, the Synthetic Penalized Logitboost increases the true positive rate when predicting a model, in particular in the most extreme observations. And it can reach convergence after some iterations in the boosting procedure.

Table 9
Root-mean-square error of logistic regression, logitboost, gradient tree boost and the synthetic penalized logitboost for the training and testing HMDA 2017 data sets The HMDA database was randomly split into training data (70%) and testing data (30%). Each RMSE for 1%, 5%, 10%, 20%, 30% and 40% of the lowest accumulated prediction scores is shown on the left-hand side of the table under "Lower Extreme", and each RMSE for 1%, 5%, 10%, 20%, 30% and 40% of the highest accumulated prediction scores is shown on the right-hand side of the