The Use of Telematics Devices to Improve Automobile Insurance Rates

Most automobile insurance databases contain a large number of policyholders with zero claims. This high frequency of zeros may reflect the fact that some insureds make little use of their vehicle, or that they do not wish to make a claim for small accidents in order to avoid an increase in their premium, but it might also be because of good driving. We analyze information on exposure to risk and driving habits using telematics data from a pay‐as‐you‐drive sample of insureds. We include distance traveled per year as part of an offset in a zero‐inflated Poisson model to predict the excess of zeros. We show the existence of a learning effect for large values of distance traveled, so that longer driving should result in higher premiums, but there should be a discount for drivers who accumulate longer distances over time due to the increased proportion of zero claims. We confirm that speed limit violations and driving in urban areas increase the expected number of accident claims. We discuss how telematics information can be used to design better insurance and to improve traffic safety.


INTRODUCTION AND MOTIVATION
According to the World Health Organization (WHO, 2017), road traffic injuries are responsible for more than 1.2 million deaths every year. Indeed, they are the leading cause of mortality among those aged between 15 and 29, at a cost to governments of approximately 3% of their GDP. This situation is exacerbated if we contemplate the fact that from the beginning of 2013 until the end of 2015, there was a 16% increase in the number of vehicles on the world's roads.
Automobile insurance is compulsory in almost all countries and, recently, many insurance companies have begun to collect telematics data about drivers' exposure to traffic (i.e., distance driven and vehicle location) and their driving behavior (excess speed and aggressiveness). This information can improve the insurance rate-making process and also allows conclusions to be drawn about how to make driving safer (Ayuso, Guillen, & Nielsen, 2018;Edlin, 2003;Ferreira & Minikel, 2013;Langford, Koppel, McCarthy, & Srinivasan, 2008;Lemaire, Park, & Wang, 2016;Litman, 2005;Paefgen, Staake, & Fleisch, 2014;Paefgen, Staake, & Thiesse, 2013;Pérez-Marin and Guillen, 2019;Sivak et al., 2007). New automobile insurance products (known by the acronyms PAYD, pay-as-you-drive, or PHYD, pay-how-you-drive) necessitate the introduction of a GPS device in the insured vehicle to record and store relevant information about variables that change over time, including, for example, the number of kilometers driven per day by the insured, the percentage of kilometers driven above the speed limit, and the percentage of kilometers driven at night, among others. This development represents a remarkable advance, given that, previously, automobile insurance companies could only use variables related to certain fixed characteristics of the insured (e.g., age, gender, or number of years since the driver's license was issued) and the vehicle (age of the automobile, engine power, etc.).
Most automobile insurance databases contain many policyholders with zero claims. This high frequency of "zeros" may be due to the presence of insureds who have no wish to claim for small accidents in order to avoid a premium increase or, alternatively, it might be due to the relative lack of use they make of their vehicles. If the vehicle is parked in a garage, it is not exposed to the risk of accident. Here, we analyze distance driven as a measure of exposure to risk and examine its role in the probability of an insured having zero claims. We show how to differentiate those drivers who almost never use their vehicles (and so have little exposure to the risk of an accident) from those who are good drivers, that is, those who, despite recording high mileages, are not involved in any accidents. In what follows, we refer to accidents as opposed to claims, even though we are aware that some accidents are not reported to the insurance company. Indeed, a detailed discussion of the difference between the number of accidents and the number of claims has previously been reported by Boucher, Denuit, and Guillen (2009).
We discover a positive relationship between the distance driven and the number of excess zeros observed in the number of claims. We argue that this is due to a learning effect, where good drivers are more frequent than expected among those who drive long distances. The overall effect of the driving distance variable is positive; however, even if it is true that longer driving should obviously result in higher premium, there is a discount due to the increased proportion of zeros in the frequencies, due to a learning effect. The overall effect is still an increase in the premium, but not as much as we would expect without the learning effect.
Our research is innovative because (1) we introduce telematics covariates while dealing with the excess of zeros and (2) we discuss the implications for new insurance products and traffic safety that are obtained on the basis of distance driven. Additional variables may be measured to assess the quality of drivers and in future work these new telematics signals could be much more sophisticated than distance driven.
Various studies have explored the potential of telematics when applied to risks of road accidents, beginning in 1968 with a preliminary analysis by Vickrey (1968). More recently, several papers have examined the impact of new technologies on road safety and how driving habits can be measured (Ayuso, Guillen, & Alcañiz, 2010;Ayuso, Guillen, & Pérez-Marín, 2014;Elias, Toledo, & Shiftan, 2010;Ellison, Bliemer, & Greaves, 2015;Jun, Guensler, & Ogle, 2011;Shafique & Hato, 2015, Xu et al., 2015Underwood, 2013), while others have focused specifically on mileage and new risk factors that might be included in the rate-making process; see Ayuso et al. (2018) for an extended review. Recently, it has been proven that including standard telematics variables significantly improves risk assessment of insureds; therefore, insurers should be able to tailor their products to the customers' risk profile (Baecke & Bocca, 2017). The objective for the insurance industry is to penalize high-risk drivers with higher premiums by taking into consideration factors related to dangerous driving, including, for example, exceeding the speed limits or not respecting safety distances. We show that having information about the annual distance driven by the insured improves the rate-making process considerably not only because it is a measure of exposure to risk, but because of the crucial role it plays in the analysis of the absence of claims, that is, the probability of not claiming or, in other words, the probability of zero claims. See the following papers on the relevance of including distance driven as a traffic risk factor (Mercer, 1989;Segui-Gomez et al., 2011).
In terms of methodology, Poisson regression models have traditionally been used to predict the number of automobile claims in insurance. The Poisson regression model is a special case of the generalized linear model class and serves as a benchmark model (Gourieroux, Monfort, & Trognon, 1984a, 1984b. However, various corrections have to be made when assuming that the probability of zero is larger than the probability under the Poisson assumption-a so-called excess of zeros. Various papers suggest that this excess is caused by asymmetrical information with an insured preferring not to declare a claim so as to avoid certain deductibles or the application of a bonus-malus system (Chiappori & Salanié, 2000;Dionne & Vanasse, 1992). In this article, we wish to differentiate those drivers who have no claims because they rarely use their vehicles during the year (in the extreme case, making no use of the vehicle at all) from those who have no claims despite being frequent drivers. To do this, we propose using a zero-inflated Poisson (ZIP) model corrected by distance (kilometers driven per year by the driver). While various studies have used ZIP models (Cameron & Trivedi, 2013;Lambert, 1992;Winkelmann, 2003) and applied them to the context of automobile insurance (Boucher, Denuit, & Guillen, 2007;Lord, Washington, & Ivan, 2005;Sarul & Sahin, 2015), none of these contributions has analyzed the role of exposure to risk in terms of distance driven.
From an empirical point of view, we draw on a real automobile claims database for a sample of insureds. This includes individual details about annual mileage traveled and other aspects of driving behavior, which enable us to study the effects of various indicators on the probability of making a claim. We highlight the implications of this for the design of new insurance rate-making processes.
The rest of the article is structured as follows. In Section 2, we present the methodology used when including distance as an offset variable in the ZIP model. The database and some descriptive results are presented in Section 3 and our main results obtained with the models specified are analyzed in Section 4. Finally, a discussion and the main conclusions drawn from this research are presented in Section 5.

METHODOLOGY
A Poisson regression with an offset variable is the logical way to include an exposure to the risk variable in our model. Here, therefore, we opt to use a Poisson model with offset and a two-step procedure aimed at introducing telematics data, which serves as a correction to the classical model.
ZIP regression is a model for count data with an excess of zeros. It assumes that with probability p the only possible observation is 0, and with probability (1 -p), a Poisson (λ) random variable is observed. For example, in a different context, the same model can be used in quality control. Thus, when a manufacturing system is properly aligned, defects are nearly impossible, and the p is large. However, when the machine is misaligned, defects may occur according to a Poisson (λ) distribution. This same principle is also plausible in motor insurance when modeling the number of accidents per year. Some drivers hardly use their vehicle or use it very rarely, so for them the probability of not being involved in an accident should be large.
Both the probability of no accidents and the mean number of defects λ in the imperfect state (when people use their cars) may depend on covariates that are defined for each individual. Here, we have not included subscript i to refer to the ith ob-servation in a sample of size n, to make notations easier. Sometimes, p and λ are unrelated; but on other occasions, p is a simple function of λ, such as p = l/(1 + λT) for an unknown constant T. In either case, ZIP regression models are easy to fit. Maximum likelihood estimates (MLEs) are approximately normal in large samples, and confidence intervals can be constructed by inverting likelihood ratio tests or using the approximate normality of the MLE. The estimation can be performed with standard statistical software, such as R or SAS, but the interpretation of the results of a ZIP regression model is not straightforward. For example, Lambert (1992) reports that in an experiment involving soldering defects on printed wiring boards, two sets of conditions resulted in roughly the same mean number of defects; however, the perfect state was more likely under one set of conditions and the mean number of defects in the imperfect state was smaller under the other set. In other words, ZIP regression can show not only which conditions give the lower mean number of defects but also why the means are lower.
Notice that formally we introduce an extended model of zero claims in insurance using distance driven as the exposure to risk variable. However, while this simple model extension primarily improves understanding of zero claims, it may have another important effect. When factors other than just mileage are included in the model, then essentially the extension suggested here also serves as a bias correction. With the data provided herein, the adjustment via our extended model improved considerably when mileage was included, and only marginally when further variables were included. Finally, therefore, we opted only to include mileage in the extension of the model, thus facilitating a straightforward interpretation. In this way, the excess zeros in our extended model are simply interpreted as a function of miles driven.
In the zero part of the model, we have only a Bernoulli variable that distinguishes between the zero event (no claim) versus the nonzero event (at least one claim), so the expectation for this binary response random variable is exactly the probability of excess zero claims, which should be limited to the [0,1] interval. For this reason, we have no offset in this part and the parameter of the log-distance is not necessarily equal to one.
Below we first introduce the simple Poisson model with and without exposure as it has traditionally been presented. Exposure, in our study, is equivalent to miles driven per year.

The Poisson Model
Let us assume that given x i , the dependent variable Y i follows a Poisson distribution with parameter λ i , which is a function of the linear combination of parameters and regressors, The unknown parameters to be estimated are (β 0 , . . . , β k ).

The Poisson Model with Exposure
When exposure to risk is introduced, then an offset is included in the model. Let us call T i the exposure factor for policyholder i (i = 1, . . . ,n), in our case T i = ln(D i ), where D i indicates distance traveled. Then the model can incorporate this factor as follows: Under this model, the probability of zero using the Poisson distribution is calculated as follows, P(Y i = 0) = exp(−D i λ i ), so it depends on the distance and, since λ i is always positive by definition, then the probability of zero claims declines naturally as distance driven increases.
We are now ready to extend the traditional Poisson regression models above to include excess zeros via ZIP models. This extension is also introduced with and without exposure.

The ZIP Model
In the ZIP model, the probability of zero is specified as follows: where p i is the probability of the perfect, zero defect state and (1 − p i ) is the probability of the complementary state. The new Y * variable follows a Poisson distribution with parameter exp(β 0 + β 1 x i1 + · · · + β k x ik ) and captures the claims distribution that is not contaminated by the excess of zeros. Note that p i may depend on some covariates. Under this model, the probability of suffering k accidents, when k is bigger than or equal to one, is:

The ZIP Model with Exposure
Here, we assume that p i is the probability of an excess of zeros for the ith observation and it is specified as a logistic regression model such that: .
The Poisson model for Y * is specified as follows, with an exposure, Then, Using the definition of the expectation of a discrete random variable, the expectation of the Poisson part is: 1+exp(α 0 +α 1 ln(D i )) is a transformation of the original exposure D i . So, when we include zeroinflation there is a transformation of the exposure in the Poisson model. Let us study the transformation. If α 1 > 1, when D i is large, then D * i tends to zero, but when α 1 < 1 then D * i increases when D i increases. On the other hand, when D i tends to zero, D * i tends to zero.
If we examine the logistic regression part (Equation (4)), we observe that p i can be understood again as a transformation of the exposure into the [0,1] interval, which tends to zero when D i tends to zero if α 1 is positive. Moreover, the derivative of Equation (4) with respect to D i shows how much the expected claims would change as a function of D i and indicates that if α 1 is significantly different from zero, then the relationship is not linear. Since insurance premiums are based on expected number of claims, this is an important result as it potentially shows that insurance prices should not necessarily be linearly proportional to distance driven.

DATA
We use information on the risk exposure and number of claims for 25,014 insureds with car insurance coverage throughout 2011, that is, individuals exposed to the risk for a full year. Note that in our case these data concern drivers up to a maximum age of 37, given that the insurance product was sold primarily to young drivers. Our aim is to discriminate between good and bad drivers in this portfolio segment and to identify the influence of driving short distances (Ayuso et al., 2014). Claim frequencies are presented in Table I, with an expected value of 0.23 claims per person. Table I has information on the frequency of all reported claims. The sum of reported claims that were not at fault is 3,108, while the sum of claims at fault is 2,652. Overall 5,760 claims were reported. Descriptive statistics for the risk exposure indicator (kilometers per year) are presented in Table II, where we analyze drivers with and without claims separately. The rest of the indica-tors, both those derived from traditional rate-making factors and those obtained from telematic devices, are presented in Table III, where we also present the definitions of these variables and their main descriptive statistics.
The results presented in Table II in relation to the annual distance traveled by the insured drivers reveal differences between those with no claims and those with claims. If we focus on the 25% of drivers who traveled the smallest distance over the year (first quartile), we observe that the insureds who claim at least one accident drove more kilometers per year than those with no claims-the respective quartile values being 4.87 versus 4.00. A similar pattern of behavior is observed for the second (median) and third quartiles with those making claims driving larger distances than those with no claims. This result was as expected and is a clear indication of a relationship between claims and distance driven.
The Mann-Whitney test is a nonparametric test of the null hypothesis that it is equally likely that a randomly selected value from one sample is less than or greater than a randomly selected value from a second sample. The Mann-Whitney test shows that the differences in the mean for the exposure risk regressor (Table II), as well as for the other classical and elematics regressors (Table III), are statistically significant in the cases of drivers with no claims and drivers with claims, with the exception of vehicle age (p-value = 0.331) and the percentage of kilometers driven over the speed limit squared (p-value = 0.9293). Note that the normality hypothesis of these variables is rejected when using the Kolmogorov-Smirnov test. The Kolmogorov-Smirnov test is a nonparametric test of the equality of continuous, one-dimensional probability distributions that can be used to compare the statistical distribution of two samples. From a univariate point of view, drivers who made a claim for at least one accident are, on average, younger than those who made no claim and have held their driving license for fewer years. A similar conclusion can be drawn in the case of ownership of a powerful vehicle, where those insureds making at least one claim present a higher value than those making no claims. Unexpectedly, in the case of cars parked overnight in a garage, the percentage value is higher among those who made at least one claim than it is among those who made no claim. We would expect such cars to be safer, but it appears that this variable may be closely related to car type, with powerful, more expensive cars being kept in garages. As for the new driving behavior indicators derived from telematics, driving at night and driving in urban areas present larger mean values in the claims group than in the no claims group.

RESULTS
Tables IV and V present the ZIP models including exposure to risk (kilometers driven per year) as the offset variable in the models as discussed in Section 2. Fig. 1 gives an overview of the estimated models.
Traditional software programs facilitate the maximum likelihood estimation of these models, their results being obtained using SAS, PROC GEN-MOD. To compare the models, we use the Akaike information criterion (AIC), calculated as twice the number of parameters in the model minus twice the value of the log-likelihood in the maximum.  The best model is the one that presents the smallest AIC value. 3 3 The AIC penalizes the number of parameters less strongly than the Bayesian information criterion (BIC), which is calculated on Table IV highlights a clear improvement in the results when considering all the model regressors (the lowest AIC value being obtained for the first specification). These results seem to validate the conclusions drawn in previous studies (Ayuso et al., 2018;Ferreira & Minikel, 2013;Lemaire et al., 2016), in which the relevance of the new indicators related to distance traveled and driving habits is highlighted, but where they are used in conjunction with the classical regressors. Individual significance is observed for a large number of parameters, including those of the logit model in its zero-inflation part. On first inspection, the positive sign of the parameter associated with the log-distance in the logistics part might seem surprising and it could be interpreted erroneously. This value (0.404) in the first column does not mean that the greater the distance driven, the greater the probability of the insured having zero claims. Rather, it means that the greater the the basis of the logarithm of the number of observations as opposed to multiplying the number of parameters by two, as with the AIC. distance driven, the greater the proportion of excess zero claims, indicating a deviation from the Poisson distribution that can be captured by the ZIP model. In the case of the classical variables, all the parameters for gender, driving experience, vehicle age, and the power of the vehicle are statistically significant. Thus, we find an increasing expectation in the number of claims for women drivers as opposed to men, inexperienced drivers as opposed to experienced, and owners of old and powerful vehicles as opposed to owners of newer and less powerful cars. As for the new telematics regressors, two-the percentage of kilometers per year driven over the speed limit and the percentage of urban kilometers driven per year-are significant in explaining the expected number of claims. Thus, the number of claims increases as these two regressors increase. No significance is observed in the case of night driving. In column 2, we present the estimation results of the reduced model when removing the covariates with insignificant coefficients in the full model. Finally, if we compare the results of the third and fourth specifications (columns 3 and 4, respectively), the best results are obtained for the model that only includes variables related to driving habits (telematics), as indicated by its lower AIC value.
Our model predicts the highest number of expected claims for younger women, with little driving experience, driving old and powerful vehicles, driving in urban zones, and exceeding the speed limit. Note that this result is in line with the results reported by Mercer (1989).
Previous research (Mercer, 1987) has shown that it may be interesting to include age and gender interaction in the model. The results for all the models, which are available from the authors, show that this interaction is not significant. In practice, gender cannot be used for pricing insurance in the European Union, but it can certainly be used for risk evaluation and it can help to understand male/female differences with implications for traffic safety. Our conclusion for this sample is that there is no interaction between age and gender. There are potentially two reasons for that. (1) The sample consists of drivers aged less than 37 years, so age may not have enough range to show a significantly different effect by gender.
(2) As found by other authors, the influence of gender is masked by the fact that men on average drive significantly longer distances than women. The relationship between distance driven and gender was discovered by independent researchers in different E.U. countries considering average daily distance in a Spanish data set (Ayuso, Guillen, & Pérez-Marín, 2016), or using average trip distance for a Belgian sample (Verbelen, Antonio, & Claeskens, 2018), or even taking both average trip distance and total distance in another European portfolio sample (Wüthrich, 2017). They all concluded that gender differences in the risk of accidents are, to a large extent, attributable to the fact that men drive longer average distances than women.
Similar results are obtained when only claims at fault are considered in Table V, with the exception that the age of the driver is now significant while gender is not. Here, again, a better goodness of fit is obtained for the specification that includes all variables (both telematic and nontelematic) and the model that includes only the telematics variables (the lowest AIC value being obtained for served column 1). As in Table IV, a lower AIC is obtained for the specification using only telematic variables as opposed to that using only classical variables (columns 2 and 3, respectively).
The age of at-fault drivers is inversely related to the expected number of claims, that is, a higher number of accidents is expected among younger drivers. However, the significance of the age squared parameter indicates a nonlinear relationship between the two variables. Inexperienced drivers (measured in terms of the number of years in which they have been in possession of a driving license) and drivers of old vehicles show a higher expected number of claims than that recorded by their more experienced counterparts and drivers of newer vehicles. In common with the result in Table IV, the percentage of kilometers per year driven over the speed limit, and additionally here the percentage of kilometers driven at night, have an impact on the expected number of claims in which the driver is at fault. The percentage of kilometers driven at night is significant at the 10% level when we only consider the telematic variables but the AIC value for this model is lower than that obtained for the first model.
Results for the models on the not-at-fault claims indicate similar conclusions. We have not discussed the not-at-fault cases because in insurance premium calculation only claims at fault are of main interest. Claims at fault indicate that the driver has caused an accident, while not at fault means that the accident was due to someone else. If the accident is caused by someone else, then the insured driver should not pay a higher insurance premium compared to someone who did not report a claim.
Comparisons with the classical Poisson model with offsets (without considering zero inflation), both for the total sample and for claims where the policyholder is at fault, are not included here, but they do not enable us to see the impact of distance on the excess of zeros. These results are available on request from the authors. The goodness-of-fit results are always better in the zero-inflated models because they take into account differences between false zeros (nonrisk exposure) and true zeros (risk exposure and zero claims).
In a similar context, it has been shown that prediction models for hurricane power outage can be improved by a new two-step outage prediction model and the inclusion of additional environmental variables that increase the overall accuracy (McRoberts, Quiring, & Guikema, 2016). Our model also improves the classical approach by introducing telematics information into the prediction of the number of claims and this can be done in a two-stage model approach (Ayuso et al., 2018).
In addition to the results presented in Tables IV and V, we have performed a hold-out analysis, and we have tested the models against test sets that were not used in the training process. We have chosen a 70% training sample, versus a 30% hold-out sample. In all cases we have confirmed the conclusions on the significance of the parameter that we had in the initial analysis. The chi-square test of differences between observed and fitted frequencies was equal to 946.7 for the whole sample. The hold-out analysis indicates very similar values (1,041.3 with 6 degrees of freedom in the training sample and 1,005.9 with 6 degrees of freedom in the test sample for the model of all claims and all variables). We find analogous results for other predictive performance measures at the policyholder level, such as the Gini index (Frees, Meyers, & Cummings, 2011), which is equal to 82.4% in the whole sample while it equals 82.5% and 82.1% in the training and test samples, respectively.
In order to evaluate the variable importance, we have estimated the models using standardized covariates, so that we can compare the coefficients. This analysis reveals that the most important factor that determines the risk of a crash is the percentage in urban driving, followed by the age of the driver's license. The third factor is the percentage of speed limit violations. The least relevant factors are the age of the vehicle, gender of the driver, percent of night distance driven, and parking in a garage.

CONCLUSIONS
We have shown that the part of the zero accident frequency not explained by traditional insurance risk factors increases with the distance driven by the policyholder. This means that when considering policyholders with the same characteristics but with different exposures to risk in terms of distance driven per year, we can conclude that those with a greater exposure present a larger proportion of excess zero claims than those with less exposure. This can be understood as an indication of a learning effect, or in terms of distance driven, that even if exposure to risk increases with distance driven, the probability of not making a claim also increases compared to that of drivers in the group who drive a shorter distance. This finding is evidence of the fact that good drivers-if we identify them with those reporting no claims-are more frequent than expected among the group of drivers that drive long distances than among those that drive shorter distances, all other things being equal.
This conclusion has a direct impact on the future design of PAYD insurance products, insofar as the premium paid should not be strictly proportional to the distance driven. Moreover, the premium should take into account the learning effect analyzed here. One possible solution would be to make the marginal increase in the insurance price per kilometer driven dependent on the accumulated distance. Here, we have shown that this relationship is not linearly dependent, as we report that the zero-inflation part plays a significant role. Taking the derivative of Equation (4) makes this nonlinearity immediately apparent.
The probability of excess zeros increases with distance. The coefficient for the logarithm of the number of kilometers driven per year in the logit model (which predicts zero inflation) is positive, that is, the probability of observing false zeros increases with increasing distance. Moreover, we have shown that the ZIP model gives better results in terms of goodness of fit than those obtained with the classical Poisson model (non-ZIP model).
Here, therefore, we have shown both the significance of the impact of the distance variable coefficient and the positive relationship between traffic violations involving excess speed and urban driving with the expected number of claims. These results are in line with reports issued by official traffic institutions where it is argued that speed limit violations should be considered in the design of insurance pre-miums so that safer driving is rewarded (Ayuso et al., 2010).
Previous traffic studies published in Risk Analysis (Mercer, 1989;Segui-Gomez et al., 2011) have stressed the desirability of including risk exposure in terms of distance driven. We have shown that indeed vehicle telemetry, and the collection of information using GPS-based technology such as percentages of kilometers driven at night, over the speed limit, and in urban zones, among others, can be included in the rate-making process, thus improving the results obtained when just using classical driver variables, such as age and gender. This opens the question whether PAYD should also consider a different price per mile depending on the time of the day and the location.
Our study shows that ZIP models with mileage as their offset variable can improve the definition of drivers' risk profiles and provide valuable policy guidelines that might be implemented to improve driving behavior. Furthermore, the higher premium associated with a higher percentage of kilometers driven in an urban area (as a consequence of a higher expected number of claims) could discourage the use of private vehicles in cities, as called for by various European institutions (not least to reduce levels of pollution). Clearly, similar conclusions can be drawn in terms of traffic violations, with an increase in the premium for drivers with a tendency to exceed the speed limit.