Can Automobile Insurance Telematics Predict the Risk of Near-Miss Events?

Telematics data from usage-based motor insurance provide valuable information – including vehicle usage, attitude toward speeding, and time and proportion of urban/nonurban driving, which can be used for ratemaking. Additional information on acceleration, braking, and cornering can likewise be usefully employed to identify near-miss events, a concept taken from aviation that denotes a situation that might have resulted in an accident. We analyze near-miss events from a sample of drivers in order to identify the risk factors associated with a higher risk of near-miss occurrence. Our empirical application with a pilot sample of real usage-based insurance data reveals that certain factors are associated with a higher expected number of near-miss events, but that the association differs depending on the type of near miss. We conclude that nighttime driving is associated with a lower risk of cornering events, urban driving increases the risk of braking events, and speeding is associated with acceleration events. These results are relevant for the insurance industry in order to implement dynamic risk monitoring through telematics, as well as preventive actions.


INTRODUCTION AND MOTIVATION
Before the emergence of telematics, insurers had no verifiable information on the driving patterns and real vehicle usage of the insured. Driving circumstances and styles could only be determined, and then indirectly, in the specific case of an accident. Today, in contrast, telematics provides a novel source of data for risk classification before an accident, or even before a dangerous event, occurs, in what insurers refer to as a "near miss." A near miss-a name taken from aviation safety, where reports cover potentially dangerous practices or mistakes that could have led to a fatal accident-can be defined as a narrowly avoided accident, such as when a driver has to brake suddenly or make rapid steering operations (Arai et al. 2001). The occurrence of near misses, though, seems to be related to a higher risk of being involved in future accidents.
Although defining a near miss is straightforward enough, being able to verify its occurrence in real life is not. Insurance claims require that actual accidents be reported to the insurer, but near misses can only be measured if they are well defined and measured on the spot. In the pilot dataset analyzed here, only near-miss events were observed, and no data on claims were available. As a matter of fact, we believe that near misses, as defined in this study, could be used to predict real accident events, but at present we are not able to confirm this because we do not have real accidents in our sample. Future studies drawing on actual insurance data could therefore distinguish between a near miss, as captured by telematics data, and an actual accident, which could exhibit characteristics similar to those of a near miss.
This study focuses on near misses identified in a sample of drivers that have a telematics sensor fitted in their vehicles. We estimate the number of three types of near-miss events, namely, accelerations, braking, and cornering (see Section 4 for full definitions), as a function of two types of variable. First, we consider the traditional risk factors of age, gender, driving experience, and vehicle power; second, we consider telematics information describing driving patterns, that is, urban and nighttime driving and speed behavior. Among the traditional risk factors, we conclude that age is relevant in predictions of near-miss events, but we do not see significant differences between men's and women's expected risks of near misses in our sample. Importantly, the impact of the risk factors on the expected number of near-miss events differs depending on the type of nearmiss event being analyzed. Thus, it would be incorrect to model the sum of all near-miss events as opposed to each type of event separately, because the impacts are confounded. For instance, engine power presents a significant association with a higher frequency of cornering and acceleration events; nighttime driving is associated with a lower risk of cornering events than daytime driving; urban driving is associated with a higher frequency of braking events; and, in line with expectations, excess speed increases the expected frequency of abnormal acceleration events.
These results are valuable for risk classification in insurance companies offering usage-based insurance (UBI) motor policies. Moreover, monitoring and predicting the risk of each near miss can serve to construct alerts that can warn drivers when their levels are approaching a dangerous threshold or risk level. The conclusions are also of interest to traffic authorities concerned with accident prevention.
The rest of this article is organized as follows. In the following section, we provide a brief description of the background to this study and of recent research in this field. In Section 3, we outline the methods used. Section 4 describes the real dataset employed in the article for the analysis of near-miss events. The model results are presented in Section 5 and, finally, Section 6 concludes.

BACKGROUND
The use of telematics in the insurance industry provides insurers with information that can be used for risk classification. In addition to the traditional risk factors considered for insurance ratemaking (such as age, driving experience, type of vehicle, etc.), global positioning system (GPS)-based technology provides a new wave of data with details about a driver's mileage, speeding, braking, cornering, and location, as well as about road and traffic conditions. The insurance industry now faces the challenge of integrating this information correctly in its ratemaking schemes, which is far from straightforward. Apart from the high costs of the technology, insurers need to familiarize themselves with insurance telematics data (Ma et al. 2018) and the value of the information contained in data streams obtained from sensor sources. Moreover, in order to use telematics factors as rating factors, the response variable has to be at least evidently associated with accidents. In that sense, Quddus et al. (2002) found that a rapid acceleration, deceleration (braking), and sharp turns may increase driving risk and damage levels. Similarly, af Wåhlberg (2004) found evidence of a significant correlation between driver acceleration behavior and accident frequency. Jun et al. (2011) also found that drivers who had crash experiences tended to drive at higher speeds than crash-not-involved drivers, and concluded that there is a real potential to identify at-risk drivers based on in-vehicle data collection technologies. More recently, Bian et al. (2018) investigated how behavioral data of drivers affects driving risk and how driver behavior should affect UBI pricing schemes. Based on empirical data, Bian et al. (2018) found that their driver risk classification model achieves a good accuracy in terms of risk-level classification. Additionally, the link between near misses and accident risk has been investigated. Wang et al. (2015) carried out the assessment of driving risk associated to near-crash events. Their results indicated that the speed when braking and the potential crash type, among other factors, exerted the greatest influence on the driving-risk level of a near crash.
Telematics-based data have been shown to be valuable for risk classification purposes in the insurance industry (Ayuso et al. 2014(Ayuso et al. , 2016(Ayuso et al. , 2019Baecke and Bocca 2017;, allowing insurers to consider the concept of risk exposure, no longer measured solely in terms of duration of policy coverage, but also of distance and time traveled. Although mileage was used as a ratemaking factor before telematics data became available (e.g., in the United States, France, and Germany), telematics allows insurers to measure a driver's exact exposure so they do not have to rely on the insured's declaration on their initial application. In this sense, Boucher et al. (2017) show, using generalized additive models (GAM), that the simultaneous effect of distance traveled and exposure time on the risk of accident can be highly informative in the context of usage-based insurance. Likewise, Verbelen et al. (2018) recently analyzed a dataset from a Belgian telematics product aimed at young drivers and report their development of generalized additive models and compositional predictors to quantify and interpret the effect of telematics variables on expected claim frequencies. They found that such variables increase the predictive power and render the use of gender as a rating variable redundant. Ayuso et al. (2016) obtained similar results in a data set for drivers in Spain. Telematics information has also been used to explain the excess of zeros observed in the frequency of claims. For example, Guillen et al. (2019) included the distance traveled per year as part of an offset in a zeroinflated Poisson model to predict the excess of zeros, which may reflect the fact that some insureds make little use of their vehicle. The authors showed the existence of a learning effect for large values of distance traveled, so that while drivers driving more should pay higher premiums, there should be a discount for drivers who accumulate longer distances over time. They also confirmed that speed limit violations and driving in urban areas increase the expected number of accident claims. 142 Ma et al. (2018) show that vehicle mileage, hard brakes, hard starts, peak-time travel, and speeding are strongly correlated with higher accident rates. They also find that contextual driving factors (such as driving at a speed significantly different from that of traffic flow) are also relevant risk factors. As a result, the authors show how second-by-second GPS data can be integrated into existing or new auto insurance pricing structures. They also analyze how usage-based insurance solution providers have chosen different measurements to evaluate driver performance. Among these, the authors describe the Progressive Insurance UBI program, where a combination of hard braking (deceleration over 7 mph/s), number of miles driven, time and day, fast starts, and trip regularity is used to calculate each driver's risk level. Ma et al. (2018) also examine the Allstate Drivewise program, which rewards drivers who limit high-speed driving, late-night trips, and hard braking (in this case, driving at speeds above 80 mph is considered unsafe).
Recently, Stipancic et al. (2018) analyze hard braking and accelerating events and compare them with historical crash data. Both maneuvers are positively correlated with crash frequency at the link and intersection levels. Locations with more braking and accelerating are also associated with more collisions. Higher numbers of vehicle maneuvers are also related to increased collision severity, though this relationship is not always statistically significant. Previously, Wahlstr€ om et al. (2015) detected dangerous vehicle cornering events, based on statistics related to the no-sliding and no-rollover conditions. Osafune et al. (2017) analyze aggressive driving behavior using a large dataset of accelerometer readings collected from drivers' smartphones. Their objective is to explore accident risk indexes that statistically separate safe drivers from risky drivers. They conclude that the frequency of acceleration exceeding 2.4 m/s 2 , that of deceleration exceeding 1.4 m/s 2 , and that of left acceleration exceeding 1.1 m/s 2 separate safe from risky drivers.
The distinction between accidents and near misses has also been investigated in the context of car-to-cyclist crashes and near crashes (Ito et al. 2018). Here, the factors that differentiate near crashes from crashes are examined and the causes of the latter are identified. Ito et al. (2018) conclude that car-to-cyclist crashes are unavoidable when the car approaching the cyclist enters an area in which the average deceleration required to stop the car is more than 4.4 m/s 2 . Finally, Sanders (2015) has analyzed the impact of near-miss and collision experiences in the perceived traffic risk for cyclists.
According to Arai et al. (2001), near-miss data can be useful for diagnosing driving behavior and developing driving safety programs and driver assistance devices, and, as such, near-miss events have attracted researcher attention in recent years. Indeed, it is our contention that insurance companies need to analyze occurrences of both accidents and near misses and the circumstances in which they take place. In this regard, the expected number of near misses should become a standard risk index for drivers, thus helping to personalize motor insurance rates.

METHODS
We use the negative binomial (NB) regression to model the number of near-miss events observed over a period of time. The NB distribution is a Poisson-gamma mixture; that is, the NB is a Poisson (k) distribution, where k is itself a random variable, distributed as a gamma distribution. Given the gamma parameter, the NB regression is a special type of generalized linear model where the mean of the dependent variable y, l, depends on a set of k independent variables (x 1 , … , x k ) according to where n is the sample size, km i is the total distance traveled during the observation period (which is 1 week for all observations) and it is used as an offset variable, b 0 , b 1 , :::, b k are unknown parameters that need to be estimated, and where a is the inverse of the scale parameter of the gamma distribution. The parameter estimates of the NB regression model can be easily estimated by maximum likelihood using PROC GENMOD of SAS.
Here, we have a small sample of drivers that are observed over a maximum period of up to 15 weeks. This period varies from one driver to another in relation to their participation in the sample. We also consider a panel model in our analysis. In essence, the method is simply a generalization of the Poisson or negative binomial model, in which we consider time and individual fixed effects, in order to account for the driver correlations observed over time. For the sake of simplicity, we do not include any more details, but a complete overview of panel data for counts can be found in Frees (2004) and Boucher and Guillen (2009), and for specific applications to the insurance industry in .

THE DATASET
A pilot study was conducted to collect telematics information on drivers in Greece during 2017. All drivers agreed to provide data from car sensors that measured all three types of near-miss event. A weekly summary allowed us to analyze the relationship between the response, defined as the observed number of near-miss counts of each type, and the explanatory factors, that is, personal information (the traditional risk factors) and behavioral values (telematics covariates). The traditional risk factors include driver's gender, age and experience and vehicle age and engine power. Telematics covariates measure the total distance traveled per week, measure nighttime and urban driving, and provide information about speeding and the near-miss events of each type, that is, acceleration, braking, and cornering events, as defined in the following. Table 1 presents and describes the variables in the dataset.
We have predefined what a near-miss event is for the purpose of this study. Unfortunately, the number of observations is far from enough if we want to let near-miss patterns to be naturally defined by searching for some specific structure in the data. The identification of each near miss is based on the calculation of a severity score for each event type, which lies within [0,10]. For example, in the case of acceleration events, the calculation takes into consideration the difference between the maximum acceleration reading and the acceleration detected in the first reading above the acceleration event detection threshold (set at 6 m/s 2 ). This threshold was chosen in accordance with previous studies. Note that Hynes and Dickey (2008) considered 5.7 m/s 2 as the threshold for a low peak acceleration event during rear-end impacts. We calculate the ratio between this difference and the corresponding timestamps of the latter readings. The final severity score is a transformation of this ratio multiplied by 10, which means we obtain a final score within [0,10]. Acceleration is also used to determine the severity of braking events, given that negative acceleration can essentially be considered as deceleration. In the case of cornering events, severity depends on the ratio between the speed of a reading and the maximum speed possible during a turn for the vehicle to stay on track (note that this definition is similar to the no-sliding condition used by Wahlstr€ om et al. [2015] in their study on dangerous cornering events). Here acceleration events are considered near misses because of the high severity of the event, but in real life, in most cases, an acceleration event results in a braking event rather than an accident. In this analysis, we also consider the total number of near-miss events, defined as the sum of acceleration, braking, and cornering events.
The final dataset comprises 1225 observations, corresponding to 157 drivers observed over an average period of 8 weeks in the years 2016-2017. This means that the number of data points per driver equals on average 8 (the number of observed weeks). Seventy-five percent of the drivers were observed during a period of 10 weeks as maximum. Table 2 shows the descriptive statistics and frequency tables of the nontelematics variables. There are 24.2% women in the sample. Almost all  Table 2 also shows the descriptive statistics for urban and nighttime driving and speeding. On average, 30% of the kilometers driven are driven at night, 13% on urban roads and 3% at speeds above the limits. The average distance traveled per week is 147.27 km. Figures 1, 2, and 3 represent the distribution of the total number of near-miss events (by weeks) for the three types of event (acceleration, braking, and cornering) considered here. 1 In approximately 73% of the weeks no acceleration events were recorded, while this percentage was 67.92% in the case of braking and 74% in the case of cornering events. Table 3 shows the descriptive statistics of the total number of near-miss events (by weeks) in the dataset. Braking was the most frequent nearmiss event, followed by accelerations and cornering. Note that the standard deviation of braking and acceleration events is high, indicating that the drivers in the sample are quite heterogeneous with respect to these occurrences. Table 4 shows the parameter estimates of the NB regression models for the acceleration, braking, and cornering events, when pooling all observations in the sample. There are three model specifications: For each type of event, we consider the model with only the traditional rating factors, then with only the telematics covariates, and finally with all the covariates. The results have been obtained by using the GENMOD procedure of SAS.

RESULTS
In the case of acceleration events (first three columns in Table 4), the Akaike information criterion (AIC) shows that the best model is the one that includes all the explanatory variables. Customer age and vehicle engine power are associated with a higher number of acceleration events, while vehicle age and vehicle night parking are associated with a lower number. Among the telematics variables, speed, as expected, is associated with a higher number of acceleration events. Thus, as a driver increases the percentage distance driven above the speed limits by 1%, the expected number of acceleration events increases by about 6%. Here, the coefficient in the model with all variables is equal to 5.63, which means that e 5:63Á0:01 ¼ 1:06, which is the impact on the expected number of acceleration events. That excessive speed is associated with abrupt accelerations is unsurprising, but what is important is the magnitude of the association when controlling for all other factors. By using the best model, the one that includes all variables, we have computed the fitted values and calculated the chi-squared test statistic for a In the case of braking events, the results of the parameter estimates of the negative binomial regression model are shown in the central columns of Table 4. Here, again, the model with the lowest AIC is the one that includes all the variables. It can be seen that customer age increases the number of braking events, while the older the vehicle, the lower is the number of braking   146 events. Vehicle night parking also reduces the number of braking events. The remaining traditional risk factors (CustomerGender, CustomerYearsHavingL, and VehicleEnginePower) do not present a significant effect. Among the telematics variables, urban is the only factor presenting a significant effect, being associated with a higher number of braking events. This is also expected due to the density of traffic in urban areas. Again, we used the best model, the one that includes all variables, and we computed the fitted values and calculated the chi-squared test statistic for a theoretical NB distribution, which results in a value equal to 20.65.
In the case of cornering events, the model presenting the lowest AIC is the one that includes both telematics and nontelematics variables. Among the traditional rating variables, customer age presents a positive and significant coefficient, indicating that cornering events are more frequent among older drivers. As expected, driving experience reduces the number of cornering events (the coefficient being significant and negative), while the greater the vehicle engine power, the lower is the number of cornering events. The remaining traditional risk factors (CustomerGender, VehicleAge, and VehicleNightParking) do not present a significant effect. Among the telematics variables, nighttime driving is the only factor presenting a significant effect. Driving during the night is associated with a lower expected number of cornering events, probably reflecting that drivers drive more carefully and more smoothly in the nighttime hours, compared to the daytime hours. As we did before, we used the model that includes all variables and computed the fitted values and calculated the chi-squared test statistic for a theoretical NB distribution, which results in a value equal to 26.26.
When we consider the sum of near-miss events as the response variable presented in Table 5, the model results are not as clear as before and the influence of each driving pattern on the aggregate number cannot be interpreted. Table 5 shows that only the effect of excess speed is significant at the 5% level of significance for the model based on telematics covariates. Again, we used the model in Table 5 (the one that includes all variables) to compute the fitted values and calculated the chisquared test statistic for a theoretical NB distribution, which results in a value equal to 60.32. However, we recommend analyzing near misses by type rather than in an aggregate form, in order to detect the influence of urban versus nonurban driving, as well as the effects of nighttime driving.
As an alternative to the NB models presented here, we have also fitted a Poisson, zero inflated Poisson, zero inflated NB, generalized additive model (GAM) with Poisson response, and GAM regression with NB response. Tweedie was not used, as we do not have any information about costs or severities. These models have been used to estimate acceleration, braking, cornering, and total number of events (four different response variables) by using traditional and telematics risk factors as explanatory variables. In the four cases, the best model was the GAM regression with NB response, as it was the one with the lowest AIC (see Table A.1 in the Appendix). We also calculated the chi-squared test statistic for a theoretical NB distribution with an expected value equal to the fitted values of the alternative GAM regressions. The chi-squared test statistic was 34.27, 40.99, and 82.28 for acceleration, braking, and total number of near-miss events, respectively. These values are higher than those corresponding to the traditional NB regression. In the case of cornering events, the chi-squared test statistic was 19.71, which is lower than the corresponding to the traditional NB model. This means that, according to the chi-squared test statistic, the NB model performs better than the alternative GAM regression in all cases, except for cornering events. Nevertheless, we should be careful with the interpretation of these results, as the distributions on the response variables had a heavy tail and we grouped the extreme observations in order to calculate the test statistic. As a consequence, we decided to focus on the results of the NB regression model for simplicity of interpretation of the linear component.
We also used a panel data analysis using Poisson regression, but the results are not included here (but are available from the authors on request) because although most of the coefficient signs of the telematics variables are the same as those obtained in the regression models without a panel approach, they are quite unstable and depend heavily on the number of observed    148 weeks considered. In most cases, no substantial changes are seen with regard to the influence of the telematics covariates, but the regressors that do not change over the weeks of observation, the case of age (in years), gender, vehicle power, and age, cannot be included if individual effects have already been considered. A more sophisticated analysis with observational periods longer than 15 weeks is recommended to assess the effect of time trends on the observed responses.

CONCLUSIONS
The occurrence of near-miss events, and not only accidents, needs the attention of traffic authorities and insurers. Knowing the circumstances in which near misses occur is relevant for risk quantification and also for accident prevention, given that such incidents are informative about narrowly avoided accidents and, more importantly, about the type of accident that could have occurred under a set of known circumstances.
The main conclusion to be drawn from our analysis is the different impact of a range of behavioral factors on the occurrence of different types of near-miss events. This clearly suggests that analyzing near misses without distinguishing the type of event is likely to lead to a confounding of the factors influencing an increase in the expected number of near-misses.
In this article we have analyzed three types of near-miss events, cornering, braking, and accelerating, and we have shown that both traditional and telematics variables are relevant risk factors. Among the former, we conclude that the driver's age is associated with a higher risk of all three types of near-miss event. Specifically, older drivers present a higher risk of near misses, perhaps owing to an excess of self-confidence at the wheel. Having said that, driving experience decreases the risk of cornering events. Among the other factors, vehicle power is associated with a higher risk of acceleration events but with a lower risk of cornering events. Finally, vehicle age is associated with a lower risk of braking and acceleration events, perhaps owing to limitations in the technical characteristics of older vehicles compared to those of newer automobiles.
Telematics risk factors have been found to be relevant for predicting the risk of each specific near-miss event. Nighttime driving is associated with a lower risk of cornering events. This is probably due to smoother driving at night, compared to daytime driving. Speeding is associated with a higher risk of acceleration events, which is as expected. Finally, urban driving is associated with a higher risk of braking events, which is not surprising if we take into account traffic conditions in cities. We believe that these results are relevant for traffic authorities, for example, pointing to the need to promote actions encouraging drivers to maintain a safe following distance, not only on highways, but also in cities, where there is a higher risk of braking events.
Given that the average number of near misses differs according to the event type, insurers could usefully establish benchmarks so that whenever a driver exceeds one of the factors (e.g., driving a high percentage of traveled distance in urban areas), this would trigger an alarm indicating a greater risk of near-miss events and, therefore, a higher risk of accident. However, one of the limitations of this analysis is that while the methodology seems transferable from one portfolio to another, some of the estimated models may only be valid for the country and situation in which these data were collected. According to the findings of this study, near-miss count data modeling shows considerable potential for the setting of personalized benchmark levels and for offering motor insurance premium rewards, based on a driver's expected number of near misses. As such, count models can be used as predictive tools to calculate the expected level of near-miss events dynamically-that is, as the telematics measurements are processed-and drivers can be warned if the predicted levels exceed a dangerous threshold and be rewarded for good driving when near-miss counts are observed below their predicted level.