A Priori Ratemaking Using Bivariate Poisson Regression Models

In automobile insurance, it is useful to achieve a priori ratemaking by resorting to generalized linear models, and here the Poisson regression model constitutes the most widely accepted basis. However, insurance companies distinguish between claims with or without bodily injuries, or claims with full or partial liability of the insured driver. This paper examines an a priori ratemaking procedure when including two different types of claim. When assuming independence between claim types, the premium can be obtained by summing the premiums for each type of guarantee and is dependent on the rating factors chosen. If the independence assumption is relaxed, then it is unclear as to how the tariff system might be affected. In order to answer this question, bivariate Poisson regression models, suitable for paired count data exhibiting correlation, are introduced. It is shown that the usual independence assumption is unrealistic here. These models are applied to an automobile insurance claims database containing 80,994 contracts belonging to a Spanish insurance company. Finally, the consequences for pure and loaded premiums when the independence assumption is relaxed by using a bivariate Poisson regression model are analysed.


Introduction
Designing a tariff structure for insurance is one of the main tasks for actuaries. Such pricing is particularly complex in the branch of automobile insurance because of highly heterogeneous portfolios. A thorough review of ratemaking systems for automobile insurance, including the most recent developments, can be found in Denuit et al. (2007).
One way to handle this problem of heterogeneity in a portfolio -referred to as tariff segmentation or a priori ratemaking-involves segmenting the portfolio in homogenous classes so that all insured parties belonging to a particular class pay the same premium. This procedure ensures that the exact weight of each risk is fairly distributed within the portfolio. In the case of automobile insurance, in order to group the policies in homogenous classes, a series of classification variables are used (i.e., age, sex and place of residence of driver or horsepower, class and use of the vehicle). These variables are called a priori ratemaking variables, since their values can be determined before the insured party begins to drive.
If all the factors influencing a risk could be identified, measured and introduced in the tariff system, then the classes defined would be homogenous. However, this is not that case as there are important risk factors that are not considered in the a priori tariff. Some examples are especially difficult to quantify, such as a driver's reflexes, his or her aggressiveness, or knowledge of the Highway Code, among others. As a result, tariff classes can be quite heterogeneous.
Hence, the idea has arisen of considering individual differences in policies within the same class by using an a posteriori mechanism, i.e., fitting an individual premium based on the experience of claims for each insured party. This concept has received the name of a posteriori tariff, experience rating or the bonus-malus system.
Here, only the first step in pricing is studied, the a priori ratemaking. In short, the classification or segmentation of risks involves establishing different classes of risk according to their nature and probability of occurrence. For this purpose, factors are determined in order to classify each risk, and it is statistically tested that the probability of a claim depends on these factors, and hence, their influence can be measured. A priori classification based on generalized linear models is the most widely accepted method; see e.g. Dionne and Vanasse (1989), Haberman 2 and Renshaw (1996), Pinquet (1999), Bermúdez et al. (2001) and Boucher and Denuit (2006) for applications in the actuarial sciences, and Mc Cullagh and Nelder (1989) or Dobson (1990) for a general overview of the statistical theory.
The most commonly used generalized linear model for this tariff system is the Poisson regression model and its generalizations (Denuit et al., 2007). Introduced by Dionne and Vanasse (1989), the model can be applied if the number of claims for each individual policy observation is known. Although it is possible to use the total number of claims as the response variable, the nature of automobile insurance policies (covering different risks) is such that the response variable is the number of claims for each type of guarantee. Therefore, a premium is obtained for each class of guarantee as a function of different factors. Then, assuming independence between types of claim, the total premium is obtained from the sum of the expected number of claims of each guarantee.
Here, two different types of guarantee are assumed: third-party liability automobile insurance and the rest of guarantees. Following the usual methodology, assuming independence between types, the premium paid by the policyholder is obtained by summing the premiums for each type of guarantee and this depends on the rating factors. However, the question remains as to whether the independence assumption is realistic? When this assumption is relaxed, it is interesting to see how the tariff system might be affected.
In this study, a bivariate Poisson regression model is introduced. Holgate (1964) provided a practical basis for the bivariate Poisson distribution but its use has been largely ignored, mainly because of computational difficulties. Therefore, only a few applications can be found, for example, Jung and Winkelmann (1993) used a bivariate Poisson regression in a labour mobility study and Karlis and Ntzoufras (2003) modelled sports data. For a comprehensive review of the bivariate Poisson distribution and its applications (especially multivariate regression), the reader should see Kocherlakota (1992, 2001) and Johnson, Kotz and Balakrishnan (1997).
One early application of the bivariate Poisson distribution in the actuarial literature is described in Cummins and Wiltbank (1983). In ruin theory, some applications of this distribution are also to be found, for example Partrat (1994), Ambagaspitiya (1999), Walhin and Paris (2000) and Centeno (2005). In microeconomic insurance, Cameron and Trivedi (1998) studied the relationship between type of health insurance and various responses that measure the demand 3 for health care by using a bivariate Poisson regression. In addition, two studies related to fitting purposes should also be quoted, albeit that no factors are considered. First, Vernic (1997) carried out a comparative study with the bivariate Poisson distribution based on data related to natural events insurance and third-party liability automobile insurance. Second, Walhin (2003) compared bivariate Hoffmann and bivariate Poisson distributions by fitting a data set for accidents sustained by members of a sample of 122 shunters in two consecutive 2-year periods.
However, in a ratemaking context, bivariate Poisson regression models have not been used to model claim counts that depend on the usual rating factors.
In the next section, the model used here is defined. This model is based on the bivariate Poisson regression model, which is appropriate for modelling paired count data that exhibit correlation. In Section 3 the database obtained from a Spanish insurance company is described.
In Section 4 the results are summarised. Finally, some concluding remarks are given in Section 5.

Bivariate Poisson regression models
Let N 1 and N 2 be the number of claims for third-party liability and for the rest of guarantees respectively and N = N 1 +N 2 . The usual methodology to obtain the a priori premium under the assumption of independence between types of claims can be described as follows. First, the model assumed is N 1 ∼ P oisson(λ 1 ) and N 2 ∼ P oisson(λ 2 ) independently, and λ 1 and λ 2 depend on a number of rating factors associated with the characteristics of the car, the driver and the use of the car. Second, with λ 1 and λ 2 estimated for each policyholder and following the net premium principle, the total net premium 1 ( π ) is obtained as π However, an amount inflates the net premium to ensure that the insurer will not, on average, lose money. Many well-known premium principles can be applied for this purpose. Here the variance premium principle is used. This principle builds on the net premium by including a risk loading that is proportional to the variance of the risk. Under the above assumptions, the variance is equal to the expected value, and the total loaded premium ( π * ) is equal to In bivariate Poisson regression models, the independence assumption is relaxed. The model 1 Assuming the amount of the expected claim equals one monetary unit.

4
can be defined as follows. Let us consider independent random variables X i (i = 1, 2, 3) to be distributed as Poisson with parameters λ i respectively. Then the random variables N 1 = X 1 + X 3 and N 2 = X 2 + X 3 follow jointly a bivariate Poisson distribution: This is the so-called trivariate reduction method that leads to the bivariate Poisson distribution.
Its joint probability function is given by: The bivariate Poisson distribution defined above presents several interesting and useful properties. First, it allows for positive dependence between the random variables N 1 and N 2 , moreover Cov(N 1 , N 2 ) = λ 3 and therefore λ 3 is a measure of this dependence. Obviously, if λ 3 = 0 the two random variables are independent and the bivariate Poisson distribution reduces to the product of two independent Poisson distributions, referred to as a double Poisson distribution (Kocherlakota and Kocherlakota, 1992). Second, the marginal distributions for N 1 Hence, the total net premium can be obtained with The variance necessary to obtain the loaded premium is now V [N ] = λ 1 + λ 2 + 4λ 3 . Since λ 3 is expected to be positive, the relaxation of the independence assumption leads to a variance greater than the expected value. Overdispersion has often been observed when modelling claim counts in automobile insurance data (Denuit et al., 2007).
Let us assume that N 1j and N 2j denote the random variables indicating the number of claims of each type of guarantee for the jth policyholder. If covariates are introduced to model λ 1 , λ 2 and λ 3 , a bivariate Poisson regression model can be defined with the following scheme: where j = 1, . . . , n denotes the observed policies with sample size n , x ij denotes a vector of explanatory variables and β i denotes the corresponding vector of regression coefficients (i = 1, 2, 3).
In the case of the explanatory variables, two aspects should be stressed. First, different covariates can be used to model each parameter λ ij . Second, to facilitate the interpretation, no covariates are used to model λ 3 . However, they can be included so as to know more about the influence of the covariates on each pair of variables.
A problem arises when examining the joint probability function given in (1), particularly in min(n 1 , n 2 ) . In the database, looking at the entries in Table 2, it can be clearly seen that the proportion of (0, 0) is larger than that of other frequencies. Moreover, most of policies have no claims. Therefore, it seems reasonable to fit a zero-inflated model.
Few studies to date have discussed zero-inflated models in bivariate discrete distributions.
Such models have been proposed by Li et al. (1999) and Wang et al. (2003) who considered inflation only for the (0, 0) cell, or Walhin (2001) who discussed zero-inflated bivariate Poisson models. However, here we follow the zero-inflated bivariate Poisson model proposed by Karlis and Ntzoufras (2005). In fact, they propose an extension of the simple zero-inflated model which inflates the probabilities in the diagonal of the probability table. It seems reasonable to believe, for instance, that there also exists a higher proportion of (1, 1) because the same accident can lead to one claim of each type being made.
Taking the bivariate Poisson model (BP) defined above as the starting point, the diagonal inflated bivariate Poisson model (DIBP) is specified by the probability function: 1 ] = 0 when only cell (0, 0) is inflated, the marginal distributions in the zero-inflated model are overdispersed and the marginal mean and variance for N 1 are: For the analysis presented in the following sections, the covariance between N 1 and N 2 for a zero-inflated model Cov ZIBP [N 1 , N 2 ] needs to be calculated. From (3), it follows that: which for the zero-inflated model is: Thus, the covariance for a zero-inflated model is given by: Different algorithms have been provided to implement bivariate Poisson regression models (Ho and Singer, 2001;Kocherlakota and Kocherlakota, 2001; or adopting a Bayesian point of view, Tsionas, 2001;Karlis and Meligkotsidou, 2005). Here an EM algorithm provided by Karlis and Ntzoufras (2005) and its implementation using R (bivpois package) is used. Standard errors for the parameters are calculated using standard bootstrap methods (boot package in R).

The database
The original sample comprised a ten percent sample of the automobile portfolio of a major insurance company operating in Spain in 1995. Only cars categorised as being for private use were considered. The data contains information from 80,994 policy holders. The sample is not 7 representative of the actual portfolio as it was drawn from a larger panel of policyholders who had been customers of the company for at least seven years; however, it will be helpful for illustrative purposes.
Twelve exogenous variables were considered plus the yearly number of accidents recorded for both types of claim. For each policy, the initial information at the beginning of the period and the total number of claims from policyholders at fault were reported within this yearly period.
The exogenous variables, described in Table 1, were previously used in Pinquet et al. (2001), Bolancé et al. (2003) and in Boucher et al. (2007). Moreover, in Table 2, the cross-tabulation for the number of claims for third-party liability ( N 1 ) and number of claims for the rest of guarantees ( N 2 ) are shown.
For this study, all customers had had a policy with the company for at least three years.
Therefore, variable v7 was rejected and variable v8 retained its definition and its baseline was now established as a customer who had been with the company for fewer than five years.
The meaning of those variables referring to the policyholders' coverage should also be clari-  Even with a small correlation between N 1 and N 2 , including λ 3 in the model produced a better fit for the data used.
Once the effectiveness of the bivariate model had been assessed, covariates to model λ 1 , λ 2 and λ 3 were included. In fact, first the same variables for λ 1 , and λ 2 were included, maintaining λ 3 constant. In Table 3  It can be seen that the intercept for λ 3 was significant (at the 5% level) indicating that the bivariate Poisson model is more appropriate for this data than is the model that assumes independence between N 1 and N 2 (double Poisson). As regards the fit, the AIC values for these models also indicate the improvement achieved with the bivariate model.
Focusing on λ 1 (claims for third-party liability), for the bivariate Poisson model the parameters from v4 to v8 and v10 were significant. For the double Poisson model no important differences were found except for the parameter v10 which was not significant. This difference may indicate the convenience of including this covariate to model the covariance term λ 3 (see Table 4).
Following the discussion above concerning claims for third-party liability, driving experience (v5 and v6 ) reduced the expected number of claims, while driving in northern Spain (v4 ) and drivers with fewer than 5 years in the company (v8 ) caused the expected number of claims to increase for this type of claim. As regards the type of coverage, only in the case of the bivariate model, when including comprehensive coverage except fire (v10 ) was the expected number of claims lowered.
Concentrating on λ 2 (the rest of claims, except third-party liability), most of the parameters were significant and no noticeable differences were found between bivariate and double Poisson models. In particular, the parameters for v2 to v5, v8 and v10 to v12 were statistically significant.
Here, some differences with the third-party liability claims were found. First, parameters related to the type of coverage (v10 and v11 ) were always significant and their presence increased the expected number o claims markedly. Second, the car's horsepower was also significant here.
When if was greater than or equal to 5500cc (v12 ), the probability of having a claim increased.
Finally, driving in an urban area (v2 ) became significant and increased the expected number of claims. As regards the driving zone and driving experience, the sign of the coefficient changed for v4 and v5 variables with respect to third-party liability claims.
In order to model the covariance term ( λ 3 ), the covariates were introduced in the bivariate Poisson model with the result that only the parameter for v10 was significant. In Table 4 the results for this bivariate model with covariate on λ 3 are shown. The improvement in AIC with respect to the bivariate model with constant λ 3 can be observed. However, no substantial differences regarding the coefficients were found with the previous bivariate Poisson models from Table 3. When the policy included comprehensive coverage(v10 ), the correlation between N 1 and N 2 was not as strong. Note that the guarantees covered by comprehensive coverage were not caused by an accident and so no liability claims could be reported.
Finally, as it was mentioned in Section 2, looking at the entries of Table 2, it is clear that the proportion of (0, 0) is larger than that of other frequencies. Therefore, two additional models were fitted using zero-inflated bivariate Poisson models. In Table 5 the results for these models are shown, the model with constant λ 3 on the left-hand side and the model with regressor (v10 ) on λ 3 on the right-hand side.
The parameter p referring to this zero-inflated model was significant and relatively large.
Moreover, the AIC values improved substantially with respect to those of the non zero-inflated models. This suggests that the use of a zero-inflated model is a good choice for fitting this database (Boucher and Denuit, 2008). Other models with inflation in diagonal were fitted, but they were rejected because of the non significance of the respective elements of parameter vector θ . Thus, the existence of a higher proportion of (1, 1) or (2, 2) cannot be considered for this database.

Comparing a priori ratemaking when introducing dependence
An analysis of the impact of using these models in a priori ratemaking was conducted at the same time as the differences between the models proposed in Section 3 were analysed through the mean (a priori pure premium) and the variance (necessary for a priori loaded premium) of the number of claims per year for some profiles of the insured parties.
Five different, yet representative, profiles were selected from the portfolio ( Table 6). The first can be classified as the best profile since it presents the lowest mean score. The second was chosen from among the profiles considered as good drivers, with a lower mean value than that of the average for the portfolio (0.1833). A profile with a mean lying very close to this average was chosen for the third profile. Finally, a profile considered as being a bad driver (with a mean above the average) and the worst driver profile were selected. Table 7 shows the results for the five profiles and the five models considered. From these results, the differences in ratemaking when using a bivariate Poisson model as opposed to two independent Poisson models can be observed. In general, without distinguishing between bivariate models, such models produce higher means for good risks and lower means for bad risks while maintaining almost equal the average risks. As regards variances, the bivariate models increased them in most cases. A further difference that should be emphasized with the double Poisson model is the overdispersion detected in the bivariate models.
In Table 7, it can be observed that the zero-inflated bivariate models did not present any noticeable differences with the non zero-inflated models in terms of the mean scores, but they were present in the case of the variance. The bivariate Poisson models (BP1 and BP2) increased the variances for the good risks more than they did for the bad ones, while the zero-inflated bivariate models (ZIBP1 and ZIBP2) increased the variances much more for the bad risks.
Finally, the differences between the bivariate models with constant λ 3 (BP1 and ZIBP1) and those that included a covariate on λ 3 (BP2 and ZIBP2) were examined. A comparison of non zero-inflated models showed that the model including covariate (BP2) presented a mean and variance lower than those presented by the BP1 model for good risks, yet higher than those presented by the BP1 model for bad risks. However, no differences were detected between zero-inflated models.

Conclusions
This paper has tested the independence assumption between claim types given a set of known risk factors and it has shown that independence should be rejected. The bivariate Poisson model is presented as an instrument that can account for the underlying connection between two types of claims arising from the same policy 2 . The interpretation of a number of bivariate Poisson models has been illustrated in the context of automobile insurance claims and the conclusion is that using a bivariate Poisson model leads to an a priori ratemaking that presents larger variances and, hence, larger loadings than those obtained under the independence assumption.
For the five models analysed here there seems to be a relationship between In short, the main finding is that the independence assumption that is implicitely used when pricing automobile insurance by adding the pure premium for each guarantee (which are obtained using count data regression models) is insufficient because correlations (conditional on the covariates) are ignored. A natural extension for this paper would be to identify other multivariate count data models that might consider correlations in pricing several guarantees simultaneously in automobile insurance.

References
Ambagaspitiya, R.S., 1999. On the distributions of two classes of correlated aggregate claims.