A state-space approach for measuring regional manufacturing production indices*

In this paper we propose a latent variable model, in the spirit of Israilevich and Kuttner (1993), to measure regional manufacturing production. To test the validity of the proposed methodology, we have applied it for those Spanish regions that have a direct quantitative index. The results demonstrate the accuracy of the methodology proposed and show that it can overcome some of the difficulties of the indirect method applied by the INE, the Spanish National Institute of Statistics.


Introduction
Despite the predominance of service industries in most developed economies, the evolution of manufacturing activities is still crucial to determine current economic conditions. In this context, quantitative manufacturing indices have become valuable tools for checking regularly how national and regional economies change over short periods of time. Two different methods are used to obtain quantitative manufacturing indices: direct and indirect. Direct quantitative indicators are elaborated using industrial production data from a survey addressed to a sample of firms. This method provides the best quantitative indices for monitoring the evolution of industrial production, but its costs are very high. Indirect quantitative indicators estimate industrial production using pre-existent information (Clar, 1998). As a consequence, the estimation is not as accurate as the one obtained by direct methods, but it offers the advantage of being cheaper. For this reason, these indicators have been (and still are) used in a range of economies, mainly regional.
In Spain, at a regional level (until very recently) there were big difficulties to analyse the short-term industrial activity evolution as there were great deficiencies regarding the availability of statistical information of these characteristics. In front of this situation, during the last years, in some Spanish regions several public and private initiatives were initiated to overcome these deficiencies.
Although an important effort was carried out, the real situation was that not every Spanish region had a quantitative indicator of the industrial activity evolution and, moreover, the available regional indicators were not directly comparable as non-homogenous methodologies were used to elaborate them. In relation to this topic, in different forums a debate was initiated about which was the most appropriate methodology to elaborate regional industrial production indicators with a high level of reliability and, at the same time, a low cost. The result was that, at last, the INE recently published regional industrial production indicators following an indirect method and, in this sense, some of the existing deficiencies have been partially overcome.
In a previous work, Clar et al. (2000), we analysed the reliability of the INE's methodology from both a theoretical and an empirical point of view.
First, the theoretical analysis of INE's methodology leaded to the conclusion that, in spite of its low cost, it does not guarantee the obtained indicators reliability for each region at a monthly frequency.
In particular, the good performance of the methodology for a particular region depends on the degree of geographical concentration of the industrial production, the level of detail of the base information, the weight of the regional industrial production in the national production, the similarity between the regional productive structure and the national structure and the availability of a priori information. With these considerations in mind, the INE's methodology is completely justified for certain regions, but not for every Spanish region.
Second, the existence of quantitative direct indices for some Spanish regions (País Vasco, Asturias and Andalucía) also offered the possibility of validating empirically the methodology proposed.
The reason for focusing on these three regions was that they are three of the four regions that have their own direct indicator (Extremadura, the fourth region, was not considered because there were only eight observations available). The results reinforce the previous conclusions: the INE's methodology is not valid for every Spanish region.
In this paper, we propose a latent variable model, following Israilevich and Kuttner (1993), as a way to overcome some of the difficulties of the method applied by the INE. Again, the existence of quantitative direct indices for some Spanish regions (País Vasco, Asturias and Andalucía) offers the possibility of validating the methodology proposed. In the next section, the theoretical model is shown and, in the third section, the model is estimated for these three regions and the results are compared with the direct indexes and the INE indirect indexes. The results show that this method overcomes some of the difficulties of INE's method.

2.
A latent variable model for measuring regional manufacturing production: The theoretical model and its specification in state-space form In this section, an alternative strategy for calculating indirect quantitative manufacturing indicators at a regional level for the Spanish case is proposed. Following Israilevich and Kuttner (1993), the regional industrial production can be considered as a latent variable. To estimate latent variable models and obtain the desired regional indicators, the considered model can be expressed in state-space form and, then, be estimated using the Kalman filter.
The basic assumption of the model is that the monthly regional industrial production can be treated as a latent variable which depends on other regional variables that are observable (proxies of labour and capital) and national variables (the national IPI). This assumption permits the specification of a parametric Cobb-Douglas production function at a regional level where the regional industrial output (monthly and regional) depends on two inputs: capital (proxied by the regional electric energy consumption for industrial purposes 1 ) and labour (proxied by the number of hours worked in the manufacturing sector): where the subindexes t and m refer respectively to the year and the month; (·) t,m denotes the monthly difference operator; x reg represents the unobservable regional production (in logarithms); e reg the regional electric energy consumption for industrial purposes (in logarithms); l reg the number of hours worked in the region in the period considered (in logarithms);  and  are, respectively, the share of capital and labour inputs;  measures the technological progress; and,  is a perturbance term which represents the shocks in the production function. Moreover, as is usual in the literature, it is supposed that equation (1) is neutral in Hicks's sense. The problem with equation (1) is that it is not possible to obtain a direct estimate of monthly regional industrial output ( reg ). To solve this difficulty, another assumption needs to be included in the model. It is assumed that there is a relationship between the evolution of national and regional industrial output (the first provides indirect information about the second). So, the monthly national IPI ( nation ) can be considered as an indirect measure (a noisy indicator) of the monthly regional industrial activity. In this sense, the relationship between the nation and the region i fluctuations can be expressed as follows 2 : where is the growth rate of the national IPI; is the growth rate of the regional production indicator;  i is included in the model to allow, on average, the (monthly) growth rate of the national IPI and the (monthly) growth rate of the regional output to be different.;  i is the weight associated to the regional output fluctuations; and, last,  t,m,i , is a perturbance term that represents those national IPI source fluctuations which are not related to the regional output. Positive values of  i reflect a slower growth in the region than in the nation.  i reflects the relation between the variation of the regional industrial production index as a result of the variation of the national index. So, positive values of  i are indicative of a direct relation between the region i and national fluctuations: when also increases (reduces). As regards the magnitude of fluctuations, the larger  i is, the larger the effect of the movements in region i on the national ones; in other words, the higher the values of the parameter  i , the higher the correlation between national and regional fluctuations will be. However, if 0< i <1, national fluctuations will be lower than regional ones (aggregation reduces the variability). From this point of view,  i has the same interpretation as the slope in a classical regression equation: it determines the relative size of the regional fluctuations with respect to national ones. Last,  t,m,i is an error term which represents the shocks experienced by the regional production that are not directly related with national shocks 3 .
According to the above definitions,  i and 2   (the variance of  t,m,i ) allow the quantification of the linkage between (the fluctuations of) the region i and the nation: the higher  i and the lower 2   , the 2 For more details, see Clar et al. (1998).
bigger the linkage. As 2   is a measure of the amount of noise in the relationship between the nation and the region, the lower the noise is, the smaller the effects of the variations of the industrial output in other regions in the national one will be. In fact, 2   is lower-bounded by zero (no noise) and has an upper- (the national indicator does not give any information on the regional evolution).
Israilevich and Kuttner (1993) propose normalising 2   by the variance of the endogenous variable of (2) to measure the link between the fluctuations of the region i and the nation, calling this measure pseudo- . The pseudo-R 2 measures how informative the fluctuations of the national indicator are to infer the size and the direction of the regional indicator: if it is equal to zero ), the national IPI fluctuations will not supply any information about the regional fluctuations; if it takes values close to one, the national indicator provides valuable information about the regional evolution. Note, consequently, that an advantage of the proposed model is that, as a subproduct, a measure of the nature and the degree of the linkages between the regional and national economic activity is obtained.
The last element of the model is the imposition of the consistency of the estimates of the monthly regional output with the (only) available indicator of regional production: the industrial GAV annual data, through the following condition: where reg t A x  represents the annual variation (of the logarithm) of the regional production.
The model is then formed by the equations (1), (2) and (3); taken together, these equations can be expressed in terms of a state-space form 4 , which permits the application of the Kalman filter to obtain 4 See Harvey (1989).
estimates of the regional manufacturing production indexes. A possible specification 5 of equations (1), (2) and (3)

Estimation of the model and validation of the results
Once the theoretical has been introduced and expressed in state-space form, in this section we have used the Kalman filter to obtain estimates of the regional indicators for those regions with direct regional indexes (País Vasco, Asturias and Andalucía). Next, these estimates are compared with the available direct indicators and the INE indirect indexes to assess the validity of the methodology.

Statistical information available for the exogenous variables.
Statistical information about labour, capital, regional GAV in manufacturing and the national IPI -the exogenous variables-is needed for the model. As regards the labour input, the number of 5 The definition of  t depends on the characteristics of the system considered, but there is usually more than one possible statespace form for each system. As a rule, models with a lower number of parameters are preferred.
worked hours in manufacturing (proxy of labour) is not available on a monthly basis 6 . So, we approximate it by another variable which was available for the three regions considered: the number of industrial workers in the General Social Security System. For these regions, electric energy consumption for industrial purposes (proxy of capital) has only been available on a monthly basis from January 1993 to December 1996. The source used for the regional GAV was Contabilidad Regional de España and the national IPI was the quantitative direct indicator produced by the INE. As a consequence, after differentiating, only 36 observations (for the period 1994-96) were available for each variable considered.

Estimation using the Kalman filter: hyperparameters and initial values.
Once the model formed by equations (4) and (5) has been specified in terms of a state-space form, it is possible to apply the Kalman filter to estimate the values of the latent variable. The Kalman filter is a recursive procedure that allows us to obtain the optimal estimates (in terms of MSE) of the state vector at time t using all the available information at time t-1, and updates and improves these estimates when additional information about the observable variables becomes available.
However, as noted above, applying the Kalman filter requires knowledge of the values of the hyperparameters. In the proposed model, the hyperparameters are , , ,  i ,  i and the variances of the error terms of the equations (4) and (5). To estimate these values there are three different approaches.
The usual approach is to estimate the unknown hyperparameters by maximum likelihood using the prediction error decomposition (Harvey, 1984). Since the analytical expression of the derivatives of the system likelihood function are too complicated 7 , numerical expressions and numerical optimisation procedures are usually used. The main disadvantage of using this recursive procedure is its high sensitivity to the numerical optimisation procedure chosen and to the available sample.
The second approach consists of estimating the values of the hyperparameters using the EM algorithm, first developed by Dempster et al. (1977) and introduced in this framework by Shumway and Stoffer (1982) and Watson and Engle (1983).
The third way of solving the problem of estimating the hyperparameters values consists of specifying the model as simply as possible and estimating some of them using a priori (external, ad hoc) information as it reduces considerably the complexity of the maximum likelihood estimation (Hackl and Westlund, 1996). This approach is strongly recommended when the number of available observations is reduced. It also reduces considerably the computational cost of the estimation procedure.
In this paper, we have tried to consider these three approaches. However, our attempts to apply the maximum likelihood estimation method for the whole set of hyperparameters have not been satisfactory as no convergence using the usual criteria was obtained. For this reason, we have focused on the other two alternatives, the EM algorithm and the a priori estimation.
In reference with the a priori estimation of the hyperparameters, the parameters of the state  The parameters of the measurement equation ( i and  i ) have been obtained as the regional share in the national GAV (see table 2) using an OLS regression 9 . Table 2 also summarises the results for the three regions considered for the estimation process using the EM algorithm. As it can be seen, there are important differences between both sets of estimates, especially for the parameters associated to factors share in the regional production function. In particular, the EM estimates are much lower than the a priori estimates having no economic interpretation, being this one of the main disadvantages of the application of this algorithm. Last, it is important to remark that the variances of the perturbance terms of both equations have been estimated in both cases by maximum likelihood obtaining similar values.

TABLE 2
Nevertheless, to obtain estimates of the state vector (the regional manufacturing indicator), the initial values of the state vector and its associated prediction error covariance matrix are also needed. To solve the problem of the initialisation of the Kalman filter, we have followed the proposal of Harvey (1981 and 1989) and Bell and Hillmer (1991), which consists of starting the recursions from t=0 assuming that the initial values of the unobservable variable ( 0 ) are equal to zero and the prediction error covariance matrix (P 0 ) is equal to ·I, where I is the identity matrix and the constant  has been approximated by 10 6 as in most empirical studies.

Validation of the results
Once the hyperparameters have been estimated using the two different previously mentioned approaches (a priori estimation and EM algorithm) and the problem of initial values of the Kalman filter has been solved, it is straightforward to obtain estimates of the regional industrial production Kalman filter using a priori estimation-, SS-KF-EM -state space Kalman filter EM algorithm estimation-) are compared. As it is shown, the results obtained with the latent variable models provide a good approximation to the direct indexes' evolution. The comparison between results obtained using the a priori (external) estimates of the hyperparameters with the ones obtained using the EM algorithm show that for the País Vasco the second works better, for Asturias the results are better using the first, while for Andalucía there are no remarkable differences.

FIGURES 1 TO 3
The results for the pseudo-R 2 have also been calculated and are shown in table 3. The results using the a priori estimates of the hyperparameters or the EM algorithm are very similar. The value for Asturias and País Vasco is very similar and close to 0.5 while the value for Andalucía is higher and near 0.7. As pseudo-R 2 measures how informative the fluctuations of the national indicator are to infer the size and the direction of the regional indicator, this means that the national indicator provides more valuable information about the regional evolution in Andalucía that in the other two regions.   In a previous work, Clar et al. (2000), we analysed the advantages and disadvantages of the indirect method used by the INE for measuring manufacturing quantitative indicators for the Spanish regions. The results showed that this method is only valid under certain hypothesis (among others, the similarity of the regional productive structure to the national structure, the weight of the regional manufacturing production in the national production, etc.), For this reason, in this paper, we have estimated a latent variable model, following Israilevich and Kuttner (1993), using the Kalman filter with the aim of overcoming some of the defficiencies of the INE's method. The results obtained for the three Spanish regions where direct indicators were available show that this method overcomes some of these difficulties. These conclusions could be confirmed as more regional statistic information becomes available. In this sense, it is also important to remark that the estimates of labour and capital could be improved; the production function could be extended to consider richer specifications for technology and the possible change of factors share over time; and the dynamicity of the relationships between regional and national economic activity could be analysed in more detail.
Last, the empirical results also highlight the potentiality of the state-space modelization and the Kalman filter to analyse economic data, especially in a regional economic context. The consideration of different procedures to estimate unknown values of hyperparameters has also permit to compare their accuracy. There seems to be evidence in favour of using the EM algorithm instead of the a priori (external) estimates of some hyperparameters. However, the a priori estimates provide quite accurate results, have a clear economic interpretation and much less complexity and computational time.