On the relationship between á connections and the asymptotic properties of predictive distributions

In a recent paper, Komaki studied the second-order asymptotic properties of predictive distributions, using the Kullback±Leibler divergence as a loss function. He showed that estimative distributions with asymptotically ef®cient estimators can be improved by predictive distributions that do not belong to the model. The model is assumed to be a multidimensional curved exponential family. In this paper we generalize the result assuming as a loss function any f divergence. A relationship arises between á connections and optimal predictive distributions. In particular, using an á divergence to measure the goodness of a predictive distribution, the optimal shift of the estimate distribution is related to ácovariant derivatives. The expression that we obtain for the asymptotic risk is also useful to study the higher-order asymptotic properties of an estimator, in the mentioned class of loss functions.


Introduction
The main goal of this work is to provide distributions that are close, in the sense of an f divergence, to an unknown distribution belonging to a curved exponential family D f ( p(x; u), p(x; x 1$ N )) f p(x; x 1$ N ) p(x; u) p(x; u)ì dx, where f is a smooth, strictly convex function that vanishes at 1.We measure the closeness by In order to choose p, we could try to ®nd the distribution that minimizes (1), uniformly in u, among `all probability distributions' equivalent to p. Since there are some technical problems in giving a structure of differentiable manifold to this in®nite-dimensional space, we follow the procedure suggested by Komaki (1996) and try to solve the problem only for distributions belonging to a ®nite-dimensional model containing P. We construct this model by enlarging P in orthogonal directions.We shall see that, for large samples, there is a special direction such that the improvement on the estimative density is maximum if and only if this direction belongs to the tangent space associated to the enlarged model.The solution does not change if we add more orthogonal directions and in this sense we can consider the problem solved in the in®nite-dimensional space of all probability distributions equivalent to p.
For simplicity, we shall work with á divergences D á (for their use in statistical inference, see Amari (1985, Chapter 3)), i.e. f divergences with In the ®nal remark, we extend the results to any f divergence.

The enlarged model
Let E be a n-dimensional full exponential family, i.e.
where the probability functions p(x; è) are densities with respect to some ó-®nite reference measure ì and is an open subset of R n .We consider the model P to be a (n, m)-curved exponential family of E , m < n, be the so-called á representation of p(x; u) (Amari 1985, p. 66).From now on, the index á will be used to denote all that regards á representation of geometric quantities.The tangent space T u of P in u is identi®ed with the vector space spanned by that are the components of what we call the á-score function.The ®rst and second derivatives of l á (x; u) are related to those of l(x; u) log p(x; u) l 1 (x; u) by and we have that the inner product of vectors d a l á and d b l á , does not depend on the á representation; it is the (a, b) component of the Fisher information matrix g ab .In the sequel, we omit the subscript á in the inner product and in the expectation, since it will be clear from the representation used.We indicate by g ab the inverse of g ab and use the repeated index convention.Following Amari et al. (1987), we can construct a vector bundle on P by associating to each point p(x; u) P P a linear space H u de®ned by Attached to different points we have different but isomorphic Hilbert spaces.In order to see this, let p p(x; u) and q p(x; u9) be two different points of P and consider the transformation In fact, it is easy to see that I u9 u (h) P H u9 , since q (1á)a2 I u9 u (h)ì dx 0 and q á fI u9 u (h)g 2 ì dx p á h 2 ì dx À q (1á)a2 p q áa2 hì dx Moreover, I u9 u is linear, its inverse is and, by (2), it is bounded.I u9 u is then a continuous linear bijection, i.e. an isomorphism between H u and H u9 .The aggregate H (P ) uPU H u constitutes Amari's Hilbert bundle.It is necessary to establish a one-to-one correspondence between H u and H u9 , when p(x; u) and p(x; u9) are neighbouring points, in order to express the rate of variation in a vector ®eld as an element of the Hilbert bundle.If we move in the direction d a l á and h u P H u , d a h u a P H u in general.Anyway, if is a smooth vector ®eld, in the sense that we can interchange the integral and the derivative, Thus, we can de®ne the á-covariant derivative in H as and the á-covariant derivative in P is the projection of = á (H ) These connections coincide with the á connections de®ned by Amari (1985, p. 38).We use the superscripts m and e respectively for the À1 and 1-covariant derivatives.
Let M be any regular parametric model containing P. We can consider on M the coordinate system (u, s), where u a , a 1, F F F , m, is the old coordinate system on P and s I , I m 1, F F F , r, r .m, are new coordinates on M .We suppose that s 0 for the points in the original manifold P and u and s are orthogonal in P. The tangent space to the enlarged model M is now spanned by vectors d a l á (x; u, s), a 1, F F F , m, and

Predictive distributions
We consider predictive distributions p(x; u N (x), s(x)), with s(x) O p (N À1 ), so that and u N (x) is a smooth, asymptotically ef®cient estimator, and hence ®rst-order equivalent to the maximum-likelihood estimator, of the form For ®xed x, both depend on N only through x.

Predictions and á connections
For each N, u N is a map u N : E 3 P , since x can be identi®ed with the point in E having expectation parameters ç i x i .Then, u I is also a map from E to P and we can associate with u N a family of ancillary (n À m)dimensional submanifolds of E , A fA(u)g, where A(u) u À1 I (u).In some discrete cases, even though the exponential model is regular, x could correspond to the expectation parameters of a point in E with a probability different from one.However, since this probability goes to one exponentially in N, we can consider a modi®cation of x, say x Ã , such that x Ã x o p (N À2 ) and x Ã are the expectation coordinates of some point in E .Then, all the results could be rewritten in terms of x Ã instead of x.
Following Amari (1985, p. 128), it can be shown that u I is consistent if and only if every p(x; u) P P is contained in the associated submanifold A(u) and u I is asymptotically ®rst-order ef®cient if and only if A(u) is orthogonal to P in u.On the other hand, since lim in distribution, the results still hold for u N .
If we introduce a coordinate system v k , k m 1, F F F , n on each A(u), every point in the full exponential family containing P is uniquely determined by a pair (u, v).It is convenient to ®x v 0 for the points in P. We denote by indices a, b, c, F F F P f1, F F F , mg the coordinates u in P, by k, ë, ì, F F F P fm 1, F F F , ng the coordinates v in A(u) and by á, â, ã, F F F P f1, F F F , ng the new coordinates w (u, v) in E .Since u N is asymptotically ef®cient, g ak (u) 0X Indices i, j, F F F P f1, F F F , ng are used to denote the natural parameters è in E and indices I, J, K, F F F P fm 1, F F F , rg for the coordinates s that we add to enlarge the model P. By the coordinate system we choose on M , g aI (u) 0X Under these assumptions, we have the following theorem.
Theorem 3.1.The average á divergence from the true distribution p(x; u 0 ) to a predictive distribution p(x; u N (x), s(x)) is given by where all the quantities are evaluated in u 0 , u u(E(x)), s s(E(x)), á a is the a component of the general covariant derivative of a tensor with respect to the á connection.
Proof.Only an outline of the proof is given; see Corcuera and Giummole Á (1996) for detailed calculations.Since, from the de®nition of f á , and s(x) O p (N À1 ), the expansion of an á divergence from p(x; u 0 ) to p(x; u N , s) is where ũ u N À u 0 and The brackets [ ] refers to the sum of a number of different terms obtained by permutation of free indices, e.g.
The mean value of D á is The mean squared error of u N can be written as where xi x i À d i ø.We can easily calculate the moments of x: By using geometrical properties of curved exponential families, it can be shown that Moreover, by ( 11) and ( 10), where u u(E u 0 (x)).If we substitute ( 11) and ( 12) in ( 9), we can ®nally write 13) By ( 4), we also have that where s s(E u 0 (x)).By ( 11) and ( 10), and We can now use ( 13)±( 17) and ( 7) to calculate each term of ( 8).With some further calculations we obtain the result.u From ( 6) we can obtain a decomposition of the average á divergence from the true distribution to any predictive one, in two parts: The ®rst term in (18) depends on the choice of the estimative distribution and the other on Predictions and á connections the shift orthogonal to the model P. It is well known that the problem of choosing a secondorder ef®cient estimator u N (x) has not, in general, a unique solution.On the other hand the following theorem solves the problem of the choice of the optimal shift orthogonal to the model.
Theorem 3.2.The optimal choice of s I (x), with respect to an á divergence, is given, up to order N À1 , by where u N (x) is any asymptotically ef®cient estimator.
Proof.It is easy to see, by ®nding the derivative of ( 18) with respect to s, that the minimum value of the asymptotic risk corresponds to The result follows by (4).u Let us now de®ne, for a, b 1, F F F , m, Vectors h ab are, by de®nition orthogonal to the original model P. Moreover they belong to H u .The following theorem explains the important role that they play in our analysis.
Theorem 3.3.The difference in average á divergence from the true distribution, between the estimative distribution p(x; u N (x)) and the optimal predictive distribution p(x; u N (x), s opt (x)), is maximal if and only if the vector g ab h ab belongs to the linear space spanned by the h I .In this case, the optimal predictive distribution is Proof.By (20) and the de®nition of H á abI , we have that By substituting ( 19) in ( 18), E u 0 fD á ( p(x; u 0 ), p(x; u N (x)))g À E u 0 fD á ( p(x; u 0 ), p(x; u N (x), which depends only on the projection of g ab h ab on the linear space spanned by the h I .Thus, it is maximal if and only if g ab h ab is included in this space and its maximal value is In this situation, by ( 19), ( 22) and ( 20), we have that and the result follows by substituting ( 24) in (3).u Remark.Including the vector g ab h ab on the enlarged model allows us to attain the best improvement on the estimative distribution.For any regular parametric model M containing P and g ab h ab we obtain the same optimal predictive distribution.In this sense, (21) gives a predictive distribution that can be considered optimal among all probability distributions equivalent to p.
In the case when P itself is a full exponential family, we can write (21) in a simpler form Note that for á 1 there is no correction, i.e. we do not move out of the full exponential model.Moreover, for á À1 we obtain exactly the same result as Vidoni (1995, p. 858, equation (3.1)).

Predictions and á connections
Example 3.1.We consider m-dimensional multivariate distributions N (ì, I m ): where ì (ì i ), i 1, F F F , m, is unknown.We have that g ij (ì) ä ij and Ã á ijk (ì) 0, for all á.Now let x(l), l 1, F F F , N , be independent of N (ì, l m ) and ì ì N (x) be any estimator for the mean vector ì, where We thus have that the optimal predictive distribution can be written in a close form as

X
For á À1, it coincides, up to order N À1 , with the result of Barndorff-Nielsen and Cox (1994, p. 318).By (23), we can calculate the difference in average á divergence between the estimative distribution and the predictive distribution: which does not depend on ì, the ef®cient estimator used.Now let ì be the James±Stein estimator for ì, i.e.We can use ( 6) with s 0 to compare the two estimative distributions obtained respectively from the maximum-likelihood estimator ì mle x, and the James±Stein estimator: E ì fD á ( p(x; ì), p(x; ì mle ))g À E ì fD á ( p(x; ì), p(x; ì))g Remark.Let us consider an f divergence D f as a loss function.Without loss of generality, we can suppose that f 0(1) 1. Theorem 3.1 can be easily generalized to this case by putting á 2 f -(1) 3 and by substituting the coef®cient (á À 11)(á À 1) 32 of the term Q abcd g ab g cd N 2 by â f (4) (1) À 2 f -(1) À 4 8 X In fact, in the expansion of D f , the ®rst-and second-order terms remain unchanged.The coef®cient of the third-order term is Predictions and á connections and it can be written as with á 2 f -(1) 3. The coef®cient â is calculated by H u is a closed linear subspace of L 2 ( p á ì), it is a Hilbert space.It is easy to see that T u & H u and the inner product de®ned on T u is compatible with that in H u . H