Line Search Multilevel Optimization as Computational Methods for Dense Optical Flow

We evaluate the performance of different optimization techniques developed in the context of optical flow computation with different variational models. In particular, based on truncated Newton (TN) methods that have been an effective approach for large-scale unconstrained optimization, we develop the use of efficient multilevel schemes for computing the optical flow. More precisely, we compare the performance of a standard unidirectional multilevel algorithm—called multiresolution optimization (MR/Opt)—with that of a bidirectional multilevel algorithm—called full multigrid optimization (FMG/Opt). The FMG/Opt algorithm treats the coarse grid correction as an optimization search direction and eventually scales it using a line search. Experimental results on three image sequences using four models of optical flow with different computational efforts show that the FMG/Opt algorithm outperforms both the TN and MR/Opt algorithms in terms of the computational work and the quality of the optical flow estimation.


Introduction
The problem of optical flow computation consists in finding the 2D-displacement field that represents apparent motion of objects in a sequence of images.Many efforts have been devoted to it in computer vision and applied mathematics [25,35,18,3,4,11,40,44,12,10,31,27,1,2,6,32,33,36].The computation of optical flow is usually based on the conservation of some property during motion, either the object's gray level or their shape properties.In a variational setting, the problem is usually formulated as a minimization of an energy function, which is a weighted sum of two terms: a data term coming from the motion modeling and a smoothing term as a result of the regularization process.For standard optical flow algorithms, the data term is usually based on a brightness constancy assumption, which assumes that the object illumination does not Key words and phrases.Optical flow, line search multigrid optimization, multiresolution, truncated Newton.
This research was initiated while the first author was visiting Centre de Recerca Matemàtica and Universitat Pompeu Fabra in Barcelona during 2008.The first author was partially supported by FRGS grant UUM/RIMC/P-30 (S/O code 11872) offered by Ministry of.Higher Education Malaysia.The second and third authors acknowledge partial support by MICINN project, reference MTM2009-08171, and by GRC reference 2009 SGR 773.The third author also acknowledges partial support by "ICREA Acadèmia" prize for excellence in research funded both by the Generalitat de Catalunya.change along its motion trajectory.The regularization process ensures that the optical flow estimation problem is well-posed.Many regularization terms have been investigated in the last two decades ranging from isotropic to anisotropic.Isotropic smoothers tend to blur motion edges while the anisotropic ones require additional computational resources.
Two strategies might be conducted for the numerical minimization of an energy functional arising in a variational framework like above.The first one which is called minimize-discretize is achieved by discretizing and solving the corresponding Euler-Lagrange equations.In the second approach called discretize-minimize one can use directly numerical optimization methods for solving a discrete version of the problem.While the first approach has been commonly used for variational optical flow computation, the second computational strategy has been less investigated in this context, to the best of our knowledge.
Adopting the first strategy, efficient algorithms have been designed to approximate the optical flow.In particular, in [1,2] the authors used a multiscale approximation to solve the corresponding Euler-Lagrange equations using the Nagel-Enkelmann regularization term [36].Thus, they computed a series of approximations where each one solves a regularized version of the Euler-Lagrange equations starting from the previous one while keeping the original grid fixed.A different approach is proposed in [9,41] where the authors use fixed point iterations to solve the corresponding Euler-Lagrange equations, fully implicit in the smoothness term and semi-implicit in the data term.Still, this fixed point iteration leads to implicit equations and they linearize some of their terms using a warping technique.The equations obtained are fully linearized using a lagged diffusivity method [9,41,10].The final linear system is solved using a linear solver like a Gauss-Seidel type method, or a SOR method.The connections with warping are detailed in [9,41].
On the other hand, in order to develop efficient and accurate algorithms working in real time, there have been recent efforts to improve performance of optical flow algorithms using multilevel techniques.We distinguish at this stage between two classes of multilevel algorithms.The first one known as a coarse-to-fine multiresolution uses a sequence of coarse grid subproblems to find a good initialization for the finest grid problem that avoids possible local minima.We shall refer to this strategy in this paper by multiresolution.The second strategy alternates between solution relaxations using the underlying algorithm and solution corrections obtained from a sequence of sub-problems defined on coarse grids.This leads to recursive algorithms like the so-called V-or W-cycle, which traverse between fine and coarse grids in the mesh hierarchy.We will reserve the term multigrid for it.In the case of elliptic PDEs -and for a wide class of problems, multigrid methods are known to outperform multiresolution methods.
However, being straightforward to implement, multiresolution methods were more used in computer vision and particularly for motion estimation; e.g.[18,3,42,40,44,15].For instance, Enkelmann [18] developed a coarse-to-fine algorithm for oriented smoothed optical flow.Also, Cohen and Herlin [15] used recently a multiresolution technique with nonuniform sampling for total variation regularization.
For the seek of optimal performance, multigrid schemes were developped to solve the resulting Euler-Lagrange equations for both isotropic and anisotropic regularizations, see [21,45,20,27,26,12,10] and references therein.The first attempts are due to Glazer [21] and Terzopoulos [45] using standard multigrid components.Improvement in performance were reported on simple synthetic images and later on [42] standard multigrids were stated to be not appropriate due to a possible information conflict between different scales.However, the method was recently better adapted for optical flow computation by tuning the multigrid components.In [20], an entirely algebraic multigrid approach was developed for a weighted anisotropic regularization based model.A geometrical multigrid based on Galerkin coarse grid discretization approximation was developped for the classical Horn-Schunck method in [27].This algorithm was extended and paralelized in [26] for real time computation of vector motion fields for 3D images.Another recent geometric multigrid investigation was presented in [12,10] with anisotropic regularization (a full account of it can be found in [10]).
All these works have considered data terms based on the brightness constancy assumption, which lead to less accurate optical flow fields when the image sequence contains illumination variations in the temporal domain, which may be often found in real images.In [14], a model is proposed for illumination invariant optical flow computation, previously introduced in [17] in the context of image registration.The brightness constancy assumption is replaced by the assumption that the shapes of the image move along the sequence.In this context, the terms of the Euler-Lagrange equation corresponding to the data attachment term which contains derivatives of the unit normal vector fields are highly nonlinear.They will not produce systems of equations with a symmetric and positive semi-definite matrix (usually after linearisation), the basic systems to which the previous multigrid methods for optical flow have been applied.
For that reason we follow in this paper the second strategy of discretizeoptimize that allows to handle such kind of variational problems.Instead of computing the Euler-Lagrange equations of the energy model, discretizing and solving them, our approach is based on the use of numerical optimization methods to solve the discrete version of the energy, be either based on gray level constancy or shape properties.This leads to the need of developing efficient algorithms for the numerical resolution of a large scale optimization problem.Therefore only large scale unconstrained optimization methods are relevant to solve this variational problem.One of them is the truncated Newton method [16].It requires only the computation of function and gradient values and has then suitable storage requirements for large-scale problems.However, due to the intensive computations that are required to solve the energy minimization problem, only multilevel versions of the method (multiresolution and multigrid) are expected to provide suitable performance.
In our context, the bidirectional multigrid scheme should be adapted directly to an optimization problem.Current research is still ongoing in this direction for solving some variational problems lacking a governing PDE in different fields.Recently [38,29], the multigrid strategy has been extended to optimization problems for truncated Newton methods.Motivated by variational problems lacking a governing PDE, the multigrid optimization was derived from the Full Approximation Storage (FAS) [7] for nonlinear PDEs but applied directly in an optimization setting.With regard to nonlinear multigrids, the MG/OPT algorithm includes two safeguards that guarantee convergence: Bounds on the coarse grid correction that introduce a constrained optimization subproblem and a line search that eventually scales the coarse grid correction treated as a search direction.In [29], it has been shown that the coarse grid subproblem is a first-order approximation of the fine-grid problem.This justifies somehow the introduction of the two safeguards.The first-order approximation suggests that the correction will be only reliable near the restricted approximation and it relates at the same time the MG/OPT to the steepest descent method.The latter connection indicates that the coarse grid correction will not be typically a well-scaled descent direction which, in turn, implies that a line search should be performed to adjust the scale of the multigrid search direction.This may not be necessary as the MG/OPT is near to convergence since in that case it will provide a Newton-like search direction for which a search step equal to 1 will be likely accepted.These connections to both steepest descent and Newton method suggest that the MG/OPT will perform well far and near the solution.
Our aim is to develop the MG/OPT method for the computation of optical flow, a problem which is of considerable computational resource requirements.To the best of our knowledge, the MG/OPT method is studied here for the first time for optical flow computation.Several components of the MG/OPT technique have been tuned for high efficiency and the algorithm is fully evaluated with respect to one-way multiresolution optimization.The proposed numerical strategy can be adapted to the minimization of other nonlinear energy functionals like the illumination invariant model proposed in [14] or depth estimation in stereo problems.Although they are an important motivation for the development of the present techniques they will not be considered in this paper.
The outline of our paper is as follows.In Section 2, we start off with a review of the variational formulation of the optical flow problem.In Section 3, we recall the basics of the truncated Newton method.In Section 4, we present multilevel algorithms applied to optimization problems.First, we discuss the coarse-to-fine multiresolution strategy.Then, after recalling the idea of multigrid for linear systems, we describe its application to optimization-based problems.In Section 5, we outline some of the implementation details, namely the calculation of the objective functional for each of the four models considered in this paper, of its Hessian computation and of the image derivatives.In Section 6, we report our experimental results by considering three classical sequences of synthetic images: the translating tree, the diverging tree and the Yosemite sequences.In Section 7, we conclude the paper and indicate future research directions.

Variational models for optical flow
Let us consider a sequence of gray level images I(t, x, y), t ∈ [0, T ], (x, y) ∈ Q, where Q denotes the image domain, which we assume to be a rectangle in R 2 .We shall either consider the case where t ∈ [0, T ], and the case where the sequence is sampled at the times t j = j∆t, j = 0, . . ., K.
Assuming that the gray level of a point does not change over time we may write the constraint I(t, x(t), y(t)) = I(0, x, y), where (x(t), y(t)) is the apparent trajectory of the point (x(0), y(0)) = (x, y).Differentiating with respect to t and denoting (u(t, x, y), v(t, x, y)) = (x (t), y (t)) we obtain the optical flow constraint (2.1) The vector field w(t, x, y) := (u(t, x, y), v(t, x, y)) is called optic flow and I t , I x , I y denote the partial derivatives of I with respect to t, x, y, respectively.Clearly, the single constraint (2.1) is not sufficient to uniquely compute the two components (u, v) of the optic flow (this is called the aperture problem) and only gives the component of the flow normal to the image gradient, i.e., to the level lines of the image.As it is usual, in order to recover a unique flow field a regularization constraint is added.For that, we assume that the optic flow varies smoothly in space, or better, that is piecewise smooth in Q.This can be achieved by including a smoothness term of the form where corresponds to the Horn-Schunk model [25], the case G = trace((∇w) T D(∇I)∇w) corresponds to the Nagel-Enkelmann model [36], the case G = ∇u 2 + ∇v 2 or G = ∇u + ∇v correspond to total variation regularization models.For a full account of this and the associated taxonomy, we refer to [46,24].
Both data attachment and regularization terms can be combined into a single energy functional (2.3) where α > 0 is the regularization parameter weighting the relative importance of both terms.
In case of illumination changes, the gray level constancy assumption is violated, and may be substituted by the constancy of the gradient [5,41] which can be expressed in differential form as (2.4) w, ∇I x = 0 w, ∇I y = 0.
Other cases include the constancy of the gradient direction [13] or the assumption that the shapes of the image (identified as the level lines) move along the sequence [14] (this assumption has been used in [17] in the context of image registration).Higher derivative models have been studied in [41].
The above models (2.3), (2.4) do not take into account that video sequences are sampled in time so that our data has the form I j (x, y) := I(t j , x, y), t j = j∆t, j = 0, . . ., K. Without loss of generality, let us assume that ∆t = 1.In that case, the gray level constancy may be expressed as (2.5) where (u j (x, y), v j (x, y)) is the optical flow between images I j−1 and I j .As argued in [1,2] the linearized gray level constancy constraint (2.1) may not be a good approximation in case of large displacements and the form (2.5) may be more appropriate [10].
A corresponding energy functional can be obtained by combining the non linearized form of the brightness constancy assumption and a regularization term.For convenience, we assume that we want to compute the optical flow between two images I 1 (x, y) and I 2 (x, y).We may write the energy (2.6) where Ψ : R → R is an increasing smooth function, and α > 0. Examples of function G have been given above (see [46,24,10] for an account of the many different possibilities).
Observe that the energy (2.6) is nonlinear and non convex.In order to develop the basic numerical optimization methods, the variational optical flow problem is set into the simplified minimization form: where f (w) := D(w) + αR(w).Here D denotes a given data term based on either (2.1) or (2.6) and R denotes a regularization term being either quadratic or the total variation.More precisely, we consider in this paper the case of As discussed in the introduction, we adopt the strategy discretize-optimize.This means that we shall first discretize the objective functional and then solve a finite dimensional but large scale optimization problem.Therefore only large scale unconstrained optimization methods are relevant to solve this variational problem.One of these methods that requires only the computation of function and gradient values and has suitable storage requirements for large-scale problems is the truncated Newton method [16].We note that other choices of the underlying optimization procedure -like the BFGS quasi Newton method, are possible.The truncated Newton algorithm is embedded in a multilevel strategy, both multiresolution and multigrid.

Truncated Newton optimization
As it is well known, Newton methods are based on the second order Taylor approximation of the objective function to build an iterative process for approaching a local minimum.The step s k to move from the current point w k to a new iterate w k+1 is chosen to be a minimum of the quadratic model of f given by where . The Newton step s k is then obtained by solving the linear system For large-scale problems, solving exactly this linear system will be very expensive.Truncated Newton methods (TN) use rather an iterative method to find an approximate solution to (3.2) and truncate the iterates as soon as a required accuracy is reached or whenever (in case when the Hessian matrix H k is not positive definite), a negative curvature is detected.One of the most well known iterative methods within TN ones is the Preconditioned Conjugate Gradient algorithm (PCG), see Algorithm 2, due to its efficiency and modest memory requirements.
In this context, we will refer to the process of finding the step s k as inner iterations, while the process of updating w k using the computed s k will be called outer iterations.Our discussion in the sequel depends on the use of the PCG method as an inner solver.
Depending on how the step solution s k of the quadratic model (3.1) is exploited, two broad classes of algorithms are distinguished: line search methods and trust region methods.Line search methods scale the step s k by a factor α k that approximately minimizes f along the line that passes through w k in the direction s k .On the other hand, trust region methods restrict the search for s k to some region B k around w k in which the algorithm "trusts" that the model function q k behaves like the objective function f .This paper focuses on the former, the line search method, whereas the comparison between both methods (line search and trust region) will be the object of future work.
There are three components in the truncated Newton method related to speeding up the convergence of inner iterations that might have a great impact on its overall efficiency: a) the truncation criterion, b) the preconditioning strategy and c) how to handle the case when the Hessian matrix is not positive definite.Indeed, for the latter, one of the advantages of trust region over line search is that negative curvature directions can be properly exploited.In the PGC algorithm (Algorithm 2), the inner iterations are truncated as soon as a negative curvature direction is detected, that is, when p T j H k p j is negative.In our case, we replace the negative curvature test with the equivalent descent direction test [48], see lines 8-11 of Algorithm 2. From our practical experience, the descent direction test has a better numerical behavior than the negative curvature test.
For the other two components a) and b), we used a scaled two-step limited memory BFGS [37] with a diagonal scaling for preconditioning the CG method, and we truncate the inner iterations when the following criterion is satisfied: where r j is the PCG residual at inner iteration j, M k is the preconditioning matrix and which are both being provided at outer iteration k, and where k may be computed easily if M k is updated using the BFGS method [39].
Line search TN methods ensure that the new TN step s k provides a good descent by performing a line minimization along this direction and then the new outer iterate becomes: Exact line search is avoided due to expensive function evaluations and normally a sufficient function decrease is obtained by imposing the Wolf's conditions: where 0 < c 1 < c 2 < 1.In order to obtain a maximum decrease with a minimum number of function evaluations, interpolating polynomials is usually employed.Here in TN a cubic interpolation was used [34].
The algorithm associated to the outer iterations of the line search TN is shown in Algorithm 1.The algorithm iteratively updates w k by computing a search direction and performing a line search until a given tolerance is reached.In our case we set tolerances on the gradient, g , to detect local minima, and on the function values and iterate values, f and x to detect convergence.Note also that the preconditionning matrix M k is updated here, which can be easily done from previous values of w k and g(w k ), see [39].
Algorithm 1 Line Search Truncated Newton (outer iterations of TN) for k = 0 to max outer do 3: exit with solution w k 6: Compute s k by calling Algorithm 2.

8:
Perform a line search to scale the step s k by α k .9: Update M k+1 by the BFGS formula 11: exit with solution w k 13: end if 14: end for Algorithm 2 Preconditioned Conjugate Gradient (inner iterations of TN) exit with s k = z j (for j = 0 take end if 7: // Descent Direction Test replaces Negative Curvature test exit with s k = z j (for j = 0 take end if exit with s k = z j+1 16: end if 17: 18: end for

Multilevel methods
Multilevel methods use a set of subproblems defined on coarser levels of resolution in order to improve the convergence to the global optimum of (2.7) when compared to using only one level of resolution.Using multiple levels of resolution allows to deal, in the case of optical flow computation, with large displacements and enables the convergence to the global minimum since some local minima may disappear at sufficient coarse resolutions.
Let us denote with Ω i the image domain at level i, where i = 0 . . .r, where i = 0 corresponds to the finest level of resolution and i = r to the coarsest one.Grid spacing at the coarser grid Ω i+1 is usually twice the spacing at the grid Ω i .Two approaches are currently widely used in the computer vision field, namely the multiresolution and the multigrid methods.Both are explained below.4.1.Multiresolution methods.Multiresolution methods have been applied successfully in many computer vision and image analysis problems where the problem can be expressed as a global optimization problem.Multiresolution methods use a series of coarse to fine resolution levels to obtain an estimate of the solution to the problem.An initial estimate is obtained at the coarsest level.In our case, this estimate may be obtained by applying the TN algorithm (see Algorithm 1).The estimate is then extended to the next level of resolution where it is refined.This process is repeated until the finest level is reached, where the final estimate is obtained, see Algorithm 3, where the MR (multiresolution) method is shown.This function is called with M R(L − 1, x L−1,0 ), where L is the number of resolution levels and x L−1,0 is the initial estimate on the coarsest.
The advantage of coarse-to-fine multiresolution is that a good initial guess may be obtained for the finest grid problem by estimating a solution to the problem using the coarser grids.However, for linear systems it is currently known that one-way multilevel methods do not reach the optimal efficiency of standard bidirectional multigrid methods, which are detailed in the next section.4.2.Multigrid methods.Multigrid methods were originally developed to solve elliptic PDEs and at present are known to be among the most powerful numerical methods for improving computational efficiency of a wide class of equations.For a classical reference on multigrids, we refer to [7,8].The main characteristic of multigrid algorithms is based on the observation that different frequencies are present in the error of the solution of the finest grid problem.Some algorithms, called smoothers (such as Gauss-Seidel), are known to efficiently reduce the high frequency components of the error on a grid (or, in other words, the components whose "wavelength" is comparable to the grid's mesh size).However, these algorithms have a small effect on the low frequency error components.This is the reason why the application of schemes like Gauss-Seidel to solve a problem, for a given grid, effectively reduces the error in the first iterations of the procedure Algorithm 3 Multiresolution method x i,0 : initial estimate 4: repeat 5: Apply TN with initial estimate x i,0 6: // Prolonge current estimate to next finer level 9: (due to smoothing of the high frequency errors) but then converges slowly to the solution (due to smoothing of the low frequency errors).
One may observe that low frequency errors appear as higher frequencies on coarser grids.The latter may be effectively reduced using a smoother in the coarse grid.Moreover, smoothing at a coarser level has typically a much lower cost than on finer ones.The core idea of multigrid is to use a sequence of subproblems defined on coarser grids as a means to accelerate the solution process (by relaxation such as Gauss-Seidel) on the finest grid.This leads to recursive algorithms like the so-called V-or W-cycle, which traverse between fine and coarse grids in the mesh hierarchy.Since ultimately only a small number of relaxation steps on each level must be performed, multigrid provides an asymptotically optimal method whose complexity is only O(N ), where N is the number of mesh points.

4.2.1.
Multigrid for linear systems.We recall first the basic of multigrid techniques.We consider a sparse linear system which typically results from the discretization of a partial differential equation on a fine grid Ω i with a given grid spacing h.
Let x i,0 be an initial approximation of (4.1) and x i be the exact solution.The first step of the algorithm consists in smoothing the error x i − x i,0 by applying N 0 iterations of a relaxation scheme S to (4.1) that has the smoothing property [23].Examples of the smoother S are Richardson, Jacobi or Gauss-Seidel.The obtained smooth approximation xi,1 satisfies equation (4.1) up to a residual r i : The corresponding error equation is therefore given by (4.2) where e i = x i,1 − x i is the unknown smooth error.Equation (4.2) is then solved on a coarser grid Ω i+1 with a grid spacing that has to be larger than h (typical choice is 2h but other choices are possible).For this, the fine grid error equation must be approximated by a coarse grid equation: We need therefore to transfer the residual to the coarse grid and construct a coarse version of the fine matrix.Let R be a restriction operator mapping functions on Ω i to functions on Ω i+1 (common examples are injection and full weighting, see [8]).The coarse residual is given then by The coarse grid matrix may be obtained by re-discretization on Ω i+1 or by using the Galerkin coarse grid approximation: Here P is a prolongation (or interpolation) operator mapping functions from Ω i+1 to Ω i (standard examples are linear and bilinear interpolation).
The result e i+1, * due to solving (4.3) is transferred back to the fine grid to obtain: e i, * = P e i+1, * .
With this approximation of the fine grid error, the approximate solution is corrected to obtain: In order to damp high frequency error components that might arise due to the interpolation, N 1 iterations of the smoother S are applied on the fine grid to get the new iterate x i,3 .The coarse grid problem (4.3) has a lower dimension than the original problem (4.1), but it must be solved accurately for each iteration which can be very costly.In a typical multigrid with three levels or more, this problem is solved by calling recursively the algorithm γ times until reaching the coarsest grid, where we solve (4.3) exactly with a negligible cost.The steps of this multigrid algorithm are summarized in Algorithm 4.
The resulting x i,3 may be injected iteratively as initialization in Algorithm 4 until the residual on the finest grid is smaller than a given tolerance.
Note that for the smoothing steps N 0 and N 1 , only the sum is important in a convergence analysis.Typical values for the tuple (N 0 , N 1 ) are (1, 1) , (2, 1) and (2, 2).For the cycling parameter γ, only the values 1 and 2 are commonly used.
Algorithm 4 Linear Multigrid V-Cycle A i , b i : defines linear system to be solved 4: x i,0 : initialization 5: if i is the coarsest level then 6: x i+1,0 = 0 10: Let e i+1, * be the returned value x i,2 := x i,1 + e i, * // Apply correction step 16: The so-called V-cycle corresponds to γ = 1 and W-cycle corresponds to γ = 2.This is illustrated in Figure 4.1 for three levels.
An important variation of multigrids, which is known as the full multigrid method (FMG) [7] or nested iteration technique [23], combines a multiresolution approach with a standard multigrid cycle.The FMG starts at the coarsest grid level, solves a very low-dimensional problem, extends the solution to a finer space, performs a multigrid cycle, and repeats the process until a multigrid cycle is performed on the finest grid level.In this way, a good initialization is obtained to start the multigrid cycle on the finest level, which usually reduces the total number of iterations required.The method is illustrated in Figure 4.1 for three levels and using a V-cycle.

Multigrid for optimization problems.
As commented previously, the multigrid strategy has been recently extended to optimization problems for both line search methods and trust region problems.Previous works on non-linear multigrid methods applied the techniques directly to systems of non-linear equations obtained by solving the first-order optimality conditions.This approach is not suitable for problems that are not easily transformed into systems of non-linear equations.Another possible way to use multigrid for optimization problems is to apply the linear multigrid as inner iterations for solving the linear system (3.2) for a given iterate w k .This idea implicitly assumes that Newton method is the underlying optimization algorithm and that the Hessian matrices can be explicitly computed.Since many large-scale optimization algorithms only require the  V-cycle, W-cycle and FMG with V-cycle computation of function and gradient values, it is less obvious how such multigrid algorithm can be applied.Moreover, solving the Newton equations using multigrid may only lead to a better unilevel optimization algorithm as long as the multigrid technique is not applied in the outer iterations.
We deal here with a multigrid algorithm that works directly with the optimization problem enabling us to work with problems that are not easily transformed into systems of non-linear equations.The multigrid line search optimization (MG/OPT) strategy is described in the following.
As with multigrid for linear systems, in order to solve the optimization problem (2.7) over an original (finest) grid level i = 0, a sequence of optimization subproblems are considered on nested coarser grids.Given a fine grid level i ≥ 0, let f i denote a representation of the objective function f on this level.Let w i,0 be an initial fine approximation to the optimization problem at level i.For the finest level i = 0, the optimization problem corresponds to the minimization of f 0 , the finest representation of the objective function f .However, for a coarser level i, the optimization problem corresponds to the minimization of a function h i that shall be specified later.The first step in the multigrid procedure is called a pre-optimization phase (by analogy to pre-smoothing in linear multigrid) and consists in applying N 0 iterations of an optimization procedure like truncated Newton (in our case line search TN) to h i to obtain w i,N 0 .As for nonlinear multigrid, this w i,N 0 is transferred to a coarser grid to obtain w i+1,0 := R w i,N 0 .The residual at this level is given by is the function to be minimized on the coarse grid level i+1.We take h 0 := f 0 = f on the finest level.Assume that w i+1, * is a solution to the optimization of (4.4).The error between w i+1, * and the initial approximation w i+1,0 is called coarse grid correction.This correction is extended back to level i, s i,N 0 = P (w i+1, * − w i+1,0 ).
In an optimization context, this correction step used to update the current solution w i,N 0 to w i,N 0 +1 is considered as a search direction called recursive to distinguish it from the direct search direction step that is computed by a given optimization procedure on the same grid level.
Finally, in order to remove the oscillatory components that may have been introduced by the correction step, one may finish with a post-optimization phase by applying N 1 iterations of the optimization procedure (in our work line search TN) to h i with initial guess w i,N 0 +1 , obtaining w i,N 0 +N 1 +1 .
The coarse optimization subproblem for h i+1 can be seen as a first order approximation to the fine grid problem h i since their gradient coincide on w i+1,0 : The first-order approximation suggests that the correction will be only reliable near the restricted approximation and it relates at the same time the multigrid optimization algorithm to the steepest descent method.The latter connection indicates that the coarse grid correction will not be typically a well-scaled descent direction which, in turn, implies that a line search should be performed to adjust the scale of the recursive search direction.However, it can be demonstrated that the multigrid algorithm is related to Newton's method, in the sense that the recursive search direction s i is an approximate Newton direction [29].Accordingly, in order to improve computational efficiency, the line search for a recursive direction step s i is performed only if w i,k + α k s i with α k = 1 does not reduce the value of h i .That is, if h i (w i,k + s i ) < h i (w i,k ) we update with w i,k+1 = w i,k + s i .Otherwise a line search is performed.
In [29] bound constraints are proposed for the optimization subproblem.In the context of our work, we have seen that bound contraints may improve robustness of the MG/OPT algorithm.These bound constraints may be implemented by means of active sets [39].However, in our case we have seen that we do not need set up such bounds since the line search TN algorithm used to optimize the subproblem h i+1 already provides with similar constraints.Line search algorithms does restrict the search at each iteration w i+1,k to an upper bound which depends on the gradient norm values and thus ensure that the update w i+1,k+1 is not far away from w i+1,k .
We have implemented the MG/OPT algorithm using a full multigrid method (FMG) to solve the problem and the resulting algorithm will be denoted by FMG/OPT.As for the linear case, the FMG/OPT starts at the coarsest grid level where enough TN iterations are performed, and prolongates the solution to the next finer level i where V i iterations of the MG/OPT cycle are performed.
Algorithm 5 shows the V-cycle algorithm used for FMG/OPT.At each iteration the algorithm computes a step s i either directly using the inner iteration of the TN method (Algorithm 2) on the current level, or recursively by means of the multigrid strategy.However, as noted in [47,22], the recursive call is useful only Thus we restrict the use of a coarser level i + 1 to the case where for some constant κ g ∈ (0, min ||R||) and where g ∈ (0, 1) is a measure of first order criticality for h i+1 .The latter condition is easy to check before trying to compute a step at level i + 1.

Implementation issues
The numerical algorithms line search TN, MR/OPT and FMG/OPT have been implemented in C using MegaWave2 library.In this section, we provide details about derivatives computation for both the objective function and the image.
5.1.1.Horn-Schunck data term.For the linear data term we have where i (respectively j) corresponds to the discrete column (respectively row) of the image, being the coordinate origin located in the top-left corner of the image.The function ψ is used to enhance robustness with respect to outliers.In our work we have used where γ is a given threshold.The gradient D for |x| ≤ γ is therefore given by where D u i,j and D v i,j refer to the partial derivative of D(u, v) with respect to variables u i,j and v i,j , respectively.Here I x , I y , I t are the spatial and temporal image derivatives for which the computation is explained in Section 5.3.Note that for |x| > γ the gradient D is (D u i,j , D v i,j ) T = (0, 0) T .
Algorithm 5 The V-cycle for the FMG/OPT algorithm 1: function MG/OPT-cycle(i, h i , w i,0 ) 2: i: level 3: w i,0 : initial approximation to h i 5: if i is coarsest level then if task is optimize, pre-optimize or post-optimize then Call MG/OPT-cycle(i+1, h i+1 , w i+1,0 ) 19: Let w i+1, * be the returned solution Perform a line search to scale the step s i,k by α k 23: Update M i,k+1 by the BFGS formula 25: return w i,k+1 else if task is recursive-call then 33: end if

35:
// Check if maximum number of outer iterations has been reached 36: if (task is optimize, pre-optimize or post-optimize) and (k−k 0 = N ) then 37: return w i,k

38:
end if 39: end loop 5.1.2.Intensity constancy based data term.The nonlinear data term based on the constancy assumption is as follows The gradient of this functional for |x| ≤ γ is given by 5.1.3.Quadratic regularization term.Now we consider the regularization term and start by the quadratic functional.We have where The partial derivatives of u, v are computed by forward finite differences with a discretization step h, that is, ) and u y i,j = h −1 (u i,j+1 − u i,j ) (derivatives v x and v y are computed similarly).The gradient of this functional is obtained as 5.1.4.Total variation term.In this case we suppose that To overcome the problem of non-differentiability of the total variation, a widely spread technique consists in approximating R by a differentiable function: where + µ and µ is a small positive parameter.Using again forward finite differences, the gradient of the last approximation is given by Another approximation of the total variation is given by Theoretically, this approximation is twice better than the first standard approximation.However, numerically in our implementaion, both approximations lead to the same results.5.2.Hessian calculation.For computing the Newton direction in the tuncated Newton method, the linear conjugate gradient is a Hessian-free procedure (see Algorithm 2) and needs only to supply a routine for computing the product of the Hessian with Newton direction p.This matrix-vector product is computed via forward finite differences: where is chosen to be the square root of the machine precision divided by the norm of w.

Image gradient calculation.
Differentiation is an ill-posed problem [5] and regularization may be used to obtain good numerical derivatives.Such regularization may be accomplished with a low-pass filter such as the Gaussian, and is essential for motion estimation [4,43].More recently, [19] proposes to use a matched pair of low pass and differentiation filters as a gradient operator which are the ones used in this work.Usually, the derivative at an integer I x i,j is computed by applying a separable filter composed by a Gaussian filter (alternatively, a matched low pass filter) in the y direction and the derivative of the Gaussian (alternatively, a matched derivative) is applied in the x direction.Conversely, the computation of the derivative in the y direction at an integer point is performed by applying a separable filter composed by a Gaussian (or matched low pass filter) in the x direction and the derivative of the Gaussian (or a matched derivative) in the y direction.
For motion estimation applications it may be necessary to compute the gradient at non integer points, since non integer displacements are allowed.This is the case, for instance, of the nonlinear data term of Section 5.1.2.In such cases, a simple way to proceed is a two step process: in a first step, the original image is interpolated at the required points using a bilinear interpolation or an interpolation kernel such as [28], and then the derivative is computed using the latter interpolated points.Another way to proceed is to first compute the gradient at integer points and then apply an interpolation kernel over the gradient values using these latter values.Both procedures are theoretically equivalent since differentiation and interpolation are based on linear operators (and thus, they are interchangeable).Let us call this approach as linear gradient interpolation.
In this work the computation of the gradient at non-integer points is done by means of a shift in the Fourier domain.Moreover, rather than shifting the image (or gradient) to obtain the interpolated values, the derivative filter taps are shifted so as to obtain the filter coefficients that have to be applied on the integer image values in order to obtain the corresponding interpolated gradient value.
Assume that the gradient and interpolation kernels are linear, shift-invariant and separable.Such kernels may be found in [19].Without loss of generality, the interpolation problem in the Fourier domain can be thus restricted to one dimension.Let us consider Figure 5.1 which shows the proposed technique.On top several samples of a one dimensional signal are shown.Filter taps A are used to obtain the gradient at integer points (in the example the filter A is centered on x = 4), whereas filter taps B may be used to obtain the gradient at every noninteger point half-way between two integer points (in the example the filter B is centered on x = 4.5).The filter taps B are obtained from filter taps A by means of a shift of 0.5 in the Fourier domain.They can be thus be applied directly on the original data.We will call this procedure Fourier gradient interpolation.
Figure 5.2 shows an example in which a set of matched filters [19] of size 9 are interpolated at non-integer points by a shift in the Fourier domain and by a linear interpolation.Performing a linear interpolation on the filter taps in order to apply them to the original data values is equivalent to the gradient linear interpolation approach described above.Note that, as expected, the obtained filter taps are different for both methods.The experimental section will show that Fourier based interpolation leads to a better performance than linear interpolation, especially in the multigrid approach.
In the two-dimensional case, if I x has to be computed at point (i + ∆i, j + ∆j), where (i, j) is an integer discrete image position and ∆i < 1 and ∆j < 1, the matched low-pass filter in the y direction (respectively the matched derivative in the x direction) is obtained by shifting the corresponding taps ∆i (respectively ∆j) in the Fourier domain.A similar procedure is used to compute I y at a non-integer point.
The previous scheme has a high computational load if the Fourier shift has to be applied to each non-integer position where one needs to interpolate the gradient.The computational load can be reduced significantly if, at the initialization of the optimization algorithm, the gradient is computed with the Fourier gradient interpolation at all points of a grid of size h/D.Then, each time one has to compute the gradient at a non-integer point a bilinear interpolation with the neighboring pixels is used.The Figure 5.2 shows the case where D = 10.

Experimental results
To assess the performance of the proposed algorithms, we use three classical sequences of synthetic images that consist of scenes of various complexity; namely, the translating tree, the diverging tree and the Yosemite sequences.The reference frame and the corresponding ground truth of the synthetic sequences are show in Figures 6.1 and 6.2.The image size of the tree sequences is 150 × 150 while the Yosemite sequence is of size 316 × 252.
Since we are more interested in the computational complexity of the proposed algorithms, we compare first the CPU time needed by each numerical algorithm to reach a similar accuracy using four optical flow models.All tests are performed on a PC with a 2.0 GHz Mobile Intel DualCore processor and 1 GB RAM.We compute also the number of functional and gradient evaluations that were performed by each algorithm to reach the estimated optical flow.We measure the overall number of function evaluations as where N f,i is the number of function evaluations performed by the optimization algorithm at resolution level i.For the TN and our MR/OPT and FMG/OPT algorithms, the function is evaluated during the line search procedure.F i is the mesh resolution ratio of a given level i with respect to the finest level 0. In this work F i = 2 2i .The number of gradient evaluations is defined in a similar manner.While the function is only evaluated during the line search procedure, the gradient is additionally evaluated during the inner iterations of the TN algorithm (i.e. the Hessian computation).Thus, we expect the number of function evaluations to be always lower than the gradient evaluations.Note also that the number of gradient evaluations is more crucial to the overall computational work than that of the objective function.Here one gradient evaluation is approximately equivalent to two function evaluations when using quadratic regularization, while it takes almost three times in the case of TV regularization.Moreover, we compare the quality of the optical flow estimations of the numerical algorithms.For this, we measure the average angular error (AAE) and the standard deviation (STD) of the estimated flow w e with respect to the ground truth w c .For a given pixel (i, j), the angular error (AE) between the ground truth motion vector, w c i,j , and the estimated flow, w e i,j , is computed as: The average angular error is the mean of the angular error over all pixels N np of the image AAE(w c , w e ) = 1 N np i,j AE(w c i,j , w e i,j ).
The standard deviation is computed as Before going through the performance evaluation of multilevel optimization algorithms, we first demonstrate the competitiveness of the adopted discretizeoptimize approach (numerical optimization) versus the standard optimizediscretize approach (Gauss-Seidel) and also justify the selection of the chosen numerical optimization algorithm.To this end, we compare the CPU time needed to reach a similar AAE by the following algorithms: 1) the proposed line-search two-step preconditionned truncated Newton algorithm approach (see Section 3), called here TN1, 2) the line-search Quasi-Newton L-BFGS approach of [30], called here QN.A positive-definite approximation to the Hessian is obtained at each iteration w k by storing the previous steps w k and the BFGS approach.Instead of solving at each outer iteration the Newton equation, the L-BFGS method takes advantge from the fact that the obtained BFGS preconditionning matrix is easily invertible.Thus, in the L-BFGS approach the line 7 of Algorithm 1 is substituted by The proposed algorithm TN1 is also compared against 3) a linesearch L-BFGS precondionned truncated Newton method, called here TN2.This corresponds to the same algorithm as TN1 but the L-BFGS preconditionning approach [30] is used instead the two-step BFGS approach [37].It should be noted that in these three numerical optimization algorithms (TN1, TN2 and QN) the same line-search approach is used [34].The previous algorithms were compared using the linear data term and quatratic regularization, which corresponds to the classical Horn-Schunck model.Thus we implemented also a Gauss-Seidel Horn-Schunck smoother [25].We call the latter algorithm GS.The experimental results for the four algorithms (GS, TN1, TN2 and QN) are shown in Table 6.1.Best results for QN were obtained using 4 steps, and TN2 was setup to the same number of steps for the preconditioner.As expected from a .00 in less than 1 second but will get stuck and does not reach the optimal solution.Overall, truncated Newton algorithms perform better justifying the choice of the TN as the underlying optimization smoother within multilevel algorithms to determine direct search directions.TN1 and TN2 perform almost the same with a slight better performance of TN1.This algorithm will be used in the sequel as the smoother and will be denoted by OPT.Now we report the performance evaluation of the unilevel, the multiresolution and the multigrid optimization algorithms applied to estimate the optical flow between two successive frames of the above three sequences.We have considered four optical flow models as described above in Section 2 and Subsection 5.1.We note that we use thresholding in all the data terms to remove outliers.In all experiments we have considered L = 6 levels of resolution for multilevel algorithms (multiresolution and multigrid).The stopping criteria for all algorithms were set on the relative error of the objective function, the gradient norm or the solution norm with a tolerance of 10 −5 .For the MR/OPT algorithm, the maximum number of outer iterations was set to 10 iterations per each resolution for all the optical flow models except when using the nonlinear data term and the TV regularization for which we use a maximum of 15 iterations.For the FMG/OPT, we perform 2 or 3 V-cycles when using the nonlinear data term, while only 1 V-cycle is sufficient in the case of the linear data term.We note here that for the purpose of a fair computational work comparison, we stop the outer iterations once a similar accuracy on the optical flow estimation is reached.For all the algorithms, the maximum number of innner iterations within the OPT method was set to 20.
In Tables 6.2-6.4,we summarize the quality of the solution and the computational costs of the three numerical algorithms for four optical flow models.Model 1 refers to the linear data term plus the quadratic regularization; model 2 refers to the nonlinear data term plus the quadratic regularization; model 3 refers to the linear data term plus the TV regularization; and finally model 4 refers to the nonlinear data term plus the TV regularization.In terms of the quality of the solution, by comparing the unilevel algorithm versus the multilevel algorithms, we note that the optical flow estimation of the four models is more accurate when 0.18 0.17 computed using the latter algorithms for all the tested images.In this regard, multigrid optimization has shown to provide a more accurate estimation than multiresolution optimization if we take the average angular error as an accuracy measure, see Tables 6.2-6.4.
In overall, the FMG/OPT algorithm performs at least twice better than the MR/OPT algorithm and ten times better than the unilevel truncated Newton algorithm; see Table 6.5.We notice also that the FMG/OPT algorithm is less independent of the image size because it often takes similar number of function and gradient evaluations while comparing across the same optical flow model.

Conclusion
Based on the discretize-optimize approach, we have applied different numerical optimization techniques to variational models for optical flow computation.First, we have shown the competitiveness of this strategy compared to the classical optimize-discretize approach.Three Newton-based optimization algorithms were superior to the Gauss-Seidel method when applied to the classical Horn-Schunck model.In particular, truncated Newton was shown to be a suitable unilevel optimization algorithm and was chosen as the smoother for optimization-based multilevel methods.We have then implemented the FMG/OPT algorithm based on a line search strategy to scale the (direct) Newton or the (recursive) multigrid search direction.Several components of the MG/OPT technique have been tuned for high efficiency and the algorithm has been fully evaluated with respect to unilevel and (one-way) multiresolution optimization.Our experimental results have demonstrated that the FMG/OPT algorithm can be effectively used for optical flow computation.Using different models and images, we have observed the FMG/OPT algorithm was faster and more accurate than both unilevel and multiresolution truncated Newton.Further research will investigate the use of line search multigrid versus trust region multigrid in the context of dense optical flow computation.The proposed numerical strategy can be adapted to the minimization of other nonlinear energy functionals like the illumination invariant model proposed in [14] or depth estimation in stereo problems.

5. 1 .
Functional gradient calculation.The gradient of the objective function in (2.7) is calculated analytically and given by

Figure 5 . 1 .
Figure 5.1.Filter tap interpolation by means of a shift in the Fourier domain.

Figure 5 . 2 .
Figure 5.2.A set of matched filters of size 9 are interpolated at non-integer points by a shift in the Fourier domain (black) and by a linear interpolation (light gray).

Figure 6 . 1 .
Figure 6.1.On the top one frame of the original sequence is shown.On the bottom the ground truth for the corresponding translating (left) and diverging (right) is shown.Motion vectors have been scaled by a factor of 2.5 for better visibility.

Figure 6 . 2 .
Figure 6.2.On the left the one frame of the Yosemite sequence is shown.On the right the corresponding ground truth is depicted where motion vectors have been scaled by a factor of 2.5 for better visibility.

Table 6 .
1. CPU time needed by the approaches GS, TN1, TN2 and QN to reach similar accuracy for the Yosemite sequence using the Horn-Schunck model.

Table 6 .
2. Comparison of computational work and optical flowEstimation for two Frames of the translating tree sequence using optimization algorithms OPT, MR/OPT and FMG/OPT.Time is CPU time in seconds.

Table 6 .
3. Comparison of computational work and optical flow estimation for two frames of the Diverging Tree sequence using optimization algorithms OPT, MR/OPT and FMG/OPT.Time is CPU time in seconds.

Table 6 .
4. Comparison of computational work and optical flow estimation for two frames of the Yosemite sequence using line search optimization algorithms (OPT, MR/OPT, FMG/OPT).Time is CPU time in seconds.

Table 6 .
5. Global characteristics of OPT, MR/OPT and FMG/OPT for optical flow models on all the three images.Nfg is the total number of function and gradient evaluations.