Discriminating signal from background using neural networks. Application to top-quark search at the Fermilab Tevatron

The application of Neural Networks in High Energy Physics to the separation of signal from background events is studied. A variety of problems usually encountered in this sort of analyses, from variable selection to systematic errors, are presented. The top--quark search is used as an example to illustrate the problems and proposed solutions.

It is well known that neural networks (NN's) are useful tools for pattern recognition.
In High Energy Physics, they have been used or proposed as good candidates for tasks of signal versus background classification. However, most of the existing studies are somewhat academic, in the sense that they essentially compare the NN performances with other classical techniques of classification using Monte Carlo (MC) events for that purpose. In realistic applications, real events should be analyzed and compared with simulated events, introducing systematic effects which have to be taken into account and could significantly modify the efficiency of the analysis. We try to give some insight in this direction using the top quark search at the Fermilab Tevatron as illustration. The top quark has been observed by the CDF [1] and D0 [2] collaborations. Recently, NN's have been applied to experimental top quark searches by the D0 Collaboration [3], for a fixed top quark mass, concluding that NN's are more efficient than traditional methods, in agreement with previous parton level studies [4].
In this paper we continue and complete the analysis of Ref. [4] for the top quark search at the Tevatron. A more realistic study is performed by including parton hadronization and detector simulation with jet reconstruction. In addition, contrary to Ref. [4] where the top mass was fixed, the present study is valid for a large range of top mass values. Moreover, the number of kinematical variables considered is enlarged and different ways of selecting subsets of the most relevant ones to the process under consideration are discussed. Finally, the influence of systematic errors on the NN results is studied.
The analysis is focused on the top quark search at the pp Fermilab Tevatron operating at √ s = 1.8 TeV. The one-charged-lepton channel, pp → tt → lνjjjj with l = e ± , µ ± , is considered as the signal to look for. The main background is pp → W jjjj → lνjjjj. Exact tree-level amplitudes with spin correlations were used to generate MC samples for both signal and background. The latter was evaluated with VECBOS [5]. The CTEQ structure functions [6] at the scale Q = m t (Q =< p t >) for the top signal (background) were utilized.
The LUND fragmentation model [7] was used to hadronize the quarks and/or gluons. GeV. Events with one-charged-lepton and four jets satisfying the following acceptance cuts were selected: p j t , p l t , p / t > 20 GeV; |η j |, |η l | < 2 and ∆R jl , ∆R jj > 0.7. The symbol p t (η) stands for transverse momentum (pseudorapidity) and the indices j = 1, 4 and l refer to the four jets and charged lepton respectively; p / t is the missing transverse momentum associated with the undetected neutrino and ∆R = (∆η) 2 + (∆φ) 2 is the distance in the η − φ space, where φ is the azimuthal angle. The cross sections after the acceptance cuts for the signal and the background are given in Table I.
In order to use NN's as signal/background classifiers, we considered layered feed-forward NN's with topologies N i × N h × N o , (N i , N h and N o are the number of input, hidden and output neurons, respectively), with back-propagation as the learning algorithm to minimize a quadratic output-error. Using a set of physical variables as inputs and taking the desired output as 1 for signal events and 0 for background events, the network output gives, after learning, the conditional probability that new test events are of signal or background type [8,9], provided that the signal/background ratio used in the learning phase corresponds to the real one.
The robustness of the NN method is shown by making the results independent of the top mass, using several values in the learning and testing phases. During the learning phase a general network (GN) is fed with a set of events which contains a signal sample, composed by three subsamples corresponding to m t = 150, 174 and 200 GeV, and a background sample in a 1 : 1 proportion. In so doing, the NN output loses its direct Bayesian interpretation when applied over data whose signal/background proportion is not 1 : 1. Nevertheless, the NN is still useful for classification [8]. This way of proceeding has been shown to optimize the learning process and allows to use the network in a wide interval for the masses of the signal [10].
A set of N = 15 initial variables was considered. Some of them are chosen specifically to pin down the a priori main characteristics of the top signal, while others are not specific to the signal. For each reconstructed event we compute: (1) S, the sphericity; (2) A, the aplanarity; (3) m W jj , the invariant mass of the hadronically decaying W ; (4) p W l t , the transverse momentum of the leptonically decaying W ; (5) E T , the total transverse energy; (6) p l t , the charged lepton transverse momentum; (7) η l , the charged lepton pseudorapidity; (8-11) p i t , i = 1, 4, the transverse momenta of the jets in decreasing order and (12-15) η i , i = 1, 4, the jet pseudorapidities in decreasing order. The missing transverse-momentum has been assigned to the undetectable neutrino and its longitudinal momentum inferred along the lines suggested in Ref. [11] In the testing phase, the GN with topology 15 × 15 × 1 is fed with new background and top events. The latter can be chosen with masses either corresponding to the values used for learning or to new values m t = 167 or 189 GeV. This differs from previous works [12,4] where the same mass values were used in both learning and testing steps. Figure 1 shows the reconstructed top mass obtained for five top signals and the background, corresponding to an integrated luminosity L = 100 pb −1 . A good top reconstruction is achieved for all masses considered but there is a substantial background contribution. To further appreciate the GN's usefulness, five specialized NN's (SN) were trained with a top mass specific to each one of them and a generic background common to all NN's. Again, a 1 : 1 signal to background ratio was used for learning. The GN and SN average errors, shown in Table II, are similar for all masses considered. This indicates that the GN performs fairly well for a Nevertheless, it is clear that the window for the top mass should be reduced if the mass is more precisely known.   Table III. It can be seen that most of them give values close to 1, showing that they are more compatible with our signal simulation than our simulated background.
The selection of the most relevant variables for a given process is one of the major problems in experimental analyses. Too many variables may introduce noise and make the event selection task very difficult. On the other hand, too much sensitivity may be lost when too few variables are used. In general, a large number of variables, N, can be considered and measured for an event. All N variables carry some information on signal versus background differences, but it is obvious that some subset of them will be more valuable than other subsets for the separation task. Therefore the selection of a subset with the 'best' variables n (n < N), carrying the largest discrimination power between signal and background samples, even if lower classification efficiencies may follow, is of interest.
In the process of reducing the number of variables, it is convenient to control the efficiency loss in the classification task. We suggest that NN's can be used for both the variable selection and the evaluation of the efficiency loss. For the former, there are several methods latter will naturally be estimated in terms of the error function. When reducing the number of variables, it is convenient to eliminate only a few variables in one step rather than making multivariable rejection at once. This introduces a mild dependence of the chosen variables on the number of rejection steps, but turns out to be more efficient. The following approach was adopted: We have considered three methods involving weights for the selection of the variables carried at step 2. For every input neuron k, the following quantities -in terms of its connections with the hidden layer units, w kl -have been considered: the sum of the weights [8], the variances [14] and the saliencies [15], defined respectively as Method 1: Method 2: variables are the mass of the hadronically decaying W , the total transverse energy E T , and the jets transverse momenta p 1 t , p 3 t and p 4 t . The quadratic error associated with this set of five variables, obtained through systematic reduction, can be compared, for instance, with the one obtained for the intuitive variables used in Ref. [4]: S, A, m W jj , p W l t , E T . The former is 18% lower than the latter, showing the usefulness of the methodical reduction.
We have trained an NN with the five relevant variables to study the enhancement of the signal/background ratio as a function of the NN output cut. For a specific cut, only events with a network output higher than the specified cut are selected. Since the signal is peaked around 1 and the background around 0, it is clear that increasing the cut makes the signal/background ratio larger. A typical quantity that is used to reveal the existence of a signal is the statistical significance, defined as:   NN output cuts between 0.6 and 0.8 increase the ratio signal/background with a minimal loss on the signal and a significant loss on the background. Figure 3 shows the reconstructed top mass with only those events with the NN output larger than 0.7. As can be observed the signals dominate clearly over the background.
At this point, one can wonder about the benefits of using a reduced number of variables in the analysis. The main reason is to avoid possible noise when a large number of variables is used. In fact, the allowed increase of 25% for the average error translates into decreases for the signal efficiency and statistical significance. We have found that the efficiency (statistical significance) diminishes from 0.75 (6.8) to 0.58 (6.0) when reducing from the initial 15 to the final 5 variables, for an NN output cut of 0.7, value chosen because it maximizes the statistical significance. These can be considered dramatic losses. However, our initial number of variables, N = 15, was moderate and we could optimize the NN learning avoiding local minima. In general, this can be done for small sets of variables, but it is very difficult for large ones, thus being possible that NN's trained with small subsets of relevant variables reach better efficiencies and/or statistical significances than NN's trained with larger variable sets.
We consider now some sources of systematic errors coming from eventual disagreements between MC and real data. In standard analyses, where single cuts are applied on single variables, the effects of systematic errors should be studied only in the region around the cuts in an easy and well understood way. In the case of an NN the only possibility to study the systematic error in the classification is to propagate the "estimated" systematic errors on the input variables to the output. Two basic effects can be considered: shifts between data and MC and different resolutions for the used variables. We have studied the effect of 2% shifts and 2% change of resolution on the clusters energy. With these new energies the five selected variables were reconstructed to obtain a "new" test data to evaluate systematic effects. Notice that the 2% variation of the reconstructed cluster energies has been chosen for illustration purposes. This procedure automatically includes the correlations of the NN input variables. (There are studies in the literature where this is not the case [16].) The results depend on the NN output cut. In the region of interest, we have found that the uncertainty due to systematic errors is comparable with the uncertainty coming from an error on m t of ± 11 GeV.
The application of Neural Networks to discriminate signal from background in High Energy Physics has been studied, using the top quark search at Fermilab as an example.
The analysis is valid for a large range of top mass values. Special attention was paid to the selection of the most relevant variables. Several methods -in terms of the weights connecting the input and the hidden neurons-were considered. We conclude that Methods 1 and 3, making use of the sum of the weights (in absolute value) and the weight saliencies, respectively, give similar results and are more suited for the variable selection than Method 2, using the weight variances. The performance of the reduced NN was studied in terms of the statistical significance. When comparing it with the initial NN, we found a small decrease for the statistical significance, and moderate loss of the signal efficiency. Finally, the effect of propagating systematic errors arising from energy shifts and changes in resolution have