Enhancing the top signal at Tevatron using Neural Nets

We show that Neural Nets can be useful for top analysis at Tevatron. The main features of $t\bar t$ and background events on a mixed sample are projected in a single output, which controls the efficiency and purity of the $t\bar t$ signal.

The announced discovery of the top quark by CDF [1] at Tevatron has originated a big excitation in the scientific community.Although the statistics is too limited † to establish the existence of the top quark, it is however natural to interpret the excess of events as t t.The experimental situation will certainly improve in next months and top will hopefully be confirmed.From the theoretical point of view, the consistency of the Standard Model demands top to be the partner of the bottom quark, ensuring the absence of flavor changing neutral currents [3].The CDF value of the top mass, m t = 174 ± 10 +13 −12 GeV [1], is consistent with recent theoretical studies on radiative corrections combined with precision measurements of the Z boson mass and the strong coupling constant at LEP leading to m t = 165 +13+18 −14−19 GeV [4].
The dominant top production mechanism at Tevatron is q q → t t, followed by gg → t t.Once produced, the top decays into bW , with the subsequent W → lν, q q′ decay, in the detector.There are therefore three possible final states for the t t signal which , on increasing branching ratios, are: 1. Two charged leptons, missing energy and two jets 2. One charged lepton, missing energy and four jets 3. Six jets.
They need different strategies for top searches and different backgrounds have to be considered respectively.The first channel suffers from a small branching ratio and the presence of two undetected neutrinos that makes top reconstruction unfeasible.It has been analyzed in terms of the correlations among the charged leptons [5] and, recently, it has been suggested to be separable from its possible backgrounds [6].The most investigated channel so far is the one containing one charged lepton [7].It has a sizeable branching ratio with a moderate background.Still the neutrino escapes detection and hence the event can not be completely reconstructed.The third channel, six final jets, is the most likely and allows full top reconstruction but at expenses of a huge QCD background.Recently, it has been pointed out that tagging of a b-quark can help to obtain acceptable signal to background ratios for m t < 180 GeV [8].
All mentioned channels need some specific experimental cuts for detecting jets and/or hard leptons as well as for their isolation.This, together with detector performance, implies † CDF has reported on 12 events, with 6 events for the estimated background, with a 0.26% probability of observing background fluctuation.D0 instead has not a clear signal of the top quark [2].
a sensible reduction on the number of possible t t candidates, and demands a good efficiency for discerning real from fake t t background events.We propose to use Neural Nets (NNs) for the analysis of experimental data trying to maximize the signal to background ratio without significant loses in statistics, in particular to top analysis at Tevatron.NNs are by now well known for its ability in classifying among different distributions and are being used for this purpose in several high energy applications [9].Some examples are Higgs search at LHC [10], b and τ analysis [11], quark and gluon jets analysis [12], determination of Z to heavy quarks branching ratios [13], or bottom jet recognition [14].It has been shown that NNs give, after proper training, the probability that a given event belongs to some class [15] providing therefore a useful tool for classifying decisions.In fact, we are not interested in a deep and exhaustive analysis but rather in the possibilities that a NN can offer us for enlarging the signal to background ratio.For we restrict ourselves at the parton level, without considering hadronization, detector acceptance, resolution effects, efficiencies, etc. in order to illustrate the potential effects of the NN in front of the classical analysis in terms of cuts on a given set of variables.
We focused our analysis to the one charged lepton channel with l = e ± , µ ± , using the exact tree level amplitudes with spin correlations [16].The main background to this process is [17] pp which is an order of magnitude smaller [18].We have only considered the first mechanism and have used VECBOS ‡ [19] for its evaluation.
We have taken m t = 174 GeV and have normalized the total t t cross section at Tevatron to 5.1 pb, value that takes into account O(α 3 s ) corrections and resummation of leading soft gluon corrections to all orders in perturbation theory [20].CDF measures a t t cross section of 13.9 +6.1 −4.8 pb [1] which is a factor around 2.5 bigger than the theoretical value we have used.Notice that using the CDF value, the signal to background ratio would increase by the same factor.We have used the HMRS set 1 structure functions [21] at the scale Q = m t (Q =< p t >) for the top signal (background).We generated events satisfying reasonable acceptance cuts for the jets, charged lepton and missing transverse momentum, and the jets and lepton pseudorapidities and requiring jet and lepton isolation, where ∆R = (∆η) 2 + (∆φ) 2 is the distance in the lego plot.These cuts are intended to simulate the experimental cuts needed to detect jets and hard leptons inside the detector and to select good candidates for top production (from now on these cuts will be referred to as acceptance cuts).The cross section after the acceptance cuts is 0.35 pb (1.2 pb) for t t signal (background) in good agreement with Ref. [22].We generated 4000 t t and 4000 background events.The total number of events is essentially limited by the time needed to generate a statistically significant sample for the background.(More efficient generation techniques have been recently proposed [8] wich could hopefully circumvent this problem).
Notice that the acceptance cuts have to be supplemented either with additional cuts or any other criteria, as a NN for instance, on some kinematical variables in order to assign a single event as signal or background, leading to a reduction of the t t and background event samples.(b tagging, for example, reduces the signal by a factor of order 0.3.[23]) We have considered six kinematical variables in our analysis, T , the transverse momentum of the leptonically decaying W .
• ii) E T , the total transverse energy.
• iii) m W jj , the invariant mass of the hadronically decaying W .
• iv) m t , the reconstructed top mass.
Variables i and ii are completely defined when assigning the missing transverse momentum to the undetected neutrino.The third variable requires pairing of two jets with invariant mass close to the W mass. Variables iv, v and vi need the knowledge of the longitudinal momentum of the neutrino, which is not measured.It can however be inferred assuming that the lν pair comes from an on-shell W .This leads to a two-fold ambiguity which can be resolved to some extend by requiring t t reconstruction in the lines suggested by Ref. [22] to which we refer for details.The sphericity and aplanarity, computed for the lepton plus neutrino plus 4-jet momenta, take into account the topology of the events expecting larger values from the signal than the background distributions.
The usual strategy for classifying signal or background type events is by applying different cuts on the kinematical variables considered, the six above mentioned in our case.These cuts are usualy given by simple expressions (for instance: var1 > cut1 and var2 < cut2), so that, the different regions are separated by hyperplanes in the variable space (from now on these cuts will be referred to as kinematical cuts).Denoting by T (B) the number of top signal (background) events passing our selection criteria, and T t the total number of t t events selected after the acceptance cuts, Eqs.(4-6), one would like to find the best combination of cuts on the kinematical variables such to maximize the efficiency η ≡ T /T t or the purity P ≡ T /(T + B) or both simultaneously.In the latest case, a method could be to maximize the statistical significance of the filtered subsample, S s ≡ T / √ B, criterium that can be used to enhance a new signal from its expected background.In any case, this gives rise to subtle fine tuning on the cuts to reach the maximization that can become a hard issue for larger number of kinematical variables considered.
We are interested in the separation of signal and background using a layered feedforward NN which, as we will show, avoids fine tuning in a multi variable space.A feedforward NN consists of several layer of units called neurons.Between the layer we can distinguish one input layer where the information comes in, one or several hidden layers where the information is processed, and one output layer which yields the output of the NN.
The input of neuron i in layer l is given by, where in i is the set of kinematical variables for event e, the sum is extended over the neurons of the preceding layer (l − 1), S l−1 j is the state of the neuron j, w l ij is the connection weight between the neuron j and the neuron i, and B l i is a bias input to neuron i.The state of a neuron is a function of its input S l j = F (I l j ), where F is the neuron response function.In this paper we take F (I l j ) = 1/(1 + exp(−I l j )), the so-called "sigmoid function", which is similar to the response curve of the biological neuron and offers more sensitive modelling of real data than a linear function.
The parallel behaviour of NNs has the capacity of learning over a set of given examples.A very popular learning algorithm is the error backpropagation (BP) [24].The main objective of the BP is to minimize an error function, also called energy by adjusting the w kl and B n parameters and being o (e) the state of the output neuron, out (e) its desired state, and e runs over the event sample.Taking the desired output as 1 for each signal event and 0 for each background event, the output of the net, after training, gives the conditional probability that given the observed quantities for a single event, this event is a signal [15], provided that the ratio of signal to background in the learning sample corresponds to the real one.
We have used a 3 layer NN with 6 input neurons that are activated with the kinematical variables mentioned in the previous section (normalized to 1 for convenience), a hidden layer with 6 neurons, and a unique output neuron which desired output is 1 for the signal and 0 for the background.We have found that using 6 neurons in the hidden layer optimizes the minimum energy.
For the training step we have used 2000 top events and 2000 background events which do not correspond to the expected cross sections ratio.However since we are not interested in the conditional probability mentioned above but to study the efficiency and purity as a function of the cut on the output activation of the NN, this fact will not produce any trouble and the learning results more efficient.As a test sample, we have taken 570 (2000) top (background) events statistically independent from the training ones.The top/background ratio of the test sample is chosen equal to the obtained from the expected cross sections.All results presented have been obtained from the test sample.
Figure 1 shows the distribution of signal and background events as a function of the NN output activation for the test sample.We see two peaks close to 1 and 0 corresponding mainly to the signal and background respectively.It is clear from this plot that cutting on the output of the net we can have samples richer on signal or in background as desired.Solid (dashed) line in Figure 2 shows the efficiency (purity) as a function of the net output cut.It is clear that we have to choose an output cut close to 1 if we want high purity or a cut close to 0 for high efficiency.The highest output cut to improve the purity, given a fixed luminosity, would be the one leading to still enough signal events (as minimum 5).This cut will be very close to 1, due to the fact that the efficiency is larger than 0.9 for any value of the output cut except for values very close to 1, Figure 3 shows the efficiency versus the purity (solid line) when varing the NN output cut from 0 to 0.99998.The points correspond to some hypercubic cuts applied over the six input variables, and have to be considered as the traditional procedure (each point represents a given combination of cuts bigger than certain values, or masses located around a certain central value, for instance p t > p min t , S > S min , m W −δ < m W jj < m W +δ,...), chosen favoring the signal in front of the background.We find that the NN performance, working only with one variable, the output of the net, is better than the traditional analysis for any combination of purity and efficiency, showing the great improvement of the method.A complex problem on many variables has been reduced to the study of only one variable, the NN output, which even improves the analysis.
When the important fact is to reveal the existence of the signal the relevant quantity should be the statistical significance.Values of S s > 5 are commonly accepted as a proof of the existence of a clear signal.Figure 4 shows the relation of the statistical significance versus the efficiency and the purity for T t = 1 signal events (changing the number of signal events, T t , the surface in Fig. 4 does not modify its shape and only rescales its height which is proportional to √ T t ). Figure 5 shows the statistical significance as a function of the net output for 7 signal events before kinematical cuts (corresponding to an integrated luminosity of 20 pb −1 ).We see that the statistical significance increases as the output cut increases.As in the case for improving purity, the highest output cut, given a fixed luminosity, would be the one leading to still enough signal events (as minimum 5), and is very close to 1.
One of the problems that is faced in pp collisions is the estimation of the background.A factor 2 on the background could destroy any evidence of the signal.In Figure 6 we have the allowed region of the output cut versus the factor f of the background ( f = 2 means that the background is two times bigger as we have computed) where, for the luminosity of 20 pb −1 , we still can obtain a 5 sigma effect with at least 5 signal events.Given a fixed factor f the largest and smallest values of the output cut correspond to the highest purity and highest efficiency respectively.Notice that output cuts very close to 1 are not included in the allowed region, although this is not visible in the plot.
Our results indicate that NNs are suitable for top analysis at Tevatron.Although we focused our study in a particular channel and worked at the parton level, we expect similar behaviour for the other channels with the corresponding backgrounds and when performing more realistic analysis including hadronization and detector simulation.We do not claim to have used neither the best kinematical variables for our analysis, nor to find the best NN topology.Our aim was only to study the potential use of NNs as a cross check to the traditional analysis in terms of cuts on a multidimensional variable space.More elaborated studies are postponed for a forthcoming publication.
In conclusion, we have shown that a NN trained with a mixed sample of t t and background events learns the main features of the different samples in a multivariable input space and projects them in a single output.This output turns out to be very useful for discrimination between signal and background events.• Fig. 2 Efficiency (solid line) and purity (dashed line) as a function of the NN output cut.
• Fig. 3 Efficiency versus purity for the test sample.The solid line shows the NN result whereas the points correspond to several sets of linear cuts (see text) applied to the six input variables.
• Fig. 4 Statistical significance as a function of the efficiency and purity normalized to T t = 1 signal events.It scales as √ T t .
• Fig. 5 Statistical significance as a function of the NN output cut for an integrated luminosity of 20 pb −1 .
• Fig. 6 Allowed region (shaded area) of the output cut versus the factor f of the background (f = 2 means that the background is two times bigger as we have estimated) where, for the luminosity of 20 pb −1 , we still can obtain a 5 sigma effect with at least 5 signal events.

Figure Captions •
Figure Captions