Adaptive Asymmetric Least Squares baseline estimation for analytical instruments

Automated signal processing in analytical instrumentation is today required for the analysis of highly complex biomedical samples. Baseline estimation techniques are often used to correct long term instrument contamination or degradation. They are essential for accurate peak area integration. Some methods approach the baseline estimation iteratively, trying to ignore peaks which do not belong to the baseline. The proposed method in this work consists of a modification of the Asymmetric Least Squares (ALS) baseline removal technique developed by Eilers and Boelens. The ALS technique suffers from bias in the presence of intense peaks (in relation to the noise level). This is typical of diverse instrumental techniques such as Gas Chromatography-Mass Spectrometry (GC-MS) or Gas Chromatography-Ion Mobility Spectrometry (GC-IMS). In this work, we propose a modification (named psalsa) to the asymmetry weights of the original ALS method in order to better reject large peaks above the baseline. Our method will be compared to several versions of the ALS algorithm using synthetic and real GC signals. Results show that our proposal improves previous versions being more robust to parameter variations and providing more accurate peak areas.


I. INTRODUCTION
Several instrumental techniques such as GC-MS or GC-IMS produce signals consisting of chemical information, baseline and noise. In order to extract reliable chemical information, such as a list of peak positions and peak areas, it is crucial to denoise the signal and remove its baseline. Analysts usually manually select the peak boundaries and fit a curve to them to estimate the baseline of each peak, however manual baseline estimation is very expensive when the number of peaks in the signal increases (such as in complex biological samples) or when there is a large number of signals to analyse. Moreover, the analyst adds a subjective component to peak identification that depends on her/his expertise. For this cases, an automatic baseline estimation method is needed.
There are many automatic baseline estimation methods published, such as methods based on polynomial fitting [1], methods based on weighted least squares [2]- [4] or methods based on wavelets [5].
In this work, a modification to the Asymmetric Least Squares method is proposed. This modification is presented in order to improve the performance of the baseline estimation when large peaks are present in the signal. This paper is organized as follows: First a description of the synthetic and real datasets is given on section II. Then, on section III the three methods under comparison are described: in section III-A the original ALS method [6] is summarized; in section III-B the improved method airPLS [3] is described; and finally our proposed method "Peaked Signal's Asymmetric Least Squares Algorithm" (psalsa) is detailed in section III-C. Results and discussion are shown on sections IV and V respectively. Finally some conclusions are given.

II. DATA DESCRIPTION
Two Gas Chromatography datasets are used to compare the different methods: On the one hand, a synthetic dataset offers the possibility to assess objectively the performance of the different methods, as we know the real baseline added to the synthetic signal and therefore we can compute the error of the different baseline estimations. On the other hand, a real dataset lets us check how the different methods perform on real world samples, which inevitably are more complex than synthetic chromatograms.

A. Synthetic dataset
A dataset with N synth = 100 samples was generated. Each synthetic chromatogram lasted 30 minutes long with a sampling frequency of 2 Hz. Each sample was the combination of three components: a baseline, noise and a signal consisting of the addition of several peaks.
A synthetic chromatogram is shown at figure 1 1) Peak model: In order to generate the signal, several peaks are generated and placed randomly on the signal. As peak density of 0.25 peaks/s is chosen, a total of 450 peaks/sample are generated.
Peaks are modelled following a Generalized Exponential (GEX) function. The generalized exponential function [7] is an empirical peak model that has been used successfully [8] to describe chromatographic peaks, taking into account factors such as peak shape and peak asymmetry.
The GEX model is given by: 2) Baseline model: In order to generate a realistic baseline, it is generated as the addition of three contributions: where The parameters for each baseline contribution are chosen from random uniform distributions in the following ranges: [−π, π] 3) Noise model: Gaussian noise with A = 100 + 200t, µ = 0 and σ = 400 has been added to the signal. The amplitude increases with the retention time to simulate the fact that the end of the chromatogram is more noisy than the beginning.

B. Real samples
Chromatograms from a GC-MS dataset of human urine samples were used to test the proposed algorithm. Figure 2 shows samples from this dataset, notice the logarithmic scale Samples were analysed at the PCB (Barcelona Scientific Park) premises, using a gas chromatograph -mass spectrometer (Focus GC-DSQ II) from Thermo Scientific equipped with a quadrupole analyser and an electron multiplier detector. The capillary column used was DB-624 (60 m × 0.32 mm i.d.) coated with 6 % cyanopropylphenyl 94 % dimethylpolysiloxane (film thickness 1.8 µm). The temperature program of the chromatographic oven began at 60 • C (2 min), ramped to 220 • C at 8 • C min −1 and held for 5 min. The injection port was maintained at 220 • C throughout the experiments.

III. METHOD DESCRIPTION
In 1987, Newey and Powell introduced [9] Asymmetric Least Squares (ALS) in order to construct statistical tests for homoskedasticity applying them to Econometrics. More recently, Eilers et al. applied ALS for baseline estimation in connection to Parametric Time Warping alignment [10], and presented it in detail [6]. In 2010, Zhang et al. presented airPLS [3] which improved the weights of the original ALS method. Additionally, J Peng et al. [4] presented a different improvement to the original ALS method focusing on baseline estimation with multiple samples.

A. Original Asymmetric Least Squares
Given a signal y of length m, ALS aims to estimate a signal z smoother than y but still similar to it. ALS proposes a modelfree cost function given by: where d i = y i − z i are the residuals of the estimation and The first term in S accounts for the fidelity from z to y, while the second term imposes smoothness to z. Smoothness is controlled by parameter λ, usually chosen between 10 2 ≤ λ ≤ 10 9 . The cost function can be generalized by introducing weights w: These weights w are introduced so as, if properly defined, will be able to reject penalizations to the cost function produced by regions where the signal is above the estimated baseline (i.e. peaked regions).
The proposed definition of w is based on a parameter p which is usually chosen as 0.001 ≤ p ≤ 0.1: As one can see from the definition of w i and the values of p, regions where the signal is placed above the baseline will have a much smaller contribution to the penalty.
Minimization of 4 leads to: where W = diag(w) and D being the difference matrix: As there is no model imposed on z, there will be m equations forming a sparse system, where only the diagonal end two sub-diagonals above and below it are non-zero. A solution to eq. 4 can be found by iterating. Given an initial set of weights w i = 1, an initial estimation for z i can be computed. From z i , weights are computed and used to get a new estimation for z. Less than 20 iterations are needed for a proper estimation of z.
According to [6], a proper value for p may be validated by considering the histogram of the residuals d, so as the noise components are normally distributed near zero and peaks are represented in the histogram as a positive asymmetric component. The right value for p will produce a baseline that cuts the noise instead of fitting below or above it.

B. airPLS correction
Zhang et al proposed in [3] an improvement to the definition of w with two objectives: To remove the parameter p, simplifying the usage of the algorithm; and to improve the quality of the estimation by adapting the weights depending on the distance from the signal to the baseline.
The definition of the weight vector w for airPLS is as follows: where t is the current iteration. With this definition of weights, regions of the signal where the signal is above the estimated baseline are ignored at the next iteration. For the rest of the weights, the further the signal is from the baseline the least it contributes to the penalty.
Having the current iteration t in the exponent forces the weights to be smaller on each iteration, making more significant the smoothness term as iterations go on.
The criteria set by airPLS to stop iterating is given by either a maximum number of 20 iterations or by: The featured airPLS version 2.0 for MATLAB was used as the reference implementation. In this version, a p value is used to set the weights of points found at the beginning and at the end of the spectra as the adaptation of the weights does not give good estimates close to the signal limits.

C. Proposed method: psalsa
We propose a different definition for the weights much more similar to the original ALS algorithm. ALS is not able to fit very intense peaks because, even though a small p value is chosen, the peak residual is big enough to contribute significantly to the baseline fit. If a smaller value of p is used instead, then the baseline fits completely below the noise instead of cutting through it. Therefore, an adaptive value for the weights depending on the residuals is required and the following definition is proposed: The difference with respect to the original ALS method is on the positive residuals, where p is pondered by exp(− di k ). Peak regions will show large residuals getting smaller weights, whereas noise regions will present small residuals and weights close to p. k is an additional parameter that controls the exponential decay of the weights. An easy way to estimate k is by setting it to the peak height we want to start rejecting. Note that by taking the limit k → ∞ we recover the traditional ALS method.
As the original ALS method does, the criteria used by psalsa to stop iterating is given by either a maximum number of iterations (usually 20) or when the residuals do not change of sign with respect to the previous iteration.

A. Synthetic chromatograms
The three described methods were applied to the synthetic chromatograms. An example of how the different baseline estimations look like is shown on figure 3.
In order to estimate the best parameters for each method, the parameter space was swept. For each sweep, the root mean square error (RMSE) was computed applying equation 10 to each sample. The RMSE values were averaged obtaining a global RMSE. The optimal parameter values for the synthetic database were chosen as the parameters with the smallest global RMSE.  In order to compare the three algorithms, figure 4 shows a boxplot of the RMSE distribution for the different methods in their optimal settings.
As the psalsa algorithm uses an additional parameter to control the exponential decay of the weights, we wanted to check the influence of that parameter on the overall performance. Several values of k chosen on a wide range of orders of magnitude were tested, obtaining figure 5.

B. Real samples
The three methods were applied to real samples. Figure 6 and figure 7 show the estimated baselines on different regions of a real urine sample.

V. DISCUSSION
The original ALS algorithm was not designed specifically to fit signals with peaks several orders of magnitude above the baseline. Considering eq. 4, one may notice that even though a small value for w i is given for d i > 0, given a large enough d i , its contribution to S may still be dominant, producing an estimation of the baseline which contains part of the peak area. In order to avoid the over-fitting, the value for p has to be chosen so as the baseline is not over-fitted to the peaks, instead of being chosen so as it cuts through the noise as described by [6]. This means that the value for p will have to be smaller, leading to baseline estimations below the real baseline. Given that the estimation is below the baseline, a flexible baseline will be easier to adapt to the real baseline whenever possible, that is the reason why λ values are smaller in the ALS method with respect to the other methods.
Therefore, on the analysed signals, the parameters which minimize the RMSE on the ALS method are chosen to be able to properly fit the large peaks, instead of according to their theoretical purpose.
On the other hand, the airPLS algorithm is able to cope with large peaks, as it gives w i = 0 for d i > 0. Unfortunately, that approach again leads to baselines fitted below the noise level instead of cutting through it. The airPLS algorithm was designed with the aim of removing the p parameter, and indeed p contribution is less relevant to the final estimation than the contribution of p at the original ALS algorithm, as it is only used at the boundaries of the signal.
Finally, psalsa algorithm does not suffer the issues of the original ALS method, as the exponential modulation reduces the contribution to S of the large peaks. This makes it possible to use p to enforce that the baseline crosses the noise level, instead of fitting below it. p value is not comparable directly to the ALS method, as its contribution is modulated by the exponential.
Even though psalsa requires an additional parameter (k) to control the exponential decay of the weights, figure 5 shows that the RMSE value is smaller on psalsa on a range of three orders of magnitude, making it easy to provide a value for k that improves ALS results.
When applying the three methods on real samples, we can confirm how psalsa is able to estimate a baseline cutting through the noise, instead of being under-fitted as happens with the other two methods.

VI. CONCLUSION
The psalsa algorithm is able to fit baselines of chromatographic signals with large peaks, improving the results of the original ALS method and the airPLS algorithm without adding computational complexity. Even though it requires an additional parameter k, it is easy to provide a reasonable estimate by relating it to the height of the peaks.
The proposed algorithm has been applied to both synthetic and real chromatograms obtaining successful estimations in both cases.
The algorithm is being adapted to work with Ion Mobility Spectrometry datasets and results will be published in a near future.