## Abstract

In this work, we optimize and further advance a noise reduction scheme for heterodyne spectroscopy. This scheme linearly combines data from reference detectors to predict the noise statistics in the signal detector through an optimized coefficient matrix. We validate this scheme for visible white-light-continuum and 800-nm light sources using un-matched CMOS arrays and show that the signal-to-noise ratio can approach the noise floor of the signal detector while using only ~5% of the energy for reference detection. We also optimize the strategy for estimating the coefficient matrix in practical applications. When combined with elaborate algorithms to perform pixel data compression and expansion, our scheme is applicable in difficult situations, including when the sample position is rapidly scanned, when detectors exhibit nonlinear response, and/or when laser fluctuations are large. The scheme is generalized to scenarios with complex chopping or phase cycling patterns, and a simple approach is provided for the chopping case. Finally, a robust and computationally efficient method is devised to remove multiplicative noise.

© 2019 Optical Society of America under the terms of the OSA Open Access Publishing Agreement

## 1. Introduction

Heterodyne detection is a widely used technique in optical spectroscopy because it can preserve signal phase information and achieve a high signal-to-noise ratio (SNR) even for weak signals. In heterodyne spectroscopy, a sample signal field is mixed with a local oscillator (LO) and detected together. As a result, the LO intensity fluctuations often dominate the total noise in heterodyne detection and severely deteriorate the final SNR. The LO noise can also introduce baseline shifts and signal distortions [1,2]. Therefore, noise reduction is critical in all heterodyne spectroscopies, especially the self-heterodyne techniques where detectors with higher full-well-capacity (FWC) are preferred [3–6]. For this class of spectroscopies, the LO itself is involved in exciting the sample signal [7], and is called the probe beam, named after the widely used pump-probe spectroscopy. For the other class of heterodyne spectroscopies, such as photon echo, the LO can be attenuated independently without weakening the sample signal. Because detectors with lower FWC are often used in this case [8–11], LO fluctuations contribute less to the total noise, but still requires proper treatment for optimal SNR.

A common way to suppress noise in heterodyne spectroscopy is via reference detection, where a second photodetector is employed to measure and compensate the LO fluctuation. Various referencing methods differ from each other mainly in the choice of signal and reference detectors, and the algorithm to process their outputs. Conventional methods can use different combinations of detector types: both detectors are single-pixel [3,5,12–14]; the signal detector is an array and the reference detector is single-pixel [1,10,15]; or both detectors are matched arrays [2,4,6,13,16,17]. The outputs from two detectors are processed by either a subtraction algorithm [5,12,14] or a ratiometric algorithm [1–6,10,16–18]. However, whatever the choice of detector type or algorithm, a property intrinsic to all conventional methods is that any individual signal pixel can only be referenced by one reference pixel. That specific reference pixel is either a single photodiode or assigned by accurate wavelength mapping.

When both detectors are single-pixel, conventional methods are mostly sufficient. However, if spectral information is recorded, conventional methods suffer from various limitations. Using a single photodiode cannot adequately monitor wavelength-dependent fluctuations. Using a matched reference array requires careful alignment and calibration, but still cannot suppress noise down to the noise floor of the signal detector. In our previous work [19], we devised a very versatile noise reduction scheme that allows using the information from all reference pixels to suppress noise in any individual signal pixel through a minimization routine. As demonstrated with mid-IR sources detected by mercury cadmium telluride (MCT) arrays, this scheme greatly improves SNR and simplifies alignment.

In this work we further advance and optimize our scheme. First, a theoretical description of the scheme is given in Section 2, as the starting point for later sections. After suppressing additive noise, we present a robust and computationally efficient method to remove multiplicative noise. In Section 3.1, we present the current demonstration setup which features visible white-light-continuum (WLC) and 800-nm light sources detected by CMOS arrays. Basic rules for detector choices are discussed in Section 3.2 and referencing performance for various combinations of light sources and detectors are discussed in Sections 3.3 and 3.4. Section 4.1 discusses the practical issue of training-set implementation, with consideration of the extra time cost. New algorithms are developed in Sections 4.2 and 4.3 to fully exploit the information content of many-pixel detectors. In Section 4.2, pixel data compression is used to facilitate estimation of the coefficient matrix, which is very important for applications such as hyperspectral imaging. In Section 4.3, pixel data expansion is used to push the scheme beyond the linear approximation, which is especially useful when detectors exhibit nonlinear response [20] and/or when light sources are very noisy. Finally, the scheme is generalized to scenarios with complex chopping or phase-cycling patterns in Section 4.4. In particular, a simple yet robust choice of the training set is suggested for the chopping case, which allows our scheme to be applied without any modification to the routine data collection procedures.

## 2. Theoretical description

Consider a general setup of heterodyne spectroscopy with reference detection: a signal field is created in a sample by some excitation/pump beams, mixed with a LO, and detected by a signal array detector (*g* pixels). To reduce the effect of laser fluctuation, a replica of LO is detected by a reference array detector (*h* pixels). ${I}_{tot}$(not ${I}_{Sig}$ or ${I}_{LO}$) and${I}_{Ref}$denote the spectra acquired by the signal and reference arrays, respectively. It should be noted that ** I** is just the detector output, rather than the true intensity, meaning that it always contains photon shot noise and detector noise.

In real experiments, both detectors acquire spectra many times. Therefore, ${I}_{tot}$and${I}_{Ref}$are saved as matrices with *g* and *h* columns, respectively. The *i*^{th} column of either matrix, denoted by ** I**(

*i*), is the output from the

*i*

^{th}pixel of that detector. For mathematical convenience, we treat the row indices (corresponding to different acquisition events) implicitly, each column as a random variable, and the matrix as a random row vector. Accordingly,$\u3008I\u3009$,$\mathrm{var}\left(I\right)$, and $\sigma \left(I\right)$are the expected value, variance, and standard deviation of the random vector

**, respectively, and therefore are row vectors themsevles.**

*I*For the signal array, we can decompose${I}_{tot}$into three contributions:

where

*S*_{h}is the heterodyne term to be extracted from the measurement,${I}_{LO}$and${I}_{Sig}$are the homodyne contributions from LO and sample signal fields, respectively.${I}_{Sig}$can be removed by phase-cycling or spectral interferometry when applicable. We will simply ignore${I}_{Sig}$in subsequent derivations because noise reduction is most important when this term is much weaker than the other terms. The strong${I}_{LO}$ background is removed by chopping or phase-cycling. In either case, two distinct acquisition events occur consecutively, which are marked by superscripts * and ′. For chopping, * refers to pump-on and ′ refers to pump-off. For phase-cycling, * and ′ refer to phase 0 or π. We define Δ as a difference operator which applies to random variables in this form: $\Delta I\triangleq {I}^{*}-{I}^{\prime}$. More complex definitions of Δ will be discussed in Section 4.3. Applying Δ to Eq. (1), we get:

*χ*^{(}

^{n}^{)}is the nonlinear susceptibility of the sample with the superscript

*n*indicating the order of field-matter interactions. The convolution with

*n*electric fields is included in

**. The specific expression of**

*F***dependes on the technique and experimental details. In the pump-probe configuration, ${\chi}^{\left(n\right)}\circ F$is equal to $\Delta OD\circ {I}_{LO}$, where ΔOD is the absorbance change defined in e-base.**

*F***is then equal to${I}_{LO}^{\text{*}}{I}_{\text{Pu}}^{\text{*}}$ for chopping and $({I}_{LO}^{\text{*}}{I}_{\text{Pu}}^{\text{*}}+{{I}^{\prime}}_{LO}{{I}^{\prime}}_{\text{Pu}})$for phase-cycling [19]. In principle,**

*F***should be evaluated with the true intensities of relevant beams. In practice, true intensities are unknown and must be approximated by detector outputs:${I}_{\text{Pu}}$is usually obtained from a photodiode and ${I}_{LO}$ can be approximated by the signal or reference array (see Section 4 of [19]). As a result, the realistic**

*F***data contains shot noise and detector noise.**

*F**χ*^{(}^{n}^{)} is the true information we want to obtain from $\Delta {I}_{tot}$. However, $\Delta {I}_{tot}$ is always corrupted by two noise components:$\Delta {I}_{LO}$and ** F**, which contains laser fluctuation, photon shot noise, and detector noise. Through their distinct ways of corrupting $\Delta {I}_{tot}$, $\Delta {I}_{LO}$is called the additive noise and

**is called the multiplicative noise (or convolutional noise). The definition of Δ**

*F***gurantees that $\u3008\text{\Delta}{I}_{LO}\u3009=0$, which means that the baseline converges to zero for Eq. (2). In general $\Delta {I}_{LO}$ is close to Gaussian white noise, as we showed previously [19], and hence $\sigma \left(\text{\Delta}{I}_{LO}\right)$describes the magnitude of the additive noise. For typical heterodyne nonlinear spectroscopy, additive noise dominates and multiplicative noise has a relatively minor impact because it is sclaed by the small**

*I*

*χ*^{(}

^{n}^{)}. In the pump-probe configuration, ${\chi}^{\left(n\right)}\circ \sigma \left(F\right)$is on the same order as $\Delta OD\circ \sigma \left({I}_{LO}\right)$, and usually orders of magnitude smaller than $\sigma \left(\text{\Delta}{I}_{LO}\right)$ for a typical ΔOD ~10

^{−3}. The additive noise is therefore the main target of noise reduction.

With our reference scheme [19], the additive noise can be suppressed in the following form:

**is a**

*B**h*×

*g*matrix, $\Delta {I}_{Ref}B$is a matrix multiplication between $\Delta {I}_{Ref}$and

**, andis the residual additive noise after referencing. Previously we showed that Eq. (3) is a general expression for referencing calculations [19]. All conventional methods are special cases where**

*B***is either a**

*B**g*×

*g*diagonal matrix (for matched arrays where

*h*=

*g*) or a 1 ×

*g*matrix (for a single reference photodiode), where the matrix elements of

**are usually the ratios between the mean intensities of the signal and reference detectors. Equation (3) generalizes conventional methods so that**

*B**h*no longer needs to be 1 or

*g*and so that reference arrays can be strategically chosen.

The definition of Δ** K** gurantees that 〈Δ

**〉 = 0 for any**

*K***because$\u3008\Delta {I}_{LO}\u3009=\u3008\Delta {I}_{Ref}\u3009=0$. This condition gurantees that the baseline converges to zero for Eq. (3). Therefore, we can choose a**

*B***that minimizes each$\sigma \left(\text{\Delta}K(i)\right)$. This optimized**

*B***can be solved for by ordinary least squares:**

*B*

*M*^{−1}is matrix inversion. The relative noise reduction after referencing is:

*R*

^{2}in linear regression. The additive noise is therefore lowered by a factor of$\sqrt{1-{R}^{2}}$ after referencing, and the nonnegative

*R*

^{2}gurantees that the noise level will not increase for any reference detector.

After suppressing the additive noise, ** F** is divided out (division involving matrices or vectors is an element-wise operation) from both sides of Eq. (3) to remove the multiplicative noise and recover

*χ*^{(}

^{n}^{)}. However, dividing by a realistic

**inevitably re-introduces some shot noise and detector noise into the results, and may even cause numerical instability in spectral ranges with low intensity. In most experiments, the residual noises are further reduced by averaging over repeated measurements with the same experimental parameters (e.g. the time delay in transient absorption). Therefore, the order of the referencing calculation and averaging is important.**

*F*With ratiometric methods, the division of ${I}_{tot}$ by ${I}_{Ref}$ must be performed before averaging [1] because these two operations do not commute. In our scheme, the specific form of Eq. (3) allows averaging $\Delta {I}_{tot}$,$\Delta {I}_{Ref}$, and ** F** data before combining them, which gives:

_{S}indicates a sample mean over

*n*

_{S}repeated measurments, in contrast to the expected value 〈 〉 (true mean over infinite samples). ${\u3008\Delta {I}_{tot}\u3009}_{\text{s}}$,${\u3008\Delta {I}_{Ref}\u3009}_{\text{s}}$, and 〈

**〉**

*F*_{S}are all averaged from the corresponding shots with the same parameters, which also greatly reduces the computational cost, especially for high-repetition-rate systems [21–23]. Finally 〈

**〉**

*F*_{S}is divided out from both sides of Eq. (7) to remove the multiplicative noise and recover

*χ*^{(}

^{n}^{)}. It is more accurate and robust to perform the division operation only once after averaging, as compared with performing it on a shot-to-shot basis as we suggested previously [19], because the shot noise and detector noise in

**are effectively averaged out in 〈**

*F***〉**

*F*_{S}. When the multiplicative noise is very weak, the final division step can be omitted.

Because Δ** K** is zero-mean white Gaussian noise (see experimental illustration in Section 3.1), 〈Δ

**〉**

*K*_{S}is also zero-mean white Gaussian noise with $\sigma \left({\begin{array}{c}\u3008\Delta K\u3009\end{array}}_{\text{s}}\right)={n}_{\text{s}}{}^{-1/2}\sigma \left(\text{\Delta}K\right)$. When calculating 〈

**〉**

*F*_{S}for the chopping case,${\u3008{I}_{LO}^{\text{*}}{I}_{\text{Pu}}^{\text{*}}\u3009}_{\text{s}}$can be reasonably approximated by the more convenient ${\u3008{{I}^{\prime}}_{LO}{{I}^{\prime}}_{\text{Pu}}\u3009}_{\text{s}}$ because their difference scales as

*n*

_{S}

^{−1/2}. After factoring out 〈

**〉**

*F*_{S}, the residual noise 〈Δ

**〉**

*K*_{S}/ 〈

**〉**

*F*_{S}is still zero-mean because 〈

**〉**

*F*_{S}is uncorrelated with 〈Δ

**〉**

*K*_{S}. This property gurantees a clean sample spectrum without baseline shifts or signal distortions for any reference detector. In contrast, ratiometric referencing often suffers from these issues [1].

Equations (3) and (7) (without or with averaging) are the basic equations to accomplish noise reduction, whereas the value of ** B** has to be calculated first with Eq. (5). However, the actual value of$\Delta {I}_{LO}$is unkown when a sample signal is present, so

**has to be estimated from blank shots, which are defined as the acquisition events without a sample signal. The acquisition events with a sample signal are defined as signal shots. The blank shots are taken under exactly the same experimental conditions as the signal shots, except that (one of) the excitation/pump beam(s) is blocked. Details about**

*B***estimation are given in Section 4.**

*B*## 3. Basic demonstration

#### 3.1. Experimental setup and noise statistics

To demonstrate the statistics of noise and its suppression, we collected data with the pump-probe setup illustrated in Fig. 1. Two light sources were used. The femtosecond 800-nm pulse with a bandwidth of ~15 nm (full-width-half-maximum) was directly from a 1-kHz Ti:sapphire regenerative amplifier. The WLC was generated by focusing ~2 μJ of 800-nm onto a 10-mm thick water cell and collimating the outgoing continuum. The water cell was removed when taking the 800-nm spectra. The outgoing beam from the light source was split into a probe/LO beam and a reference beam. The probe beam was focused onto a sample cell, where it was overlapped with a pump beam. For the purpose of noise analysis only, all data were collected with the pump off and no real sample was present. The LO and reference beams were then dispersed by two home-built spectrographs, one with a focal length of 300 mm and the other 250 mm. For the 800-nm light, 1800-grooves/mm gratings were used in the spectrographs. For the WLC, 300-grooves/mm gratings were used. Two different CMOS sensors were used for detection: DH is a high FWC detector (Hamamatsu S10453-1024). It has 1024 pixels with a 25-µm pixel width. DL is a low FWC detector (Hamamatsu S11639-01). It has 2048 pixels with a 14-µm pixel width. DH and DL are mounted, respectively, behind the spectrographs with focal lengths of 250 mm and 300 mm. This arrangement was to provide a similar spectral range for two detectors: about 780–820 nm for the 800-nm beam, and 400–740 nm for the WLC, with DH detecting a ~7% broader range than DL. The driver circuits are Synertronic Designs LineScan-II for DH, and Hamamatsu C13015-01 for DL. Both readout electronics have 16-bit analog-to-digital converter (ADC) output and are synchronized with each other and to the 1-kHz laser repetition rate. All the data processing, including referencing calculations, are easily handled by a standard desktop PC.

With this experimental setup, we collected one-minute data on both detectors using either 800-nm light or WLC as the light source. The mean spectra are plotted in Figs. 2(a) and 2(b). The maximum intensities on both arrays are controlled to be near the top of the linear response range. Roughly 1–2 nJ of pulse energy entered the spectrograph for DH, and ~5% of that entered the spectrograph for DL. Because the referencing scheme described in this work does not require pixel correspondence between two arrays, we did not perform wavelength calibration.

There are two important properties of typical ultrafast light sources that justify the effectiveness of our referencing scheme. The first property is that spectral correlation exists between pixels corresponding to different wavelengths, as plotted in Figs. 2(c) and 2(d). The 800-nm light is representative of light sources with relatively simple spectral correlation, such as those generated by optical parametric amplifiers. WLC is commonly used as a probe beam and it has complex spectral correlation. Spectral correlation indicates a redundancy of information because multiple pixels on the reference array are correlated with a single pixel on the signal array. This reason is why using a linear combination of many reference pixels is better than using one reference pixel corresponding to the same wavelength.

The second property is that Δ** I** is well represented by zero-mean Gaussian white noise, as illustrated in Fig. 3 for the WLC data. The distribution of Δ

**can be fully described by its standard deviation. In contrast,**

*I***exhibits 1/f noise features which are much more complex. The Δ operator effectively simplifies the statistics of laser intensity fluctuation. Similar to Δ**

*I***, Δ**

*I***is also well-represented by zero-mean Gaussian white noise except that it has a much narrower distribution. The 800-nm data exhibits basically the same characteristics.**

*K*#### 3.2. Decomposition of noise contributions and properties of detectors

To demonstrate the differences between two CMOS sensors and their respective advantages, the standard deviation of additive noise, Δ*I*, is decomposed as:

*I*is defined for a pair of shots.${N}_{\text{L}}$is the number of electrons generated in a pixel from each laser pulse, and${N}_{\text{L}}\times {p}_{2}$ is the contribution from laser intensity fluctuation where

*p*

_{2}is a measure of the fractional change between two consecutive shots.${(2{N}_{\text{L}})}^{1/2}$is the photon shot noise,$\sqrt{2}{N}_{\text{D}}$is the detector noise (counted by the number of electrons) including contributions from pixel dark current and all readout circuits. ${(2{N}_{\text{L}}+2{N}_{\text{D}}{}^{2})}^{1/2}$is the noise floor of that pixel. Ideally, the shot noise, detector noise and laser noise are all uncorrelated with each other, and thus their contributions are added up in a root-sum-square way. To illustrate the relative weights of different noise terms, we plot them as functions of laser intensity in Fig. 4 based on parameters estimated from the sensor specifications and real measurements: DH has a FWC of ~2 × 10

^{6}electrons and dynamic range of ~2800; whereas the DL has a FWD of ~8 × 10

^{4}electrons and dynamic range of ~3500. The

*p*

_{2}factor is set to 0.5% to simulate a relatively stable light source. From Fig. 4(a), shot noise dominates the noise floor of DL except at very low intensities, and shot noise also accounts for a large part of total noise except at high intensities. This means that the shot noise limit can be approached in DL with a medium range intensity and a relatively stable light source. In comparison, the total noise of DH is dominated by detector noise at very low intensities, and by laser noise at higher intensities in Fig. 4(b). The shot noise limit is never approached in DH for all intensity ranges.

For experiments where the LO intensity can be adjusted independently without affecting the sample signal, DL is a better signal detector because the shot noise limit is relatively easily approached. For self-heterodyne experiments, we will show in the next section that DH is better because it can realize a much higher SNR when the laser noise is properly suppressed.

Equation (8) describes the decomposition of $\sigma \left(\text{\Delta}I\right)$when only a signal array is used. We can derive a similar equation for Δ** K** when a reference array is used. For each array, we can decompose the total noises into contributions from the laser fluctuation and the pixel noise floor: $\Delta {I}_{LO}=\Delta {I}_{LO}^{L}+\Delta {I}_{LO}^{P}$ and $\Delta {I}_{Ref}=\Delta {I}_{Ref}^{L}+\Delta {I}_{Ref}^{P}$, where the superscript

**L**(aser) and

**P**(ixel) denotes the sources of noise. The two noise sources are uncorrelated with each other, so$\mathrm{var}\begin{array}{c}\left(\Delta {I}_{LO}\right)=\mathrm{var}\left(\Delta {I}_{LO}^{L}\right)\end{array}+\mathrm{var}\left(\Delta {I}_{LO}^{P}\right)$, which is equivalent to Eq. (8). Invoking Eq. (4), we get $\begin{array}{c}\Delta K=\left(\Delta {I}_{LO}^{L}-\Delta {I}_{Ref}^{L}B\right)+\left(\Delta {I}_{LO}^{P}-\Delta {I}_{Ref}^{P}B\right)\end{array}$. When there is no unwanted electronic crosstalk between signal and reference detectors, the laser noise is the only common-mode noise shared by both detectors, which yields:

For the conventional referencing methods, in the ideal situation where two identical arrays are used to detect the LO and reference spectra of the same intensity, ** B** is assumed to be an identity matrix. In this case,$\mathrm{var}\left(\Delta K\right)$is equal to$\mathrm{var}\left(\Delta {I}_{LO}^{L}-\Delta {I}_{Ref}^{L}\right)+2\mathrm{var}\left(\Delta {I}_{LO}^{P}\right)$. This relationship means that the residual noise with conventional methods is at least$\sqrt{2}\sigma \left(\Delta {I}_{LO}^{P}\right)$, and any mismatch between the two arrays will further increase the noise level. This$\sqrt{2}$factor limits the application of conventional reference detection to situations where laser noise is much higher than the pixel noise floor, not to mention the added complications of carefully matching two arrays. With our new scheme, not only can the residual noise theoretically approach $\sigma \left(\Delta {I}_{LO}^{P}\right)$ (depending on

**), but also the alignment of reference array is much simpler because no careful matching is needed.**

*B*#### 3.3. Referencing performance

One way to evaluate the performance of our reference scheme is to calculate$\sigma \left(\text{\Delta}K\right)$and compare it with$\sigma \left(\text{\Delta}I\right)$and the pixel noise floor of the signal detector (in unit of counts). However, to directly compare numbers between detectors of different characteristics, it is useful to introduce a dimensionless quantity,$SN{R}_{0}\triangleq \u3008I\u3009/\sigma \left(\text{\Delta}K\right)$with referencing or $\u3008I\u3009/\sigma \left(\text{\Delta}I\right)$without referencing. *SNR*_{0} characterizes the ability to distinguish a small excitation-induced intensity change from the shot-to-shot fluctuation of the LO itself. In self-heterodyne experiments, ΔOD × *SNR*_{0} is equal to the final SNR of the measurement, and 1/*SNR*_{0} describes the detection limit.

Figure 5 shows the referencing performance with different light sources and with different choices of detectors. For each light source, we can assign DH as the signal detector and DL as the reference detector, or vice versa, and process the collected data accordingly based on the designation. When all reference pixels are used in the referencing calculation, no matter which light source is used and how each detector is assigned, *SNR*_{0} after referencing (the red curve) is very close to the *SNR*_{0} simulated from the pixel noise floor (the green curve) for all three cases shown in Fig. 5. However, the degree of noise suppression depends on the detector choice and light source. Assigning DH as the signal array, the maximum *SNR*_{0} achieved for both 800-nm light and WLC are ~800, but assigning DL as the signal array, the maximum *SNR*_{0} achieved is only ~200. This result indicates that DH is a better signal detector in self-heterodyne experiments. The maximum *SNR*_{0} here is mainly limited by the shot noise. At the same time, it is preferable to use DL as the reference detector because it only requires ~5% of the pulse energy for reference detection, yet still achieves satisfactory noise reduction. Compared with conventional methods where LO/probe energy is split in half for referencing [23], our reference scheme enables efficient use of energy to maximize *SNR*_{0}. This energy efficiency is a critical advantage, especially when the available LO/probe energy is low.

Although the noise floor is approached with just ~5% of energy for referencing, further reducing this energy cost may impair the performance. This effect can be examined by evaluating *SNR*_{0} using the same reference array but putting it behind a less dispersive spectrograph, or by using another reference array with lower FWC. For best consistency and convenience, we simulate this situation by using the same data set, but keeping only the data from one reference pixel out of every *s* reference pixels. This approach is equivalent to reducing the total energy on the reference detector by a factor of 1/*s* without altering the spectral range. Comparing Figs. 5(a) and 5(b), with the same WLC source, the *SNR*_{0} in (a) drops very fast as *s* is increased from 2, 8, to 64, whereas the corresponding *SNR*_{0} in (b) has little drop. This behavior is straightforward to interpret: in (b) the energy on the reference array is still ~30% of that on the signal array after keeping one out of every 64 pixels, whereas in (a) the percentage becomes only ~0.08%. Because accurate information about intensity fluctuation cannot be extracted without enough photons, a sufficient amount of energy should still be spent on referencing with our scheme, although it usually requires much less energy than conventional methods. Such an increase in energy cost can be facilitated by increasing the number of reference pixels within the relevant spectral range, or by switching to another reference array with higher FWC.

Equation (5) shows that spectral correlation can affect the residual noise through its role in determining ** B**. This relationship suggests that different light sources may need different energy costs to get satisfactory performance. Comparing Figs. 5(a) and 5(c), the decrease of

*SNR*

_{0}with increasing

*s*is faster for WLC than for 800 nm. This behavior can be rationalized by recognizing that WLC has more complex spectral correlation pattern, consistent with the intuition that WLC needs better referencing.

#### 3.4. Flexible choice of the reference array

Because ** B** can be calculated for arbitrary reference detectors, the physical form of the reference array is no longer important as long as it is correlated with the signal pixels. For example, it is possible to use a photodiode and an array together for referencing, or even use some regions on the signal array instead of a separate reference detector while the data is still processed the same way as having a separate array. The effectiveness of the latter case is demonstrated through the black curve in Fig. 5(c) for the 800-nm light. We calculate this curve under the assumption that the sample signal appears only around the peak maximum in Fig. 2(a) but not in the wings of the spectrum, and used part of DH pixels whose intensity is below 1/4 of the peak intensity as the reference pixels. The resulting

*SNR*

_{0}is lower than referencing with all, 1/2, and 1/8 DL pixels, but is still significantly improved over no referencing.

This flexibility can be very useful in some situations, such as that discussed in Ref [24], where noise reduction is needed but impossible with conventional methods. The only prerequisite for such regions is that they do not contain any pump-induced signal (including scattering) in order to satisfy the condition that $\u3008\Delta {I}_{Ref}\u3009=0$. Otherwise, referencing using these regions will incorrectly add a background to the real signal. With this choice, the referencing pixels reside in the otherwise unused regions of the signal array and are therefore automatically synchronized. However, in practice, such sample-signal-free regions may not exist, depending on the experimental configurations, or they are not known beforehand. In these cases, using a separate reference detector is still the most robust approach.

## 4. Advanced demonstration

#### 4.1. Implementation of the training set and convergence of **B** estimation

We define the estimation of ** B** using Eq. (5) as the training step of

**, and the data set used in this step is defined as the training set. Because the exact value of $\Delta {I}_{LO}$involving any signal shot is unknown, the training set must be composed of only blank shots. The estimated**

*B***is then applied in Eqs. (3) or (7) to suppress the additive noise, which is defined as the application step of**

*B***, and the data set used in this step is defined as the test set. The test set must contain signal shots in real experiments to measure the sample signal, but in our demonstration, it only contains blank shots in order to evaluate the performance of**

*B***estimation. In Section 3, the entire one-minute data set is used as both the training set and the test set. Because the training set should only be a subset in real data collection, it is expected that**

*B***estimated from a subset will not perform as well as**

*B***estimated from the whole set. Therefore, the referencing performance illustrated in Fig. 5 represents the best possible performance, and the real performance will depend on the size and distribution of the training set. In this section, we will optimize the strategy for generating a training set.**

*B*Consider that we collect a data set of a fixed length as shown in Fig. 6. The training set (in red) has a total number of *N*_{b} shots and the test set (in blue) has *N*_{t} shots. To be compatible with the case of phase-cycling, where the test set is composed of all signal shots, the training set considered here is generated by collecting extra blank shots. Hence, it does not have any overlap with the test set. While the discussion here equally applies to the case of chopping, the chopped shots are blank shots themselves, and thus it is possible to include them in the training set. We will present a simple way to deal with chopping without extra blank shots in Section 4.4. Because all calculations are based on Δ** I**, which is defined for a consecutive pair, we use

*n*

_{t}and

*n*

_{b}to denote the number of consecutive pairs in the test set and training set, respectively. For the test set, ${n}_{\text{t}}={N}_{\text{t}}/2$ where each consecutive pair contains a sample signal that is not necessarily taken with the same experimental parameters as other consecutive pairs.

The training set can have a quite flexible distribution. For the fully-aggregated distribution in Fig. 6(a), the training set is composed of a single segment that contains only and all blank shots. When there are more than two blank shots in a segment, consecutive pairs can be obtained by taking all possible combinations of adjacent shots. This gives $n\text{b}=N\text{b}-1$. For the fully-dispersed distribution in Fig. 6(b), two consecutive blank shots are inserted as pairs and the pairs are dispersed evenly throughout data collection. This gives $n\text{b}=N\text{b}/2$. For the 3/4-dispersed distribution in Fig. 6(c), four consecutive blank shots are inserted at a time and the quadruplets are dispersed evenly throughout. Because 4 consecutive blank shots can produce 3 consecutive pairs, this gives $n\text{b}=3N\text{b}/4$.

The quality of ** B** estimation can be characterized by a dimensionless quality factor,

*q*:

**,**

*B**q*can be used to characterize how different factors in estimating

**affect the noise reduction performance. In this section, we use**

*B**q*to quantify the effect of training set size and distribution. The $\sigma \left(\text{\Delta}K\right)$in both the numerator and denominator should be evaluated on the same test set, so that

*q*is only sensitive to difference in

**. To make $\sigma \left(\text{\Delta}K\right)$an unbiased estimator for test sets of finite sizes, proper normalization factors should be applied in the calculation of $\sigma \left(\text{\Delta}K\right)$depending on the**

*B***used. The specific**

*B***in the numerator is estimated from a training set which does not overlap with the test set. Therefore, the corresponding $\sigma \left(\text{\Delta}K\right)$is normalized by ${n}_{\text{t}}-1$(Bessel's correction) where one degree of freedom was removed through the calculation of the sample mean. The standard**

*B***in the denominator is estimated by using the test set itself as the training set. Therefore, the corresponding$\sigma \left(\text{\Delta}K\right)$is normalized by ${n}_{\text{t}}-1-h$(standard error of the regression), where the degree of freedom is further reduced by the number of reference pixels (**

*B**h*) used for determining

**. Because the standard**

*B***is the best possible**

*B***but unknown in real data collection,**

*B**q*is always larger than 1 and it describes how close a feasible

**estimation can approach its optimal value.**

*B*Figure 7(a) illustrates how *q* depends on the size and distribution of the training set. All the calculations are performed on the one-minute WLC data. Here *h* is 128 because we bin every 16 neighboring pixels into one effective pixel before the referencing calculations. For the fully-aggregated distribution, because the relative position of the training set with respect to the test set can affect *q*, we designate 50 different positions within one minute to place the training set and calculate the corresponding *q*. The mean value and standard deviation of the 50 *q* values for a given *n*_{b} are used to create one point with an error bar in the figure. When the training set is very large, *q* converges to 1 for any distribution. The fully-dispersed distribution has the lowest *q* because it can most closely represent the statistical fluctuations in the test set over a period of time. The fully-aggregated distribution is the least representative and has the highest *q*, whereas the 3/4-dispersed distribution is between the other two cases. In terms of the position preference for the training set in the fully-aggregated distribution, the middle positions are not statistically better than edge positions. Also, *q* becomes less sensitive to the training-set position (smaller standard deviation) as the training-set size increases. For the fully-dispersed distribution, it is found that *q* accurately follows Eq. (11):

*q*only depends on

*n*

_{b}and not on

*n*

_{t}, after$\sigma \left(\text{\Delta}K\right)$is properly normalized to be unbiased.

Although a larger training set is preferable for yielding a better ** B** estimation and better noise reduction, it also spends more time on collecting blank shots, which reduces the time that could be used to average more signal shots to improve SNR. This is a trade-off that should be balanced. To quantify the competing effects between lowering

*q*and reducing the time cost, we define an overall performance factor that scales with the square-root of experimental time needed to achieve a certain SNR:

The definition of *Q* involves the number of shots *N*_{b}, but not the number of consecutive pairs *n*_{b}, because it is *N*_{b} that determines the time cost for referencing, and the factor *N*_{b} */ N*_{t} is a measure of this time cost. The lower the *Q*, the higher the overall SNR. To illustrate how *Q* depends on *N*_{b} under the same amount of signal-shot averaging (at fixed *N*_{t}), we set *N*_{t} to 60K and use the *q* from Fig. 7(a) and the corresponding *N*_{b} to calculate *Q*. The resulting *Q* is plotted as a function of *N*_{b} in Fig. 7(b). Unlike *q*, *Q* has a minimum which represents the optimal trade-off point. Surprisingly, the minimum *Q* values are very similar for different distributions, and the 3/4-dispersed distribution marginally gives the lowest *Q* value. Although the fully-dispersed distribution has the lowest *q* for the same *n*_{b}, the corresponding *N*_{b} is the largest which amplifies *Q*. Around its minimum, the *Q* curve is very flat and therefore insensitive to the exact value of *N*_{b}, which is a desired property for robust performance and easy implementation. The training-set size required to achieve the minimum *Q* is also quite similar for different distributions.

Different training-set distributions have different advantages. The fully-aggregated distribution can be easily implemented by using a motorized shutter to block any pump beam. To test the referencing performance, blocking any pump beam manually is also convenient. In contrast, dispersed distributions usually require a fast modulator and some programming to insert blank shots evenly between signal shots. However, the dispersed distributions are more robust than the fully-aggregated distribution because their *Q* factors are much less affected by fluctuations or drift between the training set and the test set.

The above discussion about *Q* corresponds to a situation where ** B** estimation is periodically updated after collecting a test set with

*N*

_{t}( = 60K in Fig. 7(b)) shots. For some experiments, updating

**is necessary when LO/probe spectra (and hence $\Delta {I}_{LO}$) are actively changed due to an experimental parameter change, such as scanning the sample position. Updating**

*B***is not always required in most other situations, such as scanning time delays or pump wavelengths, because LO/probe spectra are not actively changed and many experimental systems can often remain stable for hours. As demonstrated in our previous work [19], a fixed**

*B***estimated from a fully-aggregated 8-sec-long training set worked well for a whole night, unless there was external disturbance. Nevertheless, updating**

*B***periodically is always more robust because it can compensate the system fluctuation/drift during a long data collection without manual intervention.**

*B*For practical application of our referencing scheme, it is important to provide a guideline on how many blank shots are needed to achieve optimal performance depending on *N*_{t} and *h*. For the fully-dispersed distribution, we can insert Eq. (11) into Eq. (12) and write an analytical expression for *Q* as a function of *N*_{b}. It can then be easily derived that *Q* is minimized at:

*Q*value is approximately

Equation (14) shows that *Q*_{min} approaches 1 when *N*_{t} / *h* is infinitely large. In fact, *Q* = 1 is the extreme case with the best referencing performance where the whole test set is used as the training set and hence *q* = 1 and *N*_{b} = 0, and the referencing performance is described by Eq. (6). In practice, *Q* = 1 is not attainable because only blank shots are collected in this case. When signal shots need to be collected, Eq. (6) should be additionally scaled by *Q* to describe the actual referencing performance as $\sqrt{1-{R}^{2}}Q$. Recalling the discussion in Section 3.2 and the results demonstrated in Fig. 5, the pixel noise floor $\sigma \left(\Delta {I}_{LO}^{P}\right)$ can be achieved in the extreme case of *Q* = 1 using our scheme with all reference pixels. This indicates that the lowest possible residual noise achievable with conventional methods corresponds to the case where *Q* = $\sqrt{2}$. We can use this *Q* value as a threshold to set the minimum value for *N*_{t}/*h*, so that any *N*_{t}/*h* larger than the threshold can achieve a residual noise lower than the lowest possible level with conventional methods. Another reason why we set the *Q* threshold at $\sqrt{2}$is to avoid allowing the training set to be larger than the test set so that *N*_{b} ≤ *N*_{t} in Eq. (12). From Eq. (14), it can be determined that the minimum *N*_{t}/*h* is ~20. For the data shown in Fig. 7(b), *N*_{t}/*h* is ~470 and *Q*_{min} is ~1.07. For *N*_{t}/*h* ≥ 500, further drop of *Q*_{min} is much slower considering it scales as (2*h*/*N*_{t})^{1/2} asymptotically. In general, to achieve good referencing results, *N*_{t}/*h* should be reasonably large to achieve a *Q*_{min} close to 1. Because ** B** is updated every (

*N*

_{t}+

*N*

_{b}

^{min}) shots, this criterion will put a limit on how fast one can update

**, which is ultimately limited by**

*B**h*.

Using the results in Fig. 7(a), we have also varied the value of *N*_{t}/*h* to see how it affects the *Q* curves for different training-set distributions. As the value of *N*_{t}/*h* increases, the *Q* curve flattens around its minimum, and the *Q*_{min} values for different distributions gets closer to one another. As a result, although Eqs. (13) and (14) are only accurate for the fully-dispersed distribution, they serve as good estimations for other distributions. Usually other distributions need a slightly smaller training set than the *N*_{b}^{min} from Eq. (13) to achieve the minimum *Q*, and 90% is a good correction factor as a rule of thumb. The exact size of the training set is not so important because the *Q* curve is very flat around its minimum. Equations (11), (13), and (14) are the main results from this section. Although the results were validated using the WLC data, they are independent of the light source because they are direct consequence of ordinary least squares.

#### 4.2. Compression of reference-pixel data

As discussed in Section 4.1, to have good a referencing performance with *Q*_{min} close to 1, the number of reference pixels, *h*, sets an upper limit for the update rate of ** B**. In applications like hyperspectral imaging, a fast

**update is necessary because many sample positions are scanned. For other applications, reasonably fast**

*B***updates (~1 minute) are still desired, although not required, to swiftly respond to any system drifts. This is especially true if the fully-aggregated distribution is used. For the MCT arrays used in our previous work [19],**

*B**h*is a small number: 32, 64, or 96. CMOS/CCD arrays typically have many more pixels, and

*h*is 2048 for the DL used in this work. If we directly use all the physical pixels as the reference pixels, the period required for updating

**to achieve**

*B**Q*

_{min}~1.07 would be ~15 mins, and this would also come with a high computational cost. Therefore, it is necessary to first compress the reference-pixel data to a smaller number of effective pixels before doing the referencing calculation. In Section 4.1, we have illustrated this concept by binning 16 neighboring pixels. In this section, we will further explore different ways of data compression and their performance.

The procedure of data compression can be formalized as a decomposed estimation of ** B** with

**=**

*B***. Δ**

*C**D***becomes:**

*K***is the**

*C**h*×

*m*(

*h*>

*m*) compression matrix. Binning neighboring pixels is just one intuitive choice of

**. $\text{\Delta}{I}_{Ref}^{\text{eff}}\triangleq \text{\Delta}{I}_{Ref}C$is the**

*C**m*effective reference pixels after data compression.

**is the**

*D**m*×

*g*regression coefficient matrix for the effective pixels, which is solved by:

**is now limited by**

*D**m*, and the discussion in Section 4.1 (after replacing

*h*with

*m*) is equally applicable here. With 16-pixel binning,

*m*is only 1/16 of

*h*which can accelerate the rate for

**update and reduce**

*B**Q*

_{min}. Pixel compression also improves the robustness of the regression algorithm if there are bad or saturated pixels in the reference array, where matrix inversion will fail in Eq. (5) but may still work in Eq. (16). However, pixel data compression also comes with the disadvantage of reducing referencing performance. As shown in Fig. 8(a), the

*SNR*

_{0}decreases very slowly with moderate binning until around

*m*= 32, after which it decreases very quickly. Compared with referencing with a subset of pixels as discussed in Section 3.3, data compression through pixel binning does not cut the total energy for referencing even though both methods lose degrees of freedom in the regression. However, not all degrees of freedom are important in predicting $\Delta {I}_{LO}$. That is why the

*SNR*

_{0}with 32 effective reference pixels (red) in Fig. 8(a) is much higher than the corresponding curve (yellow) in Fig. 5(a). In the following, we will optimize strategies of pixel data compression for different light sources by dropping those degrees of freedom with little predicting power.

To quantify the effect of data compression on referencing performance, we can use the same *q* defined in Eq. (10), while using the whole one-minute data set as both the training set and test set. The specific ** B** in the numerator is estimated with

*m*effective pixels, and the corresponding $\sigma \left(\text{\Delta}K\right)$is normalized by ${n}_{\text{t}}-1-m$. The standard

**in the denominator is estimated with the original 2048 reference pixels, and the corresponding $\sigma \left(\text{\Delta}K\right)$is normalized by ${n}_{\text{t}}-1-2048$. By definition,**

*B**q*converges to 1 when

*m*approaches 2048. Figures 8(b) and 8(c) plot

*q*as a function of

*m*for WLC and 800-nm light sources, respectively. With simple binning,

*q*decreases very quickly with increasing

*m*initially, and the decrease slows down with further increase in

*m*. This trend means that only a small number of effective pixels are needed to realize most of the performance from the full 2048 pixels. However, different light sources need different

*m*to get similar

*q*. For 800 nm,

*q*is already ~1.2 with

*m*= 1, whereas the corresponding

*m*is 7 for the WLC. This phenomenon can be explained by the spectral correlation plotted in Fig. 2: the WLC exhibits a more complex spectral correlation that requires more effective pixels to capture its noise statistics. This result also demonstrates why it is often insufficient to use a single photodiode as the reference detector.

Other choices of the compression matrix ** C** are possible and can be more efficient than simple binning. One method is principal component regression (PCR). A principal component analysis (PCA) is first performed on $\text{\Delta}{I}_{Ref}$by eigendecomposition of$\mathrm{cov}\left(\text{\Delta}{I}_{Ref}\right)$, and matrix

**then consists of the eigenvectors corresponding to the**

*C**m*largest eigenvalues. Although $\text{\Delta}{I}_{Ref}^{\text{eff}}$ obtained from PCA can capture the major contributions in the variation of $\text{\Delta}{I}_{Ref}$, it does not necessarily yield the best predictors for $\Delta {I}_{LO}$. PCR is also sensitive to the relative scaling of different reference pixels, which can be problematic when using multiple reference detectors, such as a photodiode and an array. To address these concerns, we can seek a

**and the corresponding $\text{\Delta}{I}_{Ref}^{\text{eff}}$ that have the highest prediction power for $\Delta {I}_{LO}$by minimizing a weighted mean residual noise:**

*C***is calculated by Eqs. (15) and (16). The subscript ‘rms’ indicates a quadratic mean over all signal pixels, and**

*K***is a weight factor. The**

*w***factor introduces some flexibility by allowing different signal pixels to have different relative importance. For example,**

*w***can be set as $1/\sigma \left(\Delta {I}_{LO}^{P}\right)$to give higher weights to signal pixels with a lower noise floor, or some of its elements can be set to higher values for those signal pixels that are expected to see the sample signals. In this section, we set**

*w***to be $1/\sigma \left(\text{\Delta}K\right)$that is calculated from the standard**

*w***with 2048 reference pixels, and therefore ${\u3008\sigma \left(\text{\Delta}K\right)\circ w\u3009}_{rms}$ mimics**

*B**q*. The solution to Eq. (17) can be found by maximum redundancy analysis (MRA) [25,26]. This method basically performs eigendecomposition twice to obtain all redundancy coordinates, and matrix

**then consists of the first**

*C**m*redundancy coordinates. Due to the spectral correlation between signal pixels, a small change in

**will not dramatically change the performance of**

*w***and hence**

*C***is quite robust against different choices. For example, with 8 effective pixels, the corresponding**

*w**q*’s only differ by less than 0.1% whether

**is set as $1/\sigma \left(\text{\Delta}K\right)$ or an all-ones vector.**

*w*We calculate the *q* with different numbers of effective pixels by PCR or MRA, and compare the results with binning in Figs. 8(b) and 8(c). For both light sources, MRA has the fastest *q* convergence, and binning has the slowest convergence, but their difference becomes very small after *m* ≥ 32. The advantage of MRA or PCR is more significant for WLC compared with 800 nm. These facts indicate that more advanced compression methods are only needed when the LO/probe exhibits complex spectral correlations, or when a very small *m* is required to facilitate very fast ** B** update, such as in the case of sample scanning in hyperspectral imaging. For most situations where

**is just updated to adapt for slow system drift, simply binning neighboring reference pixels is sufficient even for complex light sources like WLC. The exact number of effective pixels needed depends on the specific light source used and the desired**

*B***update rate.**

*B*For binning, ** C** is fixed and therefore

**( =**

*B***) is updated only by updating**

*C**D***. For PCR and MRA,**

*D***itself is estimated from the data and thus it can also be updated. Because each column vector in**

*C***has**

*C**h*coefficients whereas that in

**has**

*D**m*coefficients (

*h*>

*m*), the estimation of

**needs more data points. It is desirable to update**

*C***less frequently than**

*C***. To facilitate**

*D***estimation in MRA and PCR, we can pre-process the original $\text{\Delta}{I}_{Ref}$data by binning it into**

*C**h*′ pixels first, followed by further compression into

*m*(≤

*h*′) pixels by MRA or PCR. Figure 9(a) shows the effect of pre-processing on referencing performance by plotting

*q*as a function of

*m*, similar to Fig. 8(b). The different curves corresponding to

*h*′ = 32 to 1024 are almost indistinguishable, indicating that pre-processing the original$\text{\Delta}{I}_{Ref}$ data with moderate binning results in very little performance loss for further compression. Intuitively, the degrees of freedom lost in binning neighboring pixels are likely those from uncorrelated detector and shot noises, which are not important for predicting the noise statistics in the signal detector and will be lost in further compression anyway. This effect is examined more carefully in Fig. 9(b) by fixing

*m*= 8 and varying

*h*′. For both PCR and MRA,

*q*only begins to increase significantly when

*h*′ is smaller than 32. In this sense, pre-processing the $\text{\Delta}{I}_{Ref}$ data with binning is preferable before further compression by PCR or MRA because it greatly facilitates

**estimation with negligible performance loss and reduces computational cost.**

*C*Let us combine the discussion in Sections 4.1 and 4.2 and consider the overall effect of pixel compression on referencing performance. The variable *h* in Eqs. (13) and (14) is now replaced by *m*. From the perspective of noise reduction power, having more effective pixels is better. However, due to the finite-size training set and extra time cost, *Q*_{min} in Eq. (14) increases with *m*. When *N*_{t} is large, this is not a critical problem. When *N*_{t} is small, which means that ** B** is updated very fast,

*m*should be carefully chosen to minimize the residual noise. For example, in Fig. 9(c) we demonstrate how to update

**every second for the one-minute WLC data. This is done by updating**

*B***every second, while**

*D***is kept constant for one minute (it can be updated every minute). Three compression methods were compared: directly binning into**

*C**m*effective pixels, or first pre-processing $\text{\Delta}{I}_{Ref}$with binning (

*h*′ = 32) and then further compressing into

*m*effective pixels with MRA or PCR. For each

*m*, a training set with a size

*N*

_{b}(determined from Eq. (13) to minimize

*Q*), is evenly distributed with a test set (with a size

*N*

_{t}= 1000 −

*N*

_{b}) in a fully-dispersed way.

**is estimated from each training set (**

*D**N*

_{b}shots) and updated every second. For MRA or PCR with pre-processing,

**is estimated from the combined training sets (60 ×**

*C**N*

_{b}shots) for one minute. The performance of each method is characterized by the corresponding

*Q*, which includes the overall effects of pixel compression, finite-size training set and extra time cost. A

*Q*value can be calculated every second, so the mean value and standard deviation of

*Q*over 60 one-second segments are used to create the error bars and are plotted as a function of

*m*in Fig. 9(c). It is clear from the plot that an optimal

*m*exists, and different methods have a different optimal

*m*. For simple binning, the

*Q*is minimized at

*m*~14 with a value of 1.27. For MRA with pre-processing, the

*Q*is minimized at

*m*~8 with a value of 1.17. The advanced compression methods provide better referencing performance over simple binning for the fast update case illustrated here, but the improvement is not huge. For light sources with simple spectral correlation, or if the update rate is slower than 1000 shots, the advantage will be even smaller. In conclusion, when the reference array has many pixels and pixel compression is preferable, moderate binning will be sufficient in most cases. When advanced compression methods are used, pre-processing with binning is also useful. The specific choice of

*m*should be analyzed based on individual experimental conditions.

#### 4.3. Beyond linear approximation: reference-pixel data expansion

In our previous work [19], we found that the referencing scheme can be applied in the nonlinear response range of detectors without first calibrating the response. This is because the linear approximation is adequate when Δ** I** represents small fluctuations around zero. In this section, we will show an extended algorithm that works for high nonlinearity and large fluctuations.

Equations (3) and (4) involve a linear function of${I}_{Ref}$, which can be naturally generalized to an arbitrary function ** f** of ${I}_{Ref}$:

**to ensure that 〈Δ**

*f***〉 = 0 for any**

*K***. The exact functional form of**

*f***is unknown, but we can approximate it with a Taylor expansion of**

*f***around the mean intensity so that it keeps the linear form:and $\Delta {I}_{Ref}^{\text{eff}}$is a data matrix that contains linear and higher order terms:**

*f**i*,

*j*, and

*k*can have arbitrary combinations, and { } represents a collection of different combinations. $\text{\Delta}{I}_{Ref}^{\text{eff}}$can be calcuated directly from raw data, and

**is then estimated with methods discussed in previous sections. We call this algorithm reference-pixel data expansion, which reduces to the normal method when only ${I}_{Ref}$ is included in Eq. (20).**

*B*For the case of pixel nonlinearity, inclusion of higher order terms is straightforward because we can always approximate the calibration curve (like the one in the Section 5.3 of [19]) of each pixel with a Taylor series and insert it into Eq. (4), which naturally produces Eqs. (19) to (21). In general, Taylor expansion is chosen due to its common use, but, in principle, other functions could be used. Although pixel expansion is beyond the linear approximation, it still uses a linear algorithm to estimate ** B** and thus is computationally robust and efficient. Generally, there is no guarantee that pixel expansion will necessarily improve referencing. Therefore, its effect should be evaluated on experimental data directly.

As the name “pixel expansion” suggests, the number of possible higher order terms increases very quickly with the number of reference pixels, *h*, if every combination is included. This prevents us from doing a full optimization of all expanded terms, as was done for pixel compression in Section 4.2, because that will require a lot of time to collect a large enough training set for the minimization to converge. To make the pixel expansion algorithm cost-effective, we summarize some rules of thumb below to help reduce the number of terms and identify the potentially important terms. From the several cases we tested, we found that the additional improvement from full optimization is very small compared to that from our rules of thumb. These rules are demonstrated with both WLC (collected by CMOS arrays) and mid-IR (collected by a 2 × 32-pixel MCT array, the details are previously described in [19]). Because the light sources have very different properties and use very different detector arrays, the summarized rules are expected to be generalizable to some extent.

First, binning neighboring pixels into a proper number of effective pixels before pixel expansion can greatly reduce the number of possible terms. We use notation 〈a, b, c〉 to represent a specific binning configuration for $\left({I}_{Ref}\text{,}{I}_{Ref}^{\text{(2)}}\text{,}{I}_{Ref}^{\text{(3)}}\right)$. For example, 〈2048, 0, 0〉 means that the configuration contains only the ${I}_{Ref}$ term with 2048 pixels and no ${I}_{Ref}^{\text{(2)}}$ or ${I}_{Ref}^{\text{(3)}}$terms. 〈64, 8, 8〉 means that ${I}_{Ref}$contains 64 effective pixels binned from original reference pixels, ${I}_{Ref}^{\text{(2)}}$ is expanded from 8 binned pixels using Eq. (21) and contains 36 terms ($C(8,2)+C(8,1)$ where $C(n,k)$ is the binomial coefficient), and ${I}_{Ref}^{\text{(3)}}$ is also expanded from 8 binned pixels and contains 120 terms ($C(8,3)+2\times C(8,2)+C(8,1)$). For WLC, the referencing performance of 〈64, 8, 0〉 is compared with that of 〈2048, 0, 0〉 in Fig. 10(a). It is clear that the *SNR*_{0} is higher with pixel expansion, even though the total number of effective pixels is much less (100 < 2048). Although the extent of improvement depends on the specific data sets, this result shows that pre-binning makes the expansion algorithm feasible, especially for those reference arrays which have many physical pixels.

Second, contributions from cross terms (when *i* ≠ *j* in Eq. (21)) are usually less important than pure terms (*i* = *j*). We use a subscript 0 in 〈64, 16_{0}, 0〉 to represent a configuration whose ${I}_{Ref}^{\text{(2)}}$ only includes pure square terms. In Fig. 10(a), the *SNR*_{0} with 〈64, 8, 0〉 referencing is compared with 〈64, 16_{0}, 0〉, and most of the improvement is realized by 〈64, 16_{0}, 0〉. Because the number of pure terms (for any order) scales linearly with the pixel number, it is easier to include only pure terms than to include all terms, and it may be done with less binning. Moreover, considering that the *SNR*_{0} is improved only in some spectral range in Fig. 10(a), it is possible to just incorporate square terms for those relevant reference pixels, which further reduces the number of terms.

The above result implies that the improvement from pixel expansion is mainly due to the nonlinearity of detectors. However, both data sets in Fig. 10 were actually collected within linear response range (or very weak nonlinearity). Because it is often difficult to predict or rationalize why and how pixel expansion improves referencing, their effects should be evaluated on experimental data directly. For the highly nonlinear response range, the improvement can be more significant. In those cases, the multiplicative noise with nonlinear response should be further corrected as described in the Section 5.3 of Ref [19].

Third, it is usually unnecessary to include ${I}_{Ref}^{\text{(3)}}$ terms. In Fig. 10(a), we can see that the *SNR*_{0} with 〈64, 8, 8〉 referencing is marginally higher than 〈64, 8, 0〉, indicating that the extra 120 ${I}_{Ref}^{\text{(3)}}$ terms only introduce a very small improvement. This makes sense because the higher order terms in a Taylor expansion tend to vanish. The situation is similar with the mid-IR source in Fig. 10(b): 〈32, 32_{0}, 0〉 is basically as good as 〈32, 32_{0}, 32_{0}〉.

Fourth, pixel expansion works very well when the laser is very noisy and there are only few reference pixels available. In Fig. 10(b) for a mid-IR data set with ~4% standard deviation noise, referencing with 〈32, 32_{0}, 0〉 is significantly better than 〈32, 0, 0〉 across the full spectral range. Actually, it is remarkable that even the performance of 〈4, 4_{0}, 0〉 is a little better than 〈32, 0, 0〉. Here the 4 refers to 4 physical pixels (1, 9, 17 and 25th pixels to simulate a 4-pixel reference array) rather than 4 binned pixels. This example shows that pixel expansion is a very useful tool for IR detection where the number of detector pixels is usually very limited due to cost.

Finally, it is possible to do pixel compression after pixel expansion. Because the units of the higher order terms and the linear terms are different, only MRA is compatible with this approach.

#### 4.4. Generalized definition of Δ**I**

In previous sections, Δ** I** is defined as the intensity difference between two consecutive shots. This definition is valid because most chopping and phase cycling patterns are simply cycled by two shots. However, there exist complex scenarios where the extraction of the sample signal involves more than two consecutive shots. In these scenarios, the Δ operator should be defined based on how the sample signal is extracted, and hence we will generalize its definition with descriptive subscripts. The normal definition of the Δ operator can be denoted as Δ

_{n}_{−(}

_{n}_{+1)}, because the difference is taken between the

*n*

^{th}and (

*n*+ 1)

^{th}shots, as illustrated in the top panel of Fig. 11(a). For simplicity, we denote it in short-hand form as Δ

_{1−2}by setting the dummy variable

*n*= 1. Obviously, the exact value of

*n*does not matter and only the lag between shots is critical in the subscripts, and thus Δ

_{3−4}is equivalent to Δ

_{1−2}, both with lag = 1. We will give a few examples of the generalized Δ operators below. It should be noted that when defining and utilizing the generalized Δ operator, exactly the same set of laser shots should still be used in calculating $\Delta {I}_{LO}$ and the corresponding $\Delta {I}_{Ref}$, and they should fulfill $\u3008\Delta {I}_{LO}\u3009=\u3008\Delta {I}_{Ref}\u3009=0$ to guarantee the correct convergence of the baseline.

In some experiments with dual color pump beams, such as femtosecond stimulated Raman experiments [4,18], the sample signal can involve the intensity difference between the *n*^{th} and (*n* + 2)^{th} shots, and the corresponding short-form Δ operator is denoted as Δ_{1−3}. In some phase-cycling patterns, the sample signal is extracted by taking the difference between the *n*^{th} shot and the mean of the (*n* + 1)^{th} and (*n* + 2)^{th} shots [27], and hence Δ_{1−(2+3)/2} can be defined accordingly. Both Δ_{1−3} and Δ_{1−(2+3)/2} involve three consecutive shots to extract the sample signal, as illustrated in the bottom panel of Fig. 11(a). Another scenario is when the array pixel integrates several laser pulses before readout. This method is often used when the array readout rate is slower than the laser repetition rate, or when a single laser pulse is not intense enough to fill the pixel FWC. In this case, the sample signal is extracted by taking the difference between two consecutive array readouts, each integrating the total energy of several consecutive laser pulses. For example, if 4 consecutive shots are integrated as one readout [15], the Δ operator can be generalized as Δ_{(1+2+3+4)−(5+6+7+8)}.

For each distinct generalized Δ operator, the training and application steps of ** B** can be performed as discussed in the previous sections by applying the same generalized Δ operator to both the training set and the test set. However, the residual noise after referencing increases as the lag in the Δ definition becomes larger. To illustrate this effect, we calculate the residual noise for different Δ

_{1−(1+lag)}and plot the corresponding

*SNR*

_{0}in Fig. 11(b). Clearly the

*SNR*

_{0}decreases when the lag increases. The

*SNR*

_{0}drops very quickly at the beginning with a small increase in the lag from lag = 1 to lag = 5, but the drop becomes much more gradual as the lag increases further from lag = 23 to lag = 999. The highest

*SNR*

_{0}of Δ

_{1−1000}is about half of that for Δ

_{1−2}.

When the lag increases, the spectral correlation of Δ** I** between the signal and reference detectors becomes weaker, which results in a higher residual noise as indicated by Eq. (6). This can be rationalized by recognizing that a Δ operator with a larger lag will leak in more low-frequency noises that are uncorrelated between detectors. In our demonstration, the laser runs at 1 kHz and thus Δ

_{1−1000}spans one second in time. During such a long period, the detected intensities are modulated by low frequency fluctuations in the experimental system, such as beam pointing, mechanical vibrations, and the utility frequency. If the laser repetition rate is higher than 1 kHz, the

*SNR*

_{0}in Fig. 11(b) is expected to drop slower with an increasing lag. For high repetition rate laser systems, it is therefore possible to integrate many shots into one readout rather than doing shot-to-shot referencing and still get good referencing performance so long as the time scale of integration is well separated from those of low frequency noises.

Fortunately, practical chopping or phase-cycling patterns do not involve a long lag. In particular, most complex Δ operators can be reduced into a linear combination of simple Δ operators such as Δ_{1−2} or Δ_{1−3}. For example, Δ_{1−2+3−4}, commonly used for scattering removal [28,29], can be reduced into Δ_{1−2} + Δ_{3−4}. This reduction means that estimation of ** B** only needs to involve two consecutive shots because Δ

_{1−2}is equivalent with Δ

_{3−4}, even though the extraction of sample signal involves 4 consecutive shots. This reduction can also be done for more complex cases, like 16-phase-cycling [30], after proper ordering of different phase masks. Therefore, one can strategically define and reduce Δ so that the decrease in

*SNR*

_{0}due to lag will not be significant in practice.

Although using the same Δ operator in both the training and application steps of ** B** can yield the best performance in noise suppression, it makes the practical implementation more complex. This complication is especially severe for the fully-dispersed distribution because it requires inserting different training set patterns for different Δ operators. Compared with the normal Δ

_{1−2}operator, the generalized Δ operator also requires a larger number of

*N*

_{b}shots to generate the same number of

*n*

_{b}pairs, which contributes to an unpreferred higher

*Q*, as defined in Eq. (12). However, as indicated by Eq. (4), the condition of 〈Δ

**〉 = 0 does not depend on how**

*K***is estimated as long as $\u3008\Delta {I}_{LO}\u3009=\u3008\Delta {I}_{Ref}\u3009=0$. This property allows us to use a Δ operator in the training step of**

*B***different from that used in the application step. This approach results in slightly higher residual noise, which is a trade-off for much easier implementation. One convenient choice is to always estimate**

*B***based on Δ**

*B*_{1−2}even though the sample signal is extracted by a different Δ operator in the application step. In this case, the training set can be distributed in the same way as discussed in Section 4.1, and a single

**estimation can be used for multiple sample signals extracted by different Δ operators.**

*B*To test the referencing performance under the situation described above, we use the *q* defined in Eq. (10) to quantify the training quality with the specific ** B** in the numerator estimated from Δ

_{1−2}and the standard

**in the denominator from Δ**

*B*_{1−(1+lag)}. The result is plotted as a function of lag in Fig. 11(c). When lag = 1,

*q*is exactly 1 because Δ

_{1−(1+lag)}is just Δ

_{1−2}. From lag = 2 to lag = 100,

*q*increases with small oscillations on top of a steady positive-slope trend. The oscillation has a period of ~16 (ms) and is due to the utility frequency of 60 Hz. However,

*q*remains reasonably low (within 1.05) until around lag = 20, which is already much longer than those involved in realistic chopping or phase-cycling patterns. This result suggests that Δ

_{1−2}is a reasonable replacement for Δ

_{1−(1+lag)}in the training step of

**.**

*B*Until now we have focused mainly on phase-cycling where some extra blanks shots must be inserted to form a training set because there are no blank shots normally. However, for the chopping case, the chopped shots are blank shots themselves and can be used as the training set too. In particular, for a pump-probe experiment that utilizes simple on-off chopping, we can use (all or part of) the chopped shots to estimate ** B** based on Δ

_{1−3}, although the sample signal is extracted by Δ

_{1−2}. For the one-minute WLC data set, doing this yields a

*q*of ~1.003, achieving basically the same noise level as the extreme case where

**is estimated from the whole one-minute set based on Δ**

*B*_{1−2}. Although no extra blank shots are inserted (

*N*

_{b}= 0 and

*Q*=

*q*), the training set contains up to

*N*

_{t}/2 shots and therefore

**estimation converges well (see Section 4.1). In this way, the data collection procedures are exactly the same as routine, only the data processing method is different. This convenient trick is robust and easily generalizable to more complex chopping patterns.**

*B*## 5. Summary and conclusions

In this work, we have optimized and expanded our noise reduction scheme for heterodyne spectroscopy. Conventional referencing methods are intuitive but limited in their performance because any signal pixel can only be referenced by one reference pixel. In contrast, our scheme optimally suppresses additive noise by fully exploiting the spectral correlation between signal and reference detectors through a coefficient matrix ** B**. This is followed by a robust division-after-averaging step to further remove the multiplicative noise. As a result, our scheme enables unprecedented benefits in terms of noise reduction performance, flexible detector choice, simple alignment, and efficient computation. We have validated this idea on 800 nm and WLC detected by two different CMOS arrays. Like the initial demonstration with mid-IR detected by MCT detectors, the residual noise in the shorter wavelength range approaches the noise floor of the signal array. Because CMOS arrays are widely available with quite different characteristics, we have discussed how to optimize detector choice, and showed that a small fraction of pulse energy is sufficient for reference detection to achieve satisfactory performance. Our results indicate that a detector with a high FWC is preferable as the signal detector for self-heterodyne experiments, whereas a sensitive detector with a low FWC is a better choice for experiments where the LO intensity can be adjusted independently.

Practical implementation of this scheme requires accurate estimation of ** B**. The basic idea is very simple: use a training set of only blank shots to fit for

**and apply it for noise reduction in the test set. For the case of phase-cycling, we have optimized the size and distribution of the training set and provided analytical equations to estimate the optimal size and corresponding referencing performance. For the case of chopping, our analysis shows that using the chopped shots as the training set is a simple yet robust way to implement our scheme. We have also generalized the scheme to scenarios with complex chopping or phase cycling patterns and showed that the corresponding**

*B***estimation can be reduced into the simpler cases.**

*B*The basic method can be combined with elaborate algorithms to account for difficult situations. Pixel compression, with three algorithms given, is an effective approach to deal with situations that require rapidly updating ** B**. This method is useful to adapt for either system drifts or sample scanning in hyperspectral imaging, and reasonably good performance is achieved when the update rate is as fast as 1 second. Furthermore, pixel expansion generalizes the functional form beyond a linear function, but keeps the benefit of using linear least squares to determine

**. This method improves the referencing performance for detectors operating in the nonlinear response range and/or for very noisy light sources.**

*B*Although these different algorithms may seem complex, our purpose is to cover all possible scenarios. For a specific experimental setup and measurement type, the analysis just needs to be performed once to determine the best way to implement the scheme. The relevant parameters include the choice of detectors, the update rate of ** B**, the size and distribution of the training set, the level of pixel compression, and whether pixel expansion is needed. After the initial analysis is done, the routine implementation is simple. Especially for the chopping case, usually, no modification to the data collection step is needed.

When this scheme is properly implemented, the residual noise originating from the laser and detector should approach the noise floor of the signal detector and follows a zero-mean Gaussian white noise distribution. Because the residual noise is unbiased and distortion-free, it does not interfere with extracting real spectral features from sample signals. This property is particularly useful for experiments utilizing WLC where conventional referencing often results in baseline distortion. Moreover, the magnitude of residual noise, $\sigma \left(\text{\Delta}K\right)$, can be estimated from the training set and updated throughout the measurement. It can either be used as diagnostic information during data collection to assist analyzing and rejecting outlier shots due to scattering events [31], or be used as inputs for post-processing steps such as the weight factors in spectral fitting. Due to the linear nature of the subtraction method used in Eqs. (3) and (7), any additional procedures to remove parasitic signals, such as fluorescence, can be performed after the removal of additive noise. We believe that our scheme fully solves the challenges associated with laser intensity fluctuation in heterodyne spectroscopy.

## Funding

National Science Foundation (NSF) (CHE-1414466, CHE-1310693, CHE-0802913); University of California, Irvine.

## Acknowledgments

This work was supported by the University of California, Irvine, and the National Science Foundation Center for Chemical Innovation on Chemistry at the Space-Time Limit (CaSTL), grant number CHE-1414466, in conjunction with instrumentation funded in part by the National Science Foundation under grant no. CHE-1310693 and CHE-0802913. We are grateful to Hiroaki Maekawa for his help with laser maintenance.

## References

**1. **J. Brazard, L. A. Bizimana, and D. B. Turner, “Accurate convergence of transient-absorption spectra using pulsed lasers,” Rev. Sci. Instrum. **86**(5), 053106 (2015). [CrossRef] [PubMed]

**2. **M. Bradler and E. Riedle, “Temporal and spectral correlations in bulk continua and improved use in transient spectroscopy,” J. Opt. Soc. Am. B **31**(7), 1465–1475 (2014). [CrossRef]

**3. **C. Schriever, S. Lochbrunner, E. Riedle, and D. J. Nesbitt, “Ultrasensitive ultraviolet-visible 20 fs absorption spectroscopy of low vapor pressure molecules in the gas phase,” Rev. Sci. Instrum. **79**(1), 013107 (2008). [CrossRef] [PubMed]

**4. **A. L. Dobryakov, S. A. Kovalenko, A. Weigel, J. L. Pérez-Lustres, J. Lange, A. Müller, and N. P. Ernsting, “Femtosecond pump/supercontinuum-probe spectroscopy: optimized setup and signal analysis for single-shot spectral referencing,” Rev. Sci. Instrum. **81**(11), 113106 (2010). [CrossRef] [PubMed]

**5. **C. A. Werley, S. M. Teo, and K. A. Nelson, “Pulsed laser noise analysis and pump-probe signal detection with a data acquisition card,” Rev. Sci. Instrum. **82**(12), 123108 (2011). [CrossRef] [PubMed]

**6. **M. Kaucikas, J. Barber, and J. J. Van Thor, “Polarization sensitive ultrafast mid-IR pump probe micro-spectrometer with diffraction limited spatial resolution,” Opt. Express **21**(7), 8357–8370 (2013). [CrossRef] [PubMed]

**7. **S. Mukamel, *Principles of Nonlinear Optical Spectroscopy*, Oxford Series on Optical and Imaging Sciences (Oxford University Press, 1999).

**8. **T. Brixner, I. V. Stiopkin, and G. R. Fleming, “Tunable two-dimensional femtosecond spectroscopy,” Opt. Lett. **29**(8), 884–886 (2004). [CrossRef] [PubMed]

**9. **M. J. Nee, R. McCanne, K. J. Kubarych, and M. Joffre, “Two-dimensional infrared spectroscopy detected by chirped pulse upconversion,” Opt. Lett. **32**(6), 713–715 (2007). [CrossRef] [PubMed]

**10. **L. A. Bizimana, J. Brazard, W. P. Carbery, T. Gellen, and D. B. Turner, “Resolving molecular vibronic structure using high-sensitivity two-dimensional electronic spectroscopy,” J. Chem. Phys. **143**(16), 164203 (2015). [CrossRef] [PubMed]

**11. **A. P. Spencer, B. Spokoyny, S. Ray, F. Sarvari, and E. Harel, “Mapping multidimensional electronic structure and ultrafast dynamics with single-element detection and compressive sensing,” Nat. Commun. **7**(1), 10434 (2016). [CrossRef] [PubMed]

**12. **J. A. Moon, “Optimization of signal‐to‐noise ratios in pump‐probe spectroscopy,” Rev. Sci. Instrum. **64**(7), 1775–1778 (1993). [CrossRef]

**13. **K. Röttger, S. Wang, F. Renth, J. Bahrenburg, and F. Temps, “A femtosecond pump–probe spectrometer for dynamics in transmissive polymer films,” Appl. Phys. B **118**(2), 185–193 (2015). [CrossRef]

**14. **M. A. R. Reber, Y. Chen, and T. K. Allison, “Cavity-enhanced ultrafast spectroscopy: ultrafast meets ultrasensitive,” Optica **3**(3), 311–317 (2016). [CrossRef]

**15. **S. D. McClure, D. B. Turner, P. C. Arpin, T. Mirkovic, and G. D. Scholes, “Coherent Oscillations in the PC577 Cryptophyte Antenna Occur in the Excited Electronic State,” J. Phys. Chem. B **118**(5), 1296–1308 (2014). [CrossRef] [PubMed]

**16. **P. Hamm, S. Wiemann, M. Zurek, and W. Zinth, “Highly sensitive multichannel spectrometer for subpicosecond spectroscopy in the midinfrared,” Opt. Lett. **19**(20), 1642–1644 (1994). [CrossRef] [PubMed]

**17. **A. Ghosh, A. L. Serrano, T. A. Oudenhoven, J. S. Ostrander, E. C. Eklund, A. F. Blair, and M. T. Zanni, “Experimental implementations of 2D IR spectroscopy through a horizontal pulse shaper design and a focal plane array detector,” Opt. Lett. **41**(3), 524–527 (2016). [CrossRef] [PubMed]

**18. **D. W. McCamant, P. Kukura, S. Yoon, and R. A. Mathies, “Femtosecond broadband stimulated Raman spectroscopy: Apparatus and methods,” Rev. Sci. Instrum. **75**(11), 4971–4980 (2004). [CrossRef] [PubMed]

**19. **Y. Feng, I. Vinogradov, and N.-H. Ge, “General noise suppression scheme with reference detection in heterodyne nonlinear spectroscopy,” Opt. Express **25**(21), 26262–26279 (2017). [CrossRef] [PubMed]

**20. **J. F. Holmes and B. J. Rask, “Optimum optical local-oscillator power levels for coherent detection with photodiodes,” Appl. Opt. **34**(6), 927–933 (1995). [CrossRef] [PubMed]

**21. **B. M. Luther, K. M. Tracy, M. Gerrity, S. Brown, and A. T. Krummel, “2D IR spectroscopy at 100 kHz utilizing a Mid-IR OPCPA laser source,” Opt. Express **24**(4), 4117–4127 (2016). [CrossRef] [PubMed]

**22. **F. Kanal, S. Keiber, R. Eck, and T. Brixner, “100-kHz shot-to-shot broadband data acquisition for high-repetition-rate pump-probe spectroscopy,” Opt. Express **22**(14), 16965–16975 (2014). [CrossRef] [PubMed]

**23. **G. Auböck, C. Consani, R. Monni, A. Cannizzo, F. van Mourik, and M. Chergui, “Femtosecond pump/supercontinuum-probe setup with 20 kHz repetition rate,” Rev. Sci. Instrum. **83**(9), 093105 (2012). [CrossRef] [PubMed]

**24. **W. Rock, Y.-L. Li, P. Pagano, and C. M. Cheatum, “2D IR Spectroscopy using Four-Wave Mixing, Pulse Shaping, and IR Upconversion: A Quantitative Comparison,” J. Phys. Chem. A **117**(29), 6073–6083 (2013). [CrossRef] [PubMed]

**25. **A. L. van den Wollenberg, “Redundancy analysis an alternative for canonical correlation analysis,” Psychometrika **42**(2), 207–219 (1977). [CrossRef]

**26. **K. E. Muller, “Relationships between redundancy analysis, canonical correlation, and multivariate regression,” Psychometrika **46**(2), 139–142 (1981). [CrossRef]

**27. **Z. Zhang, K. L. Wells, E. W. J. Hyland, and H.-S. Tan, “Phase-cycling schemes for pump–probe beam geometry two-dimensional electronic spectroscopy,” Chem. Phys. Lett. **550**, 156–161 (2012). [CrossRef]

**28. **R. Bloem, S. Garrett-Roe, H. Strzalka, P. Hamm, and P. Donaldson, “Enhancing signal detection and completely eliminating scattering using quasi-phase-cycling in 2D IR experiments,” Opt. Express **18**(26), 27067–27078 (2010). [CrossRef] [PubMed]

**29. **J. Helbing and P. Hamm, “Compact implementation of Fourier transform two-dimensional IR spectroscopy without phase ambiguity,” J. Opt. Soc. Am. B **28**(1), 171–178 (2011). [CrossRef]

**30. **H. Seiler, S. Palato, and P. Kambhampati, “Coherent multi-dimensional spectroscopy at optical frequencies in a single beam with optical readout,” J. Chem. Phys. **147**(9), 094203 (2017). [CrossRef] [PubMed]

**31. **K. E. H. Anderson, S. L. Sewall, R. R. Cooney, and P. Kambhampati, “Noise analysis and noise reduction methods in kilohertz pump-probe experiments,” Rev. Sci. Instrum. **78**(7), 073101 (2007). [CrossRef] [PubMed]