## Abstract

We introduce a high performance parallelization to the PSTD solution of Maxwell equations by employing the fast Fourier transform on local Fourier basis. Meanwhile a reformatted derivative operator allows the adoption of a staggered-grid such as the Yee lattice in PSTD, which can overcome the numerical errors in a collocated-grid when spatial discontinuities are present. The accuracy and capability of our method are confirmed by two analytical models. In two applications to surface tissue optics, an ultra wide coherent backscattering cone from the surface layer is found, and the penetration depth of polarization gating identified. Our development prepares a tool for investigating the optical properties of surface tissue structures.

©2010 Optical Society of America

## 1. Introduction

Modeling light-tissue interactions helps better understanding the fundamentals underlying various optical techniques in biomedical applications. At present, the radiative transfer theory, basically the differential radiative transfer equation (RTE) serves as the corner stone of most models in describing photon migration in tissue. However, in the derivation of RTE from Boltzmann transport equation the photons are treated as particles following straight trajectories between scattering centers and as waves at each process of scattering. The alternating role of particle and wave excludes interference and near-field effects, and can render RTE erroneous when the separation between scattering centers is insufficient [1] and incompetent to directly handle coherent effects in light scattering [2,3].

In general it is hard to solve RTE directly. Monte Carlo (MC) simulation is a statistical, step-by-step emulation of the photon transport process associated with each term of RTE. Current implementations of MC treat the tissue as a single- or multiple-layered homogeneous medium, consisting of structureless scattering points on a random walk path realized by a random number generator. This picture completely ignores the fine structures and full heterogeneities of a surface tissue, such as the epithelial layer, and therefore is not a good representation of the latter. Real world applications would require a much more sophisticated modeling that can reflect the abundant features of the modeled. As 85% of all cancers originate from the epithelia lining the cavities and surfaces of human organs, selectively probing the photons traveling through the epithelia is important because they carry the most diagnostic information, while the signal from the underlying bulk only complicates the detection. Recent years have seen rapid development of many surface-sensitive or depth-resolved optical methodologies, such as polarization gating [4,5] and low-coherence enhanced backscattering (LEBS) [6]. So far, MC simulation is the most common theoretical tool used to explore the characteristics of these techniques. Quite success has been achieved in studies utilizing homogeneous phantom models. But these oversimplified models lack the information about tissue structure and heterogeneity at cell level, and further to what extent the actual surface tissue-light interaction is describable by MC is still undetermined. Simulation with cellular resolution proves to be a tremendous task even for supercomputing. High performance parallelization algorithms are crucial to a successful model beyond MC.

Given a tissue configuration with fine details of cell organization, such simulations inevitably need to resort to the direct solutions of Maxwell equations, the first principle of electromagnetism. The finite difference time domain method (FDTD) [7] and its close variation, the pseudo-spectral time domain method (PSTD) [8] discretize the Maxwell equations on a media mesh and utilize a leapfrog time-marching mechanism to obtain numerical solutions of a complex electromagnetic structure. However, FDTD is unsuitable for tissue optics because of the sheer size of tissue and the intangible number of grids required to achieve numerical accuracy and stability. On the contrary, PSTD allows a much larger grid space, up to a half of the wavelength, and hence reduces the total number of grids. But for large-scale problems PSTD still remains computationally intensive. Up until now, reported PSTD simulations on light scatterings by random media are either two-dimensional [9] or three-dimensional with relatively large scatterers [1], resulting in small number of grids.

The current formalism of PSTD is only weakly parallelizable because its derivative operator is global. Besides, the formulation of the derivative is equivalent to deploy the six field components to the same grid point. Such collocated arrangement makes the derivative susceptible to spectral artifacts at the Nyquist frequency arising from discontinuities in the model space [10]. At present PSTD implements the fast Fourier transform on a global Fourier basis (gPSTD), which requires the access of the full data set scattered across all computation nodes in a distributed-memory multiprocessor computing environment. The best configuration of domain decomposition is a stack of equal slabs along one dimension (for example the *x* direction) [11], allowing each node to hold all the data in the $y\text{-}z$ plane, so differentiation with respect to these two variables can be evaluated with local data only. However, data required for the $x\text{-}$derivatives are lined up across all subdomains, and the algorithm utilizes two massive data transpositions per time step to assemble the scattered data to the same node and after the inner-node computation redistribute the results back to the original nodes. The performance of gPSTD is affected by several drawbacks. Firstly, the communication-intensive global data exchange drags down the parallel efficiency; secondly, the model scale in $y\text{-}z$ dimension is limited by the resource of a single node since its memory holds all the grids in the $y\text{-}z$ plane; at last, the code is not directly scalable when the number of computation nodes increases or the problem size grows.

The derivative is a local operation that depends only on the function values in the vicinity of the targeted local coordinate. FFT on local Fourier basis and its associated overlapping domain decomposition scheme are well-established methods in computation theory and already widely used in seismology [12,13]. Accordingly, the computation region can be flexibly divided into arbitrary number of subdomains in the *x*, *y* and *z* directions, data exchanges occur only between adjacent nodes concurrently, and local FFT are conducted simultaneously within each node without degradation of accuracy.

In this paper, we report a new parallel implementation of PSTD with local Fourier basis. By reformatting the derivative operator, the electric and magnetic fields are arranged in the staggered-grid Yee-lattice style. We first introduce the details of this algorithm and discuss the benefits of employing the Yee grid in terms of discontinuity suppression. We further validate the staggered-grid local PSTD (SLPSTD) by comparison with two analytical models. Then we apply it to the investigation of surface-layer EBS and polarization gating. Finally we present some details about the coding and the supercomputing environment, and compare the performance between SLPSTD and gPSTD.

## 2. Methods

#### 2.1 FFT with local Fourier basis

Current parallel implementations of FFT heavily rely on inter-CPU communications and suffer from low efficiency. So in PSTD each FFT is a sequential computation within a single CPU. The obstacle to the full parallelization of PSTD lies in the *conceived* global nature of the derivative operator. Therefore in traditional implementations of PSTD, FFT is performed on the global data set. The spectral multi-domain technique with local Fourier basis is an established method to break the one-single global computation into multiple independent concurrent local ones after a carefully designed “patching” procedure on the data [12]. In this approach, the full computation region is first split into multiple subdomains. The local partial data within each subdomain consist of its own and a copy of data from the neighborhood just outside its boundaries (Fig. 1
). The latter form the overlapping regions. To enforce the periodic condition of FFT on the local data set, the data in the overlapping regions are artificially tapered to zero by a weighting function. As a result, the derivatives in the non-overlapping region are unaffected and remain continuous across domain boundaries.

To illustrate the above local FFT algorithm, we consider an analytical “toy” function

*N*grids and

*m*subdomains. Each subdomain obtains ${n}_{\text{ovl}}$ grid values from its left neighbor and another ${n}_{\text{ovl}}$ grid values from its right neighbor. A “bell” function is defined as

*ℓ*denotes the local grid index. The overlapping region is ${n}_{\text{ovl}}$ (an even number) thick at each side, and the non-overlapping region starts at ${\mathit{\ell}}_{1}={n}_{\text{ovl}}$ and ends at ${\mathit{\ell}}_{2}$ (to be determined later). The local data set given by

In order to facilitate FFT calculations and also balance the load on different computation nodes, it is advantageous to keep the length of local FFT identical. Note that the first (last) subdomain at the far left (right) has no left (right) neighbor; therefore, these two end subdomains should be ${n}_{\text{ovl}}$ wider than the middle ones. Consequently, decomposing *N* number of grids into *m* subdomains results in ${n}_{\text{ovl}}+\left(N-2{n}_{\text{ovl}}\right)/m$ grids for subdomains 1 and *m*, and $\left(N-2{n}_{\text{ovl}}\right)/m$ grids for subdomains 2 to $m-1$. The length of local FFT turns out to be a constant $L=\left(N-2{n}_{\text{ovl}}\right)/m+2{n}_{\text{ovl}}$, and ${\mathit{\ell}}_{2}=L-{n}_{\text{ovl}}-1$.

To demonstrate the feasibility and accuracy of the above multi-domain FFT procedure, we calculate the first derivative of Eq. (1) utilizing a staggered-grid formula [10]

*Δ*denotes the grid size,

*F*and ${F}^{-1}$ the forward and inverse FFT

*within the local subdomain*, respectively. Unlike existing PSTD implementations, in Eq. (4) we offset the coordinate of the derivative by half a grid size so that a Yee lattice can be employed directly (see below). To be more specific about the meaning of Eq. (4), the $F\left(\tilde{f}\right)$ on the right hand side switches $\tilde{f}\left(x\right)$ from the spatial domain to the Fourier domain, and the spatial derivative is identical to the multiplication of a factor $jk$ in the Fourier domain, while the extra exponential factor ${e}^{jk\mathrm{\Delta}/2}$ in the Fourier domain is equivalent to shifting the coordinate by $\mathrm{\Delta}/2$ in the spatial domain. Thus in Eq. (4), the value on the right hand side at coordinate

*x*gives the first derivate at coordinate $x+\mathrm{\Delta}/2$.

Although our applications in Section 3 only consider continuous wave, short-pulsed cases are also in our mind. In the “toy” function Eq. (1) $\sigma =20\pi $ is a reasonable approximation to a Gaussian wave packet containing dozens of oscillations and two frequencies with an offset 1. Here, we set the grid resolution to one-fifth of the period, or $\mathrm{\Delta}\equiv 2\pi /5$. Figure 1b presents the results calculated using Eq. (4) versus the analytical values directly derived from Eq. (1), with parameters $L=32$, $m=50$ and ${n}_{\text{ovl}}=10$ at the float precision of C (or the equivalent real*4 precision of Fortran). From Fig. 1b, it is evident the two curves agree so well that they are basically indistinguishable. The relative error is at the same order of magnitude (${10}^{-4}$) should a global FFT algorithm is employed.

#### 2.2 PSTD on staggered-grid with local Fourier basis and overlapping domain decomposition

Existing PSTD implementations assign the field components $\left({E}_{x},{E}_{y},{E}_{z},{H}_{x},{H}_{y},{H}_{z}\right)$ to a collocated lattice. Its equivalent derivative operator is known to be susceptible to numerical errors introduced by spatial discontinuities within the system [10]. Especially when point sources are involved, the *δ*-functions must be replaced with extended ones to avoid the Gibbs phenomenon [14,15]. On the other hand, the symptom is absent in the derivative operator given by Eq. (4) and its equivalent staggered grid configuration [10]. The reason is as follows. Let *Δ* be the grid spacing, and the maximum frequency of FFT is the Nyquist frequency $\pm {f}_{c}$ where ${f}_{c}={\left(2\mathrm{\Delta}\right)}^{-1}$. A spatial discontinuity in the system generates spectral components at frequencies beyond $\pm {f}_{c}$, causing the spectrum of field *ψ*, i.e. $F\left[\psi \right]$, no longer vanishes at $\pm {f}_{c}$. Consequently in the Fourier domain, $ikF\left[\psi \right]$ flips sign when $-{f}_{c}\leftrightarrow {f}_{c}$ and exhibits a discontinuity at the Nyquist frequency, leaving ${\psi}^{\prime}\equiv {F}^{-1}\left[ikF\left(\psi \right)\right]$ ill-behaved. On the contrary, the extra exponential factor in Eq. (4) compensates the sign change and renders the derivative invariant. Moreover, the 3D version of Eq. (4) allows the well-known and well-behaved Yee lattice to be adopted in PSTD.

The extension of the overlapping domain decomposition described in Section 2.1 to 3D is straightforward. The total computation region of ${N}_{x}\times {N}_{y}\times {N}_{z}$ grids is divided into ${m}_{x}\times {m}_{y}\times {m}_{z}$ subdomains which are one-to-one mapped to $m\equiv {m}_{x}\times {m}_{y}\times {m}_{z}$ computation nodes. Within each subdomain, field data in the overlapping region must be obtained from all of its neighbors before the local FFT can be performed. According to the curl equations in the Maxwell’s theory, only the fields tangential to the domain interfaces are needed to be overlapped. Normally an interior subdomain, i.e. one that locates underneath the surface of the total region, requires $4{n}_{\text{ovl}}$ layers of interprocessor data exchange to update a single field. This number is lower for subdomains located on the surfaces, edges and corners of the total region. So on the per-time-step and per-node base, the upper limit of the overall data exchanges would be $8{n}_{\text{ovl}}\left(\frac{{N}_{x}{N}_{y}}{{m}_{x}{m}_{y}}+\frac{{N}_{y}{N}_{z}}{{m}_{y}{m}_{z}}+\frac{{N}_{z}{N}_{x}}{{m}_{z}{m}_{x}}\right)$., whereas it is $12{N}_{x}\left({N}_{y}-{N}_{x}/m\right){N}_{z}/m$ for the double matrix transposes in conventional PSTD. Considering an $N\times N\times N$ lattice and an ${m}_{x}={m}_{y}={m}_{z}={m}^{1/3}$ domain decomposition, our method gain by a factor $N/\left(2{n}_{\text{ovl}}{m}^{1/3}\right)$ in terms of data exchange efficiency, equal to the ratio between the width of the non-overlapping region and that of the overlapping region. Moreover, because in a parallel environment the data exchanges are run concurrently, our algorithm only requires 12 sequences of communication per-time-step, whereas the double matrix transposes in the conventional PSTD needs $2\left(m-1\right)$ sequences. The extra MPI latency to initiate $2\left(m-1\right)$ communications is another factor of cost. The above two analysis indicate that the advantage of using local Fourier basis becomes significant as the size of simulation (*N*) and the scale of supercomputer (*m*) go up.

Without losing generality, here we consider the updating procedure for ${E}_{x}$ alone. Discussions for the other five fields are similar. By Ampere’s Law,

*ε*and

*σ*the permittivity and conductivity, respectively. Same as in the conventional PSTD, the left hand side of Eq. (5) is digitized at time step $n+1/2$ as

In this paper we are interested in the light scattering characteristics of top layer EBS and polarized gating. The incident light is treated as a monochromic plane wave, pure scattered field formulation is adopted, and the standard anisotropic perfectly matched layer absorbing boundary condition is imposed to terminate outgoing waves and prevent the wraparound effect.

#### 2.3 Elimination of numerical dispersions

According to Nyquist sampling theorem, the results of spatial derivatives in PSTD are exact if the meshing density is higher than two samplings per wavelength. But the finite difference expression of the time derivative is only second-order accurate, leading to a numerical dispersion relationship such that [16]

*ε*and

*μ*the permittivity and permeability of the medium respectively. To speed up code execution, a large $\mathrm{\Delta}t$ is preferred to decrease the total number of leapfrogs before reaching the target time

*t*. However, this would increase the numerical dispersion according to Eq. (8) and cause a large numerical error. Fortunately, if we define $\alpha \equiv \frac{\omega \mathrm{\Delta}t}{2}{\mathrm{sin}}^{-1}\left(\frac{\omega \mathrm{\Delta}t}{2}\right)$ and replace

*ε*with $\alpha \epsilon $ and

*μ*with $\alpha \mu $, Eq. (8) is simplified to the ideal dispersion relation of a wave propagating at an isotropic phase velocity, and thus one restriction on $\mathrm{\Delta}t$ is removed. Though still subjected to the numerical stability requirement [16], relatively large $\mathrm{\Delta}t$ can be chosen to reduce the computation time without compromising the numerical accuracy.

#### 2.4 Reduction of numerical error in the near-to-far-field transformation

Near-to-far-field transformation is implemented to extend the computed near field results to the far field region where the scattering features are analyzed. It involves an integration of the equivalent surface current over the surfaces of a virtual rectangle enclosing the structure being modeled. However, unlike in FDTD, approximating the surface integral by direct surface summations is unwise due to the coarse grid in PSTD. Here we propose a measure with spectral accuracy to suppress this error. As an example, regarding the integration over the surface $z={z}_{0}$, the surface current $J\left(x,y,{z}_{0}\right)$ is first expressed as summation $\frac{1}{{N}_{x}{N}_{y}}{\displaystyle \sum _{m,n}{J}_{mn}}\left({z}_{0}\right){e}^{j({k}_{x,m}x+{k}_{y,n}y)}$, where ${J}_{mn}\left({z}_{0}\right)$ is the 2D FFT of $J\left(x,y,{z}_{0}\right)$ and ${k}_{x,m}$ and ${k}_{y,n}$ the corresponding discrete wave vectors. The retarded phase relation, which is the main source of error, can now be integrated out analytically, i.e.

The procedures for the other surface integrations are similar. Our numerical experiments demonstrate the improvement in accuracy is evident, especially for large grid spacings.

## 3. Applications to surface-tissue optics modeling

#### 3.1 Verification of the SLPSTD algorithm

The dilute solution of mono-dispersed dielectric microspheres is frequently used to study light scattering problems of tissue optics. The exact Mie solution of single sphere scattering serves as the touchstone for many model simulations. To demonstrate the accuracy of SLPSTD, we consider the case of plane wave scattering by a single polystyrene sphere of 2-micron diameter embedded in water. The wavelength of the incident light is 785 nm in vacuum, and the refractive indices of polystyrene and water are 1.59 and 1.33 respectively. The shape of the sphere is outlined by staircase approximation. A denser mesh can reduce staircasing problems but increases computation cost. Numerical experiments determine the best tradeoff is a grid resolution of 0.098 micron, corresponding to five sampling points per wavelength in polystyrene. The mesh consists of $108\times 108\times 108$ grids and is decomposed into $2\times 2\times 2$ subdomains. Figure 2 shows that the differential scattering cross section $d\sigma /d\theta $ given by SLPSTD only differs from the Mie solution by merely ${10}^{-6}$. The robustness of SLPSTD is further confirmed by test results of other combinations of sphere diameters and wavelengths.

#### 3.2 Verification on a point source: harmonic dipole emission in a simple cell model

Problems of Raman scattering or fluorescence are usually treated as classical radiations of electric dipoles. Here we model the radiation from a harmonic electric dipole embedded in a concentric dielectric sphere as a primitive simulation of a single Raman emitter or fluorophore inside a biological cell. As shown in the inset of Fig. 3 , the concentric sphere is composed of two layers. The core layer simulates the cell nucleus and has a diameter of 6 μm and refractive index 1.40, while the outer layer simulates the cell cytoplasm and has a diameter of 10 μm and refractive index 1.37. The sphere itself is immerged in water (refractive index 1.33). To show more features of the radiation, the point-like $\widehat{z}$-polarized dipole is off-centered by 1.4 μm along the $\widehat{z}$ axis. The oscillation frequency of the dipole is chosen to be $3.82\times {10}^{14}$Hz. Radiation intensities are then computed on a $172\times 172\times 172$ grid mesh decomposed into $2\times 2\times 2$ subdomains. Again, the gird size is set to one-fifth the wavelength in nucleus. The SLPSTD results are presented in Fig. 3 together with the analytical solution for comparison.

In Section 2.2 we point out that the staggered grid can suppress numerical errors originated from spatial discontinuities in the system. In Fig. 4
we present a snapshot of the distribution of electric field ${E}_{x}$ on plane $z=1.4\text{\mu m .}$
Note that the dipole is faithfully represented by a point source occupying only *a single grid in 3D*, without resorting to the double-grid or point-smoothing tricks in literature [14,15]. In Fig. 4a, the nice ring pattern is correctly predicted by SLPSTD. Meanwhile, the collocated-grid PSTD on global Fourier basis is unable to produce a meaningful result (Fig. 4b).

#### 3.3 Analysis on the angular profile of the intensity cone of EBS from surface tissue layer

We first apply SLPSTD to the study of the enhanced backscattering (EBS) phenomenon, a constructive interference effect between photons traveling along time-reversed paths. Previously Akkermans et al. employed a diffusion approximation approach to derive the angular profile of the enhanced intensity cone backscattered from a semi-infinite turbid medium [17]. Tseng et al. further investigated the enhanced backscattering in a 2D PSTD frame [9]. Their numerical EBS cone profile agrees very well with the prediction given by Akkermans’ theory. According to the theory, angular width of the EBS peak is in the order of $\lambda /{l}_{s}^{*}$, where *λ* is wavelength of the incident light and ${l}_{s}^{*}$ the transport mean free path length. Thus, an extreme narrow EBS peak is predicted for biological tissues (${l}_{s}^{*}$ at the order of 1 mm), which hampers its biomedical applications. However, Kim et al. proposed a novel EBS scheme utilizing low spatial coherent light (LEBS) as the illumination to destroy the phase coherence between photons distantly apart on the beam cross section [6]. Effectively only photons fall into the same coherence area ($\sim {L}_{sc}^{2}$, and ${L}_{sc}$ is the coherent length) can contribute to the EBS signal. Equivalently, low spatial coherence behaves like a spatial filter to reject long photon paths and restrict the trajectories of EBS photons to the superficial layer of the medium. As a result, much-broadened EBS cone could then be possible. Indeed, such effect was confirmed by experimental observations [6]. To reconcile Akkermans’ theory and Kim’s LEBS, one can argue that the EBS signal originated from the top surface layer is broad in nature, but is significantly sharpened by that from the underlying bulk with long pathlengths [18]. Here we want to numerically investigate the first point of the argument in 3D.

The model setup is presented in Fig. 5 . A linearly polarized plane wave impinges upon a thin volume of water suspension of mono-dispersed polystyrene beads. A trial and test procedure is designed to generate a uniform and random distribution of 2-μm polystyrene spheres in the rectangle. To simplify programming, the scattering medium is placed in water so there is no tissue-air interface. The bead concentration within the $100\times 100\times 50{\text{\mu m}}^{\text{3}}$ volume is set to 2.9%, corresponding to $12.5\text{\mu m}$ mean free path by Mie theory. The grid resolution equals to five samplings per wavelength in polystyrene. The incident angle is 15° tilted from the normal to avoid specular reflection, whose existence shows up in our high precision SLPSTD results. We further adopt the frequency-averaging trick to suppress speckles in the simulation [9]. Backscattering intensity are calculated and averaged over 21 frequencies, evenly spanning the region $\left[0.95{f}_{0},1.05{f}_{0}\right]$ with ${f}_{0}$ the corresponding frequency of 785 nm wavelength in vacuum. Due to the cost of supercomputing, all the 21 simulations are repeated on the same sphere distribution. The EBS cone from the top 50 μm layer, equivalent to an optical depth $\tau =4$, is presented in Fig. 6 . From the curve we can see the FWHM of the cone is as wide as 1.4° and the enhancement factor at 0° is roughly 1.6. The width is at the same order of Kim’s experimental result (~0.5°) in the case of ${L}_{sc}=35\text{\mu m}$ [6]. The extra factor of 3 can be attributed to the difference in the scattering mean free path length. In addition, since the singly backscattered light raises the baseline of the curve, the enhancement factor is less than 2.

So far, the cone width in Fig. 6 only qualitatively explains Kim’s experimental result. To quantitatively investigate that, we need to incorporate the same parameters in their experiment into our model configuration. At present it is unclear how ${L}_{sc}$ can be precisely translated into the penetration depth. Nevertheless, they should be at the same order. So we make a further assumption that the penetration depth equals to ${L}_{sc}$, and a scattering medium of size $100\times 100\times 35{\text{\mu m}}^{\text{3}}$ is generated, consisting of 1.5-μm diameter polystyrene spheres with a concentration corresponding to the same mean scattering length ${\mathit{\ell}}_{s}=700\text{\hspace{0.17em}}\mu m$ at wavelength 532 nm. This leads to a total 79 spheres in the volume by Mie theory. The EBS cone is then calculated under this configuration. However, due to the extreme dilute nature of the bead solution, speckle effects are so prominent that the 21-frequency average on one single set of sphere distribution is insufficient to alleviate the speckles. Thus, a further average over multiple sets of re-generated sphere distributions is carried out. Figure 7 shows the result of a 4-set average, including 84 different frequency-set runs. Still the curve is not free of speckles, and a complete suppression of speckles would require much more sets. Evidently, the calculated FWHM of EBS cone is in fact ~0.4°. Because in our model the phantom is embedded in water (see the figure caption in Fig. 5), the refraction at the air-water interface would convert it to a ~0.5° EBS cone width in air, in very good agreement with Kim’s experimental result. For the first time in literature, our numerical experiment from the first principle of electromagnetism confirms the physical picture of low-coherence EBS in Ref [6].

#### 3.4 Analysis on the penetration depth of polarization gating

The penetration depth of polarization gating in the backward scattering geometry has been studied previously using MC simulations [5]. Mueller matrix formulism is utilized to track the polarization variation after each scattering. However, MC over-simplifies the tissue structure to a collection of scattering centers. This statistical average picture is valid for deep tissues, but it is uncertain how well it approximates the surface layers where the first few scattering events are not the exact replica of the processes in nature. In future, a much-sophisticated surface tissue model should construct a 3D parameter representation of the organization of cells and extracellular matrix. Investigating its optical properties would require a parallel analytical tool like SLPSTD. Here as the first step toward that goal, we conduct the exact 3D calculation about the penetration depth of polarization gating on tissue phantoms.

The generation of the scattering medium is similar to that in Section 3.3. To study the dependence of depolarization on the penetration depth, the geometric thickness of the sphere cluster is progressively increased with the bead concentration fixed (2.9% vol.). The dimensionless parameter $\tau \equiv {\mu}_{s}{\mathit{\ell}}_{z}$ characterizes the number of scatterings a photon undergoes to penetrate the volume. The exact value of *τ* is given by Mie theory. The size of the rectangle is kept constant in the horizontal directions with ${\mathit{\ell}}_{x}={\mathit{\ell}}_{y}=100\text{\mu m}$, but varies in the vertical direction taking ${\mathit{\ell}}_{z}=$12.5, 25, 37.5, 50, 62.5, 75, 87.5, and 100 μm, corresponding to *τ* ranging from 1 to 8.

Unlike Fig. 5 the incidence is herein normal to the cluster (Fig. 8 ). At an oblique angle the left sidewall of the medium becomes a surface, and the variation of the thickness changes the surface area seen by the illumination. On the other hand, numerical results in Section 3.3 show the existence of specular reflection, indicating a slight mismatch between the refractive indices of the suspension and water. Its foot-to-foot angular width is 0.34°. Thus, at normal incidence the backscattering signal contains specular reflection component within the $\theta \le {0.17}^{o}$ cone (Fig. 9 ).

In another aspect regarding suppressing the speckles, we average over 5 sets of sphere distributions per value of *τ* to obtain the final results.

In a *θ* resolved collection geometry, the co- and cross-polarized Poynting components ${S}_{\parallel}\left(\theta ,\varphi \right)$ and ${S}_{\perp}\left(\theta ,\varphi \right)$ are directly produced by the SLPSTD computation. The collected signals are proportional to their integral over the azimuth angle *ϕ*, i.e.

Heuristic arguments claim ${I}_{dif}\left(\theta \right)$ contains signal mainly from the superficial area of medium under the assumption that multiple scattering events totally depolarize photons. Figures 9a and 9b plot the curves of ${I}_{tot}\left(\theta \right)$ and ${I}_{dif}\left(\theta \right)$ at different *τ*. The irregular oscillations are due to residual speckle effects. One obvious trend in Fig. 9b is the gap between adjacent curves decreases as *τ* increases, indicating a saturation as the increased depth does not contribute to ${I}_{dif}\left(\theta \right)$. Considering a 5° collection angle in the backward direction, a difference quantity can be further defined as

*τ*on the right hand side of Eq. (12) indexes the curve associated with

*τ*. The 0.2° lower limit of the integral excludes the specular reflection. The

*saturation curve*, first introduced in Ref [5], is thus obtained (Fig. 10 ). It rises to a plateau at ${\tau}_{c}=4$ reaching 90% of the saturation value, slightly different from the ${\tau}_{c}=3$ identified in Ref [5]. A conclusion can be drawn that in the backscattering geometry a single scattering event is insufficient to achieve complete depolarization of a linearly polarized light. Polarization gating is sensitive to regions 4 mean free paths under the surface.

All the computations were performed on the Dawning 5000A supercomputer at the Shanghai Supercomputer Center (SSC). The tasks were allocated to 8 hardware nodes, each composed of four 2.0GHz 4-core AMD Barcelona CPUs. The total 32 CPUs were configured into a $4\times 4\times 2$ topology, corresponding to $4\times 4\times 2$ computation nodes. We implemented a hybrid MPI and OpenMP programming for the parallel execution. MPI managed the inter-CPU data exchange, while 4 OpenMP threads split the local task to the 4 cores on each CPU. The subdomain sizes were carefully chosen so that FFT could run efficiently. To compare the computational efficacy and scalability of SLPSTD and conventional PSTD, we conducted numerical experiments using both method on two scattering media, with size 50 × 50 × 50 μm^{3} and 100 × 100 × 100 μm^{3}, respectively. The polystyrene sphere diameter, volume concentration and illumination configuration were the same as in Fig. 8. On a per time step and per node base, the time spent on data exchange and inner-node computation as well as their summation after averaged over 200 leapfrogs are listed in Table 1
. To facilitate FFT computation, the grid numbers were slightly different for gPSTD and SLPSTD, resulting in slight different grid resolutions. The advantage of SLPSTD becomes more prominent as the model size scales up. Also can be seen is that the size of the second model is 8 times that of the first one. The inner-node computation time of SLPSTD scales at a factor of 8, while the communication time scales at 4, proportional to the area of subdomain surfaces. Due to limitations to access 8 times as many CPUs, we were unable to test the scalability of SLPSTD with respect to an increased number of computation nodes. In Table 1, when the grid number increases from 512^{3} to 1024^{3}, the computation time of gPSTD scales disproportionally at a factor of $22.42/2.29=9.8$, instead of 8.

## 4. Summary

In this paper, the introduction of FFT with local Fourier basis endows PSTD with a scalable, full parallelism without degradation of numerical accuracy. Unlike common parallel PSTD codes, we calculate spatial derivatives based on local data and avoid the vast global data movement between computation nodes. Highly efficient, flexible and extendable domain decompositions become available to solve large-scale problems. This makes PSTD simulations able to handle huge number of variables commonly encountered in 3D tissue optics modeling. Furthermore, a reformatted derivative operator allows a Yee lattice for the electric and magnetic field allocation of PSTD. Its inherent staggered-grid formalism automatically suppresses artifacts from spatial discontinuities. The newly developed SLPSTD proves a powerful tool for studying the optical properties of surface layers of tissue phantoms. The hypothesis that the EBS photons from the surface layer create a large cone width is verified, and the predicted numerical value of the width agrees with experimental result. On the other hand, the precisely computed penetration depth of polarization gating is larger than predicted by MC simulations. We envision future applications of SLPSTD to the modeling of a fine surface tissue structure having cellular resolution.

## Acknowledgement

The computational part of this work was carried out at the Shanghai Supercomputer Center. This work was partially supported by the National Natural Science Foundation of China (Grant No. 10874195).

## References and links

**1. **S. H. Tseng and B. Huang, “Comparing Monte Carlo simulation and pseudospectral time-domain numerical solutions of Maxwell’s equations of light scattering by a macroscopic random medium,” Appl. Phys. Lett. **91**(5), 051114 (2007). [CrossRef]

**2. **F. Voit, J. Schäfer, and A. Kienle, “Light scattering by multiple spheres: comparison between Maxwell theory and radiative-transfer-theory calculations,” Opt. Lett. **34**(17), 2593–2595 (2009). [CrossRef] [PubMed]

**3. **H. Subramanian, P. Pradhan, Y. L. Kim, Y. Liu, X. Li, and V. Backman, “Modeling low-coherence enhanced backscattering using Monte Carlo simulation,” Appl. Opt. **45**(24), 6292–6300 (2006). [CrossRef] [PubMed]

**4. **V. Backman, R. Gurjar, K. Badizadegan, L. Itzkan, R. R. Dasari, L. T. Perelman, and M. S. Feld, “Polarized light scattering spectroscopy for quantitative measurement of epithelial cellular structures in situ,” IEEE J. Sel. Top. Quantum Electron. **5**(4), 1019–1026 (1999). [CrossRef]

**5. **Y. Liu, Y. L. Kim, X. Li, and V. Backman, “Investigation of depth selectivity of polarization gating for tissue characterization,” Opt. Express **13**(2), 601–611 (2005). [CrossRef] [PubMed]

**6. **Y. L. Kim, P. Pradhan, H. Subramanian, Y. Liu, M. H. Kim, and V. Backman, “Origin of low-coherence enhanced backscattering,” Opt. Lett. **31**(10), 1459–1461 (2006). [CrossRef] [PubMed]

**7. **A. Taflove, and S. C. Hagness, *Computational Electrodynamics: The Finite-Difference Time-Domain Method, Second Edition* (Artech House, 2000).

**8. **Q. H. Liu, “The PSTD algorithm: A time-domain method requiring only two cells per wavelength,” Microw. Opt. Technol. Lett. **15**(3), 158–165 (1997). [CrossRef]

**9. **S. H. Tseng, Y. L. Kim, A. Taflove, D. Maitland, V. Backman, and J. T. Walsh Jr., “Simulation of enhanced backscattering of light by numerically solving Maxwell’s equations without heuristic approximations,” Opt. Express **13**(10), 3666–3672 (2005). [CrossRef] [PubMed]

**10. **G. J. P. Correa, M. Spiegelman, S. Carbotte, and J. C. Mutter, “Centered and Staggered Fourier derivatives and Hilbert transforms,” Geophysics **67**, 1558–1563 (2002). [CrossRef]

**11. **K. Reuter, F. Jenko, C. B. Forest, and R. A. Bayliss, “A parallel implementation of an MHD code for the simulation of mechanically driven, turbulent dynamos in spherical geometry,” Comput. Phys. Commun. **179**(4), 245–249 (2008). [CrossRef]

**12. **M. Israeli, L. Vozovoi, and A. Averbuch, “Spectral multidomain technique with local Fourier basis,” J. Sci. Comput. **8**(2), 135–149 (1993). [CrossRef]

**13. **Q. B. Liao and G. A. McMechan, “2-D pseudo-spectral viscoacoustic modeling in a distributed-memory multi-processor computer,” Bull. Seismol. Soc. Am. **83**, 1345–1354 (1993).

**14. **T. W. Lee and S. C. Hagness, “A compact wave source condition for the pseudospectral time-domain method,” IEEE Antennas Wirel. Propag. Lett. **3**(14), 253–256 (2004). [CrossRef]

**15. **Q. H. Liu, “Large-scale simulations of electromagnetic and acoustic measurements using the pseudospetral time-domain(PSTD) algorithm,” IEEE Trans. Geosci. Rem. Sens. **37**(2), 917–926 (1999). [CrossRef]

**16. **Y. F. Leung and C. H. Chan, “Combining the FDTD and PSTD methods,” Microw. Opt. Technol. Lett. **23**(4), 249–254 (1999). [CrossRef]

**17. **E. Akkermans, P. E. Wolf, and R. Maynard, “Coherent backscattering of light by disordered media: Analysis of the peak line shape,” Phys. Rev. Lett. **56**(14), 1471–1474 (1986). [CrossRef] [PubMed]

**18. **K. M. Koo, Y. Takiguchi, and R. R. Alfano, “Weak localization of photons: contributions from the different scattering pathlengths,” IEEE Photon. Technol. Lett. **58**, 94–96 (1989).