## Abstract

The excessive time required by fluorescence diffuse optical tomography (fDOT) image reconstruction based on path-history fluorescence Monte Carlo model is its primary limiting factor. Herein, we present a method that accelerates fDOT image reconstruction. We employ three-level parallel architecture including multiple nodes in cluster, multiple cores in central processing unit (CPU), and multiple streaming multiprocessors in graphics processing unit (GPU). Different GPU memories are selectively used, the data-writing time is effectively eliminated, and the data transport per iteration is minimized. Simulation experiments demonstrated that this method can utilize general-purpose computing platforms to efficiently implement and accelerate fDOT image reconstruction, thus providing a practical means of using path-history-based fluorescence Monte Carlo model for fDOT imaging.

© 2015 Optical Society of America

## 1. Introduction

Fluorescence diffuse optical tomography technology (fDOT) plays an important role in biomedical research [1] and has broad application potential in functional genomics and proteomics, genetic pathology, cancer diagnostics, cellular and molecular biology, and pharmaceuticals [2,3 ]. It enables quantitative observation and precise location of biochemical reactions and the corresponding changes that occur inside living organisms [4]. Most biological tissues are highly scattering media. It is essential to establish a model that can accurately simulate the excitation and propagation of fluorescence for quantitative and locative fDOT imaging.

As is well known, photon propagation in biological tissue can be described by the radiative transfer equation. The Monte Carlo model derived from the radiative transfer equation [5] has become the gold standard to describe photon transport in biological tissue [6–10 ]. In 1997, a standard fluorescence Monte Carlo (sfMC) model was proposed [11], which describes fluorescence excitation and transport in biological tissue. Due to its low computational efficiency, the sfMC model cannot be utilized for fDOT image reconstruction. Subsequently, an adjoint fluorescence Monte Carlo (afMC) model was proposed [12], which is based on the Born approximation. It can be applied to smaller numbers of source-detector (SD) pairs; however, as the number of SD pairs increases, the calculation time required by the afMC model increases exponentially [13]. In order to overcome this difficulty, path-history-based fluorescence Monte Carlo models were developed, such as the perturbation fluorescence Monte Carlo (pfMC) model [14–16 ] and the decoupled fluorescence Monte Carlo (dfMC) model [17–19 ]. Both methods can directly calculate fluorescence by using path information of the excitation light when the optical parameters are changed. They both can improve the computational efficiency of the fDOT image reconstruction [14–19 ].

The excessive time required by the aforementioned Monte Carlo methods is the main drawback to using these techniques. Initially, researchers reduced the calculation time by employing parallel multi-node and parallel multi-processor methods. In 1997, the design of a Monte Carlo program that could run simultaneously on up to 24 computers was presented [20]. Similarly, in 2000, a Monte Carlo program that could run on a multi-processor computer was developed [21]. Over the past 20 years, Intel and AMD have continuously improved central processing unit (CPU) performance. CPU have developed from single-core to multi-core. Multi-core CPUs can execute programs simultaneously in parallel. During the same time, NVIDIA developed programmable graphics processor units (GPUs), which are more suitable than CPUs for parallel computing. In 2008, executing a Monte Carlo program on a GPU was first proposed [22], and in 2009, a parallel Monte Carlo algorithm to simulate time-resolved photon transport in arbitrary 3D turbid media accelerating by GPU was reported [23]. Parallel implementation of a Monte Carlo simulation based on a triangle mesh model was first demonstrated in 2010 [24]. In 2011, a GPU-accelerated method was first introduced into fDOT image reconstruction [25] and used the afMC Monte Carlo model. From 2011 to 2015, a path-history-based fluorescence Monte Carlo model was widely used for fDOT image reconstruction [15,19 ]. Nowadays, due to the development of computer technology, multiple nodes in parallel in cluster, multiple cores in parallel in CPU, and multiple streaming multiprocessors in parallel in GPU preclude the ability to rapidly reconstruct images via fDOT. However, in all the above acceleration methods, either multiple nodes in parallel in cluster or multiple streaming multiprocessors in parallel in GPU was used to accelerate Monte Carlo simulations, which makes fDOT image reconstruction significantly time-consuming on general-purpose computing platforms.

There has been no research thus far on accelerating fDOT image reconstruction based on path-history fluorescence Monte Carlo model. In this study, we combined multiple parallel nodes in cluster, multiple parallel cores in CPU, and multiple parallel streaming multiprocessors in GPU to develop and implement a more rapid fDOT image reconstruction method based on the dfMC model. Likewise, this acceleration method is also suitable for the fDOT image reconstruction based on the pfMC model. First, we introduce the dfMC model and apply it to fDOT image reconstruction. Then, we present the three-level parallel architecture scheme and the implementation of fDOT image reconstruction using this architecture. After that, we introduce how to store, read/write, and transfer the large amounts of data involved in fDOT image reconstruction. Furthermore, we present a simulation experiment in which the effects of different tissue optical parameters on the image reconstruction time required by the three-level parallel architecture fDOT image reconstruction were studied. A discussion of the results follows. Finally, we summarize the paper.

## 2. Methods

#### 2.1 fDOT image reconstruction problem

The fDOT image reconstruction process mainly consists of two parts: 1) establishing an accurate forward model to simulate fluorescence excitation and transport in biological tissue and 2) reconstructing the concentrations of fluorescence targets in biological tissue based on the intensities of excitation light and fluorescence measured by the detectors.

### 2.1.1 Decoupled fluorescence Monte Carlo forward model

By decoupling the excitation-to-emission and transport process of fluorescence from the path probability density function [17], the dfMC method calculates fluorescence along the trajectories of the excitation light. The calculation process can be divided into three parts: excitation light transmission, fluorescence excitation, and fluorescence transmission. In the dfMC forward model, $p$ is defined as the state of photon transmission and is a six-dimensional vector that includes information about the spatial position and direction of the transmitted photons. An excitation photon with initial weight ${w}_{0}$ is incident on the tissue surface at ${r}_{s}$. The weight of an excitation photon that achieves fluorescence excitation at position $r$ has decreased to

In the process of fluorescence excitation, the probability of conversion from excitation to emission at $r$ is

Assuming that the initial weight of an excited fluorescence photon is ${w}_{0}^{\text{'}}$, then when the fluorescence photon reaches the detector at ${r}_{d}$, its weight will have decreased to

In fDOT image reconstruction, we calculate the fluorescence intensity on the detector along the excitation light path. If an excitation photon with initial weight ${w}_{0}$ propagates from ${r}_{s}$ and is detected at ${r}_{d}$, then its weight will have decreased to

Substituting Eq. (5) into Eq. (4), the relationship between the weight of the excitation light and the fluorescence can be obtained:

Since the difference between the excitation and emission wavelengths is generally negligible [14,15 ], we can approximate ${\mu}_{a}^{ex}\approx {\mu}_{a}^{em},{\mu}_{s}^{ex}\approx {\mu}_{s}^{em}$, so Eq. (6) can be written as:

For calculation convenience, the biological tissue can be voxelized. In a single voxel ${v}_{f}$, if the optical parameter distribution is considered to be uniform, then ${{\mu}_{af}\left(\overrightarrow{r}\right)|}_{r\in voxel\text{\hspace{0.17em}}{v}_{f}}={\mu}_{af{v}_{f}}$,${{\mu}_{s}\left(\overrightarrow{r}\right)|}_{r\in voxel\text{\hspace{0.17em}}{v}_{f}}={\mu}_{s{v}_{f}}$, and ${g\left(\overrightarrow{r}\right)|}_{r\in voxel\text{\hspace{0.17em}}{v}_{f}}={g}_{{v}_{f}}$. There may be multiple scattering points in a single voxel, and it is likely that fluorescence will be excited at each scattering point. If an excitation photon propagates from a source at ${r}_{s}$ and is excited in voxel ${v}_{f}$, then the total weight of fluorescence detected at ${r}_{d}$ can be expressed as:

The excitation photon passes through $f$voxels before reaching voxel ${v}_{f}$ in which fluorescence is excited. Here, ${p}_{{h}_{f-1}+1},\cdots ,{p}_{{h}_{f}}$ (${h}_{0}=0,{h}_{f}=k-1$) represents the sequence of states of a propagating photon that falls within the ${f}^{th}$ voxel, while ${p}_{k},\cdots ,{p}_{k+t}$ represents the sequence of states of a propagating photon that falls within voxel ${v}_{f}$. In the forward calculation, it is necessary to store the physical quantities ${L}_{1}=l\left({p}_{k-1}\to {p}_{k}\right)$ and ${L}_{2}={P}_{I}\left({\widehat{s}}_{k-1}\cdot {\widehat{s}}_{k}\right)/{P}_{A}\left({\widehat{s}}_{k-1}\cdot {\widehat{s}}_{k},{g}_{{v}_{f}}^{ex}\right)$ at each step along the excitation light path to perform fDOT image reconstruction.

### 2.1.2 Image reconstruction with Tikhonov regularization

Because of the nonlinearity and ill-posedness of the inverse problem [26], Tikhonov regularization is usually introduced into the reverse calculation to improve its stability [27]. The process of fDOT image reconstruction involves finding the optimal solution to minimize the objective function [28], which can be expressed as

For calculation convenience, Eq. (9) can be transformed into the equivalent form

#### 2.2 Three-level parallel architecture scheme for fDOT image reconstruction

In order to obtain the accurate image reconstruction results, a large number of photons must be simulated in the forward calculation. Furthermore, the stored information about the photons’ states when they reach the detectors and their paths in the tissue must be imported to calculate the Jacobi matrix, and then image reconstruction results are obtained by iteration in the reverse calculation. Obviously, this task is computationally intensive task. The calculations for each photon are independent from one another. Since the process of fDOT image reconstruction is well suited to parallel calculation, we ported the program executed by CPUs in serial to three-level parallel architecture in parallel execution, thus greatly reducing the image reconstruction time. In Fig. 1(a) , we show the developed three-level parallel architecture, which can analyze in parallel multiple nodes in cluster, multiple cores in CPU, and multiple streaming multiprocessors (SMs) in GPU. The implementation process for the three-level parallel architecture is as follows.

- 1. By employing a Message Passing Interface (MPI), multiple nodes in cluster can be coordinated to execute a program in parallel. The specific steps are as follows.
- (1) Each computer is regarded as a node. Node 0 is regarded as the host node, while the other nodes are regarded as the child node. Each node obtains information about its own number and type of GPUs.
- (2) Each child node transfers the information about its own number and type of GPUs to the host node.
- (3) According to the GPU number and type at each child node, the host node assigns computing tasks to the child nodes.

- 2. By using Open Multi-Processing (OpenMP), multiple cores in CPU can be coordinated to execute a program in parallel. The specific steps are as follows:
- 3. Compute unified device architecture (CUDA) can be utilized to coordinate multiple streaming multiprocessor in GPU to execute a program in parallel. The specific steps are as follows:
- (1) According to the type of GPU, the available resources in each streaming multiprocessor is determined. According to the program, the amount of resource used in each streaming multiprocessor is determined.
- (2) According to the scale of the task, the dimension and size of the grid and block in each streaming multiprocessor is determined.
- (3) The GPU further assigns the computing tasks to each streaming multiprocessor.

In Fig. 1(b), we present a schematic of fDOT image reconstruction on single child node. The specific process used for fDOT image reconstruction in the three-level parallel architecture is as follows:

- 1. The parallel MPI environment is initialized, and the child node obtains information about its own number and type of GPUs, transfers the information to the host node, and receives the serial numbers of the sources to be calculated from the host node.
- 2. The parallel Open MP environment is initialized, and the multi-core CPU on the child node starts the number of processes corresponding to its number of GPUs and assigns the serial numbers of the sources to be calculated to each process.
- 3. The parallel CUDA environment is initialized, and the number of blocks in each grid and the number of threads in each block is determined according to the available resources in streaming multiprocessor on GPU. The child node reads the number of photons to be calculated for each source. Each photon is processed on single thread.
- 4. In the forward calculation, the transmission of each photon through the biological tissue is tracked on each thread. In the reverse calculation, the contribution of each photon to the Jacobi matrix is calculated on each thread.
- 5. After the computing tasks have been completed on each thread, in the forward calculation, the child node writes the information about the photon’s path in the tissue and its state on the detector onto its disk. In the reverse calculation, the child node calculates the corresponding vector and transfers them to the host node. The image reconstruction results can then be obtained by performing the conjugate gradient method on the host node.

#### 2.3 Programmatic implementation of three-level parallel architecture

In the forward calculation, it is necessary to read the initial information about each photon, size of the tissue model, optical parameters of the tissue and store the information about each photon’s state on the detector and its path in the tissue. In the reverse calculation, it is necessary to read the information about each photon’s state on the detector and its path in the tissue. Due to the large amount of data acquired in the three-level parallel calculation method, it is particularly important to efficiently store, read/write, and transfer the data. In the following section, we will discuss data processing in this method.

### 2.3.1 Allocation of memory

CUDA has a specific memory model in which each thread has its own registers and each block has its own shared memory. All threads in the grid can access the global memory, constant memory, and texture memory. The read/write speeds of the different memory types vary significantly. The read/write speed of global memory is the slowest, but its memory space is the greatest among the aforementioned memory types. To implement programs, it is preferred to use high-speed memory. The allocation of memory is shown in Fig. 2 . The register and shared memory are used to store temporary variables. The global memory is used to store the information about the photon states on the detectors. Constant memory can be used to store the initial information about the photon, the size of the tissue model, and the optical parameters of the tissue, while texture memory can be used to store the index of the tissue model, as all of these quantities only need to be read [24,29 ]. The amount of information about the photons’ path in the tissue to be stored is large and must be copied from the CPU memory to the GPU memory in the forward calculation. Data copying requires significant time. The page-locked host memory located at the CPU can be mapped into the address space of the GPU, eliminating the need to copy it to or from the GPU memory [30]. Thus, there is no need to allocate memory on the GPU. Hence, in order to save data-copying time, the page-locked host memory can be used to store information about the photons’ path in the tissue.

### 2.3.2 Eliminating the data-writing time

As shown in Fig. 3(a) , in the original forward calculation time flow, one CPU core on each node starts a process to call the GPU and simulate the photon transport in the biological tissue. After the simulation has been completed, the information about the photons’ states on the detectors and their paths in the tissue is copied from the GPU memory to the CPU memory and is written onto the disk. This process is called a simulation cycle. Thus, a simulation cycle includes GPU computing time, data-copying time, and data-writing time. From Fig. 3(a), it is apparent that the CPU resources are not fully utilized and that the data-writing time occupies a large proportion of the time consumed by each simulation cycle. In order to accelerate the calculations, we optimized the forward calculation time flow. As shown in Fig. 3(b), in the optimized time flow, one CPU core starts a process to call the GPU to simulate the photon transport in the biological tissue and complete the data-copying. At the same time, the other core starts a process to write onto the disk the stored information from the previous simulation cycle about the photons’ states on the detectors and their paths in the tissue. After both of these tasks have been completed for both processes, the simulation cycle ends. In Fig. 3(b), it is apparent that, after optimizing the time flow, the parallelism of the entire simulation is maximized and the time required for one simulation cycle can be reduced from ${T}_{A}$ to ${T}_{B}$.

### 2.3.3 Minimization of data transport per iteration

In the forward calculation, each node only stores the state and path information for photons propagated from a portion of the sources. Thus, in the reverse calculation, each node only calculates one block of the Jacobi matrix. If the block of the Jacobi matrix on each child node is transferred to the host node and the total iteration process is completed on the host node, then the amount of data transferred will be immense and the computational burden on the host node will be excessive. As shown in Fig. 4 , our technique uses the conjugate gradient method [31] to perform iterative reconstruction on the three-level parallel architecture. As shown, only a portion of the computing tasks are assigned to each node. Thus, the amount of data transfer and the memory space required in the host node are reduced. The specific steps are as follows:

- 1. Each child node initializes iteration step ${x}^{0}$ for the specific absorption coefficient of the fluorophore and calculates the vector${J}_{i}^{T}{J}_{i}x{\text{\hspace{0.05em}}}^{0}\text{\hspace{0.05em}}-\text{\hspace{0.05em}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{J}_{i}^{T}\Delta D{\text{\hspace{0.05em}}}_{i}$, where $i$ is the serial number of the node. After that, it is transferred to the host node to calculate $\nabla f\left({x}^{0}\right)$. The calculation process on the host node is as follows:$$\begin{array}{l}\nabla f\left({x}^{0}\right)=H{x}^{0}+b\\ \text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{0.17em}}=\left({J}^{T}J\text{+}\lambda \right){x}^{0}\text{\hspace{0.05em}}\text{\hspace{0.05em}}-\text{\hspace{0.05em}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{J}^{T}\Delta D.\\ \text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{0.17em}}=\text{\hspace{0.05em}}\text{\hspace{0.05em}}\left(\left({J}_{0}^{T}{J}_{0}+\lambda \right){x}^{0}-{J}_{0}^{T}\Delta D{\text{\hspace{0.05em}}}_{0}\right)+{\displaystyle \sum _{i=1}^{n\text{-}1}\left({J}_{i}^{T}{J}_{i}{x}^{0}\text{\hspace{0.05em}}\text{\hspace{0.05em}}-\text{\hspace{0.05em}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{J}_{i}^{T}\Delta D{\text{\hspace{0.05em}}}_{i}\right)}\end{array}$$
where $n$ is the total number of nodes.

- 2. The host node judges whether iteration step ${x}^{k}$ minimizes the objective function, where $k$ is the index of the iteration. If the value of ${\Vert {g}^{k}\Vert}_{2}$ is less than the threshold, the iteration ends; otherwise, the algorithm proceeds to the third step. Here, ${g}^{k}=\nabla f\left({x}^{k}\right)$.
- 3. The search direction ${s}^{k}$ is calculated on the host node and transferred to each child node. If $k=0$, then ${s}^{0}=-\nabla f\left({x}^{0}\right)=-{g}^{0}$. If $k>0$, then ${s}^{k}=-{g}^{k}+{\beta}_{k-1}{s}^{k-1}$.
- 4. The vector ${J}_{i}^{T}{J}_{i}{s}^{k}$ is calculated on each child node. After that, it is transferred to the host node to calculate $H{s}^{k}$. The calculation process on the host node is as follows:$$\begin{array}{l}H{s}^{k}=\left({J}^{T}J\text{+}\lambda \right){s}^{k}\\ \text{\hspace{0.17em}}\text{\hspace{1em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}=\text{\hspace{0.05em}}\text{\hspace{0.05em}}\left({J}_{0}^{T}{J}_{0}+\lambda \right){s}^{k}+{\displaystyle \sum _{i=1}^{n\text{-}1}\left({J}_{i}^{T}{J}_{i}{s}^{k}\right)}.\end{array}$$
- 5. The intermediate variable ${m}_{k}=-\frac{{\left({g}^{k}\right)}^{T}{s}^{k}}{{\left({s}^{k}\right)}^{T}H{s}^{k}}$, ${g}^{k+1}={g}^{k}+H{s}^{k}{m}^{k}$, and ${\beta}_{k}=\frac{{\Vert {g}^{k+1}\Vert}^{2}}{{\Vert {g}^{k}\Vert}^{2}}$ are calculated on the host node in order.
- 6. A new iterative step is determined by ${x}^{k+1}={x}^{k}+{m}_{k}{s}^{k}$ on the host node, and the algorithm returns to step 2.

## 3. Results

The three-level parallel architecture cluster we built has three nodes. The first node has a six-core Intel E5-2630 v2 CPU with 32 GB of RAM. It contains two GPUs (Quadro K4000), and the GPU with 768 CUDA cores has four streaming multiprocessors. The second node has a six-core Intel Xeon X5680 with 32 GB of RAM. It contains two GPUs (GTX770), and the GPU with 1536 CUDA cores has eight streaming multiprocessors. The third node has a four-core Intel i7-2600 CPU with 16 GB of RAM. It contains two GPUs (GTX670), and the GPU with 1344 CUDA cores has seven streaming multiprocessors.

#### 3.1 Image reconstruction time for cylindrical model

We tested the three-level-parallel-architecture fDOT image reconstruction through simulation experiments. The designed cylinder model is homogeneous, as shown in Fig. 5(a) . It contains three fluorescent targets with different specific absorption coefficients. In general, the optical parameters at the excitation and emission wavelengths are considered to be the same [14,15 ], so we set ${\mu}_{a}={\mu}_{a}^{ex}={\mu}_{a}^{em}$ and ${\mu}_{s}={\mu}_{s}^{ex}={\mu}_{s}^{em}$. The quantum efficiency was set to 1, and the other optical parameters used in the simulation are shown in Table 1 [13,14,32 ]. The cylinder model is 2.0 cm × 2.0 cm × 3.0 cm in size and was discretized into 12,000 voxels, which were each 0.1 cm in size. The reconstructed region, which is 12 mm in height, was discretized into 3,660 voxels. In the fDOT image reconstruction, we simulated 120 sources that were 10 mm in height and were located at azimuthal angles ranging from ${0}^{\xb0}$ to ${360}^{\xb0}$. The sources were evenly distributed in six rings with 20 sources in each ring. We selected 240 detectors that were 20 mm in height and were located at azimuthal angles ranging from ${0}^{\xb0}$ to ${360}^{\xb0}$. The 240 sources were evenly distributed in 10 layers with 24 detectors in each layer. For each source, $2\times {10}^{6}$ photons were simulated.

According to the number and type of GPUs as well as the read/write performance of the disk on each node, the numbers of sources assigned to the three nodes were 25, 50, and 45. We selected the conjugate gradient method to iterate. The 3D and cross-section (z = 15 mm) image reconstruction results for the concentrations of the fluorescent targets are shown in Figs. 5(b) and 5(c), respectively. In Fig. 5(d), the linear correlation between the reconstructed and actual fluorescence concentrations can reach 0.985. For a linear correlation of 0.985, the time required for image reconstruction is shown in Table 2 . We repeated the simulation experiments five times, and the standard deviation of the image reconstruction time was determined to be 0.07.

#### 3.2 Factors influencing reconstruction efficiency

When the path-history fluorescence Monte Carlo model is used, the image reconstruction efficiency of the three-level parallel architecture is mainly influenced by two factors. Firstly, it is influenced by the optical parameters of the cylinder model, such as the absorption coefficient of the background medium, scattering coefficient of the background medium, and specific absorption coefficient of the fluorescent target. In addition, it is influenced by the fDOT image reconstruction parameters, such as the numbers of sources, detectors, and photons simulated for each source. The pfMC and dfMC models are both path-history fluorescence Monte Carlo models, and they have the similar calculation process. As the numbers of sources and photons increase, the time required for fDOT image reconstruction increases linearly [13]. We mainly studied the effects of the cylinder model optical parameters on the computational efficiency of fDOT image reconstruction. The numbers of sources, detectors, and photons simulated for each source are consistent with those in the case described in 3.1. In the fDOT image reconstruction, the initial specific absorption coefficient of the fluorescent target was set to 0. When the linear correlation between the reconstructed and actual fluorescence concentrations reached 0.98, the iteration ended. The other parameters were held constant, and the effects of the absorption coefficient of the background medium, scattering coefficient of the background medium, and specific absorption coefficient of the fluorescent target on the image reconstruction time were investigated. When the absorption coefficient of the background medium is changed, the experimental results are shown in Table 3 and Fig. 6 . When the scattering coefficient of the background medium is changed, the experimental results are shown in Table 4 and Fig. 7 . When the specific absorption coefficient of the fluorescent targets are changed, the experimental results are shown in Table 5 and Fig. 8 .

## 4. Discussion and conclusions

In Figs. 5(b) and 5(c), it is evident that the reconstructed and actual fluorescent target positions are very close to each other. In Fig. 5(d), the linear correlation between the reconstructed and actual fluorescent target concentrations is high. The image reconstruction results confirm that three-level parallel architecture fDOT image reconstruction can recover the positions and concentrations of fluorescent targets correctly. From Table 2, it is evident that the image reconstruction time is very stable. Its minor variation mainly originates from the random changes in the read/write speed of the disk over time. Thus, the use of three-level parallel architecture reduces the image reconstruction time to an acceptable level.

As shown in Table 3 and Fig. 6, as the absorption coefficient of the background medium increases, the image reconstruction time is stable at first and then increases slightly. This trend appears because, as the absorption coefficient of the background medium increases, the number of photons that reach the detector decreases. Therefore, to achieve the same reconstructed accuracy, more iterations are needed. Table 4 and Fig. 7 demonstrate that, as the scattering coefficient of the background medium increases, the image reconstruction time increases. This increase is present because a larger scattering coefficient of the background medium corresponds to longer photon paths through the voxels and, hence, a greater amount of path information that must be read/written. In Table 5 and Fig. 8, when the initial specific absorption coefficient of the fluorescent target in the iteration is set to 0, the image reconstruction time increases linearly as the specific absorption coefficient of the fluorescent target increases. This increase is caused because, with a fixed iteration step size, when the actual specific absorption coefficient of the fluorescent target is closer to the initial value, the iteration proceeds faster.

In conclusion, utilizing general-purpose computing platforms, we developed the three-level parallel architecture to accelerate fDOT image reconstruction based on path-history fluorescence Monte Carlo model. Through the simulation experiments, we demonstrated the stability and efficiency of the three-level parallel architecture fDOT image reconstruction. With the development of electronic technology, the CPU and GPU performances can be further improved. In addition, if the numbers of nodes and of CPUs are increased, the image reconstruction time can be further reduced. This accelerating method of fDOT image reconstruction based on path-history fluorescence Monte Carlo model will potentially be useful for in vivo optical imaging in the future.

## Acknowledgments

This work was supported by National Major Scientific Research Program of China (No. 2011CB910401), National Key Scientific Instrument & Equipment Development Program of China (No. 2012YQ030260), Science Fund for Creative Research Group (No. 61421064), and National Natural Science Fund (No. 61078072).

## References and links

**1. **J. Lackowicz, *Principle of Fluorescence Spectroscopy* (Springer, New York, 2006).

**2. **A. Corlu, R. Choe, T. Durduran, M. A. Rosen, M. Schweiger, S. R. Arridge, M. D. Schnall, and A. G. Yodh, “Three-dimensional in vivo fluorescence diffuse optical tomography of breast cancer in humans,” Opt. Express **15**(11), 6696–6716 (2007). [CrossRef] [PubMed]

**3. **V. Ntziachristos, “Fluorescence molecular imaging,” Annu. Rev. Biomed. Eng. **8**(1), 1–33 (2006). [CrossRef] [PubMed]

**4. **S. C. Davis, K. S. Samkoe, K. M. Tichauer, K. J. Sexton, J. R. Gunn, S. J. Deharvengt, T. Hasan, and B. W. Pogue, “Dynamic dual-tracer MRI-guided fluorescence tomography to quantify receptor density in vivo,” Proc. Natl. Acad. Sci. U.S.A. **110**(22), 9025–9030 (2013). [CrossRef] [PubMed]

**5. **B. F. Manly, *Randomization, Bootstrap and Monte Carlo Methods in Biology* (CRC Press, 2006).

**6. **C. Zhu and Q. Liu, “Review of Monte Carlo modeling of light transport in tissues,” J. Biomed. Opt. **18**(5), 050902 (2013). [CrossRef] [PubMed]

**7. **T. Li, H. Gong, and Q. Luo, “MCVM: Monte Carlo modeling of photon migration in voxelized media,” J. Innov. Opt. Health Sci. **3**(2), 91–102 (2010). [CrossRef]

**8. **B. H. Hokr, V. V. Yakovlev, and M. O. Scully, “Efficient time-dependent Monte Carlo simulations of stimulated raman scattering in a turbid medium,” ACS Photonics **1**(12), 1322–1329 (2014). [CrossRef]

**9. **A. V. Bykov, M. Y. Kirillin, and A. V. e. Priezzhev, “Monte Carlo simulation of an optical coherence Doppler tomograph signal: the effect of the concentration of particles in a flow on the reconstructed velocity profile,” Quantum Electron. **35**(2), 135–139 (2005). [CrossRef]

**10. **T. Li, Y. Zhao, Y. Sun, and K. Li, “Effects of wavelength, beam type and size on cerebral low-level laser therapy by a Monte Carlo study on visible Chinese human,” J. Innov. Opt. Health Sci. **8**(1), 1540002 (2015). [CrossRef]

**11. **A. J. Welch, C. Gardner, R. Richards-Kortum, E. Chan, G. Criswell, J. Pfefer, and S. Warren, “Propagation of fluorescent light,” Lasers Surg. Med. **21**(2), 166–178 (1997). [CrossRef] [PubMed]

**12. **V. Ntziachristos and R. Weissleder, “Experimental three-dimensional fluorescence reconstruction of diffuse media by use of a normalized Born approximation,” Opt. Lett. **26**(12), 893–895 (2001). [CrossRef] [PubMed]

**13. **J. Chen and X. Intes, “Comparison of Monte Carlo methods for fluorescence molecular tomography-computational efficiency,” Med. Phys. **38**(10), 5788–5798 (2011). [CrossRef] [PubMed]

**14. **A. Liebert, H. Wabnitz, N. Zołek, and R. Macdonald, “Monte Carlo algorithm for efficient simulation of time-resolved fluorescence in layered turbid media,” Opt. Express **16**(17), 13188–13202 (2008). [CrossRef] [PubMed]

**15. **J. Chen, V. Venugopal, and X. Intes, “Monte Carlo based method for fluorescence tomographic imaging with lifetime multiplexing using time gates,” Biomed. Opt. Express **2**(4), 871–886 (2011). [CrossRef] [PubMed]

**16. **A. T. Kumar, “Direct Monte Carlo computation of time-resolved fluorescence in heterogeneous turbid media,” Opt. Lett. **37**(22), 4783–4785 (2012). [CrossRef] [PubMed]

**17. **Z. Luo, Y. Deng, K. Wang, L. Lian, X. Yang, and Q. Luo, “Decoupled fluorescence Monte Carlo model for direct computation of fluorescence in turbid media,” J. Biomed. Opt. **20**(2), 025002 (2015). [CrossRef] [PubMed]

**18. **X. Jiang, Y. Deng, Z. Luo, K. Wang, L. Lian, X. Yang, I. Meglinski, and Q. Luo, “Evaluation of path-history-based fluorescence Monte Carlo method for photon migration in heterogeneous media,” Opt. Express **22**(26), 31948–31965 (2014). [CrossRef] [PubMed]

**19. **Y. Deng, Z. Luo, X. Jiang, W. Xie, and Q. Luo, “Accurate quantification of fluorescent targets within turbid media based on a decoupled fluorescence Monte Carlo model,” Opt. Lett. **40**(13), 3129–3132 (2015). [CrossRef] [PubMed]

**20. **D. R. Kirkby and D. T. Delpy, “Parallel operation of Monte Carlo simulations on a diverse network of computers,” Phys. Med. Biol. **42**(6), 1203–1208 (1997). [CrossRef] [PubMed]

**21. **A. Colasanti, G. Guida, A. Kisslinger, R. Liuzzi, M. Quarto, P. Riccio, G. Roberti, and F. Villani, “Multiple processor version of a Monte Carlo code for photon transport in turbid media,” Comput. Phys. Commun. **132**(1-2), 84–93 (2000). [CrossRef]

**22. **E. Alerstam, T. Svensson, and S. Andersson-Engels, “Parallel computing with graphics processing units for high-speed Monte Carlo simulation of photon migration,” J. Biomed. Opt. **13**(6), 060504 (2008). [CrossRef] [PubMed]

**23. **Q. Fang and D. A. Boas, “Monte Carlo simulation of photon migration in 3D turbid media accelerated by graphics processing units,” Opt. Express **17**(22), 20178–20190 (2009). [CrossRef] [PubMed]

**24. **N. Ren, J. Liang, X. Qu, J. Li, B. Lu, and J. Tian, “GPU-based Monte Carlo simulation for light propagation in complex heterogeneous tissues,” Opt. Express **18**(7), 6811–6823 (2010). [CrossRef] [PubMed]

**25. **G. Quan, H. Gong, Y. Deng, J. Fu, and Q. Luo, “Monte Carlo-based fluorescence molecular tomography reconstruction method accelerated by a cluster of graphic processing units,” J. Biomed. Opt. **16**(2), 026018 (2011). [CrossRef] [PubMed]

**26. **X. Song, D. Wang, N. Chen, J. Bai, and H. Wang, “Reconstruction for free-space fluorescence tomography using a novel hybrid adaptive finite element algorithm,” Opt. Express **15**(26), 18300–18317 (2007). [CrossRef] [PubMed]

**27. **B. W. Pogue, T. O. McBride, J. Prewitt, U. L. Österberg, and K. D. Paulsen, “Spatially variant regularization improves diffuse optical tomography,” Appl. Opt. **38**(13), 2950–2961 (1999). [CrossRef] [PubMed]

**28. **J. Ye, C. Chi, Z. Xue, P. Wu, Y. An, H. Xu, S. Zhang, and J. Tian, “Fast and robust reconstruction for fluorescence molecular tomography via a sparsity adaptive subspace pursuit method,” Biomed. Opt. Express **5**(2), 387–406 (2014). [CrossRef] [PubMed]

**29. **Y. Wang, P. Li, C. Jiang, J. Wang, and Q. Luo, “GPU accelerated electric field Monte Carlo simulation of light propagation in turbid media using a finite-size beam model,” Opt. Express **20**(15), 16618–16630 (2012). [CrossRef]

**30. **“Nvidia Corporation, “CUDA programming guide 6.5,” (2014).”

**31. **J. R. Shewchuk, *An Introduction to the Conjugate Gradient Method Without the Agonizing**Pain* (Carnegie-Mellon University, Department of Computer Science, 1994).

**32. **T. Correia, N. Ducros, C. D’Andrea, M. Schweiger, and S. Arridge, “Quantitative fluorescence diffuse optical tomography in the presence of heterogeneities,” Opt. Lett. **38**(11), 1903–1905 (2013). [CrossRef] [PubMed]