High-precision microscopic autofocus with a single natural image

Zhijie Hua; Xu Zhang; Xu Zhang; Xu Zhang; Dawei Tu; Dawei Tu

doi:10.1364/OE.507757

1. Introduction

In industrial microscopic real-time detection, autofocus systems serve as an indispensable component, providing high-quality and well-focused images for detected samples, such as wafers [1], printed circuit boards [2], chips [3], and so on. The system adjusts the focal plane automatically, eliminating the need for manual adjustments and allowing inspectors to focus on analyzing the images to determine whether the detected samples are defective. Therefore, autofocus systems play an important role in industrial real-time detection with its high productivity and focusing accuracy.

One of the most common types of autofocus methods in the industry is image-based autofocus, as shown in Fig. 1(a), which uses sharpness metrics [4–9] to evaluate the quality of the images and determine the optimal focus position by search strategies [10–12]. However, these methods are time-consuming and sensitive to noises and backgrounds. To solve these problems, various learning-based methods are proposed to get rid of the disadvantages of traditional multi-step focusing and improve the focusing robustness by the training of a large amount of data. A learning-based autofocus method assumes an optical defocus model, as shown in Fig. 1(b), which establishes a connection between the off-focus amount of the image and the defocus distance. Such methods can be divided into three aspects according to the output of the network. The first aspect converts autofocus into a classification problem, which uniformly samples in a specific stroke range and regards each sampling point as a category. Some works take a complete focal stack [13], two focal slices [13,14], or a single focal slice [13,15–17] as network input and then calculate defocus distance by the category with the highest score. The second aspect regards autofocus as a regression problem. Compared to the classification problem, this method omits the intermediate step and maps the focal state of the image directly to the defocus distance. In general, the input is one or two focal slices from the natural image [18–21] or the magnitude of its Fourier transform [22,23], and the output is the estimated defocus distance from the in-focus position. The third aspect, which does not need to estimate the focus distance, is entirely different from the previous two. Still, it relies on the robust adversarial generative network to restore the out-of-focus image to the in-focus image [24–27], which realizes the end-to-end prediction.

Fig. 1. (a) The workflow of Image-based methods. These methods determine the optimal focus position by sharpness metric and search strategies. (b) The workflow of Learning-based methods. Such methods predict the defocus distance from a single defocus image or multiple images.

Download Full Size | PDF

As an extension of image-based methods, learning-based methods improve not only the focusing speed but also the focusing accuracy. However, these methods still have a lot of errors that affect the focusing accuracy. All network models in learning-based methods have fitting errors due to the nonlinearity of the model and the diversity of training data, and the error size depends on the model’s ability to extract image features. The training dataset is usually annotated by the sharpness metric, as shown in Fig. 2, which is similar to the traditional image-based method, while the limitations of the metric itself will affect the accuracy of making dataset. Both the dataset and models are prone to errors, thereby limiting the potential for further improvements in focusing accuracy. To solve this problem, some methods [13,28] have been proposed to determine the optimal focus position in the dataset by selecting the best fitting algorithm for any given application or computing depth for each image using a modified form of the multi-view stereo pipeline. With the continuous improvement of computing power, a large number of excellent network models [29–33] have been proposed. They not only have stronger feature extraction capabilities, but also have smaller fitting errors.

Fig. 2. The process of routinely making dataset.

Download Full Size | PDF

In this paper, a high-precision autofocus pipeline was introduced, which predicts the defocus distance from a single natural image. A new method with active laser illumination for making datasets was proposed, which labels the dataset by the spot features of the split image and overcomes the limitations of the sharpness metric itself, improving the overall accuracy of the dataset. Furthermore, a lightweight regression network was built, namely Natural-image Defocus Prediction Model (NDPM), which makes it pay more attention to the global features of the image, fully establishes the optical defocus model of the microscope, and further improves the focusing accuracy at arbitrary magnification. A realistic dataset of sufficient size was made to train all models. The experiment demonstrated that NDPM has better focusing performance than other prediction models. In addition, compared with the current mainstream lightweight neural network, NDPM has stronger ability of global feature extraction, so it has higher focusing accuracy and robustness.

2. Making datasets

2.1 Challenges of making datasets

There are many challenges in the process of making datasets. Prior work [14] is found the optimal focal position by eye which has great human error. Some works [18,25] are generally based on the sharpness metric to determine the focal position, as shown in Fig. 2. First, the objective lens scans the image along the z-axis with a fixed sampling step, and extracts the region of interest from each image as a slice. These slices obtained from different focus planes form a focal stack. Then, the sharpness metric is maximized to determine the ground truth focal position of each focal stack as its reference slice($k^{*}$), and the actual defocus distance is marked as 0$\mathrm {\mu }$m. At the same time, the actual defocus distances($\Delta z_{k}$) of the remaining slices are marked by the fixed sampling step. The equation can be defined as:

(1)$$\begin{array}{c} \underset{k}{\arg \max } G\left(I_{k}\right) \rightarrow k^{*} \\ \quad \Delta z_{k} = \left(k-k^{*}\right) * s \end{array}, \forall k \in\{1, \ldots, n\}$$

where, $G$ represents the sharpness metric, $I_k$ is a slice in the focal stack, $k$ is the index corresponding to the slice, $k^{*}$ is the index value corresponding to the sharpest focused slice calculated by sharpness metric, $n$ is the total number of slices in one focal stack, and $\Delta z_{k}$ is the actual defocus distance of the slice, $s$ is the fixed sampling step.

Here, some of the issues regarding how the sharpness metric affects the precision and speed of dataset in certain scenarios are introduced, due to its inherent limitations.

2.1.1 Sensitive problem of sharpness metric

The sharpness metric is sensitive to noise and contrast [13]. When the surface has particle-type defects [1], texture [34], dust [35], and other types of noises, or the main texture feature is not obvious, the sharpness metric will seriously affect the evaluation result of image sharpness, resulting in a deviation between the calculated focal plane and the ideal focal plane.

2.1.2 Selection problem of sharpness metric

Different sharpness metric employ distinct algorithms to evaluate image clarity. As a result, the ideal focal plane calculated by different sharpness metric will differ for a given focal stack. Consequently, it is difficult to determine the optimal sharpness metric for a scene. Table 1 illustrates the index values of the ideal focal plane obtained by 13 different sharpness metric [4–9] for 3 different scenes. Each focal stack contains 45 slices of different focal planes, indexed from 0 to 44.

Table 1. The index values of the ideal focal plane obtained by 13 different sharpness metric for 3 different scenes

View Table | View all tables in this article

2.1.3 Calculation problem of sharpness metric

The evaluation results of sharpness metric are calculated by computer, which makes it non-intuitive when collecting images offline, resulting in significant deviations in the selection of the initial focal plane. Given the difficulty of discerning the clearest image within the depth of field, additional images must be captured to ensure dataset uniformity, which proves to be both time-consuming and labor-intensive. Fig. 3 shows the ideal focal plane calculated by sharpness metric and the initial focal plane selected manually in one focal stack.

Fig. 3. The ideal focal plane and the initial focal plane in one focal stack.

Download Full Size | PDF

2.2 Our method for making datasets

To overcome the limitations of making datasets by sharpness metric, a new method with active laser illumination was proposed to determine the ground truth focal position for each focal stack. Such method not only eliminates potential interference from background noise, but also expedites and enhances the accuracy of the focus plane localization process.

2.2.1 Principle of the split-image focusing

The split-image focusing system [36–38] consists of a lens, a pair of triangular prisms called optical wedges, and a pattern mask. The optical wedges interlace to divide the spot pattern into two portions, and their structure is shown in Fig. 4. According to the positional relationship between the focal point and the optical wedges, the system can be categorized into far-focus, in-focus, and near-focus. When the split-image system is properly focused, as shown in Fig. 4(b), the ray bundles are clustered at the intersection of the optical wedges, making the split-image patterns horizontally symmetrical. Otherwise, the upper and lower patch patterns are shifted left and right, so that they are no longer symmetrical. Therefore, this phenomenon can be used to quickly achieve split-image focusing.

Fig. 4. The principle of split-image focusing system in three situations. (a), (b), and (c) create aerial images at $A, B, C$ positions and virtual images at $A_u, A_l, C_u, C_l$ positions. The red-marked optical wedge is in front of the blue one.

Download Full Size | PDF

Upon the complete installation of the entire split-image system, the position of its focal point can be ascertained, with the optimal focusing distance being denoted as $z_{split-image}$. As the optical paths of both the entire split-image system and the camera imaging system are mutually independent, the focus position of the camera system can similarly be determined upon the completion of its installation, with the ideal focusing distance being represented as $z_{camera}$.

The split-image system utilizes active laser illumination to project the split-image features onto the surface of the detected object, thereby establishing a co-planar relationship between the region containing the split-image features and the object surface. When the split-image system and the camera system are conjugate, the split-image features exhibit perfect symmetry, and the natural image contrast calculated by the sharpness metric reaches its maximum. In principle, it is feasible to manipulate the focusing ring of the camera lens to alter the focus position of the camera system, thus rendering the split-image system and the camera system in conjugacy state, that is to say:

(2)$$Z_{split-image } ={=} Z_{camera }$$

However, the focusing ring is manually adjusted, which cannot be adjusted to exactly conjugate the two systems, resulting in a constant deviation, as depicted in Fig. 5. Therefore, it is necessary to calibrate this constant deviation $\Delta z_{diff}$ to correlate the two systems, which is defined as:

(3)$$\Delta z_{diff } = z_{camera }-z_{split-image }$$

Fig. 5. The split image system and the camera imaging system are in a non-conjugate state.

Download Full Size | PDF

Following the calibration of this constant deviation, the split-image features can be employed as a rapid and effective substitute for conventional sharpness metric in determining the precise focus position of the camera system, thus overcoming the inherent limitations of the latter.

2.2.2 Calibration of constant deviation

A high-resolution calibration board with a resolution of 0.25$\mathrm {\mu }$m, as shown in Fig. 6(c), was selected to calibrate the constant deviation between the split-image system and the camera system. With the high-resolution stripe patterns on the calibration board serving as the background, the objective lens is scanned along the z-axis near the focal point from bottom to top, covering a range of $R$ $\mathrm {\mu }$m with a step size of $S$ $\mathrm {\mu }$m to capture one split image and one natural image at each position. This procedure is repeated $\frac {R}{S}+1$ times to collect two sets of images. The split images were taken when the laser was turned on, and the bright field light was turned off. While the natural images were taken when the bright field light was turned on, and the laser was turned off. The focus positions of the two sets of images were calculated and compared.

Fig. 6. (a) The motorized industrial microscope. (b) The chips. (c) High resolution calibration board. (d) A1, B1, and C1 show 7 split images at the same defocus distance from 3 focal stacks under different scenes. A2, B2, and C2 show 7 natural images at the same defocus distance, and their ground truth focal positions are determined by the split images.

Download Full Size | PDF

When the pixel distance between the upper and lower split-image patterns in split image is the smallest, the image at this time is considered to be the ground truth focal position, and the actual defocus distance is marked as 0$\mathrm {\mu }$m. The Normalized Cross Correlation (NCC) [39], which regards each pixel as a feature and calculates the correlation between the two feature vectors, was used to calculate the pixel distance, as shown in Fig. 7(a). NCC can be defined as follows:

(4)$$NCC(x, y) = \frac{\sum_{w \in M} \sum_{h \in N}\left|L(x+w, y+h)-\bar{L}_{x, y}\right||U(w, h)-\bar{U}|}{\sqrt{\sum_{w \in M} \sum_{h \in N}\left[L(x+w, y+h)-\bar{L}_{x, y}\right]^{2} \sum_{w \in M} \sum_{h \in N}[U(w, h)-\bar{U}]^{2}}}$$

(5)$$\bar{L}_{x, y} = \frac{1}{T} \sum_{w \in M} \sum_{h \in N}[L(x+w, y+h)]$$

(6)$$\bar{U} = \frac{1}{T} \sum_{w \in M} \sum_{h \in N} U(w, h)$$

where $M$ and $N$ represent the set of pixel coordinate values selected from the upper split image, $T$ denotes the total number of the set, $L(x,y)$ and $\bar {L}_{x, y}$ are the pixel value and average pixel value of the lower split image, respectively, whereas $U(w, h)$ and $\bar {U}$ signify the pixel value and average pixel value of the upper split image, respectively.

Fig. 7. Calibration of constant deviation. (a) NCC was used to calculate the pixel distance of the split image. (b) Tenengrad was used to evaluate the contrast of the natural image.

Download Full Size | PDF

When the natural image contrast is maximum, the image at this time is considered to be the ground truth focal position, and the actual defocus distance is marked as 0um. The sharpness metric was used to evaluate the clarity of each natural image, as shown in Fig. 7(b). The different sharpness metric, such as EOG [40], Tenengrad [40], Brenner [7], calculate the same ground truth focal position for the same focal stack, so the Tenengrad was selected for this work. The equation can be defined as follows:

(7)$$\begin{array}{r} G_{i}(i, j) = I(i-1, j+1)+2 I(i, j+1)+ \\ I(i+1, j+1)- I(i-1, j-1) \\ - 2I(i, j-1)-I(i+1, j-1) \end{array}$$

(8)$$\begin{array}{r} G_{j}(i, j) = I(i-1, j-1)+2 I(i-1, j)+ \\ I(i-1, j+1)- I(i+1, j-1) \\ - 2I(i+1, j)-I(i+1, j+1) \end{array}$$

(9)$$S(i, j) = \sqrt{G_{i}^{2}(i, j)+G_{j}^{2}(i, j)}$$

where, $I(i,j)$ is the gray value of the image at the pixel $(i,j)$. $G_i (i,j)$ represents the gradient value of the pixel $(i,j)$ along the $x$ direction. $G_j (i,j)$ represents the gradient value of the pixel $(i,j)$ along the $y$ direction. $S(i,j)$ represents the sharpness evaluation at the pixel $(i,j)$.

The difference between the ground truth focal positions by two sets of images was calculated, and the final calibration result was obtained by averaging the difference of multiple groups. The calibrated constant deviation was 2$\mathrm {\mu }$m, that is:

(10)$$\Delta z_{diff } = 2 \mu m$$

After calibrating the constant deviation, the ground truth focal position of the natural images can be determined by the split images, and the dataset with higher accuracy can be obtained.

2.3 Dataset

The chip surfaces were used as the experimental object and observed under a 10X objective lens. The numerical aperture of the 10X objective lens was 0.3, the depth of field was 3.5$\mathrm {\mu }$m, and the work distance was 8.5mm. The Daheng industrial camera of 5 million pixels, called MER2-503-23GC, was used as an image acquisition device. A laser with a wavelength of 532nm was utilized to project the split-image pattern.

240 focal stacks under different scenes were captured by the motorized industrial microscope. Each focal stack contains 161 slices from −40$\mathrm {\mu }$m to 40$\mathrm {\mu }$m with a step size of 0.5$\mathrm {\mu }$m, which ground truth focal position was determined by the split images and constant deviation. The negative sign represents near-focus, and the positive one is far-focus. The z-axis with 50nm motion resolution can be moved in a step size of 0.5$\mathrm {\mu }$m and continuously triggers the camera to collect images. Each raw image is composed of 2448 $\times$ 2048 pixels in 24-bit. To facilitate the training of data, the region of interest is segmented from the raw image where the split-image pattern is located, and thus each slice’s size is 320 $\times$ 320. Fig. 6(d) shows some labeled images under different focal planes, including split images and natural images. All data are partitioned randomly into training (80%, 192 focal stacks), validation(10%, 24 focal stacks), and testing(10%, 24 focal stacks) sets.

3. Model

In convolutional neural networks, the configuration of the convolutional kernel limits the size of the receptive field, resulting in each convolutional layer performing feature extraction only in a local region, which cannot capture global features. Therefore, current mainstream feature extraction networks [33,41,42] introduce image attention mechanisms, enabling the model to adaptively adjust the focus on different regions of the input image based on its content. Common image attention mechanisms include Squeeze-and-Excitation [43], Coordinate Attention [44], and Convolutional Block Attention Module [45]. However, these attention mechanisms primarily focus on the relationships between channels and spatial dimensions, rather than global relationships.

Multi-head self-attention mechanism, as a vital component of Vision Transformer [29,30,46], facilitates the acquisition of global spatial information from the feature maps. Multi-head self-attention uses multiple different query, key, and value mapping functions to compute multiple self-attention representations, which are then concatenated together. This approach allows the model to more finely model the correlations between different feature subspaces, thereby better capturing global information.

When using multi-head self-attention, the image feature matrix $\boldsymbol {X} \in \mathbb {R}^{h \times w \times d}$ is linearly transformed into multiple sets of query, key, and value matrices:

(11)$$\boldsymbol{Q_{k}} = \boldsymbol{X W_{k}^{Q}} \in \mathbb{R}^{h \times w \times d_{k}}$$

(12)$$\boldsymbol{K_{k}} = \boldsymbol{X W_{k}^{K}} \in \mathbb{R}^{h \times w \times d_{k}}$$

(13)$$\boldsymbol{V_{k}} = \boldsymbol{X W_{k}^{V}} \in \mathbb{R}^{h \times w \times d_{v}}$$

where, $\boldsymbol {W}_k^Q \in \mathbb {R}^{d \times d_k}$ , $\boldsymbol {W}_k^K \in \mathbb {R}^{d \times d_k}$ , and $\boldsymbol {W}_k^V \in \mathbb {R}^{d \times d_v}$ are the weight matrices for query, key, and value, respectively, and $d_k$ and $d_v$ are the dimensions of key and value.

Then, each set of query, key, and value matrices is used to compute a self-attention representation:

(14)$$\begin{array}{r} \boldsymbol{Z}_{k} = softmax\left(\frac{\boldsymbol{Q}_{k} \boldsymbol{K}_{k}^T}{\sqrt{d_k}}\right) \boldsymbol{V}_k \\ = Attention\left( \boldsymbol{X}; \boldsymbol{W}_k^Q, \boldsymbol{W}_k^K, \boldsymbol{W}_k^V \right) \end{array}$$

Finally, the multiple self-attention representations are concatenated and transformed to obtain the final self-attention representation:

(15)$$\boldsymbol{Z} = Concat\left( \boldsymbol{Z}_1,{\ldots},\boldsymbol{Z}_H \right) \boldsymbol{W}_O$$

where $H$ is the number of heads, $\boldsymbol {W}_O \in \mathbb {R}^{Hd_v \times d}$ is the weight matrix used for concatenation, and $Concat$ denotes concatenation along the last dimension.

In this work, the NDPM was proposed to predict the defocus distance from a single natural image. The architecture of NDPM is depicted in Fig. 8. To accelerate the model’s operation and reduce memory consumption, the model was based on depthwise separable convolutional networks, which have achieved significant success in MobileNet [33,47,48], GhostNet [49], EfficientNet [50]. The whole model includes five blocks. The first block was defined by a convolution followed by batch normalization and ReLU transformation. The second, third, and fourth block were composed of inverted residuals [47], which implemented $1 \times 1$ expansion convolution followed by a $3 \times 3$ depthwise convolution and a $1 \times 1$ reduction convolution. In the third and fourth blocks, the input and output were connected with a residual connection. The introduction of the Self-Attention Mechanism serves as an image attention mechanism in the third block, enabling the model to dynamically determine the significance of each spatial position by performing self-comparisons across all positions. This is accomplished by calculating similarity scores between each position and other positions, thereby facilitating the extraction of global features. Prior to applying the Self-Attention Mechanism, the features extracted by depthwise convolution were adjusted to a size of $16 \times 16 \times C$ through convolutional operations. Finally, global max pooling was employed to compute the weights for each channel, which are subsequently applied to the original feature maps. The fifth block encompassed a fully connected layer and a LeakyReLU transformation. It integrates the high-level features extracted by preceding layers and adjusts the dimensionality of the output features.

Fig. 8. The architecture of NDPM. The NDPM’s input is a slice of size $256 \times 256 \times 3$ and the output is the predicted defocus distance.

Download Full Size | PDF

The implementation of the third block in the first three stages is driven by the fact that as the network depth increases, self-attention introduces significant computational costs, resulting in heightened network latency. By introducing the attention mechanism in the early layers of the network, not only can global features be efficiently extracted from the images, but it also ensures the model’s adaptability to embedded devices where memory and computational resources are limited.

4. Experiments

4.1 Training details

The models were trained using Keras, a Tensorflow framework, on an NVIDIA GeForce GTX 1650 GPU. The Adam optimizer was utilized with parameter values of $\beta _1=0.5$ and $\beta _2=0.999$. The initial learning rate was set to $1 \times 10^{-4}$ and halved every five epochs. A batch size of 12 was employed during the training process. Each model underwent a total of 36 epochs to ensure comprehensive training. All training data were enhanced by random flipping and rotation. Mean Squared Error(MSE) is defined as the loss function for the models, and it is formulated as follows:

(16)$$LOSS = \frac{1}{N} \sum_{i = 1}^{N}\left(NDPM\left(I_{i}\right) - \Delta z_{i}^{{true }}\right)^{2}$$

where $I_i$ is the out-of-focus input image, $NDPM(*)$ is the prediction model, $\Delta z_{i}^{true}$ denotes the actual defocus distance of the input image, $N$ represents the number of input images.

4.2 Comparison of four prediction models

The experiment demonstrated that the regression model in the spatial domain has higher focusing accuracy than the other three prediction models: the ordinal regression model in the spatial domain, the classification model in the spatial domain, and the regression model in the frequency domain. All four prediction models are single-shot autofocus algorithms. Ordinal regression model is a statistical method used for predicting and analyzing ordered categorical outcomes. Classification model predicts the class or category of a given observation based on its features or attributes. Regression model predicts a continuous outcome variable based on the relationship between independent variables and the target variable. Ordinal regression model, classification model, regression model all use the feature extraction network proposed in NDPM to extract image features, and the output of different models are different.

All models used the same datasets. And the comparative results are presented in Table 2, where the values represent the root mean square error (RMSE) and mean absolute error (MAE) of the testing set. The RMSE and MAE can be defined as follows:

(17)$$RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n}\left(\Delta z_{i}^{ {true }}-\Delta z_{i}^{ {predict }}\right)^{2}}$$

(18)$$MAE = \frac{1}{n} \sum_{i = 1}^{n}\left|\Delta z_{i}^{ {true }}-\Delta z_{i}^{ {predict }}\right|$$

where $n$ is the total number of in the testing set, $\Delta z_{i}^{true }$ represents the actual defocus distance, and $\Delta z_{i}^{predict }$ denotes the predicted defocus distance.

Table 2. Results of four prediction models

View Table | View all tables in this article

The experiment demonstrated that NDPM outperformed the other prediction models with an RMSE of 0.528 compared to the closest baseline value of 0.731, and an MAE of 0.422 compared to 0.534. There was worse focusing accuracy when converting the natural image in the spatial domain to the frequency domain to predict the defocus distance. The reason is that the frequency domain focuses on the amplitude and phase information of frequency components, disregarding the original image’s position, details, and texture. Both ordinal regression model and classification model predict categories, so that their focusing accuracy were limited to the sampling step of 0.5$\mathrm {\mu }$m. While regression model predicted continuous outcome variable can break this limitation.

Figure 9 shows the autofocusing performance of four prediction models for 4 focal stacks. In each focal stack, the scatter plot depicts the focusing error of each natural image and the MAE corresponding to each model. Compared with other models, the focusing accuracy of the Ordinal regression model was quite different in different scenes. Because the model directly predicted the position of the focus image in each focal stack, it may lead to the overall deviation of the prediction results. Conversely, NDPM showed stronger robustness in different detection objects.

Fig. 9. Comparison of autofocusing performance of NDPM with other prediction models for 4 focal stacks. In each focal stack, the scatter plot depicts the focusing error of 161 slices, and MAEs are calculated by the four prediction models respectively. The natural image of each focal stack is displayed on the graph.

Download Full Size | PDF

4.3 Comparison of lightweight feature extraction networks

In addition, to further demonstrate the superiority of the feature extraction network in NDPM, six different lightweight feature extraction networks were compared. The focusing accuracy of each of them under different datasets was calculated, where the sampling step size of the 5X dataset is 1$\mathrm {\mu }$m, the 10X dataset is 0.5$\mathrm {\mu }$m, and the 20X dataset is 0.2$\mathrm {\mu }$m. Each dataset contains 240 focal stacks. The results are presented in Table 3. In the 5X dataset, NDPM had a greater improvement in focusing accuracy than other feature extraction networks, where RMSE and MAE were increased by 60% compared to the closest baseline. In the 10X and 20X datasets, NDPM had a certain improvement. Although NDPM is better than the state-of-the-art networks in focusing accuracy and robustness, it sacrifices some inference time and memory, as presented in Table 4. Since the self-attention mechanism known to be computationally demanding is introduced into NDPM, the computational complexity of the model increases. However, with GPU accelerated inference, real-time autofocus can still be achieved.

Table 3. Comparison of NDPM and state-of-the-art lightweight feature extraction networks in focusing accuracy

View Table | View all tables in this article

Table 4. Comparison of NDPM and state-of-the-art lightweight feature extraction networks in computational overhead

View Table | View all tables in this article

In industrial detection, it is often necessary to first determine the approximate position of the detection object under a 5X objective lens and then magnify the observation under a 20X objective lens. Therefore, achieving autofocus at multiple magnifications can improve detection efficiency. The experimental results showed that NDPM outperformed the other networks in the mixed datasets of 5X and 20X, with an RMSE of 0.863 compared to the closest baseline value of 1.173, and an MAE of 0.646 compared to 0.804. Although the model accuracy of the mixed dataset was slightly lower than that of the single dataset, it brings great convenience to the detection. Fig. 10 shows the focusing performance of the same focal stack under single data model and mixed data model. Due to the smaller sampling step, the focusing accuracy of the 20X data model was generally better than that of the 5X data model.

Fig. 10. Comparison of focusing performance of the same focal stack under single data model and mixed data model. (a) and (b) are focal stacks in 5X dataset. (c) and (d) are focal stacks in 20X dataset. The MAEs and RMSEs of each focal stack are calculated by different data models.

Download Full Size | PDF

Figure 11 displays the process of NDPM focusing from an out-off-focus image. Firstly, the region of interest(ROI) with a size of $320 \times 320$ pixels was segmented from the natural image. Then, the defocus distance was predicted by NDPM. Finally, the objective lens was moved to the target position. The split images corresponding to the natural image during defocus and focus are shown in Fig. 11(a). Taking the defocus distance of 20$\mathrm {\mu }$m as an example, when it completed the focusing, 6 images were collected in a 0.5$\mathrm {\mu }$m step around it, and the clarity of the ROI in each image was calculated by Tenengrad. The curve, as shown in Fig. 11(b), shows that the image was sharpest when focusing. The focusing error for this focal stack was irregularly distributed, as shown in Fig. 11(c). Fig. 11(d) displays the actual defocus distance and predicted defocus distance of some images in the focal stack.

Fig. 11. NDPM for one focal stack. (a) The process of predicting the defocus distance from a defocused natural image. The split images corresponding to the natural images are shown on the right. (b) The images were collected with a step size of 0.5$\mathrm {\mu }$m near the predicted focus(the red dot) and the contrast of the ROI was calculated by Tenengrad. (c) The focusing error of NDPM for this focal stack is plotted as a function of the axial defocus distance. (d) shows the prediction results of some images on different focal planes.

Download Full Size | PDF

5. Discussion

In this paper, all prediction models are capable of distinguishing whether a defocused image is near-focus or far-focus, but it is difficult for the human to discern, as shown in Fig. 11(d). For the defocus model of the image, it is often assumed to follow the Point Spread Function (PSF), which describes the distribution of a point light source on the focal plane during the imaging process of an optical system. Due to the asymmetry in the propagation of light through a lens or optical system, the PSF also exhibits asymmetry [13,18].

When the image is located on the focal plane, the PSF exhibits minimal asymmetry near the focal point. This is because, on the focal plane, the light rays converge to a single point after propagating through the optical system, resulting in a more symmetric distribution of the PSF. In this case, the predicted defocus direction by the models are often inaccurate. However, when the image is located off the focal plane, the asymmetry of the PSF becomes more pronounced. As the object moves away from the focal plane, the distribution of the PSF shows skewed or elongated features. Therefore, the prediction models can accurately infer the defocus direction of each image, as shown in Fig. 9. Moreover, after being trained on images from various focal planes, the prediction models can accurately predict the defocus distance of any given image.

6. Conclusion

In this work, a high-precision autofocus pipeline was introduced, which only predicts the defocus distance from a single natural image. To improve the focusing accuracy, a lightweight regression network was built to pay more attention to the global features of the image. Furthermore, a new method with active laser illumination for making datasets was proposed to overcome the limitations of the sharpness metric itself and improve the overall accuracy of the dataset. Compared with other prediction models, NDPM has better focusing performance. It was also demonstrated that NDPM has stronger ability of global feature extraction, so it has higher focusing accuracy and robustness.

Funding

National Natural Science Foundation of China (51975344, 62176149).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. S. Cheon, H. Lee, C. O. Kim, et al., “Convolutional neural network for wafer surface defect classification and the detection of unknown defect class,” IEEE Trans. Semicond. Manufact. 32(2), 163–170 (2019). [CrossRef]

2. P. Pereira Gonçalves and A. Otsuki, “Determination of liberation degree of mechanically processed waste printed circuit boards by using the digital microscope and sem-eds analysis,” Electronics 8(10), 1202 (2019). [CrossRef]

3. T. Pei, F. Cheng, X. D. Jia, et al., “High-resolution detection of microwave fields on chip surfaces based on scanning microwave microscopy,” IEEE Trans. Instrum. Meas. 72, 1–6 (2023). [CrossRef]

4. Z. Bian, C. Guo, S. Jiang, et al., “Autofocusing technologies for whole slide imaging and automated microscopy,” J. Biophotonics 13(12), e202000227 (2020). [CrossRef]

5. S. Pertuz, D. Puig, and M. A. Garcia, “Analysis of focus measure operators for shape-from-focus,” Pattern Recognit. 46(5), 1415–1432 (2013). [CrossRef]

6. A. Santos, C. Ortiz De Solórzano, J. J. Vaquero, et al., “Evaluation of autofocus functions in molecular cytogenetic analysis,” J. Microsc. 188(3), 264–272 (1997). [CrossRef]

7. L. Firestone, K. Cook, K. Culp, et al., “Comparison of autofocus methods for automated microscopy,” Cytom. The J. Int. Soc. for Anal. Cytol. 12(3), 195–206 (1991). [CrossRef]

8. S.-Y. Lee, J.-T. Yoo, Y. Kumar, et al., “Reduced energy-ratio measure for robust autofocusing in digital camera,” IEEE Signal Process. Lett. 16(2), 133–136 (2009). [CrossRef]

9. Y. Sun, S. Duthaler, and B. J. Nelson, “Autofocusing in computer microscopy: selecting the optimal focus algorithm,” Microsc. Res. Tech. 65(3), 139–149 (2004). [CrossRef]

10. J. He, R. Zhou, and Z. Hong, “Modified fast climbing search auto-focus algorithm with adaptive step size searching technique for digital camera,” IEEE Trans. Consumer Electron. 49(2), 257–262 (2003). [CrossRef]

11. N. Kehtarnavaz and H.-J. Oh, “Development and real-time implementation of a rule-based auto-focus algorithm,” Real-Time Imaging 9(3), 197–203 (2003). [CrossRef]

12. Z. Wu, D. Wang, and F. Zhou, “Bilateral prediction and intersection calculation autofocus method for automated microscopy,” J. Microsc. 248(3), 271–280 (2012). [CrossRef]

13. C. Herrmann, R. S. Bowen, N. Wadhwa, et al., “Learning to autofocus,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), pp. 2230–2239.

14. C. Li, A. Moatti, X. Zhang, et al., “Deep learning-based autofocus method enhances image quality in light-sheet fluorescence microscopy,” Biomed. Opt. Express 12(8), 5214–5226 (2021). [CrossRef]

15. L. Wei and E. Roberts, “Neural network control of focal position during time-lapse microscopy of cells,” Sci. Rep. 8(1), 7313 (2018). [CrossRef]

16. T. Pitkäaho, A. Manninen, and T. J. Naughton, “Focus prediction in digital holographic microscopy using deep convolutional neural networks,” Appl. Opt. 58(5), A202–A208 (2019). [CrossRef]

17. Y. Xiang, Z. He, Q. Liu, et al., “Autofocus of whole slide imaging based on convolution and recurrent neural networks,” Ultramicroscopy 220, 113146 (2021). [CrossRef]

18. J. Liao, X. Chen, G. Ding, et al., “Deep learning-based single-shot autofocus method for digital microscopy,” Biomed. Opt. Express 13(1), 314–327 (2022). [CrossRef]

19. C. Wang, Q. Huang, M. Cheng, et al., “Deep learning for camera autofocus,” IEEE Trans. Comput. Imaging 7, 258–271 (2021). [CrossRef]

20. T. R. Dastidar and R. Ethirajan, “Whole slide imaging system using deep learning-based automated focusing,” Biomed. Opt. Express 11(1), 480–491 (2020). [CrossRef]

21. A. Shajkofci and M. Liebling, “Spatially-variant cnn-based point spread function estimation for blind deconvolution and depth estimation in optical microscopy,” IEEE Trans. on Image Process. 29, 5848–5861 (2020). [CrossRef]

22. H. Pinkard, Z. Phillips, A. Babakhani, et al., “Deep learning for single-shot autofocus microscopy,” Optica 6(6), 794–797 (2019). [CrossRef]

23. S. Jiang, J. Liao, Z. Bian, et al., “Transform-and multi-domain deep learning for single-frame rapid autofocusing in whole slide imaging,” Biomed. Opt. Express 9(4), 1601–1612 (2018). [CrossRef]

24. Y. Xu, X. Wang, C. Zhai, et al., “A single-shot autofocus approach for surface plasmon resonance microscopy,” Anal. Chem. 93(4), 2433–2439 (2021). [CrossRef]

25. C. Zhang, H. Jiang, W. Liu, et al., “Correction of out-of-focus microscopic images by deep learning,” Comput. Struct. Biotechnol. J. 20, 1957–1966 (2022). [CrossRef]

26. Y. Luo, L. Huang, Y. Rivenson, et al., “Single-shot autofocusing of microscopy images using deep learning,” ACS Photonics 8(2), 625–638 (2021). [CrossRef]

27. L. Jin, Y. Tang, Y. Wu, et al., “Deep learning extended depth-of-field microscope for fast and slide-free histology,” Proc. Natl. Acad. Sci. 117(52), 33051–33060 (2020). [CrossRef]

28. M. Bonet Sanz, F. Machado Sánchez, and S. Borromeo, “An algorithm selection methodology for automated focusing in optical microscopy,” Microsc. Res. Tech. 85(5), 1742–1756 (2022). [CrossRef]

29. S. Mehta and M. Rastegari, “Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer,” arXivarXiv:2110.02178 (2021). [CrossRef]

30. J. Li, X. Xia, W. Li, et al., “Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios,” arXivarXiv:2207.05501 (2022). [CrossRef]

31. S. W. Zamir, A. Arora, S. Khan, et al., “Restormer: Efficient transformer for high-resolution image restoration,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2022), pp. 5728–5739.

32. J. Chen, S.-h. Kao, H. He, et al., “Run, don’t walk: Chasing higher flops for faster neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2023), pp. 12021–12031.

33. A. Howard, M. Sandler, G. Chu, et al., “Searching for mobilenetv3,” in Proceedings of the IEEE/CVF international conference on computer vision, (2019), pp. 1314–1324.

34. H.-I. Lin and P. Menendez, “Image denoising of printed circuit boards using conditional generative adversarial network,” in 2019 IEEE 10th International Conference on Mechanical and Intelligent Manufacturing Technologies (ICMIMT), (IEEE, 2019), pp. 98–103.

35. V. Semenov, Y. G. Astsaturov, and Y. B. Khanzhonkov, “Determining the level of dust on printed circuit boards of radio-electronic equipment by optoelectronic method,” Inorg. Mater. 56(15), 1458–1461 (2020). [CrossRef]

36. Z. Hua, X. Zhang, D. Tu, et al., “Learning to high-performance autofocus microscopy with laser illumination,” Measurement 216, 112964 (2023). [CrossRef]

37. D. A. Kerr, “Principle of the split image focusing aid and the phase comparison autofocus detector in single lens reflex cameras,” (2005).

38. Z. Hua, X. Zhang, and D. Tu, “Autofocus methods based on laser illumination,” Opt. Express 31(18), 29465–29479 (2023). [CrossRef]

39. D. I. Barnea and H. F. Silverman, “A class of algorithms for fast digital image registration,” IEEE Trans. Comput. C-21(2), 179–186 (1972). [CrossRef]

40. W. Huang and Z. Jing, “Evaluation of focus measures in multi-focus image fusion,” Pattern Recognit. Lett. 28(4), 493–500 (2007). [CrossRef]

41. J. Terven and D. Cordova-Esparza, “A comprehensive review of yolo: From yolov1 to yolov8 and beyond,” arXivarXiv:2304.00501 (2023). [CrossRef]

42. H. Zhang, C. Wu, Z. Zhang, et al., “Resnest: Split-attention networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), pp. 2736–2746.

43. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), pp. 7132–7141.

44. Q. Hou, D. Zhou, and J. Feng, “Coordinate attention for efficient mobile network design,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2021), pp. 13713–13722.

45. S. Woo, J. Park, J.-Y. Lee, et al., “Cbam: Convolutional block attention module,” in Proceedings of the European conference on computer vision (ECCV), (2018), pp. 3–19.

46. A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., “An image is worth 16×16 words: Transformers for image recognition at scale,” arXivarXiv:2010.11929 (2020). [CrossRef]

47. M. Sandler, A. Howard, M. Zhu, et al., “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), pp. 4510–4520.

48. A. G. Howard, M. Zhu, B. Chen, et al., “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXivarXiv:1704.04861 (2017). [CrossRef]

49. K. Han, Y. Wang, Q. Tian, et al., “Ghostnet: More features from cheap operations,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2020), pp. 1580–1589.

50. M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International conference on machine learning, (PMLR, 2019), pp. 6105–6114.

Algorithm	Error ( $μ$ m)		Parameters	CPU Time (ms)
Algorithm	RMSE	MAE	Parameters	CPU Time (ms)
Ordinal regression model - Spatial domain [13]	1.205	0.863	7.54M	75.8
Classification model - Spatial domain [14]	0.731	0.534	7.54M	77.1
Regression model - Frequency domain [23]	3.844	1.478	7.46M	87.9
NDPM (Regression model - Spatial domain)	0.528	0.422	7.46M	70.3

Algorithm	5X Dataset ( $μ$ m)		10X Dataset ( $μ$ m)		20X Dataset ( $μ$ m)		Mixed Dataset ( $μ$ m)
Algorithm	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE
GhostNet [49]	2.050	1.552	0.784	0.610	0.722	0.500	1.505	0.972
EfficientNet_0 [50]	3.529	2.614	0.806	0.647	0.680	0.473	1.247	0.843
EfficientNet_1 [50]	3.011	2.278	0.753	0.592	0.653	0.498	1.196	0.815
MobileNetV1 [48]	2.859	2.330	0.667	0.526	0.514	0.400	1.305	0.893
MobileNetV2 [47]	1.636	1.348	0.625	0.488	0.508	0.416	1.173	0.804
MobileNetV3 [33]	1.864	1.373	0.676	0.509	0.491	0.391	1.307	0.846
NDPM	0.661	0.542	0.528	0.422	0.486	0.367	0.863	0.646

Algorithm	Parameter	CPU Time
GhostNet [49]	5.77M	50.2ms
EfficientNet_0 [50]	7.01M	49.5ms
EfficientNet_1 [50]	9.54M	58.3ms
MobileNetV1 [48]	7.29M	40.1ms
MobileNetV2 [47]	5.23M	43.6ms
MobileNetV3 [33]	6.19M	47.6ms
NDPM	7.46M	70.3ms

Algorithm	Error ( $μ$ m)		Parameters	CPU Time (ms)
Algorithm	RMSE	MAE	Parameters	CPU Time (ms)
Ordinal regression model - Spatial domain [13]	1.205	0.863	7.54M	75.8
Classification model - Spatial domain [14]	0.731	0.534	7.54M	77.1
Regression model - Frequency domain [23]	3.844	1.478	7.46M	87.9
NDPM (Regression model - Spatial domain)	0.528	0.422	7.46M	70.3

Algorithm	5X Dataset ( $μ$ m)		10X Dataset ( $μ$ m)		20X Dataset ( $μ$ m)		Mixed Dataset ( $μ$ m)
Algorithm	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE
GhostNet [49]	2.050	1.552	0.784	0.610	0.722	0.500	1.505	0.972
EfficientNet_0 [50]	3.529	2.614	0.806	0.647	0.680	0.473	1.247	0.843
EfficientNet_1 [50]	3.011	2.278	0.753	0.592	0.653	0.498	1.196	0.815
MobileNetV1 [48]	2.859	2.330	0.667	0.526	0.514	0.400	1.305	0.893
MobileNetV2 [47]	1.636	1.348	0.625	0.488	0.508	0.416	1.173	0.804
MobileNetV3 [33]	1.864	1.373	0.676	0.509	0.491	0.391	1.307	0.846
NDPM	0.661	0.542	0.528	0.422	0.486	0.367	0.863	0.646

High-precision microscopic autofocus with a single natural image

Abstract

1. Introduction

2. Making datasets

2.1 Challenges of making datasets

2.1.1 Sensitive problem of sharpness metric

2.1.2 Selection problem of sharpness metric

2.1.3 Calculation problem of sharpness metric

2.2 Our method for making datasets

2.2.1 Principle of the split-image focusing

2.2.2 Calibration of constant deviation

2.3 Dataset

3. Model

4. Experiments

4.1 Training details

4.2 Comparison of four prediction models

4.3 Comparison of lightweight feature extraction networks

5. Discussion

6. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (11)

Tables (4)

Equations (18)

Optics Express