SCIFI: 3D face reconstruction via smartphone screen lighting

Wuyuan Xie; Zhaonian Kuang; Miaohui Wang

doi:10.1364/OE.447575

1. Introduction

With the rapid development of visual technology, smartphones usually have deployed multiple CMOS camera modules [1]. The combination of these camera sensors and 3D face reconstruction techniques has spawned many popular applications, such as security authentication [2] (e.g. face reconstruction and face anti-spoofing [3]), user-generated content [4] (e.g. face filtering in augmented reality), interactive entertainment [5] (e.g. Animoji), etc. On the other hand, the operating environment of smartphones is complex and diverse. So, how to use the smartphone to obtain a stable and accurate 3D reconstruction of human face [6,7] is a very challenging but urgent task.

Generally speaking, 3D face reconstruction can be categorized into three groups: 1) Software-based. This kind of method usually utilizes 3D processing software [8] (e.g. 3DS MAX, Maya, Unity3D) to establish a face model. One of the biggest problems is that it requires professional experience, the number of the candidate face model is limited, and the reconstructed face is lack of sense of reality. 2) Hardware-based. This kind of method usually adopts specialized projector or laser (e.g. iPhoneX and HUAWEI Mate20 Pro) [9] to obtain the face shape in a non-contact way. Although it can be accurate to millimeter level, it involves high hardware cost and increases the operation difficulty. 3) Image-based. This kind of method usually uses one or more pictures to build a 3D face shape based on some computer vision models (e.g. Lambert, Phong, Cook-Torrance, etc.) [10]. It has the advantages of simple operation and low cost, but there are still few high-precision reconstruction results based on smartphones, which is heavily affected by the environment. Therefore, it is meaningful to investigate how to accurately reconstruct the 3D face reconstruction using the mobile camera in natural scenarios at low cost.

To reconstruct an accurate 3D face surface, some efforts [11–17] have been made in the past few years. Among them, parametric face modeling and photometric stereo are two main representative methods. The earliest parametric-based face is 3D morphable model (3DMM) [11], which firstly obtains a 3D face dataset through high-precision scanning, and then decouples the 3D face data into shape and texture features via principal component analysis (PCA). The parametric-based methods make it feasible to reconstruct a 3D face from a small number or even a single image. However, the limitation of these methods is that they can only recover the low-frequency shape information, and are difficult to recover the high-frequency detail information of the 3D face surface. In addition, the reconstruction results are also constrained by the capacity of the related face databases. Compared with parametric face modeling, photometric stereo [13,18] is able to reconstruct the 3D face surface with high-frequency details. The advantage of the photometric stereo-based methods is that they can restore the normal vector for each pixel, which makes it feasible to restore the high-frequency details of the 3D surface. However, the selection and calibration of the light source have a non-negligible influence on the reconstruction results. For example, Vogiatzis [12] et al. introduced a multi-view geometry and multi-spectral photometric stereo for 3D face reconstruction. A coarse face model was first reconstructed via surface-from-motion (SFM), and it was then used for the light source calibration. After obtaining the light source parameters, the corresponding face normal map can be estimated. In the near-point light source, Cao et al. [13] adopted 3DMM to fit the proxy face model, and estimated the light source direction for each surface patch. After obtaining the light source parameters, they further utilized the photometric stereo method to generate the surface normal, and iteratively obtained a high-precision face normal map. Wang et al. [14] proposed a two-stage coarse-to-fine face surface normal restoration network. In the first stage, the proxy estimation network obtained the 3DMM parameters from the input face images, and subsequently rendered the corresponding coarse normal map. In the second stage, the normal estimation network has been employed to obtain the enhanced normal map. However, the above methods need to be performed under ideal laboratory conditions, based on complicated light source and camera configurations, so it is difficult to be applied to real scenes in daily life.

Most recently, some face reconstruction methods have been developed based on camera sensors from smartphone or laptop. Agrawal et al. [15] used a smartphone to take a video around the face, and reconstructed the corresponding 3D face shape via multi-view geometry. However, it needs a long shooting time ($15-20$ seconds) and consumes a high computing resource, which makes it difficult to be used in practical security authentication-based applications. In addition, the reconstructed faces are overly smoothed due to the filtering of video frames. Farrukh et al. [16] designed a liveness detection system via smartphone camera based on uncalibrated 3D reconstruction. Due to the unknown lighting assumption, they have to leverage an additional face template and singular value decomposition (SVD) to estimate the surface normal. As a result, this method usually generated oversmoothed surface texture and distorted face shape. Moreover, the computation of SVD is a non-negligible payload for a power-constrained smartphone.

In this article, we propose a Smartphone sCreen Illumination-based Face reconstructIon (SCIFI) scheme based on the front camera and screen lighting. Specifically, to improve the reconstructed details of face surface, we propose to adopt the calibrated planar lighting to solve the Lambertian-based reconstruction. Further, we employ the face landmarks to align multiple face photographs taken by handheld devices, which can effectively relieve the position mismatch effect caused by jitter in face reconstruction. In addition, we design different normal outlier elimination methods according to the characteristics of human face, which can avoid the face deformation caused by singular normals. Finally, qualitative and quantitative experiments show that SCIFI can effectively restore the 3D surface shape as well as the texture of human face. Compared with recent representative methods, SCIFI has the following main contributions.

• We have designed a robust 3D face reconstruction system for smartphones, which only relies on the front camera and screen lighting. SCIFI can effectively reconstruct the overall face shape as well as micro facial expressions.
• To improve the performance of 3D face reconstruction, we have investigated how to employ a calibrated planar lighting model to address the complicated computation of uncalibrated illumination. Moreover, we have also explored how to solve the mismatch issues for multiple face photographs, and eliminate the outlier normals.
• To verify the robustness of SCIFI, we have designed several challenging evaluation scenes, including different environments (dark and bright), different lighting pattern (4-zones and 9-zones), and different testing subjects. Extensive experiments validate the effectiveness and applicability of SCIFI.

The organization of this article is arranged as follows. In Section 2., we briefly introduced the basic foundation of the Lambertian model for 3D surface reconstruction. In Section 3., we provide the details of the proposed face reconstruction scheme. In Section 4., we provide the comparison results, and performance analysis of SCIFI. Finally, the overall conclusion is drawn in Section 5.

2. Background

2.1 Lambertian model

Given an object to be reconstructed, the calibrated photometric stereo requires a fixed viewpoint (i.e. camera position) and more than three calibrated parallel light sources. For Lambertian surfaces, pixel intensity $i(x,y)$ at any position $(x,y)$ on the object surface can be represented by

(1)$$i(x,y)=\lambda \rho \vec{\boldsymbol{l}}\cdot \vec{\boldsymbol{n}},$$

where $\vec {\boldsymbol {l}} \in \mathbb {R} ^3$ denotes the reflected light direction, $\lambda$ denotes the light intensity, $\vec {\boldsymbol {n}} \in \mathbb {R} ^3$ represents the normal vector, and $\rho$ is the surface albedo. Since $\lambda$ and $\rho$ are two constants and $\vec {\boldsymbol {n}}$ can eventually be normalized to a unit vector, Eq. (1) can be simplified as

(2)$$i(x,y)=\vec{\boldsymbol{l}} \cdot \vec{\boldsymbol{n}}.$$

Under multiple light sources, Eq. (2) can also be converted into a matrix form as

(3)$$\mathbf{I}=\mathbf{LN},$$

where $\mathbf {I}=\mathbb {R} ^{m\times p}$, $\mathbf {L}=\mathbb {R} ^{m\times 3}$, and $\mathbf {N}=\mathbb {R} ^{3\times p}$. $m$ represents the number of light sources, which also represents the number of photographs. $p$ represents the number of the vailded pixels. In general, Eq. (3) can be solved by the least square method to obtain the surface normal $\mathbf {N}$.

3. Proposed SCIFI framework

In this section, we give a detailed description for the proposed SCIFI scheme as illustrated in Figure 1. SCIFI has been specially designed for both dark and bright lighting conditions. When a lighting pattern is selected and the related face photographs are captured, face alignment will be conducted to address the position mismatch issues caused by hand jitter. Subsequently, a calibrated planar lighting is employed to estimate the face surface normal. To remove outlier normals, different filtering methods have also been designed for different facial regions. Finally, a discrete geometry-based surface from normal method is employed to generate 3D face shape.

Fig. 1. Pipeline of the proposed Smartphone sCreen Illumination-based Face reconstructIon (SCIFI) framework.

Download Full Size | PDF

3.1 Screen lighting patterns

According to the isometric model of the planar light source, we need to obtain the relative distance $D$ between each face surface patch and the planar light source to perform the surface reconstruction. However, this is a dilemma that since once we know the relative distance $D$ for each surface patch, we have already reconstructed the related surface. Considering that the distance between face regions and the distance between the face and the planar light source have a large gap, we assume that all face surface patches are on the same plane. We take the nose tip as a reference, and assume that the distance between the reference and the screen of smartphone is a constant for simplicity, and the face is parallel to the phone screen. In other words, the distance between each screen region and the face surface is considered as a constant, and it is set as $20~cm$ in the experiments.

Figure 2 shows the illustration of the proposed screen lighting patterns with three different colors. When a subject holds the smartphone, different screen regions are sequentially lighted. At the same time, the front camera is used as the photograph acquisition device. The screen of a smartphone can be divided into $4$-zones or $9$-zones. According to the color of the lighting source, it can be further divided into white ($W$), red ($R$), green ($G$), and blue ($B$).

Fig. 2. Illustration of the proposed screen lighting patterns. (a) The relative position relationship between the human face and the screen of smartphone. (b) Two lighting patterns (top: $4$-zones; bottom: $9$-zones). From left to right in the zig-zag way, we mark them as $W4$, $R4$, $G4$, $B4$, $W9$, $R9$, $G9$, and $B9$, respectively.

Download Full Size | PDF

3.2 Face alignment

In the 3D face reconstruction based on the smartphone camera, hand jitter is an inevitable problem in capturing photographs. Due to the hand jitter, the face positions of the participated subject in different photographs may be different. To solve this mismatch problem, we propose to employ face alignment [19] to relocate the face position. In the experiments, we generate 106 landmarks to estimate the similarity matrix to relocate the face position.

In the experiments, we take the first face photograph as the reference, and align the other face photographs. The relationship between the landmarks $(x_i^{\prime },y_i^{\prime })(i=1,2,\ldots,106)$ of the reference photograph and the landmarks $(x_i,y_i )(i=1,2,\ldots,106)$ of the other photographs can be modeled as

(4)$$\small \left[\begin{array}{c}x_i^{\prime} \\ y_i^{\prime} \\ 1\end{array}\right]=\left[\begin{array}{ccc}\cos \theta & -\sin \theta & t_{x} \\ \sin \theta & \cos \theta & t_{y} \\ 0 & 0 & 1\end{array}\right]\left[\begin{array}{c}x_{i} \\ y_{i} \\ 1\end{array}\right],$$

where the affine matrix is represented by the rotation angle $\theta$, the offset $t _{x}$ on the $x$-axis, and the offset $t _{y}$ on the $y$-axis.

For computational efficiency, the mappings for all landmarks are organized as a matrix multiplication, and Eq. (4) is reorganized as

(5)$$\small \left[\begin{array}{c}x_{1}^{\prime} \\ y_{1}^{\prime} \\ \vdots \\ x_{106}^{\prime} \\ y_{106}^{\prime}\end{array}\right]=\left[\begin{array}{cccc}x_{1} & -y_{1} & 1 & 0 \\ y_{1} & x_{1} & 0 & 1 \\ \vdots & \vdots & \vdots & \vdots \\ x_{106} & -y_{106} & 1 & 0 \\ y_{106} & x_{106} & 0 & 1\end{array}\right]\left[\begin{array}{c}\cos \theta \\ \sin \theta \\ t_{x} \\ t_{y}\end{array}\right].$$

For convenience, Eq. (5) can also be represented as

(6)$$\mathbf{F}^{\prime}=\mathbf{FT},$$

where $\mathbf {T}$ denotes the affine transform, $\mathbf {F}$ denotes original face landmark position matrix, and $\mathbf {F}^{\prime }$ denotes the aligned landmark position matrix.

Finally, the solution of $\mathbf {F}^{\prime }$ can be converted to solve the following optimization

(7)$$\underset{\theta, t_{x}, t_{y}}{\operatorname{\boldsymbol{\arg \min}}}\|\mathbf{FT}-\mathbf{F}^{\prime}\|.$$

In order to deal with the impact of some seriously deviated landmarks, we can also adopt iteratively reweighted least squares (IRLS) to solve Eq. (7).

3.3 Feature extraction

According to the illumination condition, the use case of SCIFI can be colloquially divided into dark environment and bright environment. In a dark environment, we conduct the experiments under the white light, such as the $W4$ and $W9$ modes as shown in Figure 2. In a bright environment, it is difficult to create shadows on the face due to the similarity of white light and natural light. For this, we propose a “shading-from-color-lighting" method to use the color light as the lighting sources.

In the experiments, we first take a face photograph under the natural light as the reference $\mathbf {I}_{ref}$, where the screen is not lighted at this time. Then, we obtain the face photograph $\mathbf {I}_k$ by sequentially lighting the colored lights in different screen regions. Finally, we obtain the difference image by subtracting the photograph obtained under the color lighting source from the reference one, which is used as the feature image $\mathbf {I}_{type}^k$ for the corresponding color type.

(8)$$\mathbf{I}_{tpye}^k=\mathbf{I}_k-\mathbf{I}_{ref}, (k=1,2,\ldots).$$

where $\mathbf {I}_{tpye}^k=\{\mathbf {I}_{R}^k, \mathbf {I}_{G}^k, \mathbf {I}_{B}^k\}$.

For example, when the green light is used, the feature component $\mathbf {I}_{G}^k$ is extracted. In fact, we can also use the color lighting source in a dark environment, and directly extract the corresponding color component as the feature image. In other words, $\mathbf {I}_{ref}=\mathbf {0}$. It is worth noting that we need to flip the captured photograph horizontally before obtaining the feature image due to the use of the front camera.

3.4 Calibrated planar lighting

Under a planar lighting condition $\vec {\boldsymbol {l}}$, using the parallel light assumption to solve Eq. (1) will bring errors to the estimation of $\vec {\boldsymbol {n}}$. Therefore, it is necessary to incorporate different types of light source assumptions, such as near-point light sources [18,20], circular plane light sources [21], and rectangular planar light sources [22]. In [22], Clark et al. has made an assumption that the illumination effect of a rectangular planar light source could be equivalent to a point light source at infinity. As an illustration in Figure 3, we set the position of the face surface patch to the origin of the world coordinate system, and assume that the rectangular planar light source is perpendicular to the $z$-axis. The emissivity of the resulting micro surface patch can be modeled by $\mathcal {R}_{s}(\cdot )$, and defined as

(9)$$\mathcal{R}_{s}\left(p, q, \rho ; x_{1}, x_{2}, y_{1}, y_{2}, D\right)=\rho \mathcal{R}_{\vec{\boldsymbol{l}}} \int_{y_{1}}^{y_{2}} \int_{x_{1}}^{x_{2}} \frac{(p x+q y+D)}{\sqrt{1+p^{2}+q^{2}} \sqrt{\left(x^{2}+y^{2}+D^{2}\right)^{3}}} d x d y,$$

where $(x_1, y_1, D)$, $(x_1, y_2, D)$, $(x_2, y_1, D)$, and $(x_2, y_2, D)$ represent the four vertices of the lighting rectangular area, $\mathcal {R}_{\vec {\boldsymbol {l}}}$ denotes the emissivity of the lighting source and normal vector $\vec {\boldsymbol {n}}=(p, q, 1) / \sqrt {p^{2}+q^{2}+1}$. Equation (9) can be closely solved as

(10)$$\small \mathcal{R}_{s}\left(p, q, \rho ; x_{1}, x_{2}, y_{1}, y_{2}, D\right)=\mathcal{R}_{\vec{\boldsymbol{l}}} \frac{\rho\left[q G_{1}+ G_{2}+p G_{3}\right]}{\sqrt{1+p^{2}+q^{2}}},$$

where $G_{1}=\log \left (\frac {\left (x_{1}+\sqrt {D^{2}+y_{2}^{2}+x_{1}^{2}}\right )\left (x_{2}+\sqrt {D^{2}+y_{1}^{2}+x_{2}^{2}}\right )}{\left (x_{1}+\sqrt {D^{2}+y_{1}^{2}+x_{1}^{2}}\right )\left (x_{2}+\sqrt {D^{2}+y_{2}^{2}+x_{2}^{2}}\right )}\right )$, $G_{2}=\arctan \left (\frac {x_{2} y_{2}}{D \sqrt {D^{2}+y_{2}^{2}+x_{2}^{2}}}-\frac {x_{2} y_{1}}{D \sqrt {D^{2}+y_{1}^{2}+x_{2}^{2}}}\right )$ - $\arctan \left (\frac {x_{1} y_{2}}{D \sqrt {D^{2}+y_{2}^{2}+x_{1}^{2}}}-\frac {x_{1} y_{1}}{D \sqrt {D^{2}+y_{1}^{2}+x_{1}^{2}}}\right )$, and $G_{3}=\log \left (\frac {\left (y_{1}+\sqrt {D^{2}+y_{1}^{2}+x_{2}^{2}}\right )\left (y_{2}+\sqrt {D^{2}+y_{2}^{2}+x_{1}^{2}}\right )}{\left (y_{1}+\sqrt {D^{2}+y_{1}^{2}+x_{1}^{2}}\right )\left (y_{2}+\sqrt {D^{2}+y_{2}^{2}+x_{2}^{2}}\right )}\right )$.

Fig. 3. Illustration of the relative position between a planar lighting source and a micro face surface patch.

Download Full Size | PDF

Consequently, Eq. (10) can be rewritten into the vector form as,

(11)$$\small \mathcal{R}_{s}\left(p, q, \rho ; x_{1}, x_{2}, y_{1}, y_{2}, D\right)=\mathcal{R}_{\vec{\boldsymbol{l}}}^{E} \rho \vec{\boldsymbol{l}}_{E} \cdot \vec{\boldsymbol{n}},$$

where $\mathcal {R}_{\vec {\boldsymbol {l}}}^{E}=\mathcal {R}_{\vec {\boldsymbol {l}}} \sqrt {F_{1}^{2}+F_{2}^{2}+F_{3}^{2}}$ is the equivalent light source emissivity, and $\vec {\boldsymbol {l}}_{E}=\frac {\left (F_{1}, F_{1}, F_{3}\right )}{\sqrt {F_{1}^{2}+F_{2}^{2}+F_{3}^{2}}}$ is the equivalent infinity point light source direction.

3.5 Normal recovery

After obtaining the feature component $\mathbf {I}_{tpye}^k$, we further recover the normal vector of the face surface. First, we look for a convex hull based on the aligned 106 landmark points to locate the interesting region. Subsequently, we carry out the pixel-by-pixel light modeling and normal estimation. Assume that the height and width of the smartphone screen are $H_{sp}$ and $W_{sp}$, respectively.

In the screen lighting pattern of $4$-zones, the position matrix with respect to the position of the nose tip can be denoted by $\mathbf {P}_{4}$, and expressed as

(12)$$\small \mathbf{P}_{4}=\left[\begin{array}{ccccc} 0 & \frac{W_{sp}}{2} & 0 & \frac{H_{sp}}{2} & D \\ -\frac{W_{sp}}{2} & 0 & 0 & \frac{H_{sp}}{2} & D \\ 0 & \frac{W_{sp}}{2} & -\frac{H_{sp}}{2} & 0 & D \\ -\frac{W_{sp}}{2} & 0 & -\frac{H_{sp}}{2} & 0 & D \end{array}\right].$$

Moreover, we represent the height and width of the bounding box of the interesting face region as $H_{bb}$ and $W_{bb}$, respectively. The height of the average human face, $H_{face}$, is assumed to be $15~cm$. Then, the width of a real face, $W_{face}$, can be estimated as

(13)$$W_{face}=\frac{H_{face} \times W_{bb}}{H_{bb}}.$$

Let the low-left corner of the bounding box be the origin of the coordinate system. Let the nose tip be the center of the bounding box, and the position is denoted as $(c_x, c_y)$. Let the pixel position in the $i$-th row and $j$-th column of the face image be $(p_x, p_y)$. Then, we deduce the offset between the face image and the real face with respect to the nose tip by Eq. (14).

(14)$$\begin{aligned} &x_{offset}^{i, j}=\frac{\left(p_{x}-c_{x}\right) \times W_{face}}{W_{ret}},\\ &y_{offset}^{i, j}=\frac{\left(p_{y}-c_{y}\right) \times H_{face}}{H_{ret}}. \end{aligned}$$

To calculate the relative position of the face surface patch and lighting source, we only need to subtract the corresponding offset from the $x$ and $y$ coordinates in the position matrix. For instance, the position matrix of $\mathbf {P}_{4}^{i, j}$ corresponding to the $i$-th row and $j$-th column is

(15)$$\small \left\{ {\begin{array}{c} \mathbf{P}_{4(:, 1: 2)}^{i, j}=\mathbf{P}_{4(:, 1: 2)}-x_{offset}^{i, j} \\ \mathbf{P}_{4(:, 3: 4)}^{i, j}=\mathbf{P}_{4(:, 3: 4)}-y_{offset}^{i, j} \end{array}} \right..$$

When obtaining the surface patch position $\mathbf {P}_{4}^{i, j}$, we can further obtain the planar lighting direction. According to Eq. (10), we first compute three internal parameters for the $k$-th planar light under the 4-zones pattern by

(16)$$\small \left\{ {\begin{array}{c} G_1^k = G_1 \left( \mathbf{P}_{4(k, 1)}^{i, j}, \mathbf{P}_{4(k, 2)}^{i, j},\mathbf{P}_{4(k, 3)}^{i, j},\mathbf{P}_{4(k, 4)}^{i, j}, D \right) \\ G_2^k = G_2 \left( \mathbf{P}_{4(k, 1)}^{i, j}, \mathbf{P}_{4(k, 2)}^{i, j},\mathbf{P}_{4(k, 3)}^{i, j},\mathbf{P}_{4(k, 4)}^{i, j}, D \right) \\ G_3^k = G_3 \left( \mathbf{P}_{4(k, 1)}^{i, j}, \mathbf{P}_{4(k, 2)}^{i, j},\mathbf{P}_{4(k, 3)}^{i, j},\mathbf{P}_{4(k, 4)}^{i, j}, D \right) \end{array}} \right..$$

Then, we can compute the planar light direction by

(17)$$\small \vec{\boldsymbol{l}}_{E,k}^{i,j} = \frac{{\left( {G_1^k ,G_2^k ,G_3^k } \right)}}{{\sqrt {\left( {G_1^k } \right)^2 + \left( {G_2^k } \right)^2 + \left( {G_3^k } \right)^2 } }}.$$

By the similar way, the planar light direction can be obtained for the 9-zones lighting pattern. Consequently, the direction of each parallel light source can be obtained, and the normal vector for the face surface patch can be computed by via Eq. (11).

3.6 Outlier elimination

Under an ideal condition, if the reconstructed object is rigid and stationary relative to the camera, the surface conforms to the Lambertian reflection. However, illumination, hand jitter, and micro expressions together make the reconstruction condition complex. As a result, outlier normals usually appear in areas, such as eye, nose, eyebrows, and face contour.

In the eye region, the surface of the eyeball is of specular reflection. Further, hand jitter and the alignment error will jointly cause slightly overlapping artifacts. In the nose region, the grease on the nose tip tends to produce specular reflection. In the eyebrow area, the slight movement of the face may lead to the mixing of black non-reflective hair and normal skin. In the face contour, the inaccuracy of landmark prediction makes the face contour expansion error, resulting in the edge area containing complex background. Therefore, it is necessary to remove local outliers. Figure 4(b) and 4(c) show the results before and after the outlier elimination, respectively.

Fig. 4. Illustration of the outlier normal elimination.

Download Full Size | PDF

The eyes and nose belong to the low-frequency components in the face surface, so we adopt the mean filter to remove the high-frequency noise and retain the low-frequency shape. For the eyes, we construct bounding box with landmarks as the filtering area, and then perform mean filtering with a kernel size of $(w_{eye}, w_{eye})$. For the nose tip, we reduce the bounding box to half of the landmarks as the filtering area, and perform mean filtering with a kernel size of $(w_{nose}, w_{nose})$. For the eyes and nose regions, $\mathbf {N}$ is filtered by

(18)$$\small \bar {\mathbf{N}}\left( {x,y} \right) = \frac{{\sum\limits_{s ={-} w}^w {\sum\limits_{t ={-} w}^w {\kappa \left( {s,t} \right)\mathbf{N}_{\{nose, eye\}}\left( {x + s,y + t} \right)} } }}{{{{\left\| {\sum\limits_{s ={-} w}^w {\sum\limits_{t ={-} w}^w {\kappa \left( {s,t} \right)\mathbf{N}_{\{nose, eye\}}\left( {x + s,y + t} \right)} } } \right\|}_2}}},$$

where $\kappa$ is a $(2w+1)$ by $(2w+1)$ average convolution kernel, and ${\left \| \cdot \right \|_2}$ denotes the $L_2$-norm operation. A toy example of the filtering of eyes and nose is provided in Figure 4(a).

The eyebrows have high-frequency details, so we design a special operation as follows. For the left eyebrow ( Figure 4(d)), we construct the area with the upper vertex of the nose contour and the upper left vertex of the face. As shown in Figure 4(e), we remove the eyes, eyebrows and edges in this area, and denote the average normal of this area as the reference $\boldsymbol {n}_{ref}^{x,y}$. Considering that the angle between normal vectors in this area should be similar, we propose to use a threshold $t_{ebs}$ to remove abnormal normals. For any normals in the eyebrow area, if the angle between it and $\boldsymbol {n}_{ref}^{x,y}$ is bigger than the threshold $t_{ebs}$, it is regarded as an outlier. Then, the related normal is replaced by the average normal in this area. The eyebrow processing can be formulated as

(19)$$\small \bar {\mathbf{N}}\left( {x,y} \right) = \left\{ {\begin{array}{cc} {{\sum {\boldsymbol{n}_{ebs}^{x,y}} } \mathord{\left/ {\vphantom {{\sum {\boldsymbol{n}_{ebs}^{x,y}} } {\left\| {\sum {\boldsymbol{n}_{ebs}^{x,y}} } \right\|}}} \right. } {\left\| {\sum {\boldsymbol{n}_{ebs}^{x,y}} } \right\|}}, & {\arccos \left( {{ \frac{\boldsymbol{n}_{ebs}^{x,y} \cdot \boldsymbol{n}_{ref}^{x,y} }{\left\| {\boldsymbol{n}_{ebs}^{x,y} } \right\|\left\| {\boldsymbol{n}_{ref}^{x,y} } \right\|} {} }} \right) \ge t_{ebs} } \\ {\boldsymbol{n}_{ebs}^{x,y}}, & {otherwise} \\ \end{array}} \right.,$$

where $\left \| {\boldsymbol {n}_{ebs}^{x,y}} \right \|$ represents the total number of normals in the eyebrow region. For the right eyebrow, we take a similar scheme as shown in Figure 4(f) and 4(g).

For the face contour, we obtain the facial boundary via the landmarks in Eq. (7), and denote the boundary width by $w_{edge}$ (Figure 4(h) and (k)). The main goal is to remove outliers and smooth the face contour region. For the left boundary area, we construct the reference area at the left vertex of the nose contour, and remove the eyes and edge. Accordingly, this reference area is shown in Figure 4(i), and $t_{edge}$ is used as the threshold to filter abnormal normals by Eq. (19). For the right edge area, we take the similar approach as shown in Figure 4(j). Finally, we further perform a smooth filtering Eq. (18) on the edge area with a kernel size of $(w_{edge}, w_{edge})$.

3.7 Face surface reconstruction

After recovering the normal map, we can integrate it to obtain the final reconstructed face surface, which is also called surface-from-normal (SfN) [23]. Specifically, [24] developed an efficient SfN method based on discrete geometric processing (DGP) to reconstruct the 3D surface. In DGP, each pixel in a normal map is expressed as a quadrilateral facet. Let the quadrilateral facet related to a pixel $({x,y})$ be represented by $\mathcal {F}_{x,y}$ with the normal $\vec {\mathbf {n}}_{x,y}$. $\mathcal {F}_{x,y}$ has four corner vertexes, like $\mathbf {v}_{x,y}$, $\mathbf {v}_{x,y+1}$, $\mathbf {v}_{x+1,y+1}$, and $\mathbf {v}_{x+1,y}$ ordered from the upper-left to the bottom-left. Let $\mathbf {d}_{x,y}$ be the depth component. In addition, let $\mathcal {Z} (\cdot )$ represent the height (or depth) field function, and $\mathcal {Z}(\mathbf {v}_{r,c})=\mathbf {d}_{x,y}$.

Given $\vec {\mathbf {n}}_{x,y}$, a shape-up strategy is used to predict the depth value of $\mathbf {v}_{x,y}$. A straightforward way to reconstruct a face surface ${\mathcal {F}}$ is to move every vertex $\mathbf {v}_{x,y}$ to a proper position, where the normal of ${\mathcal {F}_{x,y}}$ should be the same as $\vec {\mathbf {n}}_{x,y}$. Thus, the SfN can be represented by the following optimization problem.

(20)$${\mathop {\boldsymbol{\min} }_{\left\{ \mathbf{d}_{x,y} \right\}} } \quad {\mathbf{E}\left( {\mathcal{F} } \right)} \quad {subject~~ to} \quad {\mathbf{N}\left( { {\mathcal{F}_{x,y} }} \right) = {\vec{\mathbf{n}}_{x,y}}} ,$$

where $\mathbf {E} (\cdot )$ denotes a function measuring shape variation of the reconstructed 3D face surface $\mathcal {F}$, and $\mathbf {N}(\cdot )$ returns the normal vector for the facet ${\mathcal {F}_{x,y} }$.

Numerical DGP solver: We solve the problem of (20) by a local-global DGP strategy. In other words, the 3D face shape is reconstructed by iteratively applying the local projection and global deformation operations. To be more specific, each facet ${\mathcal {F}_{x,y}}$ is first lifted up by locally projecting the related depth value onto a feasible region one-by-one, and then globally blend all facets by minimizing the shape variation that satisfies the normal constraint.

1) Local projection: In the local projection, the vertexes of ${\mathcal {F}_{x,y}}$ are projected onto a plane with the normal $\vec {\mathbf {n}}_{x,y}$, and the plane is supposed passing though the center $\vec {\mathbf {c}}_{x,y}$. Then, the local projection of each vertex in ${\mathcal {F}_{x,y}}$ can be obtained by

(21)$$\small \mathcal{L}(\mathbf{v}_{x+i,y+j})=\vec{\mathbf{c}}_{x,y} - \frac{{ {i} \times {\vec{\mathbf{n}}}_{x,y}^x + {j} \times {\vec{\mathbf{n}}}_{x,y}^y}}{{{\vec{\mathbf{n}}}_{x,y}^z}}.$$

where $\vec {\mathbf {c}}_{x,y}=[x, y, {{\left ( {\mathbf {d}_{x,y} + \mathbf {d}_{x + 1,y} + \mathbf {d}_{x,y + 1} + \mathbf {d}_{x + 1,y + 1} } \right )} \mathord {\left / {\vphantom {{\left ( {\mathbf {d}_{x,y} + \mathbf {d}_{x + 1,y} + \mathbf {d}_{x,y + 1} + \mathbf {d}_{x + 1,y + 1} } \right )} 4}} \right. } 4}]^T$. $\vec {\mathbf {n}}_{x,y}^{x}$ denotes the $x-$axis component of $\vec {\mathbf {n}}_{x,y}$. Accordingly, $\vec {\mathbf {n}}_{x,y}^{y}$ and $\vec {\mathbf {n}}_{x,y}^{z}$ are the $y-$ and $z-$axis component of $\vec {\mathbf {n}}_{x,y}$, respectively.

2) Global deformation: In the global deformation, the face surface is deformed that all vertexes are expected to generate the same surface shape as the projected one in Eq. (21). Thus, the global deformation process, $\mathcal {G} \left ( {\mathcal {F}_{x,y}} \right )$, can be modeled by minimizing the following least square problem,

(22)$$\small \mathcal{G} \left( {\mathcal{F}_{x,y}} \right) = \sum_{{ {\mathcal{F}_{x,y}}}} { {{\left\| {{\mathcal{Z}}\left( {{ {\mathcal{F}_{x,y}}}} \right) - {\mathcal{L}}\left( {{ {\mathcal{F}_{x,y}}}} \right)} \right\|}_2^2}} ,$$

where ${\mathcal {Z}}\left ( {\mathcal {F}_{x,y}} \right )$ returns a column vector stacking the depths of four vertexes on ${\mathcal {F}_{x,y}}$, and $\mathcal {L}( {\mathcal {F}_{x,y}})$ returns a column vector stacking the local projection depths for all vertexes in the facet ${\mathcal {F}_{x,y}}$ Typically, $\mathcal {G} \left ( {\mathcal {F}_{x,y}} \right )$ enforces ${\mathcal {Z}}\left ( {\mathcal {F}_{x,y}} \right )$ to be close to ${\mathcal {L}}\left ( {\mathcal {F}_{x,y}} \right )$.

After initializing $\mathbf {d}_{x,y}$, it is employed to predict the local position by Eq. (21) in the local projection. Then, the resulting $\mathcal {L}(\mathbf {v}_{x,y})$ is used to update the vertex position by minimizing Eq. (22) in the global deformation. Such a local-global operation is iteratively performed, which can eventually form the mesh surface $\mathcal {F}$ whose orientations are expected to be the same as the input normals.

4. Experimental results and analysis

In the experiments, we choose Xiaomi Mi 8SE as the test smartphone. The focus length of the front camera is $3.86~mm$. The height $H_{sp}$ and width $W_{sp}$ are $13.5~cm$ and $6.5~cm$, respectively. In the outlier elimination, $w_{eye}=5$, $w_{nose}=11$, $w_{edge}=11$, $t_{ebs}=30$, and $t_{edge}=90$. To satisfy the assumption conditions as mentioned above, we draw a face contour on the front screen. When capturing an image, once a subject fits the reference face contour, we consider the test face meets the previous assumptions for simplicity.

4.1 Visual analysis of the reconstruction results

4.1.1 Face reconstruction in a dark environment

Figure 5 shows the reconstruction results based on two screen lighting patterns (i.e. 4-zones and 9-zones) in a dark environment. We have the following observations based on the practical reconstructed faces: 1) 9-zones provides better visual results than 4-zones under the same color lighting. The main reason is that Eq. (3) can obtain a better estimation of $\mathbf {N}$ when the number of lights increases. 2) White light ($W4$ and $W9$ in Figs. 5(a) and (e)) generates a better surface than the R-G-B color lights. The main reason can be that the three primary lighting colors can be absorbed or affected by the facial skin color, which will further increase the outlier normals. 3) Red light in 4-zones generates the worst reconstructed surface. It can be seen that in Figure 5(b), the red light $R4$ is heavily interacted with the skin color. The central axis area of the face in red lighting will have large noise, which will cause serious distortion on the reconstructed face shape. However, the reconstruction performance can be greatly improved (see $R9$ in Figure 5(f)) when the input number of face images increases.

Fig. 5. Reconstruction results of SCIFI in a practical dark environment. Two screen lighting patterns (4-zones and 9-zones) are evaluated under four different color lights, including white ($W4$ and $W9$), red ($R4$ and $R9$), green ($G4$ and $G9$), and blue ($B4$ and $B9$). From left to right: face photographs, estimated normals, surface in view 1, and surface in view 2.

Download Full Size | PDF

4.1.2 Face reconstruction in a bright environment

Figure 6 shows the reconstruction results in a natural light environment. From left to right, we provide the captured images under different light sources, the restored normal map, and the reconstruction results in two viewpoints. In a natural light environment, the blue and green lights can achieve satisfying reconstruction results than the red light. Further, the reconstructed result of the red light in a bright environment is slightly better than in a dark environment. It also confirms that the red color is similar to our skin color, and the environment lighting can alleviate the effect of the red light.

Fig. 6. Reconstruction results of SCIFI in a practical natural lighting condition. Two screen lighting patterns (4-zones and 9-zones) are evaluated under three different color lights, including red ($R4$ and $R9$), green ($G4$ and $G9$), and blue ($B4$ and $B9$). From left to right: face photographs, estimated normals, surface in view 1, and surface in view 2.

Download Full Size | PDF

4.1.3 Facial expression reconstruction

Figure 7 shows the reconstruction results of facial expression details under the 4-zones pattern in a dark environment. For comparison, we magnify some key face regions, like mouth, nose, and brow. It shows that SCIFI can successfully restore the micro textures of different facial expressions. This can be very usefully for face recognition, especially in liveness detection.

Fig. 7. Reconstruction results of SCIFI on facial expressions. Micro expression details can be successfully reconstructed in a dark environment. Please zoom in the electronic version for details.

Download Full Size | PDF

4.1.4 Reconstruction performance comparisons

Figure 8 shows the comparison results of five different subjects. Without loss of generality, we provide the reconstruction results under the white light in the 4-zones pattern. For each subject, we compare three representative reconstruction methods, including SCIFI, Farrukh2020 [16], and Chen2019 [25]. Compared with the other two latest methods, SCIFI provides better reconstruction results. Further, the deep learned model in Chen2019 [25] has been trained with image data in a laboratory environment, and the reconstruction results are poor under a really changeling condition. This also confirms that domain adaptation is one of the main unsolved problems in deep learning.

Fig. 8. Comparison results for five different subjects under the $W4$ pattern. From left to right: face photographs, SCIFI, Farrukh2020 [16], and Chen2019 [25].

Download Full Size | PDF

4.2 Quantitative comparison and analysis

4.2.1 Shape angle change analysis

To evaluate the shape correctness of the reconstruction results, we extract $20$ models from the high-precision face dataset Facescape [17], and extract the average normal map as the baseline (denoted by “Baseline"). Figure 9(a) shows a typical example. We conduct the evaluation by comparing our reconstructed normal map with Baseline.

Fig. 9. Illustrations of sampling lines and the related surface angle change curves. Please zoom in the electronic version for better details.

Download Full Size | PDF

As shown in the yellow line in Figure 9(a), we take a cross-section in the vertical direction of the 3D face, and then we sample it. Specifically, the profile of this cross-section is represented as $y(x)$. The derivative function can be obtained by the difference of neighboring normals, which represents the slope at each point. Finally, by converting the slope into an angle, we obtain the angle function $\theta (x)$, and then investigate the 3D face property based on $\theta (x)$.

Further, we perform the sampling segment with the help of the acquired face landmarks. The position of the key point (in blue line) on the cross-section is shown in Figure 9(b), and the corresponding angle change is shown in Figure 9(c). Compared with the Baseline, the angle change curve of Farrukh2020 [16] is larger, indicating that the recovered face shape is severely curved in the vertical direction. For Chen2019 [25], the angle change curve is almost a straight line, indicating that the recovered face shape has almost no angle change in the vertical direction. Compared with Farrukh2020 and Chen2019, SCIFI archives the highest similarity to the curve of Baseline. Similarly, we also conduct experiments in the horizontal direction as shown in Figure 9(d). The experimental results show the similar trend, which validates the efficiency of SCIFI.

4.2.2 Quantitative comparison

To quantitatively analyze the quality of face reconstruction, we compute the average angle deviation between each sampling point and the corresponding position of the baseline. The average angle deviation is denoted by $\hat {\theta }$, and defined as

(23)$$\hat{\theta}=\frac{\left(\left(\sum_{k=1}^{K_{h}}\left|\theta_{k}^{h_r}-\theta_{k}^{h}\right|\right) / K_{h}\right) \times\left(\left(\sum_{k=1}^{K_{v}}\left|\theta_{k}^{v_r}-\theta_{k}^{v}\right|\right) / K_{v}\right)}{2},$$

where $\theta _{k}^{h_r}$ and $\theta _{k}^{h}$ represent the angles of the $k$-th sampling point on the surface of the reconstruction face and the Baseline in the horizontal direction, respectively. $\theta _{k}^{v_r}$ and $\theta _{k}^{v}$ represent the angles of the $k$-th sampling point on the surface of the reconstruction face and the baseline in the vertical direction, respectively. $K_{h}$ and $K_{v}$ denote the total number of sampling points in the horizontal and vertical directions, respectively.

Table 1 provides the comparison results of the average angle deviation and the running time in dark environment. The average angle deviation is computed by Eq. (23), where the five subjects in Figure 8 are represented by #1–#5 from the top to bottom. Compared with Farrukh2020 and Chen2019, SCIFI achieves the smallest angular deviation value, which shows the reconstruction fidelity of the biological face structure.

Table 1. Comparison of the average angle deviation and the running time in dark environment.

View Table

5. Conclusion and future work

3D face reconstruction on smartphones is a very useful technique, which can be widely used in face recognition, liveness detection, face animation, etc. In this paper, we present a robust 3D face reconstruction system for smartphones, which is only relied on a single front camera with screen illumination. To restore micro facial textures, we have explored the calibrated planar lighting for Lambertian-based reconstruction. In addition, we have employed the face landmark remove the position mismatch between multiple photographs caused by hand jitter. Moreover, we have designed different methods to eliminate singular normals based on the characteristics of human face. Experimental results show that the proposed method can effectively restore more satisfying surface shape as well as micro facial texture in comparison with the latest uncalibrated and data-driven schemes. We believe that the proposed method can provide new inspirations for smartphone-based security authentication in the future.

In the future, some further explorations can be developed to improve the performance of the proposed method. First, the current algorithm is simple and efficient for a single-view 3D face surface, but its performance on a multiple-view case is unknown and needed to be investigated. Second, the proposed method is dependent on the screen illumination, and a complicated lighting condition can be an obstacle. Learning-based methods can be a useful tool to address this problem. Third, the estimated normal precision by the proposed method is relative lower compared with a hardware-based method, and it is worth further investigating normal enhancement method to improve the reconstruction quality.

Funding

National Natural Science Foundation of China (61701310, 61902251); Natural Science Foundation of Guangdong Province (2019A1515010961, 2021A1515011877); Natural Science Foundation of Shenzhen City (20200805200145001, JCYJ20180305124209486).

Acknowledgements

The face photographs used in this article are from our laboratory members, and we thank them for their permission. We would like to give thanks to Mr. Maolin Cui for the preliminary data preparation. The authors would also like to express their sincere gratitude to the anonymous reviewers for valuable comments and suggestions.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. A. D. Griffiths, J. Herrnsdorf, M. J. Strain, and M. D. Dawson, “Scalable visible light communications with a micro-led array projector and high-speed smartphone camera,” Opt. Express 27(11), 15585–15594 (2019). [CrossRef]

2. M. Uzair, A. Mahmood, F. Shafait, C. Nansen, and A. Mian, “Is spectral reflectance of the face a reliable biometric?” Opt. Express 23(12), 15160–15173 (2015). [CrossRef]

3. L. Birla and P. Gupta, “PATRON: Exploring respiratory signal derived from non-contact face videos for face anti-spoofing,” Elsevier Expert. Syst. with Appl. 187, 115883 (2022). [CrossRef]

4. C. Shin, S.-H. Hong, and H. Yoon, “Enriching Natural Monument with User-Generated Mobile Augmented Reality Mashup,” J. Multimed. Inf. Syst. 7(1), 25–32 (2020). [CrossRef]

5. R. McGregor, “Apple Animoji Demo for IPhoneX,” https://www.youtube.com/watch?v=Hdvqb3PJWYw (2017). [Online; accessed Nov. 2021].

6. Z. Niu, J. Shi, L. Sun, Y. Zhu, J. Fan, and G. Zeng, “Photon-limited face image super-resolution based on deep learning,” Opt. Express 26(18), 22773–22782 (2018). [CrossRef]

7. P. Zhou, J. Zhu, and Z. You, “3D face registration solution with speckle encoding based spatial-temporal logical correlation algorithm,” Opt. Express 27(15), 21004–21019 (2019). [CrossRef]

8. Y. Guo, “3D graphics platforms and tools for mobile applications,” Tampere University of Technology, pp. 1–66 (2014).

9. P. Zhou, J. Zhu, and H. Jing, “Optical 3D surface reconstruction with color binary speckle pattern encoding,” Opt. Express 26(3), 3452–3465 (2018). [CrossRef]

10. L. Ma, Y. Lyu, X. Pei, Y. M. Hu, and F. M. Sun, “Scaled SFS method for Lambertian surface 3D measurement under point source lighting,” Opt. Express 26(11), 14251–14258 (2018). [CrossRef]

11. V. Blanz and T. Vetter, “A morphable model for the synthesis of 3D faces,” in Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), (1999), pp. 187–194.

12. G. Vogiatzis and C. Hernández, “Self-calibrated, multi-spectral photometric stereo for 3D face capture,” Springer Int. J. Comput. Vis. 97(1), 91–103 (2012). [CrossRef]

13. X. Cao, Z. Chen, A. Chen, X. Chen, S. Li, and J. Yu, “Sparse photometric 3D face reconstruction guided by morphable models,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2018), pp. 4635–4644.

14. X. Wang, Y. Guo, B. Deng, and J. Zhang, “Lightweight photometric stereo for facial details recovery,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2020), pp. 740–749.

15. S. Agrawal, A. Pahuja, and S. Lucey, “High accuracy face geometry capture using a smartphone video,” in IEEE Winter Conference on Applications of Computer Vision, (2020), pp. 81–90.

16. H. Farrukh, R. M. Aburas, S. Cao, and H. Wang, “FaceRevelio: a face liveness detection system for smartphones with a single front camera,” in Annual International Conference on Mobile Computing and Networking, (2020), pp. 1–13.

17. H. Yang, H. Zhu, Y. Wang, M. Huang, Q. Shen, R. Yang, and X. Cao, “Facescape: a large-scale high quality 3D face dataset and detailed riggable 3D face prediction,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2020), pp. 601–610.

18. W. Xie, C. Dai, and C. C. Wang, “Photometric stereo with near point lighting: A solution by mesh deformation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2015), pp. 4585–4593.

19. Face++, “Face Landmarks,” https://www.faceplusplus.com/landmarks/ (2021). [Online; accessed Oct. 2021].

20. W. Xie, Y. Nie, Z. Song, and C. C. Wang, “Mesh-based computation for solving photometric stereo with near point lighting,” IEEE Comput. Grap. Appl. 39(3), 73–85 (2019). [CrossRef]

21. L. Bi, Z. Song, and L. Xie, “A novel LCD based photometric stereo method,” in IEEE International Conference on Information Science and Technology (ICIST), (2014), pp. 611–614.

22. J. J. Clark, “Photometric stereo using LCD displays,” Elsevier Image Vis. Comput. 28(4), 704–714 (2010). [CrossRef]

23. M. Wang, W. Xie, and M. Cui, “Surface Reconstruction with Unconnected Normal Maps: An Efficient Mesh-based Approach,” in ACM International Conference on Multimedia (MM), (2020), pp. 2617–2625.

24. W. Xie, Y. Zhang, C. C. Wang, and R. C.-K. Chung, “Surface-from-gradients: An approach based on discrete geometry processing,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2014), pp. 2195–2202.

25. G. Chen, K. Han, B. Shi, Y. Matsushita, and K.-Y. K. Wong, “Self-calibrating deep photometric stereo networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2019), pp. 8739–8747.

Subject	SCIFI	Time (sec.)	Farrukh2020 [16]	Time (sec.)	Chen2019 [25]	Time (sec.)
1	13.2088	2.25	23.4476	1.86	23.2747	1.89
2	16.1744	1.82	27.4111	1.77	29.2676	1.76
3	17.0291	2.17	27.5023	1.78	21.0223	1.78
4	16.9782	1.70	24.1661	1.52	22.3711	1.54
5	18.0839	1.72	27.9734	1.60	22.3453	1.62

SCIFI: 3D face reconstruction via smartphone screen lighting

Abstract

1. Introduction

2. Background

2.1 Lambertian model

3. Proposed SCIFI framework

3.1 Screen lighting patterns

3.2 Face alignment

3.3 Feature extraction

3.4 Calibrated planar lighting

3.5 Normal recovery

3.6 Outlier elimination

3.7 Face surface reconstruction

4. Experimental results and analysis

4.1 Visual analysis of the reconstruction results

4.1.1 Face reconstruction in a dark environment

4.1.2 Face reconstruction in a bright environment

4.1.3 Facial expression reconstruction

4.1.4 Reconstruction performance comparisons

4.2 Quantitative comparison and analysis

4.2.1 Shape angle change analysis

4.2.2 Quantitative comparison

5. Conclusion and future work

Funding

Acknowledgements

Disclosures

Data availability

References

Data availability

Cited By

Figures (9)

Tables (1)

Equations (23)

Optics Express