Expand this Topic clickable element to expand a topic
Skip to content
Optica Publishing Group

Neural distortion fields for spatial calibration of wide field-of-view near-eye displays

Open Access Open Access

Abstract

We propose a spatial calibration method for wide field-of-view (FoV) near-eye displays (NEDs) with complex image distortions. Image distortions in NEDs can destroy the reality of the virtual object and cause sickness. To achieve distortion-free images in NEDs, it is necessary to establish a pixel-by-pixel correspondence between the viewpoint and the displayed image. Designing compact and wide-FoV NEDs requires complex optical designs. In such designs, the displayed images are subject to gaze-contingent, non-linear geometric distortions, which explicit geometric models can be difficult to represent or computationally intensive to optimize. To solve these problems, we propose neural distortion field (NDF), a fully-connected deep neural network that implicitly represents display surfaces complexly distorted in spaces. NDF takes spatial position and gaze direction as input and outputs the display pixel coordinate and its intensity as perceived in the input gaze direction. We synthesize the distortion map from a novel viewpoint by querying points on the ray from the viewpoint and computing a weighted sum to project output display coordinates into an image. Experiments showed that NDF calibrates an augmented reality NED with 90° FoV with about 3.23 pixel (5.8 arcmin) median error using only 8 training viewpoints. Additionally, we confirmed that NDF calibrates more accurately than the non-linear polynomial fitting, especially around the center of the FoV.

© 2022 Optica Publishing Group under the terms of the Optica Open Access Publishing Agreement

1. Introduction

Near-eye displays (NEDs) provide an immersive visual experience in virtual reality (VR) and augmented reality (AR) by overlaying virtual images directly onto the user’s view. Improving NED design to enhance its performance inevitably faces various trade-offs between optics design and image quality. Achieving a wide field of view (FoV), a key requirement for NEDs to enhance immersion [1], tends to compromise the form factor [2] or the image distortion to employ image expanding optics. Modern NEDs employ a variety of off-axis optics for wide FoV such as curved beam splitter [35], holographic optical elements (HoE) [6,7], and polarization-based pancake optics [8,9]. While they improve the FoV and potentially reduce the display form factor, these complex, off-axis optics cause non-linear, viewpoint-dependent image distortion (Fig. 1(a)). Due to this distortion, a rectangle image on display appears distorted as a curved surface on the viewpoint. Especially with wide-FoV NEDs, large image distortion occurs at the periphery of the display, causing the image to constantly sway as the eyes rotate and move, called pupil swim [10]. This image distortion and pupil swim can break the reality of virtual objects and, in the worst case, cause severe headaches and nausea.

 figure: Fig. 1.

Fig. 1. (a) Images of the checker pattern displayed on the wide-FoV AR-NED, captured from the viewpoint camera. Due to the complex beam combiner optics of the wide-FoV NED, the displayed image is subject to viewpoint-dependent image distortion. The images are captured from two different viewpoints with a distance of 6 mm along the x-axis. (b) Schematic diagram of the mapping between the display image and the retinal (or the viewpoint) image. (Left) When images are displayed directly on the NED, they appear distorted according to the user’s viewpoint. The mapping function $f$ represents the correspondence from the viewpoint coordinate to the display coordinate. (Right) If we correct the displayed image in advance, we can perceive the image without distortion.

Download Full Size | PDF

This paper focuses on modeling this non-linear, viewpoint-dependent image distortion on wide-FoV NEDs. Let $D$ and $E$ be the coordinate systems of the display and retinal image (or viewpoint camera image), respectively. To formulate the image distortion, we want to know a map function $f$ that indicates which coordinates on the display image ${\textbf{u}}_{D}\in \mathbb {R}^{2}$ appear on the coordinates in the retinal image ${\textbf{u}}_{E}\in \mathbb {R}^{2}$, as shown in Fig. 1(b). Especially for wide-FoV NEDs, this map varies not only with ${\textbf{u}}_{E}$ but also with the translation and rotation of the eye. In this paper, we denote an eye pose by a 6D vector $\textbf{p}=[{\textbf{v}}, \textbf{t}] \in \mathbb {R}^{6}$ determined from the 3D rotation vector ${\textbf{v}} \in \mathbb {R}^{3}$ and the position vector $\textbf{t} \in \mathbb {R}^{3}$. Using these notations, a mapping function we focus on $f:\mathbb {R}^{8}\rightarrow \mathbb {R}^{2}$ denotes as

$${\textbf{u}}_{D} = f({\textbf{u}}_{E}, \textbf{p}).$$

Theoretically, we can estimate $f$ from the optical prescription of the NED. However, the deformation of the optical system and assembly is inevitable due to the manufacturing process and aging. Thus, calibration of $f$ is necessary for practical use. In the case of NED with a typical beam splitter, the light emitted from each display pixel is perceived as a point light source at the viewpoint (Fig. 2(a)). Hence, conventional work [11] recovers $f$ by estimating the 3D position of the point source of each display pixel using triangulation, then projecting it into a retinal image at a novel viewpoint. However, in the case of wide-FoV NEDs, the virtual light source on each display pixel distribute in space due to the complex optics design of the projection system (Fig. 2(b)). Therefore, accurate mapping with wide-FoV NEDs requires estimating the light field formed by the complex optics design.

 figure: Fig. 2.

Fig. 2. (a) A simplified optical simulation of NEDs employing different optics. (Left) Each display pixel is perceived as a point light source in a conventional flat beam splitter. (Right) When we use a curved mirror to expand the FoV, the perceived pixel gradually deviates from the point light source. For this simulation, we used a 2D ray optics simulator [22]. (b) Diagram of the relationship between the rays from the viewpoint and the perceived light source from the NED in the wide-FoV case in (a). In this case, we can model the perceived light sources as multiple translucent curved displays in space.

Download Full Size | PDF

Furthermore, in wide-FoV NEDs, image distortion changes dynamically with the viewpoint, making image distortion correction more challenging. Modeling and correcting static image distortion in VR-NED is a well-established technique [12,13]. As a study of dynamic image distortion correction, the mainstream approaches extend the polynomial model for static image distortion correction to handle translations and rotations of the eye. Although some studies [1416] have dealt with eye translation, the number of coefficients is insufficient to represent dynamic image distortion. Moreover, these studies do not take into account image distortions due to eye rotation, except for the method of approximating light ray field directly with a Gaussian polynomial fitting [17].

On the other hand, ray-tracing-based approaches [5,10] model image distortion by simulating the multi-stage refraction and reflection of the light rays passing through the optical system. Although these methods achieve high accuracy, they take a lot of computation time and are not practical for interactive VR/AR applications. As a hybrid approach, concurrent with our work, Guan et al. applied nonlinear dimensionality reduction to pre-traced light rays in a lens design application to simulate viewpoint-dependent image distortion in real time [18].

In contrast, we propose the implicit representation model for the viewpoint-dependent image distortion, which is completely different from the previous approaches. Our neural distortion field (NDF) learns a distortion map $f$ directly from a set of observed images without explicitly simulating the light field or optical aberration as polynomial models. NDF is an extension of the neural radiance fields (NeRF) [19], a neural network-based representation model developed for novel view synthesis from multi-view color images. NeRF implicitly learns viewpoint-dependent light reflections and refractions and synthesizes novel-view images. Similarly, NDF is a neural network-based representation of the behavior of light rays passing from each pixel of a display through NED optics. By using volumetric rendering for NDF representation, we can synthesize a non-linear, viewpoint-dependent distortion map from a novel viewpoint.

The key novelty of our work is in applying NeRF’s implicit, neural net-based view synthesis method to image distortion correction. In the field of holography, Neural Holography [20,21] implicitly represents optical misalignment, wave propagation, and display characteristics, and has significantly improved the quality of displayed images and processing time. Similarly, our method aims to provide an another solution based on implicit function representation for the problem of image distortion modeling in NED, providing advantages such as improved processing time and accuracy. Furthermore, it is expected that NDF can be incorporated into existing ray-tracing based image distortion correction and extended to a hybrid distortion representation model, similar to [18]. As a proof-of-concept and a preliminary step of the further model, in this paper we evaluate the accuracy of image distortion reproduction by a fully implicit NDF model.

Contributions. Our main contributions include the following:

  • • We propose an NDF, the neural net model that can implicitly learn complex, view-point-dependent image distortion maps of NEDs directly from observed images.
  • • Experiments with using an off-the-shelf wide-FoV AR-NED show that NDF can simulate image distortion as accurately as or better than conventional non-linear polynomial mapping.
  • • We discuss improvements of NDF for applications in wide FoV HMDs and other optics designs and provide future research directions on image distortion correction with implicit representation models.

2. Methods

In this section, we describe the basic NDF pipeline. Figure 3 visualizes the overall pipeline of NDF. Sec. 2.1 provides an overview of NDF representation. Sec. 2.2 describes how to synthesize mapping $f$ (Eq. (1)) from outputs of NDF. Finally, Sec. 2.3 describes how to train NDF from ground truths, including some tricks to improve NDF training. Note that the description of NDF in this section follows the original NeRF. With the recent development of NeRF research, various improved methods have already been proposed. Our implementation is based on the improved version of NeRF, which describes in Sec. 3.

 figure: Fig. 3.

Fig. 3. An overview of our NDF representation and distortion map generation. (a) We extend the ray connecting each pixel on the retinal image from the viewpoint and sample the points along the ray. NDF accepts as input the 3D coordinates of each point and ray direction. (b) The NDF returns the corresponding display coordinates and their intensity. The model assumes that there are myriad translucent curved displays in space, similar to Fig. 2(b). The output coordinates of the NDF represent which display pixels reach the eye at the input position and viewing direction. (c) We sum the output of the NDF at each point on the ray weighted by the intensity, and (d) estimate the subpixel-wise display coordinates perceived at the eye coordinate that consists of the ray. During training, we compute the loss between the estimating coordinate and the ground truth, and the loss is back propagated to NDF.

Download Full Size | PDF

2.1 Neural distortion field for distortion map representation

First, we describe our NDF representation. Briefly, when we look at a point from a certain direction, NDF returns the display pixel coordinate where the perceived light ray comes from and its intensity. NDF is represented as a multi-layer perceptron (MLP) $F_{\Theta }$, whose inputs are 5D coordinates (spatial position ${\textbf{x}} = [x, y, z]^{\mathrm {T}} \in \mathbb {R}^{3}$ and viewing direction $(\theta, \phi )$), and outputs are display coordinates ${\textbf{u}}_{D}$ and the intensity $\rho$ of the light source. In practice, we express the viewing direction as a 3D Cartesian unit vector ${\textbf{d}} \in \mathbb {R}^{3}$, i.e., $F_{\Theta }:\mathbb {R}^{6}\rightarrow \mathbb {R}^{3}$. Note that later in Sec. 2.3, we encode the input position ${\textbf{x}}$ into higher-order dimension $L$, i.e., $F_{\Theta }:\mathbb {R}^{L+3}\rightarrow \mathbb {R}^{3}$.

We consider the ray connecting the eye position $\textbf{t}$ and each pixel on the retinal image ${\textbf{u}}_{E}$ (Fig. 3(a)). This ray denotes as ${\textbf{r}}(s) = \textbf{t} + s{\textbf{d}}$. Since ${\textbf{d}} \in \mathbb {R}^{3}$ is direction of the ray that moves in conjunction with eye rotation ${\textbf{v}}$, the ray ${\textbf{r}}(s)$ is essentially determined from ${\textbf{u}}_{E}$ and $\textbf{p}=[{\textbf{v}}, \textbf{t}]$. When we sample a set of position and viewing direction $(\textbf{r}(s_i), \textbf{d})$ on the ray at $s_i$ as an input of NDF, NDF $F_{\Theta }$ outputs the display coordinate and the intensity $({{\textbf{u}}_{D}}_{i}, \rho _{i})$ (Fig. 3(b)).

Qualitatively, as the light from the microdisplay passes through the optical system, reflections and refractions create numerous transparent display surfaces in space, as shown in Fig. 3(b). From this, NDF can be regarded as implicitly learning these multiple, translucent display surfaces formed on the space.

2.2 Distortion map reconstruction from neural distortion field

NDF outputs the display coordinates ${{\textbf{u}}_{D}}_{i}$ and its intensity $\rho _i$, which are the source of the light perceived at $({\textbf{r}}(s_i), {\textbf{d}})$. By computing a weighted sum of the outputs along the ray ${\textbf{r}}(s)$, we can estimate the display pixel that is the source of the light perceived at each pixel on the retinal image ${\textbf{u}}_{E}$ (Fig. 3(c)). ${\bar {\textbf{u}}}_{D}$ denotes this weighted sum of the display pixel coordinate.

Here, we sample $P$ points along the ray ${\textbf{r}}(s)$, which indexes as $\{s_i\}_{i=1}^{P}$ in order of proximity from the viewpoint. NDF outputs $\{({\textbf{u}}_{D i}, \rho _i )\}_{i=1}^{P}$ from these sampling points as input. Using the outputs, we calculate ${{\bar {\textbf{u}}}_{D}}$ as

$${\bar{\textbf{u}}}_{D}({\textbf{r}})=\sum^{N}_{i=1}\tau_i(1-\exp(-\rho_i\delta_i)){\textbf{u}}_{D i}, \quad \tau_i = \exp\left(-\sum^{i-1}_{j=1}\rho_j\delta_j\right),$$
where $\delta _i = s_{i+1}-s_{i}$ is the distance between adjacent samples. Note that from Sec. 2.1, since the ray ${\textbf{r}}$ is determined from ${\textbf{u}}_{E}$ and $\textbf{p}$, Eq. (2) satisfies the form of Eq. (1).

2.3 Optimizing neural distortion field

By applying Eq. (2) to the entire field of view, the display coordinate system $D$ is mapped as a 2D manifold on the retinal image $E$ (Fig. 3(d)). To train NDF, we back-propagate the difference between the ground truth map obtained from several viewpoints and the map synthesized from Eq. (2).

Let ${\textbf{u}}_{D}^{*}({\textbf{r}})$ denote the ground truth of the display coordinates for each ray ${\textbf{r}}$. From definition, the number of ${\textbf{r}}$ in single retinal image $E$ is equal to the number of pixels in $E$. In practice, we randomly sample a batch of ${\textbf{r}}$ from each pixel at each optimization iteration, then compute the total-squared loss $\mathcal {L}$:

$$\mathcal{L} = \sum_{{\textbf{r}}\in\mathcal{R}} \| {\bar{\textbf{u}}}_{D}({\textbf{r}}) - {\textbf{u}}_{D}^{*} ({\textbf{r}}) \|^{2}_{2}$$
where $\mathcal {R}$ denotes the set of randomly sampled rays.

The remainder of this subsection introduces improvements to more accurately simulate image distortion: positional encoding (Sec. 2.3.1) and deviation map learning (Sec. 2.3.2).

2.3.1 Positional encoding

Instead of training the NDF using Eq. (3), we introduce the technique called positional encoding, which is also used in the original NeRF, to facilitate the neural network to capture higher-order image distortions. We encode the input position ${\textbf{r}}(s_i) \in \mathbb {R}^{3}$ into higher-dimension vector $\gamma ({\textbf{r}}(s_i)) \in \mathbb {R}^{L}$. Position encoding is generally represented by a combination of trigonometric functions [23], similar to the Fourier transform:

$$\begin{aligned}{\textbf{P}} = \begin{bmatrix} 1 & 0 & 0 & 2 & 0 & 0 & & 2^{L-1} & 0 & 0 \\ 0 & 1 & 0 & 0 & 2 & 0 & \cdots & 0 & 2^{L-1} & 0 \\ 0 & 0 & 1 & 0 & 0 & 2 & & 0 & 0 & 2^{L-1} \\ \end{bmatrix}^{\mathrm{T}}, \quad \gamma(\textbf{x})=\begin{bmatrix} \sin(\textbf{P} \textbf{x}) \\ \cos(\textbf{P} \textbf{x}) \end{bmatrix}. \end{aligned}$$
By introducing the encoding, we redefine the MLP function as $F_{\Theta }:\mathbb {R}^{L+3}\rightarrow \mathbb {R}^{3}$ and total-squared loss $\mathcal {L}$ as:
$$\mathcal{L} = \sum_{{\textbf{r}}\in\mathcal{R}} \| {\bar{\textbf{u}}}_{D}(\gamma(\textbf{r})) - {\textbf{u}}_{D}^{*} (\gamma(\textbf{r})) \|^{2}_{2}.$$

Note that, compared to the original NeRF, which targets natural images, NDF deals with distorted image coordinates that vary relatively smoothly in space. Thus, we are interested in the impact of encoding to high frequencies in NDF. In later experiments (Sec. 5.5), we evaluate the accuracy with different $L$ of the position encoding in NDF.

2.3.2 Learning of deviation map from a reference viewpoint

In general, we normalize the raw data in the range [0, 1] to promote the training of neural networks. In our NDF, the output of the neural network is the image coordinates. For example, for a Full HD display, the range of raw output values in the horizontal direction is [0, 1920]. In this case, we should multiply the output value of the neural network by approximately $2.0\times 10^{3}$. This operation allows very small rounding errors ($< 0.001$) in the neural network to significantly affect the final results ($< 2$ pixel). In the case of NeRF, even if the color changes slightly due to scaling, there is no significant perceptual difference. In the case of NDF, however, this difference appears as a perceptually significant distortion.

To avoid this, we set a reference eye pose $\hat {\textbf{p}}$ near the center of the eye box, and we use the measured display coordinates ${\hat {\textbf{u}}}_{D}$ at the reference viewpoint $\hat {\textbf{p}}$ as the reference map. Then, we train the neural net $F_{\Theta }$ using the deviation $\Delta {\textbf{u}}_{D}= {\textbf{u}}_{D}-{\hat {\textbf{u}}}_{D}$, instead of using the raw ${\textbf{u}}_{D}$. In our training data set (Sec. 4), the range of $\Delta {\textbf{u}}_{D}$ is [-41.0, 39.5]. Thus, the scaling factor is 80.5, and we can reduce the effect of rounding errors in the neural network to 1/25 compared to the case using raw data.

3. Implementation

To demonstrate our concept of implicit distortion map generation, we implemented NDF on top of mip-NeRF framework [24] implemented in JAX [25]. mip-NeRF streamlines NeRF rendering by extending the NeRF query to be an expectation over a spatial region rather than a point, resulting in highly accurate and fast image reconstruction with reduced parameters. Note that we currently select mip-NeRF based on the ease of implementation, training speed, and accuracy. Hence, although performance can be improved by building on other NeRF frameworks, the underlying NDF concept (Sec. 2) remains unchanged.

Sampling Strategy on mip-NeRF. Instead of sampling each point on a ray, mip-NeRF samples a conical frustum connecting the viewpoint position and the pixel area. As a result, mip-NeRF reduces unpleasant aliasing artifacts and improves the detail representation capability of NeRF. With this improvement, the cone frustum around ${\textbf{x}}$ is considered a multivariate Gaussian distribution, and the mean value $\mathbb {E}[\gamma (\textbf{x})]$ within the frustum is used as (integral) positional encoding.

Architecture. As an intensity network, we use an MLP with eight fully connected ReLU layers of 256 channels each. Then, we connect another MLP with four fully-connected ReLU layers of 128 channels each as the coordinate network in the latter stage. The neural network architecture we chose utilizes the same configuration as NeRF [19], where this work is based. The original NeRF and derivative studies have adopted the same network architectures for controlled experiments, and our paper follows this convention. In NeRF, increasing the number of layers and channels from this number does not result in significant improvements in accuracy. To accommodate NDF, we change the dimension of the output layer from the 3D color ${\textbf{c}}$ to the 2D coordinate $\Delta {\textbf{u}}_{D}$ from the mip-NeRF code base.

In the original NeRF, to reduce the influence of the view direction on the output intensity $\rho$, in the connection part of the network, $\mathbb {E}[{\textbf{d}}]$ is input later after we extract $\rho$ from the network at the former stage. We also adopt the same two-stage architecture for NDF, because we consider that the directivity of light emitted from a display does not change significantly with minute changes in angle.

In mip-NeRF, the activation functions used to generate the color ${\textbf{c}}$ (in NDF, map $\Delta {\textbf{u}}_{D}$) and intensity $\rho$ were sigmoid and softplus, respectively. There are possible candidates for the activation function. As the activation function for color (in NeRF) or coordinate (in NDF) output, mip-NeRF used sigmoid to suppress the output value ${\textbf{c}}$ to the [0, 1] floating-point RGB color space. Instead, we consider that the piecewise-linear function such as ReLU is appropriate for the activation function, because NDF outputs the coordinate value $\Delta {\textbf{u}}_{D}$. Also, as the activation function for intensity output, the original NeRF uses SoftPlus. Instead, we consider Sigmoid as the possible candidate because we consider it would be better to adopt a stochastic model, considering that the light emitted from each point of the display is gradually dispersed from 100%. Based on these hypotheses, we evaluate the impact on accuracy when learning with different activation functions in Sec. 5.5.

Training. We train NDF by Adam [26] with a batch size of 1024 and a learning rate $\{\eta _{i}\}$ that is annealed logarithmically from $\eta _{0}=5 \cdot 10^{-4}$ to $\eta _{n}=5 \cdot 10^{-6}$. We train all NDF models up to $5.0 \times 10^{5}$ iterations, where the training error is no smaller on a logarithmic scale. Later in Sec. 5.5, we evaluate the relationship between the number of training iterations and accuracy in detail.

Our network takes about 4 hours to train on an NVIDIA RTX 3090 GPU and about 20 seconds to generate the whole distortion map. We can accelerate the map generation in real time with the latest NeRF architecture, as later discussed in Sec. 6.

4. Data acquisition

To compare NDF with other mapping estimation methods, we acquire a data using a commercial wide FoV AR-NED. Sec. 4.1 describes the hardware setup for capturing NEDs from different viewpoints. Then, Sec. 4.2 describes the viewpoint camera locations and sampling intervals for training and testing. Finally, Sec. 4.3 describes how to obtain the correspondence between viewpoint image coordinates and display coordinates at each viewpoint.

4.1 Hardware setup

Figure 4(a) shows the hardware setup of our experiment. We use a Meta 2 (Meta Company, 90$^{\circ }$ FoV) as a wide-FoV AR-NED with curved beam combiners, a Dell U2718Q as a background display, and two Blackfly S Color 12.3 MP USB3 cameras as a viewpoint camera and a world camera, respectively. We mount the AR-NED and the world camera on a composite translation stage, which moves in x-, y-, and z-direction, respectively. We fix the position of the viewpoint camera and the background display with printed 3D jigs to move the OST-HMD with respect to them. In other words, the viewpoint camera position is translated relative to the NED, and the viewpoint camera position is translated relative to the NED, and the world camera position is treated as the origin. To prevent the background display reflects the light from room lights, we cover the entire setup with black cloth.

 figure: Fig. 4.

Fig. 4. The hardware setup for the calibration. (Left) A back view of the calibration setup. (Right) Bird’s eye view of the setup. The OST-HMD and two cameras are rigidly mounted each other and placed on an XYZ stage. The display is placed in front of the system and used as calibration reference.

Download Full Size | PDF

4.2 Viewpoint camera positions

Before obtaining the coordinate transformation map between the HMD and the viewpoint camera, we set the measuring viewpoint positions for both training and testing.

Figure 5 shows the viewpoint camera positions. For training, we get the data from 125 viewpoints at the are vertices of grid cubes inside an eyebox cube divided into $4^{3}$ cubes, as shown in Fig. 5(a). Each grid cube is 3 mm on one side, so the entire eyebox cube is 12mm on one side. We use these $5^{3} = 125$ datasets as training data. In the experiment, we also evaluated the accuracy of each method on a dataset with a wider interval (i.e., fewer viewpoints). With a gap of 6 mm, the number of study viewpoints is $3^{3} = 27$. Also, with an interval of 12 mm, we use only the $2^{3} = 8$ viewpoints that make up the eyebox as training data.

 figure: Fig. 5.

Fig. 5. Viewpoint camera positions for data acquisition. (a) Camera positions for training data. For the full training set (125 views), we place the viewpoint camera on each vertex of a 3 mm square lattice. Then, to create training data for a fewer number of viewpoints, we extract training data for 8 and 27 viewpoints on the vertices indicated by red dots. (b) Camera positions for test data. We sample 12 points from the diagonal of a 12 mm square eyebox indicated by blue dots, a total of 48 viewpoint positions, as the test dataset. To avoid positional overlap with the training data set, we divide the diagonal of the 3 mm square cube into six equal parts and extract the 1st, 3rd, and 5th points.

Download Full Size | PDF

We get the data from 48 viewpoints for testing, as shown in Fig. 5(b). We sample 12 points from each diagonal of a 12 mm square cube representing the entire eye box and use these 48 points as viewpoint positions for testing. This sampling interval for testing is based on [16], which is designed so that the distribution of test points covers the entire eyebox and is as uniform as possible.

To detect the poses of the viewpoint camera, we display a 38.85 mm AR marker of a $4 \times 4$ binary pattern on the background display. Then, we obtain the viewpoint camera pose $\textbf{p}$ as a relative posture with the world camera as the origin. Since we manually adjust the translating stage position this time, there is a slight discrepancy between the ideal viewpoint position described above and the actual measurement position. However, this slight discrepancy from the ideal does not affect the training process because we train the MLP with the viewpoint positions obtained from the actual measurements.

4.3 Obtaining map at each viewpoint

On each viewpoint, we obtain the correspondence between the coordinate system on the viewpoint camera and the display coordinate system. Let $N$ be the number of training viewpoints and $\{ \textbf{p}^{*}_i = [{\textbf{v}}^{*}_i, {\textbf{t}}^{*}_i]\}_{i=1}^{N}$ be the set of eye pose at the training viewpoints. At each training viewpoint $\textbf{p}^{*}_i$, we obtain the ground truth of the mapping function from ${\textbf{u}}_{E}$ to ${\textbf{u}}_{D}$, which denotes $f^{*}_i:\mathbb {R}^{2}\rightarrow \mathbb {R}^{2}$. This $f^{*}_i$ can also be regarded as $f$ given the eye posture $\textbf{p}^{*}_i$, i.e.,

$${\textbf{u}}_{D}=f^{*}_i({\textbf{u}}_{E})=f({\textbf{u}}_{E}, \textbf{p}^{*}_i).$$

To establish a set of the map at the training viewpoints $\{f_i^{*}\}_{i=1}^{N}$, we display the gray-code pattern images and capture them on the viewpoint camera at first. Then, we obtain the discrete correspondence between viewpoint coordinates and display coordinates from the gray code image as a look-up table (LUT). At viewpoint $\textbf{p}^{*}_i$, $J_i$ denotes the number of pairs for which we can obtain a correspondence between the coordinates and $\{({\textbf{u}}_{E ij}, {\textbf{u}}_{D ij})\}_{j=1}^{J_i}$ denotes the LUT of coordinate pairs.

Then we apply Gaussian kernel regression to interpolate this LUT as a continuous polynomial function $f^{*}_{i}$ [17]. We express $f^{*}_{i}$ as a Gaussian polynomial model:

$$f^{*}_{i}({\textbf{u}}_{E})= {\textbf{A}^{\mathrm{T}}}{\mathbf{\phi}}({\textbf{u}}_{E}) = \sum^{K}_{k=1} \boldsymbol{\alpha}_{k}^{\mathrm{T}} \phi_k({\textbf{u}}_{E}),\quad \phi_k({\textbf{u}}_{E}) = \exp\left(\frac{-({\textbf{u}}_{E}-\boldsymbol{\mu}_{k})^{\mathrm{T}}({\textbf{u}}_{E}-\boldsymbol{\mu}_{k})}{2\sigma^{2}}\right)$$
where $\boldsymbol {\phi }$ is Gaussian radial basis vector, $K$ is the number of basis functions, $\sigma$ is the kernel width, $\{\boldsymbol {\mu }_k\}$ is the center of Gaussian kernel (randomly chosen from $\{{\textbf{u}}_{E ij}\}$), and ${\textbf{A}} = [\boldsymbol {\alpha }_{1}, \ldots, \boldsymbol {\alpha }_{k}, \ldots, \boldsymbol {\alpha }_{K}]^{\mathrm {T}}$ is a $K \times J_i$ coefficient matrix. After that, we determine ${\textbf{A}}$ using the regularized least-square estimator:
$${\textbf{A}} = (\boldsymbol{\Phi}^{\mathrm{T}} \boldsymbol{\Phi} + \lambda {\textbf{I}}_{K} )^{{-}1}\boldsymbol{\Phi}^{\mathrm{T}}{\textbf{U}}_{D}$$
where $\boldsymbol {\Phi }$ is a $J_i \times K$ design matrix defined as $[\boldsymbol {\Phi }]_{jk}={\mathbf{\phi}}_k({\textbf{u}}_{E ij})$, $\lambda$ is the regularization parameter, ${\textbf{I}}_{K}$ is a $K \times K$ identity matrix, and ${\textbf{U}}_{D} = [{\textbf{u}}_{D i1}, \ldots, {\textbf{u}}_{D ij}, \ldots, {\textbf{u}}_{D iJ_i}]^{\mathrm {T}}$. We implement Eq. (8) using MATLAB R2022a. We repeat the above operations for all training viewpoints $\{\textbf{p}^{*}_{i}\}_{i=1}^{N}$ to obtain a set of ground truth maps $\{f_i^{*}\}_{i=1}^{N}$. Additionally, we obtain the ground truth maps for all test viewpoints for evaluation.

5. Experiments

Using the acquired dataset, we compared the accuracy of NDF with other map interpolation methods (Sec. 5.1) and evaluate the performance of NDF. We applied each method to the dataset and generated a map at the test viewpoints. Then, we evaluated quantitatively with respect to the reprojection error with respect to the ground-truth (Sec. 5.2). After that, we evaluated the reproducibility of dynamic image distortion with respect to the viewpoint image in the FoV (Sec. 5.3) and the spatial distribution of the test viewpoints (Sec. 5.4), respectively. Finally, we evaluated the difference in accuracy when changing the network configuration of NDF (Sec. 5.5).

5.1 Interpolation methods for comparison

First, prior to the evaluation, we briefly discuss each of the other interpolation methods being compared. The problem set in this paper is to interpolate a map $f:\mathbb {R}^{8}\rightarrow \mathbb {R}^{2}$ at a novel viewpoint $\textbf{p}$ from the ground truth maps at the training viewpoints $\{f^{*}_i\}_{i=1}^{N}$ (Eq. (6)). We implemented 3 interpolation methods in addition to NDF: (i) 3D reconstruction, (ii) linear interpolation and (iii) Gaussian (non-linear) polynomial interpolation. Note that, except for the 3D reconstruction-based interpolation, we ignore the eye rotation ${\textbf{v}}$. In other words, we train $\hat {f}({\textbf{u}}_{E}, \textbf{t}):\mathbb {R}^{5}\rightarrow \mathbb {R}^{2}$, instead of the complete $f$.

(i) 3D Reconstruction of Virtual Display Surface. Assuming that each pixel ${\textbf{u}}_{D}$ on the display image forms a virtual display surface in space, we recover the 3D point of each pixel by triangulation and bundle adjustment [11]. Then we estimate the map $f$ by re-projecting the reconstructed 3D surface onto the image plane ${\textbf{u}}_{E}$ at the new viewpoint $\textbf{p}$. As discussed in Sec. 1, this model assumes each pixel as a point light source.

(ii) Linear Interpolation. We take the 8 viewpoints on the cubic grid containing the new viewpoint position $\textbf{t}$ from the training data (Fig. 4(b)). Then, we estimate $\hat {f}({\textbf{u}}_{E}, \textbf{t})$ using tri-linear interpolation using $f^{*}_i({\textbf{u}}_{E})$ at the 8 vertices of the cubic grids.

(iii) Non-Linear Gaussian Polynomial Fitting. We learn $\hat {f}({\textbf{u}}_{E}, \textbf{t}):\mathbb {R}^{5}\rightarrow \mathbb {R}^{2}$ directly from the Gaussian polynomial model (Eq. (7)), using the ground truth maps $\{f^{*}_{i}({\textbf{u}}_{E}):\mathbb {R}^{2}\rightarrow \mathbb {R}^{2}\}_{i=1}^{N}$ and corresponding eye positions $\{{\textbf{t}}^{*}_i\in \mathbb {R}^{3}\}_{i=1}^{N}$.

5.2 Reprojection error between distortion models

We trained maps using each method with $N=8, 27, 125$ training viewpoints, and we calculated the reprojection error of the output display coordinates of each pixel at 48 test viewpoints. We used only pixels within the area where the AR-NED can be seen on each viewpoint image for error calculation. In addition to the per-pixel error, we calculated the angular error from the viewpoint at each pixel. NDF in this subsection were trained with the position encoding $L=16$ and the output coordinate and intensity activation functions ReLU and SoftPlus, respectively.

Figure 6 shows the pixel errors and median angular error for different training viewpoints and interpolation methods. In $N=8$, the mean errors are (i) 18.86 pixel (31.96 arcmin), (ii) 15.60 pixel (24.86 arcmin), (iii) 3.64 pixel (5.99 arcmin), (iv) 3.23 pixel (5.79 arcmin). From the results, we confirm that NDF can recover maps with accuracy comparable to non-linear polynomial-based fitting with very few training viewpoints. Also, the deteriorating accuracy of the (i) 3D reconstruction-based method confirms that the spatially distributed model of light sources assumed by the (iv) NDF correctly approximates the optical model of the wide-FoV NED.

 figure: Fig. 6.

Fig. 6. The re-projection error of ${\textbf{u}}_{D}$ on for different training viewpoints and interpolation methods. We plotted the median error at each viewpoint for 48 test viewpoints. (a) Summary of median projection errors in pixels. (b) Summary of median angular errors.

Download Full Size | PDF

From Fig. 6, while nonlinear polynomial-based methods do not change significantly as the number of training viewpoints increases, NDF shows an improvement in accuracy as the number of training viewpoints increases: (iii) 2.63 pixel (5.38 arcmin) vs. (iv) 3.13 pixel (5.42 arcmin) in $N=27$, and (iii) 3.61 pixel (6.46 arcmin) vs. (iv) 2.17 pixel (4.25 arcmin) in $N=125$. Moreover, in all cases of $N=8, 27, 125$, (iii) Gaussian polynomial fitting has a larger error variance in the test viewpoints than (iv) NDF. From the result, we assume that in the case of polynomial-based explicit optimization, the more points are trained, the more they overfit the map at the center of the eyebox. In contrast, in NDF, the implicit function representation uniformly performs the optimization. Later we quantitatively analyse this difference in error distribution between test viewpoint positions in Sec. 5.4.

5.3 Error distribution in the field of view

We analyzed the properties of the distribution of reprojection errors perceived in the FoV by comparing (iii) Gaussian polynomial fitting and (iv) NDF. Figure 7 shows the difference between the ground truth and the actual transformation of the display coordinate system using the estimated distortion map. From Fig. 7(b), the estimated map with Gaussian fitting is accurate vertically but has relatively large horizontal deviations. In contrast, the estimated map from NDF (Fig. 7(c)) shows a uniform fit in both horizontal and vertical directions near the center of the FoV, although the error is larger than Gaussian at the periphery of the FoV.

 figure: Fig. 7.

Fig. 7. Qualitative comparison of estimated display coordinate from (a) linear interpolation, (b) Gaussian polynomial interpolation, and (c) NDF, on the viewpoint image near the center of eyebox. The green grids represent the ground truth, and the red grid is the display coordinates estimated from each method. The more the two coordinate systems matched, the closer the color of the grid approached yellowish. Three regions of interest (blue rectangles) are enlarged on each image.

Download Full Size | PDF

To evaluate the error distribution within the FoV, we calcuated pixel-wise average of the reprojection error in all test viewpoint images, as shown in Fig. 8. While the Gaussian fitting does not have a smooth distribution of errors, NDF has smaller errors from the center to the lower right of the FoV, and the pixels with the largest errors are concentrated only at the periphery of the FoV. We also confirmed that the errors of NDF were better for 55 % of the total FoV pixels. This result shows that NDF learns the distortion of the target AR-NED well.

 figure: Fig. 8.

Fig. 8. Error averaged over each pixel of the viewpoint camera. (a) Error from Gaussian fitting, (b) error from NDF, and (c) difference between the two. The red area indicates that Gaussian fitting is better, and the blue area indicates that NDF is better.

Download Full Size | PDF

The error tends to be larger in the periphery of the FoV in NDF. This is likely due to the fact that NDF learns not only the map but also the intensity of the map on a pixel-wise, i.e., the shape of the FoV. As a result, NDF cannot cope with abrupt changes of the intensity $\rho$ at its periphery, resulting in large errors in the weighted sum (Eq. (2)). This problem could be addressed by combining NDF with some explicit model, e.g., explicitly defining the display surface in space in advance and sampling the surrounding area with NDF. Such a combination of NDF and explicit models to improve accuracy is further discussed in Sec. 6.

5.4 Error distribution depending on viewpoint position

Next, we evaluated the accuracy of the reconstruction of the distortion map with respect to changes in viewpoint $\textbf{p}$. With Gaussian fitting and NDF trained on 8 viewpoints, we calculate the reprojection error of each pixel test viewpoint position that is correspond to Fig. 5(b). Figure 9(a) shows the median of the reprojection error on the entire pixels of the field of view at each test viewpoint position. To further clarify the difference between the two methods, Fig. 9(b) shows the distribution of the difference of the NDF reprojection error minus the Gaussian, as in Fig. 8(c). From Fig. 9(b), NDF shows better results at viewpoints far from the center of the eyebox. This means that Gaussian overfits the training data near the center of the eye box, while NDF is able to reproduce the distribution of the distortion map uniformly across the entire eye box. From the result, we confirmed that NDF can reproduce the distortion map robustly even when the viewpoint position changes.

 figure: Fig. 9.

Fig. 9. Comparison of reprojection errors against the distribution of 48 test viewpoints in the eyebox using 8 training viewpoints. (a) Median reprojection errors in the reconstructed distortion map at each viewpoint position. The color of each circle indicates the method (red: Gaussian, blue: NDF), and the radius indicates the error at each viewpoint position. (b) Difference of (a) reprojection errors between NDF and Gaussian at each test viewpoint. The red plot shows the viewpoint position at which Gaussian fitting reconstructs the distortion map more accurately, and the blue plot shows the opposite. (c) Scatter plots of the results in (b) projected onto the (Left) x-y plane and (Right) x-z plane.

Download Full Size | PDF

5.5 Error analysis of different network architecture

Finally, we evaluate the effects of different network parameters on accuracy. It is known that the number of layers and channels in the network has little effect on accuracy, while the number of dimensions of the position encoding, the activation function, and the number of training steps significantly impact accuracy [23,27]. Thus, we trained NDFs under different conditions while varying these parameters and evaluated their accuracy on the training dataset ($N=8$).

5.5.1 Training step

We compared networks trained with a different number of training steps from $1.0\times 10^{5}$ to $5.0\times 10^{5}$. During the experiment, the dimension of encoding $L=16$, the activation function of the coordinate MLP, and the intensity MLP were fixed to ReLU and SoftPlus, respectively.

Figure 10(a) shows the error of NDF at different numbers of training steps. When the number of steps is increased by $1.0\times 10^{5}$, the mean error changes as follows: {3.61, 3.78, 3.75, 3.72, 3.23} pixel. The results show that the minimum error value decreases as the number of training steps increases. However, there is no change in the intermediate error value for all viewpoints, confirming that the accuracy of NDF is as good as the Gaussian fitting even at $1.0\times 10^{5}$ training steps. This result indicates that the NDF is already able to represent image distortion well in the early stages of training. One possible reason for the result is that the image distortion reproduced by NDF is simpler in structure than that of NeRF, which targets natural images.

 figure: Fig. 10.

Fig. 10. (a) The error of Gaussian interpolation and NDF with varying number of training steps from $1.0\times 10^{5}$ to $5.0\times 10^{5}$ using the 8 training viewpoints. (b) The errors of Gaussian interpolation for reference and NDFs for different combinations of positional encoding dimensions and output activation functions. Each label (i.)–(iv.) on the figure indicates, in order, the dimension of positional encoding, activation function for coordinates, and for intensity.

Download Full Size | PDF

5.5.2 Input dimension of positional encoding

We compared networks trained with a different encoding dimension ($L=6, 16$). We fixed number of learning steps at $5.0\times 10^{5}$.

Figure 10(b i.) and (b ii.) show the training results with the dimension of positional encoding $L$ set to 16 and 6. The mean errors of (b i.) and (b ii.) are 3.23 pixel and 3.62 pixel, respectively. We previously assumed that reducing the number of encoding dimensions would not change the accuracy of NDF as discussed in Sec. 2.3.1. However, from the result, as with the original NeRF, increasing the number of encoding dimensions reduced the error. This can be attributed to the fact that the current NDF learns not only the distortion of the image but also the range visible from the display in FoV, which results in higher frequencies at the periphery of the NED.

5.5.3 Selection of activation functions

Finally, we trained the network with different combinations of activation functions for the coordinate MLP (ReLU or Sigmoid) for ${\textbf{u}}_{D}$ and the intensity MLP (SoftPlus or Sigmoid) for $\rho$.

Figure 10(b i.), (b iii.), and (b iv.) show that the errors when varying the combination of output activation functions. The mean errors of (b i.), (b iii.), and (b iv.) are 3.23 pixel, 3.60 pixel, and 5.36 pixel, respectively. From the result, as expected in Sec. 3, the accuracy was greatly improved by using ReLU for the output activation function of ${\textbf{u}}_{D}$. In constrast, changing the activation function of $\rho$ from Softplus to Sigmoid did not significantly improve the accuracy. This result indicates that the virtual display surfaces of the wide-FoV NEDs targeted in this paper form multiple images through repeated multistage reflections and refractions.

6. Limitation and future work

From experiments, NDF can synthesize novel-view distortion maps for the wide-FoV AR-NED with accuracy equal to or better than explicit polynomial fitting models. Our fully implicit, MLP-based approach is a completely different from existing works for distortion correction of NEDs. While the current NDF model is still rough around the edges, it has many possibilities for improvements and future research directions. This section discusses both such limitations and potential research directions.

Real Time Dynamic Distortion Correction. As mentioned in Sec. 1, dynamic distortion correction is one of the most important issues for NEDs, yet unsolved. In particular, real time distortion correction in response to eye tracking is required to make dynamic adjustments that are imperceptibly fast for the user. Our NDF is compatible with real time distortion map generation. Thanks to neural network-based architecture, NDF can be GPU-accelerated. Moreover, since NED has almost the same configuration as NeRF, acceleration methods proposed in the research field of NeRF can be applied almost directly to NED. For example, InstantNeRF [28] uses hash tables to adapt multi-resolution position encoding to GPU calculations, enabling the generation of $1920\times 1080$ pixel resolution images at tens of milliseconds. Since our initial desire in this paper is to verify our NDF concept and InstantNeRF is implemented on a customized CUDA kernel that is hard to customize, we currently have not implement NDF on its architecture. However, in theory, it is possible to run NDF on the real time framework.

Combination with Explicit Models. In this paper, we defined NDF as a fully implicit model that does not assume an a priori optical model. However, since NDF is essentially a ray-casting-based method, it can be extended to a hybrid model, which combines NDF with conventional distortion correction methods that trace rays on explicitly defined optical models. In the field of NeRF, some methods have been proposed to recover both the 3D shape and viewpoint-dependent texture of an object with high accuracy by intensively sampling points close to the object surface [29,30]. In the same way, by intensively sampling points near the focal plane in a roughly defined optical design of NEDs, NDF can improve the distortion map’s accuracy while fine-tuning the actual optical property of the NEDs.

Correction of Chromatic Aberration and Viewpoint-Dependent Blur. By extending the number of output dimensions, NDF could be utilized to calibrate various pixel-wise viewpoint-dependent properties, such as chromatic aberration [31] and viewpoint-dependent blur [32]. Although increasing the number of output dimensions may make learning difficult to converge, it can also include the correlation of each property, for example, implicitly learned, viewpoint-dependent color-mixing matrix for chromatic aberration.

Applying NDF tor Other Severely Distorted Optics. Although this paper only discusses the application of NDF to wide-FoV NEDs, it is expected that NDF can be applied to other non-smooth and extremely distorted optical systems, just as NeRF can be applied to images with abrupt changes in adjacent pixel values. For example, NDF may be applied to aerial displays that form images using a special beam combiner [33], or acquire correspondence between 3D scene and image coordinates in dynamic projection mapping.

7. Conclusion

We proposed NDF, an MLP-based distortion map generation method for wide-FoV NEDs. NDF implicitly learns virtual display surfaces as light source distributions in viewpoint-dependent space, a mutually complementary concept to explicit, geometric optics models. Experiments show that NDF can synthesize distortion maps with an error of about 5.8 arcmins using only 8 training viewpoints, which is competitive with non-linear polynomial fittings. We also confirmed that NDF produces maps with better accuracy around the center of the FoV, and the accuracy improves as the number of training viewpoints increases. NDF has the potential for higher accuracy by combination with explicit optical models, and real time distortion correction with GPU optimization. We hope that our new approach will facilitate subsequent research and contribute to the realization of an immersive virtual experience that combines a wide field of view with perfect spatial consistency.

Funding

Japan Society for the Promotion of Science (JP17H04692, JP20H04222, JP20H05958, JP22J01340); Fusion Oriented REsearch for disruptive Science and Technology (JPMJFR206E); Precursory Research for Embryonic Science and Technology (JPMJPR17J2).

Acknowledgments

The authors thank Daisuke Iwai and Takumi Kaminokado for engaging in valuable discussions.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. G. A. Koulieris, K. Aksit, M. Stengel, R. K. Mantiuk, K. Mania, and C. Richardt, “Near-eye display and tracking technologies for virtual and augmented reality,” Comput. Graph. Forum 38(2), 493–519 (2019). [CrossRef]  

2. X. Hu and H. Hua, “High-resolution optical see-through multi-focal-plane head-mounted display using freeform optics,” Opt. Express 22(11), 13896–13903 (2014). [CrossRef]  

3. K. Aksit, W. Lopes, J. Kim, P. Shirley, and D. Luebke, “Near-eye varifocal augmented reality display using see-through screens,” ACM Trans. Graph. 36(6), 1–13 (2017). [CrossRef]  

4. D. Dunn, C. Tippets, K. Torell, P. Kellnhofer, K. Aksit, P. Didyk, K. Myszkowski, D. Luebke, and H. Fuchs, “Wide field of view varifocal near-eye display using see-through deformable membrane mirrors,” IEEE Trans. Visual. Comput. Graphics 23(4), 1322–1331 (2017). [CrossRef]  

5. Q. Guo, H. Tang, A. Schmitz, W. Zhang, Y. Lou, A. Fix, S. Lovegrove, and H. M. Strasdat, “Raycast calibration for augmented reality hmds with off-axis reflective combiners,” in 2020 IEEE International Conference on Computational Photography (ICCP) (2020), pp. 1–12.

6. C. Jang, K. Bang, S. Moon, J. Kim, S. Lee, and B. Lee, “Retinal 3d: augmented reality near-eye display via pupil-tracked light field projection on retina,” ACM Trans. Graph. 36(6), 1–13 (2017). [CrossRef]  

7. J. Kim, Y. Jeong, M. Stengel, K. Akşit, R. Albert, B. Boudaoud, T. Greer, J. Kim, W. Lopes, Z. Majercik, P. Shirley, J. Spjut, M. McGuire, and D. Luebke, “Foveated ar: dynamically-foveated augmented reality display,” ACM Trans. Graph. 38(4), 1–15 (2019). [CrossRef]  

8. A. Maimone and J. Wang, “Holographic optics for thin and lightweight virtual reality,” ACM Trans. Graph. 39(4), 1–14 (2020). [CrossRef]  

9. O. Cakmakci, Y. Qin, P. Bosel, and G. Wetzstein, “Holographic pancake optics for thin and lightweight optical see-through augmented reality,” Opt. Express 29(22), 35206–35215 (2021). [CrossRef]  

10. Y. Geng, J. Gollier, B. Wheelwright, F. Peng, Y. Sulai, B. Lewis, N. Chan, W. S. T. Lam, A. Fix, D. Lanman, Y. Fu, A. Sohn, B. Bryars, N. Cardenas, Y. Yoon, and S. McEldowney, “Viewing optics for immersive near-eye displays: pupil swim/size and weight/stray light,” in Digital Optics for Immersive Displays, vol. 10676B. C. Kress, W. Osten, and H. Stolle, eds., International Society for Optics and Photonics (SPIE, 2018), pp. 19–35.

11. M. Klemm, F. Seebacher, and H. Hoppe, “High accuracy pixel-wise spatial calibration of optical see-through glasses,” Comput. & Graph. 64, 51–61 (2017). [CrossRef]  

12. W. Robinett and J. P. Rolland, “A computational model for the stereoscopic optics of a head-mounted display,” Virtual Real. Syst. 1(1), 45–62 (1992). [CrossRef]  

13. J. P. Rolland and T. Hopkins, “A method of computational correction for optical distortion in head-mounted displays, Technical Report TR93-045,” University of North Carolina at Chapel Hill (1993).

14. M. B. Hullin, J. Hanika, and W. Heidrich, “Polynomial Optics: A construction kit for efficient ray-tracing of lens systems,” Comput. Graph. Forum 31(4), 1375–1383 (2012). [CrossRef]  

15. E. Schrade, J. Hanika, and C. Dachsbacher, “Sparse high-degree polynomials for wide-angle lenses,” Comput. Graph. Forum 35(4), 89–97 (2016). [CrossRef]  

16. J. Martschinke, J. Martschinke, M. Stamminger, and F. Bauer, “Gaze-dependent distortion correction for thick lenses in hmds,” in 2019 IEEE Conference on Virtual Reality and 3D User Interfaces (VR) (2019), pp. 1848–1851.

17. Y. Itoh and G. Klinker, “Light-field correction for spatial calibration of optical see-through head-mounted displays,” IEEE Trans. Visual. Comput. Graphics 21(4), 471–480 (2015). [CrossRef]  

18. P. Guan, O. Mercier, M. Shvartsman, and D. Lanman, “Perceptual requirements for eye-tracked distortion correction in vr,” ACM Trans. Graph. (preprint) (2022).

19. B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in European Conference on Computer Vision (ECCV) (2020).

20. Y. Peng, S. Choi, N. Padmanaban, and G. Wetzstein, “Neural Holography with Camera-in-the-loop Training,” ACM Trans. Graph. (SIGGRAPH Asia) (2020).

21. S. Choi, M. Gopakumar, Y. Peng, J. Kim, and G. Wetzstein, “Neural 3d holography: Learning accurate wave propagation models for 3d holographic virtual and augmented reality displays,” ACM Trans. Graph. (SIGGRAPH Asia) (2021).

22. W.-F. S. Yi-Ting Tu, “Ray optics simulation,” https://github.com/ricktu288/ray-optics (2022).

23. M. Tancik, P. P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. T. Barron, and R. Ng, “Fourier features let networks learn high frequency functions in low dimensional domains,” Neural Information Processing Systems (NeurIPS) (2020).

24. J. T. Barron, B. Mildenhall, M. Tancik, P. Hedman, R. Martin-Brualla, and P. P. Srinivasan, “Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields,” in International Conference on Computer Vision (ICCV) (2021).

25. J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang, “JAX: composable transformations of Python+NumPy programs,” (2018).

26. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR) (2015).

27. V. Sitzmann, J. N. P. Martel, A. W. Bergman, D. B. Lindell, and G. Wetzstein, “Implicit neural representations with periodic activation functions,” in Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS’20) (2020).

28. T. Müller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,” ACM Trans. Graph. 41(4), 1–15 (2022). [CrossRef]  

29. M. Oechsle, S. Peng, and A. Geiger, “Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction,” in International Conference on Computer Vision (ICCV) (2021).

30. L. Yariv, J. Gu, Y. Kasten, and Y. Lipman, “Volume rendering of neural implicit surfaces,” in Thirty-Fifth Conference on Neural Information Processing Systems (2021).

31. Y. Itoh, M. Dzitsiuk, T. Amano, and G. Klinker, “Semi-parametric color reproduction method for optical see-through head-mounted displays,” IEEE Trans. Visual. Comput. Graphics 21(11), 1269–1278 (2015). [CrossRef]  

32. Y. Itoh, T. Amano, D. Iwai, and G. Klinker, “Gaussian light field: estimation of viewpoint-dependent blur for optical see-through head-mounted displays,” IEEE Trans. Visual. Comput. Graphics 22(11), 2368–2376 (2016). [CrossRef]  

33. X. Luo, J. Lawrence, and S. M. Seitz, “Pepper’s cone: An inexpensive do-it-yourself 3D display,” in Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology (Association for Computing Machinery, 2017), pp. 623–633.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.


Figures (10)

Fig. 1.
Fig. 1. (a) Images of the checker pattern displayed on the wide-FoV AR-NED, captured from the viewpoint camera. Due to the complex beam combiner optics of the wide-FoV NED, the displayed image is subject to viewpoint-dependent image distortion. The images are captured from two different viewpoints with a distance of 6 mm along the x-axis. (b) Schematic diagram of the mapping between the display image and the retinal (or the viewpoint) image. (Left) When images are displayed directly on the NED, they appear distorted according to the user’s viewpoint. The mapping function $f$ represents the correspondence from the viewpoint coordinate to the display coordinate. (Right) If we correct the displayed image in advance, we can perceive the image without distortion.
Fig. 2.
Fig. 2. (a) A simplified optical simulation of NEDs employing different optics. (Left) Each display pixel is perceived as a point light source in a conventional flat beam splitter. (Right) When we use a curved mirror to expand the FoV, the perceived pixel gradually deviates from the point light source. For this simulation, we used a 2D ray optics simulator [22]. (b) Diagram of the relationship between the rays from the viewpoint and the perceived light source from the NED in the wide-FoV case in (a). In this case, we can model the perceived light sources as multiple translucent curved displays in space.
Fig. 3.
Fig. 3. An overview of our NDF representation and distortion map generation. (a) We extend the ray connecting each pixel on the retinal image from the viewpoint and sample the points along the ray. NDF accepts as input the 3D coordinates of each point and ray direction. (b) The NDF returns the corresponding display coordinates and their intensity. The model assumes that there are myriad translucent curved displays in space, similar to Fig. 2(b). The output coordinates of the NDF represent which display pixels reach the eye at the input position and viewing direction. (c) We sum the output of the NDF at each point on the ray weighted by the intensity, and (d) estimate the subpixel-wise display coordinates perceived at the eye coordinate that consists of the ray. During training, we compute the loss between the estimating coordinate and the ground truth, and the loss is back propagated to NDF.
Fig. 4.
Fig. 4. The hardware setup for the calibration. (Left) A back view of the calibration setup. (Right) Bird’s eye view of the setup. The OST-HMD and two cameras are rigidly mounted each other and placed on an XYZ stage. The display is placed in front of the system and used as calibration reference.
Fig. 5.
Fig. 5. Viewpoint camera positions for data acquisition. (a) Camera positions for training data. For the full training set (125 views), we place the viewpoint camera on each vertex of a 3 mm square lattice. Then, to create training data for a fewer number of viewpoints, we extract training data for 8 and 27 viewpoints on the vertices indicated by red dots. (b) Camera positions for test data. We sample 12 points from the diagonal of a 12 mm square eyebox indicated by blue dots, a total of 48 viewpoint positions, as the test dataset. To avoid positional overlap with the training data set, we divide the diagonal of the 3 mm square cube into six equal parts and extract the 1st, 3rd, and 5th points.
Fig. 6.
Fig. 6. The re-projection error of ${\textbf{u}}_{D}$ on for different training viewpoints and interpolation methods. We plotted the median error at each viewpoint for 48 test viewpoints. (a) Summary of median projection errors in pixels. (b) Summary of median angular errors.
Fig. 7.
Fig. 7. Qualitative comparison of estimated display coordinate from (a) linear interpolation, (b) Gaussian polynomial interpolation, and (c) NDF, on the viewpoint image near the center of eyebox. The green grids represent the ground truth, and the red grid is the display coordinates estimated from each method. The more the two coordinate systems matched, the closer the color of the grid approached yellowish. Three regions of interest (blue rectangles) are enlarged on each image.
Fig. 8.
Fig. 8. Error averaged over each pixel of the viewpoint camera. (a) Error from Gaussian fitting, (b) error from NDF, and (c) difference between the two. The red area indicates that Gaussian fitting is better, and the blue area indicates that NDF is better.
Fig. 9.
Fig. 9. Comparison of reprojection errors against the distribution of 48 test viewpoints in the eyebox using 8 training viewpoints. (a) Median reprojection errors in the reconstructed distortion map at each viewpoint position. The color of each circle indicates the method (red: Gaussian, blue: NDF), and the radius indicates the error at each viewpoint position. (b) Difference of (a) reprojection errors between NDF and Gaussian at each test viewpoint. The red plot shows the viewpoint position at which Gaussian fitting reconstructs the distortion map more accurately, and the blue plot shows the opposite. (c) Scatter plots of the results in (b) projected onto the (Left) x-y plane and (Right) x-z plane.
Fig. 10.
Fig. 10. (a) The error of Gaussian interpolation and NDF with varying number of training steps from $1.0\times 10^{5}$ to $5.0\times 10^{5}$ using the 8 training viewpoints. (b) The errors of Gaussian interpolation for reference and NDFs for different combinations of positional encoding dimensions and output activation functions. Each label (i.)–(iv.) on the figure indicates, in order, the dimension of positional encoding, activation function for coordinates, and for intensity.

Equations (8)

Equations on this page are rendered with MathJax. Learn more.

u D = f ( u E , p ) .
u ¯ D ( r ) = i = 1 N τ i ( 1 exp ( ρ i δ i ) ) u D i , τ i = exp ( j = 1 i 1 ρ j δ j ) ,
L = r R u ¯ D ( r ) u D ( r ) 2 2
P = [ 1 0 0 2 0 0 2 L 1 0 0 0 1 0 0 2 0 0 2 L 1 0 0 0 1 0 0 2 0 0 2 L 1 ] T , γ ( x ) = [ sin ( P x ) cos ( P x ) ] .
L = r R u ¯ D ( γ ( r ) ) u D ( γ ( r ) ) 2 2 .
u D = f i ( u E ) = f ( u E , p i ) .
f i ( u E ) = A T ϕ ( u E ) = k = 1 K α k T ϕ k ( u E ) , ϕ k ( u E ) = exp ( ( u E μ k ) T ( u E μ k ) 2 σ 2 )
A = ( Φ T Φ + λ I K ) 1 Φ T U D
Select as filters


Select Topics Cancel
© Copyright 2024 | Optica Publishing Group. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.