Automatic montaging of adaptive optics SLO retinal images based on graph theory

Ting Luo; Robert N. Gilbert; Kaitlyn A. Sapoznik; Kaitlyn A. Sapoznik; Brittany R. Walker; Stephen A. Burns

doi:10.1364/BOE.505013

1. Introduction

Over the past two decades, there has been remarkable progress in the development of high-resolution retinal imaging devices that utilize adaptive optics (AO). With real-time sensing and the correction of the eye’s aberrations, AO-based retinal imaging systems allow us to obtain cellular and subcellular resolution to measure retinal structure and function in vivo. By combining AO with numerous retinal imaging modalities like adaptive optics scanning laser ophthalmoscope (AOSLO) [1,2], adaptive optics optical coherence tomography (AO-OCT) [3–5], and adaptive optics flood-illumination cameras [6–9], it has become possible to obtain quantitative measures of cone photoreceptors [10–13], nerve fiber bundles [14–16], ganglion cells [17–19], RPE [20–23] and the retinal vasculature [24].

AO-based retinal imaging is, however, limited by the small field of view (FOV). In AO systems, the FOV is typically only 1-2 degrees, and this limitation is a result of the small size of the human eye’s isoplanatic angle [25]. The small FOV poses difficulties for researchers in investigating changes to the retina occurring over wider retinal regions. This lack can limit studies of pathological relationships arising from retinal disease. Therefore, it is common to apply a standard post-processing step to combine these small FOV images into montages comprising larger retinal areas through an image stitching operation.

Many current montaging approaches are labor-intensive processes requiring an experienced operator to align each of the averaged frames to any other frame that may have overlapping regions using software such as Adobe Photoshop (Adobe Creative Suite). Most AO labs use manual montaging or semi-automated techniques to combine these larger montages. There are obvious limitations in creating the montage manually. First, it is time-consuming. A typical procedure to create a single 10 × 10 degrees size montage with decent image quality can take hours. With lower quality images and in participants with poorer fixation, the required time can be doubled or even tripled. Second, even the semi-automated montaging requires subjective decisions to optimize the alignment of overlapping regions especially in the presence of locally distorted images.

Because of the time involved in manual montaging, there is an increasing interest in automated montaging. Complex alignment programs such as i2kRetina (DualAlign, LLC, Clifton Park, NY) are used by some groups [13,26–28]. However, misalignments sometimes exist, and the assignment of a continuous region to several individual sub-montages occurs relatively often. One primary challenge for automatic montaging is to estimate the correct translation between images. Li et al. [29] presented an automated method incorporating local scale-invariant feature transforms (SIFT) [30] to detect and describe salient key points in overlapping images and then used random sample consensus (RANSAC) to estimate the translation matrix. This approach was later extended [31] to apply to multiple imaging modalities. Chen et al. evaluated the approach using normalized cross-correlation (NCC) and normalized mutual information (NMI) metrics. The SIFT/RANSAC method worked well on images of the photoreceptor layer. Davidson et al. used oriented rotated brief (ORB) descriptor [32], and local sensitivity hashing to perform fast montaging and achieved a result of 1-2 mins per montage of 250 frames [33]. In addition, Chen et al. worked on the automatic longitudinal alignment of AO images at the photoreceptor layer based on a constellation method, allowing the matching of structural patterns in cone mosaics rather than simple intensity similarities [34].

These feature dependent alignment approaches work well, however, there are several remaining challenges for performing general purpose high-quality automated montaging. First, the algorithm needs to determine which adjacent image pairs to use for translation estimation, especially when there are multiple overlapping images that could be the basis for alignment and the different pairings result in slightly different translation estimates. These multiple candidates occur because typically montages are created from images that densely tile the retina with each region sampled in several different images. Second, the alignment can be challenging for datasets acquired from inner retinal layers such as the nerve fiber layer (NFL) or vascular plexuses, since they have fewer high spatial frequency features and can have a degree of self-similarity, both of which can lead to misalignment. For this reason, an algorithm that can incorporate matching results from more than one matching technique may be desirable. In addition, in some patients image quality is not as high as desired. For instance, in glaucoma patients, dry eye is common, and image quality can change rapidly between blinks. This variable image quality can pose challenges for both the identification of overlaps and for stitching aligned frames together to form a final image that uses the best quality images. This paper proposes a processing pipeline which can incorporate any of the alignment modalities proposed to date, but also addresses issues of converging on a final set of spatial relations. This pipeline utilizes graph theory to determine locally optimal adjacent pairs and a dynamic programming algorithm to blend the overlapping regions for improved montage quality. We demonstrate the ability of this pipeline using AOSLO data.

2. Method

2.1 Indiana AOSLO system and the acquisition of the data

The Indiana AOSLO used in this study has been previously described [19,35]. Briefly, a supercontinuum laser (SuperK Extreme; NKT Photonics, Birkerød, Denmark) provided light for imaging and wavefront sensing. The two imaging beams were 820 nm (bandwidth 19 nm) and 775 nm (bandwidth 26 nm). The wavefront sensing was performed with a portion of the 775 nm light. The system allows steering of the imaging beam across the retina over almost 30 degrees without changing fixation location. The fixation channel is generated by a digital light projector (DLP) that allows positioning the fixation target for subjects over a range of approximately 25 degrees. Thus, the nominal imaging position is controlled by the combined position of the steering mirrors and the fixation point, and both of these positions are stored in an acquisition database for each acquisition. The total light level was safe according to American National Standards Institute standards for laser safety (ANSI Z136, 1-2014). The system uses two deformable mirrors; the Mirao (Mirao TM 52-e; Imaging Eyes, Orsay, France) and BMC (Multi-DM; Boston Micromachines Corporation, Cambridge, MA, USA) to correct the lower order wavefront error and high order wavefront error, respectively [36]. The galvanometer scanners in horizontal (15.1 kHz) and vertical (28 Hz) directions form a raster scan on the human retina. Most data were collected with a nominal 1 micron/pixel spacing generating a scanned region of 1.8 × 1.9 degrees or 520 × 547 µm.

The research protocol and data presented in this study were approved by the Institutional Review Board at Indiana University Bloomington and followed the tenets of the Declaration of Helsinki. Data previously collected from ten control subjects and thirteen subjects with either diabetes or glaucoma were used for testing the method. The subject information is provided in Table 1.

Table 1. Description of Datasets used

View Table | View all tables in this article

The workflow for image processing (Fig. 1) used several steps to automatically generate the montage. First, images were averaged based on the retinal location being imaged (section 2.2). Next, we generated a potential connection matrix from the nominal retinal location of each image, based on scanner and fixation positions, as stored by the instrumentation at the time of imaging (section 2.3). The connection matrix was then refined based on a calculated translation matrix between any potentially overlapping pairwise images, and a confidence level was assigned to each potential translation (section 2.4). From the weighted connection matrix, we compute a small diameter spanning tree (SDST, section 2.5) to determine the optimal global singly connected graph, as well as a blending order. Finally, we blended the montage (section 2.6) based on the local contrast of overlapping images. The details for individual steps are elaborated in the following sections.

Fig. 1. The workflow of the automatic montage pipeline for a hypothetical example with five frames is shown. Following the blue arrows in the figure, the pipeline starts with five different frames from different retinal locations (colored coded in five different colors) as described in section 2.2. We created a connection matrix (shown as a 5 × 5 matrix) which indicates possible matches based on the target retinal location determined from fixation and scan position tables. We typically assume the eye is within several hundred microns of the intended position, and thus potential overlaps are shown by ones as described in section 2.3. We next refined candidate matches, and possible displacements, as described in section 2.4, eliminating poor candidates (shown as red zeros in the matrix). We then used a small diameter spanning tree (SDST) to compute optimal relation between candidate matches as described in section 2.5. Finally, we took the matches computed in section 2.5 and merge them to create a final blended montage as described in section 2.6.

Download Full Size | PDF

2.2 Averaging frames

We created an average frame for each block of 100 continuous frames (approximately 3 seconds of video) collected at a single nominal location. The procedure for averaging frames has been described previously [37]. Briefly, a template image is automatically selected by examining the bulk motion of the retina over time using a cross-correlation approach. From temporal epochs within which there is minimal motion, a high-quality video frame is selected which has little motion relative to the previous or subsequent frame. Then a strip alignment approach is used to correct small eye movements [38] From the resulting eye movement estimates within each frame, the median position across frames for all motion estimates is computed and then a detailed set of motions is calculated, and each pixel in the frame is placed according to the eye motion estimates. Frames with excess motion or blinks are rejected. The resulting stack of images is then averaged using “lucky” averaging [39]. For this work, averaged images were histogram stretched to mitigate differences in intensity between frames.

To test the algorithms, we chose data that range from high quality to lower but acceptable quality. The results were averages of variable size, since eye movements could cause frames to shift retinal location within each 3-second video acquisition, and our implementation of the strip alignment algorithm accepted matches where there was not complete overlap between the individual frames. As a result, the final average image can have irregular edges, which arise because of eye movements. These irregular margins of the averaged frames were automatically cropped to a rectangular image. To accomplish this the algorithm defined a pixel as “valid” if its intensity value was higher than 50 (out of 256) and computed a bounding box for regions with valid pixels. The value was based on experience with our imaging system and is therefore arbitrary, but it does help to avoid noisy frames where the intensity has dropped significantly, for instance due to drying of the cornea between blinks. The algorithm then runs a scanning line from each of the four edges towards the center of the bounding box to shrink its size one line at a time and stops when the ratio of the number of valid pixels to the total number of pixels of that line is higher than a predefined threshold (80%).

2.3 Initialization of the connection matrix

For small sets of images, it may be feasible to compare every frame to every other frame. However as the number of frames increases, the number of possible matches increase quadratically, that is this is an $O({{n^2}} )$ problem. To reduce the runtime for large datasets, the algorithm utilizes the focus depth and the scan size that are stored in the acquisition database. These values are used as a first-order segregation, grouping images from the same session into different clusters so the images from the superficial and deep retina are separated as are images with different scan sizes. While in principle the differences in scan sizes can be accounted for by a scaling stage, this was not included in the present pipeline. For each cluster, the algorithm then uses the scanner location and fixation location to determine if any pairwise images have potentially overlapping regions. Since the actual image location deviates from the programmed location due to eye movements and fixation variability of the participant, we used a predetermined cutoff distance to identify frame pairs that are nominally close to form the potential connection matrix. This cutoff was typically 1.3 degree for the data presented but varied with image size. After applying these filtering criteria, since the number of neighboring frames that are close enough to be overlapping with a single frame is bounded (the average number of connections per frame in our datasets is 7.56), we can reduce the computation to linear time with the number of frames. It is worth noting that this step in the pipeline is not required- it simply speeds up the computation. For some systems (or datasets) where such information is not available, we treat the matrix as fully connected at the outset.

2.4 Refining the connection matrix

The connection matrix gives a necessary condition for any two frames that may overlap. Frames may have an overlap if they meet this necessary condition. Our automatic montage pipeline then needs to “plug in” a matching module to refine the connection matrix by removing pairings with a low probability of overlap. In addition, for the rest of the pairings that are retained in the connection matrix, this module is expected to estimate the relative position of the two frames (e.g., translation estimation) and to assign a similarity score within [0, 1] for the confidence of the translation estimation. The higher the value of the similarity score, the higher the quality of the estimation of the spatial relation between the images is. This calculation was applied to each pair of frames in the connection matrix from step 2.3.

In the current study, we tested three similarity metrics as candidates. The first is based on SIFT and RANSAC as used by others [29,31,40]. This method uses SIFT to detect and describe key points in frames and applies a matching algorithm on these key points to form a list of pairs between the two frames. Next, RANSAC processes these key point pairs and returns a consensus model of the transform from one frame to the other. In this study, we focus on rigid transform with three degrees of freedom (i.e., the horizontal and vertical translation offsets and the rotation angle) between frames. Because the rotation between frames is typically small [41], the estimated rotation angle is used to exclude unlikely matches. If the estimated rotation angle is larger than 3 degrees, it is assumed there is no overlap between the two frames. When RANSAC finds a valid translation, its similarity score is assigned as $min({100,\; {n_{inlier}}} )/100$, where n_inlier is the number of inlier pairs.

The second method we tested is solely based on generalized normalized cross-correlation (NCC) [42]. Given two images A and B of size ${n_A} \times {m_A}$ and ${n_B} \times {m_B}$ respectively, NCC computes a matrix of correlation coefficients map (cmap) of size $({{n_A} + {n_B} - 1} )\times ({m_A} + {m_B} - 1$]. Each location in the cmap corresponds to an estimated translation between A and B. Instead of simply picking the location that has the largest coefficient, which often leads to false positives, due to similarity of inner retinal linear structures such as nerve fiber layer or blood vessels, we generate a list of candidate locations as local maximum of an adjusted cmap (Acmap) which is generated by applying top-hat filtering to the raw cmap, (Rcmap), and to the Rcmap we apply a series of filtering criteria (Table 2) to determine whether an exclusive “dominant peak” location exists. These thresholds were chosen based on experience with errors seen when examining the cmaps. In addition, we required the resulting overlap region of the image pair to have a size of at least 10% of either A or B to be considered as a valid translation estimation. The similarity score is assigned as the correlation coefficient at the selected peak location.

Table 2. The four criteria required for a correlation coefficient (cc) map peak to represent a valid match between two frames. We apply top hat filtering on the raw cmap (Rcmap) to get an adjusted cmap (Acmap) for certain filtering operations. Note that the “regional maximum” means that the coefficient at a location is larger than the coefficient of its 4 neighbors. And the “neighboring region” in the table refers to a 41 × 41 pixel region surrounding that location. See section 2.4 for other terms

View Table | View all tables in this article

The third method used was to combine the first two. We accepted an overlapping pair of frames only if both methods agreed that there was an overlap. This rule for the combined matching method will produce fewer acceptable matches than either of the other two. Next, the translation estimation was taken from the NCC-based estimation as it optimizes the correlation over the overlapping region. While the combined similarity score is assigned as the reciprocal of the translation offset difference in the number of pixels between SIFT and NCC’s estimation, thus if both techniques agree, the confidence of the displacement estimate is high.

2.5 Automatic montage based on small diameter spanning tree (SDST)

Since a frame may overlap with more than one neighboring frame, it may have conflicting multiple estimated translations relative to the same reference frame via different alignment sequences. While these differences may be small, they can accumulate over long series of sequential alignments. To arrive at a good alignment sequence, we modeled the refined connection matrix as the “adjacency matrix” of a graph $G = ({V,E} )$, where $V = \{{{v_i}} \}_{i = 1}^n\; $ is a set of nodes each of which represents an averaged frame and $E = \{{({{u_i},{v_i}} )} \}_{i = 1}^m\; $ is a set of edges connecting two overlapping frames. We cast a weight ${w_{uv}}$ over the edge $({u,v} )$ as $1/s({u,v} )$ where $s({u,v} )$ is the similarity score of the estimation derived from the overlapping region as computed in section 2.4.

This graph outlines the connection between frames with overlapping regions. Ideally, an entire session can be modeled as a graph which contains a single connected component. However, because of eye movement as well as the rejection of some edges due to poor quality, as described above, we might end up with multiple disconnected components (groups) which in turn can be represented as graphs (or individual frames without clear matches, i.e., singletons) The number of these discontinuities is discussed in Results.

Theoretically, for each connected component of the graph, we can form a montage by starting with any node as the anchoring frame and register the rest of the frames by aligning them one after another along edges. Generally, only a subset of edges is needed for alignment and this subset is not unique, because a given frame could have multiple base frames to which it can be aligned and only one of them is selected. Different approaches can be taken to arrive at the final set of alignments, but a common problem is that small errors in alignment can accumulate, causing the final mosaic to depend on the order images were matched. We envision this problem as having the final selected edges forming a spanning tree (T) of the original graph where a spanning tree is a graph where each vertex is connected, but there are no loops. To arrive at the desired spanning tree, we consider the inverse of the similarity scores as “weights” on the edges and these represent the “cost” of aligning two frames. In other words, higher weights indicate lower alignment quality or lower confidence in the estimated translation between two images. A minimum spanning tree (MST), which minimizes the total weights along all graph edges, is therefore a good potential candidate among all possible spanning trees. However, because each registration step can introduce alignment error, a tree with a long path with few branches (i.e., a large diameter tree) would be less preferable, as the amount of alignment error accumulates with increasing path length. In fact, a long path might cause noticeable misalignment in the case where two frames are adjacent in the montage but lie at either end of a long path. Therefore, we use a small diameter spanning tree (SDST) [43] rather than an MST to balance the overall weights and the diameter of the resulting tree. The anchoring frame is chosen as the node closest to the center of the tree, independent of its actual spatial coordinates and we used a breadth-first search to traverse the SDST to obtain an alignment order.

2.6 Blending frames for the final montage

Once the relative location of each frame has been determined for the final montage, we applied a montage blending procedure to minimize image discontinuities that can arise if frames are directly overlayed. This blending procedure is also based on a graphical approach and uses Dijkstra’s algorithm to find the shortest path on a “cost graph” corresponding to the overlapping region between two frames.

Figure 2(A)-(B) illustrates how montage blending works. Basically, for an overlapping region between frame A and frame B, we wanted to find a path proceeding from the intersection points of the two frames, s and t (Fig. 2(A)). The goal is then to choose a line from s to t such that the discontinuity along the path is minimized. We defined each entry d(i, j) of the intensity discontinuity map on the overlapping region being the absolute pixel intensity difference between the two frames (Fig. 2(B)). A cost graph was constructed from this map such that each pair of neighboring pixels corresponds to a node in the graph and the four nodes associated with the same pixel are connected, the weight on these four edges is set to be the intensity difference on that pixel. We then ran Dijkstra’s algorithm to find the shortest path on this cost graph as illustrated as a red path in Fig. 2(B). Finally, the shortest path is mapped back to the discontinuity map to mark a boundary. We can simply fill the overlapping region with pixels from frame A on one side of this boundary and with those of B on the other side. By stitching two frames together according to this boundary, we created a larger frame which in turn would be used as one of the inputs in the next iteration of the montage blending. We repeated these steps using the order of image registration determined in the previous section and we continue to expand our montage until all frames have been successfully combined into the final image.

Fig. 2. Procedure for frame blending. (A)The goal of blending is to find a path (from s to t), which represents the optimal boundary between frame 1 and frame 2. The merged image will then consist of pixels from frame 1 which lay to the left and above line st, and from frame 2 to the right and below line st. (B) To find the line ST, we treat the overlap region as a graph where the center of each pixel edge represents a node. The cost of using an edge between two nodes is defined in the text, Djikstra’s algorithm then finds the least cost path across the overlap region which is the line st. This can be visualized as shown in panel B). (Left) the intensity discontinuity map corresponding to a 3 × 3 region: where each cell has an associated cost d(i, j); (Middle) the discontinuity map is converted to a cost graph of the 3 × 3 map, where the graph is constructed from each pair of adjacent pixels with the vertex at the center of the pixel boundary, and source positions starting at location s and proceeding towards the terminal position at t. The red line indicates the resulting lowest cost path from source to terminal; (Right) The pixels crossed by the shortest path are boundary pixels (red), they are converted back to the discontinuity map and represent boundary line. (C) Result of direct overlay of two frames showing a discontinuity at the border between images (red arrows) and an overlap region is expanded within the yellow box. (D) Result of image blending on the same two frames.

Download Full Size | PDF

2.7 Semi-automated montage technique

To compare the automated technique to human guided montaging, we used a semi-automated montage routine for each dataset as a “gold standard”. In this semi-automated routine, we used the same averaged frames and connection matrix as described in sections 2.2 and 2.3. At that point, however, the operator was provided with a tool to judge potential alignments individually.

For a given cluster of potential overlapping images that could form a montage, the operator chooses a starting frame for the montage based on the number of potentially overlapping frames and the presence of potential alignment features (such as blood vessels). The program then proceeds through the connection matrix presenting all possibly paired images to the operator in a graphic user interface (GUI) and asking the operator whether to accept or reject the potential alignment provided by a 2D NCC. The tool allowed frame pairs to be automatically alternated in the tool (blink comparison) or overlayed. If the operator believes there is an alignment and it was not identified by the NCC, the operator is allowed to overwrite the NCC to generate a manual alignment using a mouse. This manual alignment mode was often used for aligning images containing blood vessels. Once a pairing was accepted, the program moved to the next set of connected frames, until all frames were either connected or put into a disconnected group. Finally, the connected images and the corresponding connection matrix are saved, and a montage is written to Adobe Photoshop CS6 Extended (Adobe System, Inc., San Jose, CA, USA) where each frame is in a different layer. The human operator could then individually adjust the order of the layers (individual image frames) and correct alignments by translating individual frames with the goal of having the frames aligned and the higher quality images on “top”. The operator can also use other image adjustments (such as histogram stretching) to mitigate brightness discontinuity between frames and import frames that were not matched to produce an optimum final semi-automated montage. These montages are treated as a “gold standard” against which we compare the results of the automated montaging.

2.8 Calculation of performance metrics

Various metrics [31] have been proposed to measure the similarity of overlapping regions of the montage, however, we couldn’t use NCC-based metrics because it is one of the methods available for alignment. Thus, it would be a biased metric for our technique. Instead, we used average intensity difference (AID) as an independent metric to measure the dis-similarity across the overlapping regions. This metric calculates the averaged pixel-wise intensity difference between each pair of images that overlap. In particular,

AID({A,B} )= \frac{1}{n}\mathop \sum \limits_x |{A(x )- B(x )} |,

where A and B are intensities of the overlapping regions and n is the number of pixels in this region. The AID of the entire montage is then averaged across all overlapping pairs of frames weighted by the size of the overlapping region. The physical meaning of the AID of a montage can be directly interpreted as the intensity difference between different images that are nominally acquired from the same location. However, due to changes within an imaging session as well as post-acquisition processing, the raw AID calculated from different datasets is not readily interpretable. Hence, we normalized the raw AID of the montage obtained from our pipeline by dividing by the AID of a random montage formed by randomly pairing each frame with another frame from the same session and aligning them in a random manner. We call the normalized version the relative AID and use this metric to compare across datasets.

A lower relative AID indicates high quality of alignment within a session. Alignment methods with high precision typically excel at producing a low relative AID. On the other hand, methods optimizing precision could simply reject alignments with low confidence and this would result in multiple smaller montages or even non-montaged frames. Therefore, we also used the percentage of frames that have been dropped from the montage and the percentage of frames in the two largest groups in the montage as metrics for analyzing performance.

As a final metric we used human judgements to determine the accuracy of algorithmically accepted pairs of images. We presented human raters with every pair of image matches in the final montage. They were presented these pairs in a custom GUI and could choose to overlay the two images or alternate them. The graders then indicated whether the alignment produced an accurate spatial relation between the frames or not. The percentage of “good” matches was used as a metric for the success of the algorithm.

3. Results

3.1 Datasets

We compiled 26 datasets from 23 individuals aged 24 to 81 years old (see Table 1). Datasets were collected to include various experiment conditions, such as planes of focuses and diseases. Table 3 lists the health condition, image layer of each subject, as well as the number of image frames (excluding defocused ones) in each of our datasets. Overall, the montage pipeline was able to make montages from all datasets. The montages made generally were of high apparent quality (Fig. 3 shows two examples). However, there were some characteristic errors that occurred. First, for photoreceptors, while the montages appear contiguous and smooth, there was a tendency for the merging algorithm to choose black pixels in regions with a low average luminance (regions within white boxes in Fig. 3(A)). These regions arose during the histogram equalization process, and only occurred when the gray scale values of the images were near zero. There were also more errors when montaging images from the inner retina. These occurred most often in regions in and around the fovea where there could be very few retinal features (Fig. 3(B), center-bottom) and the frames ended up being rejected. The other characteristic error we detected tended to occur with inner retinal images that contained relatively large blood vessels (Fig. 4). Here the NCC could make mistakes because of the high correlation that can occur between any two blood vessels with a similar angle in the image, Fig. 4(c) shows that the two different vessels can have a relatively high correlation, whereas the image locations with the same vessel (Fig. 4(c), green box) produced a smaller correlation index. This example would represent a case where the SIFT and NCC approaches would not agree- and thus could represent a match failure for the combined algorithm, which required agreement between all contributing algorithms.

Fig. 3. Automatic montage results for two relatively large datasets acquired at different retinal depths. (A). A photoreceptor montage of a healthy retina encompassing the foveal region out to approximately 7 degrees in the temporal retina. The montages of most photoreceptors were excellent using any of the three alignment methodologies (SIFT, NCC or combined). The remaining errors that occurred arose in the blending stage where dim regions of the retina, such as below vessel junctions, could appear black due to histogram equalization adjustments. There are two examples shown in the white boxes. (B) A montage of images obtained when focused at the nerve fiber layer. Here the montage is extensive, however the complete data set included some images in the fovea, but due to the lack of detail a few frames were omitted (the dark notch indicated by the white arrow).

Download Full Size | PDF

Fig. 4. Example of an alignment failure arising from using NCC when imaging regions with multiple blood vessels. (A) and (B) show the two frames to be aligned. (C) shows the cross-correlation map of these two frames. Each bright spot in the map is a local peak which corresponds to a potential alignment. The red boxed peak corresponds to a wrong alignment that was chosen by the algorithm (D) The green boxed peak is the correct alignment (E) which was verified by manual comparison of image pairs.

Download Full Size | PDF

Table 3. Detailed list of 26 datasets from 23 subjects. For “layer” columns: S = superficial layer, P = photoreceptor layer

View Table | View all tables in this article

3.2 Performance comparison of automated and semi-automated montaging

All three matching models produced average intensity differences (AID) similar to the semi-automated method (Fig. 5(A)). While the semiautomated method provided the lowest AID, indicating the best alignment, the differences were not statistically significant. The slightly better performance of the “combined” method came at the cost of slightly more discontinuities, as can be seen in Fig. 5(B). This is expected since the combined method trades a requirement of agreement between the two translation estimates with the number of rejected potential matches.

Fig. 5. Box plots of the performance of automatic montaging framework using 3 different alignment techniques compared with semi-automated montaging for all 26 datasets. (A) Relative average intensity difference normalized by the random matching baseline (B) The ratio of number of frames in the largest two groups to the total number of frames in the dataset. Plot A shows the alignment quality while plot B shows the discontinuity of these montaging frameworks.

Download Full Size | PDF

Overall, the algorithm generated excellent montages of the photoreceptor layer for all matching modules and for the combined data sets the human graders judged essentially all of the photoreceptor matches between pairs of frames to be acceptable (Fig. 6). There were more errors when using images from the more superficial layers of the retina, however, even for these data sets the median number of acceptable matches was more than 98% of the pairwise matches (Fig. 6). As expected from the discussion of Fig. 3, the matches for the superficial retinal montages that were judged to be incorrect often occurred due to spurious correlations in regions of either high nerve fiber layer reflectance or in the presence of blood vessels.

Fig. 6. Percentage of correct alignments for the estimates of acceptable matches generated by the combined method for all datasets. The precision is the percent of the algorithmically determined matches judged to be good matches by human observers required to make a forced choice decision as to whether each frame pair was aligned. The left box plot shows that essentially all the photoreceptor layer pairings were acceptable, and the right box plot shows that the median for the superficially focused data was greater than 95%, although there was an outlier data set where only about 79% of the matches were judged acceptable.

Download Full Size | PDF

The accuracy measurement shown in Fig. 6 does not include the actual number of frames that were either excluded from any match or placed in smaller groupings of aligned frames. Because ending up with only one, or possibly two, large sub montages from a dataset is an important goal of this process, we also analyzed the percentage of all frames from a dataset that were contained in the two largest groupings (Fig. 7(A)). Here we can see that while all of the matches accepted for the photoreceptor images were “good” there were still some frames that could not be placed into the final montage although in general more than 95% of all frames were placed accurately placed in the two largest groups. This percentage drops for the superficial retinal layers, where the median percent of frames is still more than 90%, but there is a longer tail where datasets with less contrast or more lower quality images dropped the number of frames to between 60 and 75%.

Fig. 7. (A) Boxplot of the ratio of number of frames in the largest two groups to the total number of frames in the dataset. For the photoreceptor imaging more than 75% of the datasets had more than 92% of their frames in the top two groups. For the superficial layers this dropped to about 86%, but a long tail was present. (B) Average number of inliers generated from frames for both photoreceptor and superficial layer images found by SIFT + RANSAC. The number of inliers depend on both the images quality and the degree of overlap of the frames, which in our datasets tended to be about 40%, but could be as low as 10%. Some of the failures and disconnected groups for the superficial imaging arise from both the lower number of inliers found for the superficial images as well as the errors in the NCC metric for the same images as discussed in Fig. 3.

Download Full Size | PDF

The averaged time to perform semi-automated montaging for a 60-frame montage with average image quality takes between 30 minutes to a few hours while the same task can be done within 3 minutes for the automatic montage framework using the “combined” method implemented in MATLAB as shown in Code 1 (Ref. [44]) and run on a MacBook Pro laptop with an M1 pro chip (3.2 GHz).

4. Conclusion and discussion

This paper proposes a framework to generate montages of images of AOSLO-images based on graph theory and presents results from a range of retinal images. We believe the approach is generic in the sense of being applicable to other forms of imaging where small views need to be stitched together to make a larger view, although testing this on other imaging modalities has yet to be done and the current results pertain solely to AOSLO images.

The proposed framework is generic in that one can flexibly plug in different modules for translation estimation that works best for a given dataset. We have demonstrated the use of three different modules. In our case the combined module outperforms modules based on either SIFT + RANSAC or NCC alone, in the sense that it generated well connected graphs where almost all of the final matches were good and the number of false positives was low [31]. As a result, it produced more disconnected groups and singletons but because the combined method uses both the NCC method and SIFT + RANSAC’s it filters out less confident estimations. Overall- all three modules performed well, and the framework seems able to produce useful montages from any suitable matching technique. This modularity makes it straightforward to add improved or alternative approaches such as ORB. The use of SDST produces a compact matching set. This helps when there are a long sequence of possible matches and building up the montage sequentially solely based on nearest neighbor results can cause errors to accumulate. We have also demonstrated qualitatively and quantitatively that the montage result that comes out of this automatic pipeline is comparable to or better than those that are semi-automatically generated by an experienced operator. That is, while we still did not produce a perfect complete montage in one step, we could perform as well as a trained operator and in much less time.

We included the option for generating displacement estimates from multiple algorithms because we wanted the technique to perform well under multiple imaging conditions. While SIFT + RANSAC has been shown to work well for several imaging conditions, it depends on the key point detector (e.g., SIFT) to find as many key points as possible so that the matching algorithm can generate a sufficient number of inliers for RANSAC to reach a reliable translation estimation. Chen et al. [31] reported that once there are more than 100 inlier matches, the marginal gain of more matches is negligible. While we used a stricter criterion for acceptance of inliers (1.5 pixel displacement vs 6 pixel displacements), both criteria generate fewer inliers for the inner retinal layers than for the photoreceptors (Fig. 7(B)). Thus, we wanted to include other matching methods, and in fact being able to include other means for estimating displacements seemed desirable and as shown- the combined algorithm seems to represent a useful balance between the number of final groups and the accuracy of the final comparisons.

We believe that envisioning the formation of the montage as a modular pipeline will allow it to be more readily adapted to improved algorithms at any stage of the pipeline. In fact, we have identified several areas for future improvements. For instance, the current pipeline assumes that the transform matrix between averaged frames only consists of translation. It discards the translation estimation if the estimated rotation angle is larger than a certain predetermined cutoff as this often indicates an error. This works well for our approach, since we typically move the scanning beam, but not the eye. However, in many applications, there may be rotations. The framework can be readily expanded to allow rotation as long as a similarity matrix can be calculated. Another limitation of the proposed pipeline is that the resulting montages are susceptible to montage placement errors that do not result in multiple final groups as seen in the high value of the NCC for the incorrect match shown in Fig. 4. While we created several ad hoc rules to filter out candidate pairings based on the correlation coefficient map in the NCC-based module to increase the accuracy when it claims an overlap and similar criteria limit the acceptability of SIFT based matches, these criteria may not be optimal. When verifying matches, it is possible that machine learning methods could be introduced to help determine matches in a data-driven manner generating improved filtering of incorrect matches. Similarly, we currently use the SDST approach to create a montage with the least likelihood of having a drift from cumulative positioning errors. However, the SDST approach could be replaced by an approach which evaluated all possible paths in the graph, and then used a combined approach that determined a consensus graph to generate final location estimates. This may help because when we did find errors, it usually resulted in one bad match, and then the subsequent matches of that branch of the graph were “good”, but their final location was linked to the misplaced edge. As a result, the entire branch might overlay other images and be discordant. We have also found that when using the output of this technique we often manually “connect” the separate groups using Photoshop, and then automatically extract the corrected positions from photoshop and applying them to other simultaneously collected imaging modalities. For instance, allowing the operator to correct the confocal images then apply that correction to the multiply scattered light image. Automating this step of matching seemingly disconnected groups based on another imaging modality such as non-AO images would be useful. Finally, the quality of the average frames played an essential part in whether the automatic montage procedure will be successful. Many of the frames that cannot be aligned to the montage may be attributed to a low signal-to-noise ratio. These occur more often in older subjects where dry-eye becomes more of a problem producing lower quality images if they fixate for too long without blinking. The pipeline could be expanded by adding an upstream filtering module to detect low-quality frames and both prevent them from intervening in the downstream montage process or even go back to the original choice of the template frame for that portion of the montage and recomputing the average image. While the current implementation allows recomputing averages, we have not yet implemented a feedback approach to provide an improved manner of template choice, and most of our data sets do not have the required dual imaging to produce template independent imaging [45]. Finally, the current blending method is based on the relative intensity between images. For confocal images, this is a good approach, the brightness tends to co-vary with the image contrast. However, this is not always true for multiply scattered light images. However, the same blending method could be used, but be based on local contrast or entropy of the images to include the ‘best’ images.

In summary, we have presented a modular approach to generating montages from high magnification adaptive optics image sets. The approach can incorporate known information, if available on the intended position of individual images, and to allow the user to flexibly choose an algorithm for estimating the displacement between frames.

Funding

National Eye Institute (EY024315).

Acknowledgements

The authors wish to thank Dr. Ann E. Elsner for helpful discussion.

Disclosures

The authors have nothing to disclose related to this work.

Data Availability

The total montage datasets are not available from this work, however an example dataset as well as a complete software implementation are publicly available as Matlab functions [44].

References

1. A. Roorda, F. Romero-Borja, W. Donnelly Iii, et al., “Adaptive optics scanning laser ophthalmoscopy,” Opt. Express 10(9), 405–412 (2002). [CrossRef]

2. S. A. Burns, R. Tumbar, A. E. Elsner, et al., “Large-field-of-view, modular, stabilized, adaptive-optics-based scanning laser ophthalmoscope,” J. Opt. Soc. Am. A 24(5), 1313–1326 (2007). [CrossRef]

3. B. Hermann, E. J. Fernandez, A. Unterhuber, et al., “Adaptive-optics ultrahigh-resolution optical coherence tomography,” Opt. Lett. 29(18), 2142–2144 (2004). [CrossRef]

4. R. J. Zawadzki, S. M. Jones, S. S. Olivier, et al., “Adaptive-optics optical coherence tomography for high-resolution and high-speed 3D retinal in vivo imaging,” Opt. Express 13(21), 8532–8546 (2005). [CrossRef]

5. Y. Zhang, J. Rha, R. Jonnal, et al., “Adaptive optics parallel spectral domain optical coherence tomography for imaging the living retina,” Opt. Express 13(12), 4792–4811 (2005). [CrossRef]

6. D. T. Miller, D. R. Williams, G. M. Morris, et al., “Images of cone photoreceptors in the living human eye,” Vision Res 36(8), 1067–1079 (1996). [CrossRef]

7. J. Liang, D. R. Williams, and D. T. Miller, “Supernormal vision and high-resolution retinal imaging through adaptive optics,” J. Opt. Soc. Am. A 14(11), 2884–2892 (1997). [CrossRef]

8. A. Roorda and D. R. Williams, “The arrangement of the three cone classes in the living human eye,” Nature 397(6719), 520–522 (1999). [CrossRef]

9. J. Rha, R. S. Jonnal, K. E. Thorn, et al., “Adaptive optics flood-illumination camera for high speed retinal imaging,” Opt. Express 14(10), 4552–4569 (2006). [CrossRef]

10. T. Y. Chui, H. Song, and S. A. Burns, “Individual variations in human cone photoreceptor packing density: variations with refractive error,” Invest. Ophthalmol. Vis. Sci. 49(10), 4679–4687 (2008). [CrossRef]

11. R. J. Zawadzki, B. Cense, Y. Zhang, et al., “Ultrahigh-resolution optical coherence tomography with monochromatic and chromatic aberration correction,” Opt. Express 16(11), 8126–8143 (2008). [CrossRef]

12. H. Song, T. Y. Chui, Z. Zhong, et al., “Variation of cone photoreceptor packing density with retinal eccentricity and age,” Invest. Ophthalmol. Vis. Sci. 52(10), 7376–7384 (2011). [CrossRef]

13. L. Sawides, A. de Castro, and S. A. Burns, “The organization of the cone photoreceptor mosaic measured in the living human retina,” Vision Res 132, 34–44 (2017). [CrossRef]

14. O. P. Kocaoglu, B. Cense, R. S. Jonnal, et al., “Imaging retinal nerve fiber bundles using optical coherence tomography with adaptive optics,” Vision Res 51(16), 1835–1844 (2011). [CrossRef]

15. K. Takayama, S. Ooto, M. Hangai, et al., “High-resolution imaging of the retinal nerve fiber layer in normal eyes using adaptive optics scanning laser ophthalmoscopy,” PLoS One 7(3), e33158 (2012). [CrossRef]

16. G. Huang, T. J. Gast, and S. A. Burns, “In vivo adaptive optics imaging of the temporal raphe and its relationship to the optic disc and fovea in the human retina,” Invest. Ophthalmol. Vis. Sci. 55(9), 5952–5961 (2014). [CrossRef]

17. Z. Liu, K. Kurokawa, F. Zhang, et al., “Imaging and quantifying ganglion cells and other transparent neurons in the living human retina,” Proc. Natl. Acad. Sci. U.S.A. 114(48), 12803–12808 (2017). [CrossRef]

18. E. A. Rossi, C. E. Granger, R. Sharma, et al., “Imaging individual neurons in the retinal ganglion cell layer of the living eye,” Proc. Natl. Acad. Sci. U.S.A. 114(3), 586–591 (2017). [CrossRef]

19. K. A. Sapoznik, T. Luo, A. de Castro, et al., “Enhanced retinal vasculature imaging with a rapidly configurable aperture,” Biomed. Opt. Express 9(3), 1323–1333 (2018). [CrossRef]

20. A. Roorda, Y. Zhang, and J. L. Duncan, “High-resolution in vivo imaging of the RPE mosaic in eyes with retinal disease,” Invest. Ophthalmol. Vis. Sci. 48(5), 2297–2303 (2007). [CrossRef]

21. E. A. Rossi, P. Rangel-Fonseca, K. Parkins, et al., “In vivo imaging of retinal pigment epithelium cells in age related macular degeneration,” Biomed. Opt. Express 4(11), 2527–2539 (2013). [CrossRef]

22. J. I. Morgan, A. Dubra, R. Wolfe, et al., “In vivo autofluorescence imaging of the human and macaque retinal pigment epithelial cell mosaic,” Invest. Ophthalmol. Vis. Sci. 50(3), 1350–1359 (2009). [CrossRef]

23. Z. Liu, O. P. Kocaoglu, and D. T. Miller, “3D Imaging of Retinal Pigment Epithelial Cells in the Living Human Retina,” Invest. Ophthalmol. Vis. Sci. 57(9), OCT533 (2016). [CrossRef]

24. Z. Zhong, B. L. Petrig, X. Qi, et al., “In vivo measurement of erythrocyte velocity and retinal blood flow using adaptive optics scanning laser ophthalmoscopy,” Opt. Express 16(17), 12746–12756 (2008). [CrossRef]

25. P. Bedggood, M. Daaboul, R. Ashman, et al., “Characteristics of the human isoplanatic patch and implications for adaptive optics retinal imaging,” J. Biomed. Opt. 13(2), 024008 (2008). [CrossRef]

26. P. Godara, C. Siebe, J. Rha, et al., “Assessing the photoreceptor mosaic over drusen using adaptive optics and SD-OCT,” Ophthalmic Surg Lasers Imaging 41(S1), S104–108 (2010). [CrossRef]

27. R. F. Cooper, A. M. Dubis, A. Pavaskar, et al., “Spatial and temporal variation of rod photoreceptor reflectance in the human retina,” Biomed. Opt. Express 2(9), 2577–2589 (2011). [CrossRef]

28. K. Gocho, V. Sarda, S. Falah, et al., “Adaptive optics imaging of geographic atrophy,” Invest Ophthalmol Vis Sci 54(5), 3673–3680 (2013). [CrossRef]

29. H. Li, J. Lu, G. Shi, et al., “Automatic montage of retinal images in adaptive optics confocal scanning laser ophthalmoscope,” Opt. Eng. 51(5), 1–057005 (2012). [CrossRef]

30. D. G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” International Journal of Computer Vision 60(2), 91–110 (2004). [CrossRef]

31. M. Chen, R. F. Cooper, G. K. Han, et al., “Multi-modal automatic montaging of adaptive optics retinal images,” Biomed. Opt. Express 7(12), 4899–4918 (2016). [CrossRef]

32. E. Rublee, V. Rabaud, K. Konolige, et al., “ORB: An efficient alternative to SIFT or SURF,” in 2011 International Conference on Computer Vision, (2011), 2564–2571.

33. B. Davidson, A. Kalitzeos, J. Carroll, et al., “Fast adaptive optics scanning light ophthalmoscope retinal montaging,” Biomed. Opt. Express 9(9), 4317–4328 (2018). [CrossRef]

34. M. Chen, R. F. Cooper, J. C. Gee, et al., “Automatic longitudinal montaging of adaptive optics retinal images using constellation matching,” Biomed. Opt. Express 10(12), 6476–6496 (2019). [CrossRef]

35. R. D. Ferguson, Z. Zhong, D. X. Hammer, et al., “Adaptive optics scanning laser ophthalmoscope with integrated wide-field retinal imaging and tracking,” J. Opt. Soc. Am. A 27(11), A265–277 (2010). [CrossRef]

36. W. Zou, X. Qi, and S. A. Burns, “Wavefront-aberration sorting and correction for a dual-deformable-mirror adaptive-optics system,” Opt. Lett. 33(22), 2602–2604 (2008). [CrossRef]

37. T. Y. Chui, M. Dubow, A. Pinhas, et al., “Comparison of adaptive optics scanning light ophthalmoscopic fluorescein angiography and offset pinhole imaging,” Biomed. Opt. Express 5(4), 1173–1189 (2014). [CrossRef]

38. S. B. Stevenson and A. Roorda, “Correcting for miniature eye movements in high resolution scanning laser ophthalmoscopy,” Proceedings of the SPIE 5688, 145–151 (2005).

39. G. Huang, Z. Zhong, W. Zou, et al., “Lucky averaging: quality improvement of adaptive optics scanning laser ophthalmoscope images,” Opt. Lett. 36(19), 3786–3788 (2011). [CrossRef]

40. A. E. Salmon, R. F. Cooper, M. Chen, et al., “Automated image processing pipeline for adaptive optics scanning light ophthalmoscopy,” Biomed. Opt. Express 12(6), 3142–3168 (2021). [CrossRef]

41. S. Sadeghpour and J. Otero-Millan, “Torsional component of microsaccades during fixation and quick phases during optokinetic stimulation,” J Eye Mov Res 13(5), 2 (2020). [CrossRef]

42. J. P. Lewis, “Fast normalized cross-correlation,” Vision Interface 95, 1995 (1995).

43. M. Kano and H. Matsumura, “Spanning trees with small diameters,” AKCE International Journal of Graphs and Combinatorics (2019).

44. Luo, “Montaging Approach for retinal images based on Graph Theory Figshare,” figshare, (2024). https://doi.org/10.6084/m9.figshare.24999914.

45. T. Luo, R. L. Warner, K. A. Sapoznik, et al., “Template free eye motion correction for scanning systems,” Opt. Lett. 46(4), 753–756 (2021). [CrossRef]

Subject group	Number of subjects	Age range (years)	Age std. (years)
Glaucoma	4	65–81	7.3
Diabetes	9	39–63	8.3
Control	10	24–74	17.7
Total	23	24–81	16.9

Filtering criterion on each local maximum at location (i, j)	Explanation
Rcmap(i, j) > 110% of the mean.	The correlation coefficient (cc) must be relatively large.
Acmap(i, j) > 30% of the max AND Rcmap(i, j) > 80% of the max.	Its cc must be sufficiently large.
No other regional maximums occur within the neighboring region of Acmap.	The cc must represent a regional “dominant peak”.
If multiple regional dominant peaks are found, keep the largest only if the second largest peak’s coefficient < 90% of that of the largest.	The cc must be a global “dominant peak”.

Subject group	Number of subjects	Age range (years)	Age std. (years)
Glaucoma	4	65–81	7.3
Diabetes	9	39–63	8.3
Control	10	24–74	17.7
Total	23	24–81	16.9

Filtering criterion on each local maximum at location (i, j)	Explanation
Rcmap(i, j) > 110% of the mean.	The correlation coefficient (cc) must be relatively large.
Acmap(i, j) > 30% of the max AND Rcmap(i, j) > 80% of the max.	Its cc must be sufficiently large.
No other regional maximums occur within the neighboring region of Acmap.	The cc must represent a regional “dominant peak”.
If multiple regional dominant peaks are found, keep the largest only if the second largest peak’s coefficient < 90% of that of the largest.	The cc must be a global “dominant peak”.

Automatic montaging of adaptive optics SLO retinal images based on graph theory

Abstract

1. Introduction

2. Method

2.1 Indiana AOSLO system and the acquisition of the data

2.2 Averaging frames

2.3 Initialization of the connection matrix

2.4 Refining the connection matrix

2.5 Automatic montage based on small diameter spanning tree (SDST)

2.6 Blending frames for the final montage

2.7 Semi-automated montage technique

2.8 Calculation of performance metrics

3. Results

3.1 Datasets

3.2 Performance comparison of automated and semi-automated montaging

4. Conclusion and discussion

Funding

Acknowledgements

Disclosures

Data Availability

References

Supplementary Material (1)

Data Availability

Cited By

Figures (7)

Tables (3)

Equations (1)

Biomedical Optics Express