Gaussian mixture model for coarse-grained modeling from XFEL

Tetsuro Nagai; Yuki Mochizuki; Yasumasa Joti; Florence Tama; Osamu Miyashita

doi:10.1364/OE.26.026734

1. Introduction

Determination of bio-molecular structure is one of the central interests of structural biology and many techniques have been developed. X-ray free electron laser (XFEL) is an emerging and promising technique that provides strong and coherent X-ray beam, by which macro-molecular structures can be determined [1,2]. Because of the strategy of “diffraction-before-destruction” [3] cryogenic cooling is not necessary, and room-temperature structure and dynamics may be investigated. Differences in dynamics of biological molecules as well as ordered water molecules between room temperature and cryogenic temperature are reported [4,5]. XFEL is promising also for time-resolved structural studies [6]. Moreover, because of the coherence and brilliance of XFEL, a single particle can produce sufficient diffractions that enable to 2D coherent diffraction imaging [7–12] which was initially performed with a synchrotron beam for a large system [13]. As recent improvements in XFEL experimental methodology produce sufficient numbers of diffraction patterns, 3D reconstruction of structural models is becoming more feasible [14,15].

The determination of high-resolution 3D structure models requires assembly of many diffraction patterns [16,17] obtained with sufficiently strong XFEL source [18,19]. As the current sample delivery system does not allow for the control of sample orientation [20], the orientation of the particle with respect to the incident beam must be estimated during the 3D reconstruction [14,21–23]. As the state of sample is also difficult to control, conformational heterogeneity has to be considered. In [15], the structural heterogeneity was investigated by analyzing conformational spectra of diffraction images lying in similar orientations.

However, the acquisition of large data sets is still a significant challenge, especially for biological systems. Thus, the ability to study structure and dynamics of biological systems from a small number of diffraction patterns would be beneficial. In general, data retrieved from XFEL experiments tends to be of low-resolution, which means that there is not enough information available to reconstruct 3D models ab initio. In such a situation, computational modeling can be used to generate a large number of hypothetical structural models (e.g. using a database or molecular mechanics simulations), and then the experimental data can be used as queries to select models that are most likely to represent the system. Such approaches have been used for SAXS [24–26], cryo-EM [27–29], and also demonstrated for XFEL diffraction patterns [11,30]. A key algorithm in this approach is to simulate XFEL diffraction patterns from a large number of models with many incident beam angles. In the previous study to quantify the accuracy of such an approach [31], diffraction patterns were simulated from atomic models to compare against experimental data. However, the calculation is time-consuming. In order to efficiently generate a large number of diffraction patterns during thorough conformational sampling, the simulation of diffraction patterns from structure models needs to be fast. In addition, the target systems in the current XFEL experiments are large macromolecular complexes, where atomic details are not essential. Therefore, low-resolution/coarse-grained models should be used for such data analysis.

In this work, we explored the utilization of Gaussian mixture model (GMM) as a coarse-grained model for structure modeling from XFEL data. The GMM has been demonstrated by Kawabata to be useful in obtaining docking poses for a given meso-resolution volume complex (by cryo-EM) and structures of its component [32]. This is because the evaluation of the overlap is computationally fast, as the overlap between Gaussians can be quickly calculated analytically. Here we exploit the advantage that the Fourier transformation of Gaussians is very computationally efficient. Therefore the diffraction image of protein can be quickly obtained through the use of GMMs, thereby becoming a useful tool for 3D reconstruction of XFEL experiment.

2. Methods

2.1. Gaussian mixture model

In the Gaussian mixture model (GMM), a macromolecule is represented by sum of N_g Gaussian distributions. We will denote by r a position in three dimensional real space. A molecule is represented as density function $f (r | Θ) = \sum_{i = 1}^{N_{g}} π_{i} ϕ (r | μ_{i}, Σ_{i})$ , where ϕ(r|μ_i, Σ_i) stands for the Gaussian distribution of mean μ_i and covariance matrix Σ_i and π_i ∈ [0, 1] is the wight of ith Gaussian distribution. The weight is set to sum to unity, i.e., $\sum_{i = 1}^{N_{g}} π_{i} = 1$ . The individual Gaussian distributions are given by $ϕ (r | μ_{i}, Σ_{i}) = {(2 π)}^{- 3 / 2} {| Σ_{i} |}^{- 1 / 2} exp [- 2^{- 1} {(r - μ_{i})}^{T}, Σ_{i}^{- 1} (r - μ_{i})]$ . Here, Θ collectively denotes the parameters of these Gaussian distributions.

The parameters Θ can be optimized by maximum likelihood method for a given macromolecule. When coordinates of atoms of the protein ({r₁, r₂, . . ., r_N}) are given, these values can be thought of observations from GMM. Then the likelihood function may be given by $L (Θ) = Π_{j = 1}^{N} \sum_{i = 1}^{N_{g}} π_{i} ϕ (r_{j} | μ_{i}, Σ_{i})$ , and the GMM parameter values that maximize L(Θ) can be thought to best represent molecular shape. In this way, we made GMMs for some of the protein structures available in the PDB. In the proposed approach, the resulting GMMs were considered as electron densities, which diffract the coherent X-ray beam. We used the program of gmconvert [32] to obtain the optimized GMM. In the optimization, no hydrogen atoms were included, as crystal structures do not typically provide information on the position of hydrogens in the biomolecule. This disregard could also be justified by the fact that the number of electrons associated with hydrogen is small compared to non-hydrogen atoms. In addition, all the non-hydrogen-atom coordinates available are treated equally. This treatment should be an acceptable approximation, as most of the atoms in proteins are carbon (atomic number Z = 6), nitrogen (Z = 7) and oxygen (Z = 8), which all have a similar number of electrons. When heavy metal ions and/or nucleic acids are important, some special care needs to be taken.

2.2. Calculation of diffraction images

Since the Fourier transformation of GMM can be performed analytically, the structure factor of GMM may be given by

\begin{array}{l} F (s) \equiv ∭ f (r | Θ) e^{i s \cdot r} d r \\ \begin{array}{l} = \sum_{i = 1}^{N} π_{i} e^{i s \cdot μ_{i}} exp [- \frac{1}{2} s^{T} \sum_{i} s], \end{array} \end{array}

where s represents a diffraction wave vector. Hereafter, we use k = s/(2π).

In this study, without loss of generality, we assumed that the incident beam comes from the positive side of z-axis, i.e., the wave vector of the incident beam is (0, 0, −k_inc), that the object is at (0, 0, d), and that the detector is set perpendicular to z-axis with its center being at (0, 0, 0), where d is the distance between object and detector (d > 0). In elastic scattering, the intensity of diffraction at (x, y, 0), I(x, y), is proportional to |F(k_x, k_y, k_z)|² such that

k_{x} = k_{inc} \frac{x}{\sqrt{d^{2} + x^{2} + y^{2}}},

k_{y} = k_{inc} \frac{y}{\sqrt{d^{2} + x^{2} + y^{2}}},

k_{z} = k_{inc} - \sqrt{k_{inc}^{2} - k_{x}^{2} - k_{y}^{2}} .

The coordinates (k_x, k_y, k_z) form (half of) the Ewald sphere. We obtained two-dimensional diffraction image I(x, y) relative to I(0), and, in the following, the diffraction patterns are discussed using k values corresponding to pixels as the coordinates, i.e., I(x, y) → I(k_x, k_y). The calculation was performed with an in-house program written mainly in python. Computationally expensive parts are written in C. The program can be obtained upon request.

In all the diffraction calculations, the wavelength of the incident beam was set to 1 Å, i.e., k_inc = 1 Å⁻¹. The size of the pixel in the diffraction image is 0.001 Å⁻¹ × 0.001 Å⁻¹ in k-space near the center.

We also obtained diffraction images that are simulated from atomic models, by using the program ‘SPSim,’ which has been included into ‘condor’ [33]. With this program we acquired the diffraction images calculated with atomic scattering factors, which we used as answers to evaluate the effect of GMM approximation. The expected values of diffraction patterns, i.e., without shot noise, were used in the analyses. We did not include hydrogen atoms in our simulations. The parameters are chosen according to the parameters used in the calculation of GMM-based diffraction patterns. The resulting diffraction images were compared with those obtained using GMMs.

2.3. Correlation coefficient and measurement of resolution

Pearson’s correlation coefficient was used to qualitatively compare the diffraction patterns simulated from atomic structure and the diffraction patterns simulated from above GMM. We first obtained the logarithm of intensity of two diffraction images as a function of polar coordinates k and ϕ, i.e., log₁₀ I₁(k, ϕ) and log₁₀ I₂(k, ϕ). Here, k and ϕ are defined such that $k = \sqrt{k_{x}^{2} + k_{y}^{2}}$ and (k_x, k_y) = (k cos ϕ, k sin ϕ). Then Pearson’s coefficient between log₁₀ I₁(k, ϕ) and log₁₀ I₂(k, ϕ) for a fixed k is defined as CC(k). When two diffraction images match at k, CC(k) = 1. As a measure of (inverse of) resolution, k_0.5 was defined by 0.5 = CC(k_0.5). To obtain this numerically, we first fit CC(k) by a sigmoid function 1/(1 + exp(C₁(k − C₂)), where C₁ and C₂ are fitting parameters. Then C₂ is reported as k_0.5.

2.4. Test of angular assignment and model selection by similarity score

In addition we also examined the performance of GMM approach for practical algorithms; when a diffraction pattern simulated from a hypothetical 3D model is compared against experimental diffraction patterns, the first requirement is to estimate the orientation of the molecule against the incident beam [31]. In order to check the feasibility of the angular assignment of incident beam, we compared atomic model’s diffraction image and many rotated GMMs’ diffraction images. The rotation of atomic model and GMM is performed, by using the three Euler angles α, β, and γ, with the convention of R_z(γ)R_y(β)R_z(α). Here, α and β can be associated with (negative values of) incident beam angles and γ can be associated with (negative values of) in-plane rotation angle along axis parallel to the incident beam. The rotated GMM’s diffraction image is compared with the atomic model’s diffraction image by the similarity score that we defined as $H_{α, β, γ} = (1 / N_{k}) \sum_{ℓ = 1}^{N_{k}} {CC}_{α, β, γ} (ℓ Δ k)$ [34]. We used N_k = 5 and Δk = 0.02 Å⁻¹ in this study. The values of parameters were determined, given that Δk should be equal to or larger than the correlation length in k-space. This length can be approximated by 1/L ≈ 0.01Å⁻¹ for the test system used. Also the maximum wavenumber N_kΔk should not be too larger than the resolution that GMM can reach. As is shown in Results and Discussion, the 0.1 Å⁻¹ can be reasonably achieved for the model system used. We also studied whether the correct model can be selected with GMM by comparing this similarity score.

2.5. Model systems

Proteins used in this study are summarized in Table 1. These proteins were selected by considering variations in sizes and overall shapes. The system size varies from 6375 to 398820 in terms of the number of non-hydrogen atoms. Two conformations of yeast elongation factor 2 (EF2) were included to see the sensitivity of GMM to the conformational variation and to test the feasibility of angular assignment and model selection. The structure of 1n0v is the apo-form, while the structure in 1n0u is the holo-form complex with an inhibitor sordarin. Upon sordarin binding, the three C-terminal domains undergo substantial conformational change, whereas three N-terminal domains are almost rigid [35]. These two structures differ by 14 Å RMSD. All HETATM entries were eliminated before any calculations. We also examined the accuracy of GMM approach in relation to their volumes. Volume was calculated with gmx sasa in gromacs package [36], with the probe radius set to 0. The values of volume are also tabulated in Table 1.

Table 1. The proteins used for test in this research. Here, N denotes the number of atoms in protein. Volume is denoted by V that is calculated with gromacs [36].

View Table | View all tables in this article

3. Results

3.1. Diffraction data simulated using GMMs is sensitive to conformational changes observed between two yeast translation elongation factor 2 structures

As expected, diffraction patterns obtained with the GMM have better correlation to the atomic diffraction patterns as N_g increases (Fig. 1). Yet, even with a small N_g of 40, the central regions of the diffraction patterns were well reproduced, indicating that the low-resolution region can be expressed by GMM with small N_g. The differences in diffraction patterns between incident beam angles were also well reproduced by the GMM diffraction patterns (Figs. 1 and 2). The diffraction patterns produced from GMMs with N_g = 1000 are nearly identical to atomic diffraction patterns up to a high wavenumber (≈ 0.2 Å⁻¹) regardless of incident beam angles (Figs. 1 and 2). Increase in similarity between atomic and GMM diffraction images with large N_g can be more clearly shown by CC(k) (Fig. 3). For this particular system, with N_g ≈ 700 or more, the diffraction images of GMM and atomic models are identical up to 0.1 Å⁻¹. The achievable resolution of GMM was evaluated in terms of k_0.5. It is shown that k_0.5 increases as N_g increases (Fig. 4), consistent with the behavior of CC(k) shown in Fig. 3. This relationship suggests that in practical applications we can adjust the parameter N_g so that the achievable resolution of GMM covers the experimentally targeted resolution. We note that k_0.5 is independent of incident-beam wavenumber.

Fig. 1 (a) Structure and GMM envelope (N_g = 200) and (b) diffraction image calculated from atomic model. Panels (c) to (f) are those obtained from GMMs with different N_g. The molecule in (a) is shown from the incident beam direction, corresponding to (001) direction. The pixel size is 0.001 Å⁻¹ times 0.001 Å⁻¹. The intensities in the diffraction images are shown as log₁₀[I(k)/I(0)], where the colors represents values of intensity at each pixel in common logarithm relative to the maximum intensity.

Download Full Size | PDF

Fig. 2 (a) Structure and GMM envelope (N_g = 200) shown from (010) direction, which the incident beam comes. Diffraction images calculated from the atomic model and the GMM with N_g = 1000 are shown in panel (b) and (c), respectively. Panels (d) to (f) are counterparts for the (100) direction. The pixel size is 0.001 Å⁻¹ times 0.001 Å⁻¹. The intensities in the diffraction images are shown as log₁₀[I(k)/I(0)].

Download Full Size | PDF

Fig. 3 Correlation coefficient as a function of radial distance k, CC(k), improves as N_g increases. The panels (a), (b), and (c) correspond to incident beam direction (001), (010), and (100), respectively.

Download Full Size | PDF

Fig. 4 k_0.5 as a function of N_g, indicating resolution of GMM approximation. Purple square, green circle, and blue triangle represent incident beam angle direction (001), (010), and (100), respectively.

Download Full Size | PDF

To see the sensitivity of GMM with respect to structural heterogeneity, the diffraction images of two isomers of EF2, i.e., PDB ID 1n0u and 1n0v were calculated (Fig. 5). As mentioned above, the structures of 1n0v and 1n0u are the apo-form and the holo sordarin-bound complex, respectively [35], differing by 14 Å RMSD when aligned as shown in Figs. 5(a) and 5(d). This alignment was performed globally with rigid body rotation and translation with RMSD tool in VMD [41], so that the backbone RMSD was minimum. The resulting diffraction images are significantly different, and thus demonstrate that their respective GMMs can distinguish the two different conformations through their corresponding diffraction images.

Fig. 5 Comparison between two conformational isomers suggests the sensitivity of GMM to structural heterogeneity. Panels (a) to (c) show 1n0u and panels (d) to (f) show 1n0v. The envelopes shown in panels (a) and (d) are based on GMM with N_g = 200. Panels (b) and (e) show the diffraction images simulated from atomic model, while panels (c) and (f) show those simulated from GMM. The pixel size is 0.001 Å⁻¹ times 0.001 Å⁻¹. The intensity of diffraction images is shown as log₁₀[I(k)/I(0)].

Download Full Size | PDF

3.2. Data with groEL

The results for groEL are shown in Figs. 6 and 7. The GMMs produce diffraction patterns that match diffraction patterns produced from atomic models very well up to k = 0.1 Å⁻¹. Also, the correlation with the diffraction image simulated from atomic model increases as N_g increases.

Fig. 6 Structure and diffraction images of groEL. Panels (a), (d), and (g) are structure and envelope based on GMM with N_g = 200 shown from the incident beam direction. Hereafter views of (a), (d), and (g) are referred to as “bottom,” “side,” and “tilt,” respectively. In panels (b), (e), and (h), diffraction images obtained with atomic models are shown, while in panels (c), (f), and (i), those simulated from GMM are shown. The pixel size is 0.001 Å⁻¹ times 0.001 Å⁻¹. The intensity of diffraction images is shown as log₁₀[I(k)/I(0)].

Download Full Size | PDF

Fig. 7 Correlation coefficient of diffraction images between GMM and atomic model, CC(k), as a function of radial distance k for bottom view (a), side view (b), and tilt view (c) of groEL, exhibits improvement as increase in N_g.

Download Full Size | PDF

3.3. Data with Satellite Tobacco Necrosis Virus Capsid (STNVC)

Even for a larger complex like the Satellite Tobacco Necrosis Virus Capsid (STNVC), the GMM captures the small angle region very well (Fig. 8). The correlation between diffraction image obtained with GMM and one simulated from the atomic model increases as N_g increases (Fig. 9).

Fig. 8 Structure and diffraction images of STNVC. Panel (a) is shown from the incident beam angles. The envelope in panel (a) is based on a GMM with N_g = 200. Panel (b) and (c) show the diffraction images simulated from the atomic model and the GMM with N_g = 1000, respectively. The pixel size is 0.0010 Å⁻¹ times 0.0010 Å⁻¹. The intensity of diffraction images is shown as log₁₀[I(k)/I(0)].

Download Full Size | PDF

Fig. 9 Correlation coefficient of diffraction images between GMM and atomic models as a function of k for STNVC.

Download Full Size | PDF

3.4. Effect of the number of Gaussians on simulation accuracy

To obtain a guideline for the number of Gaussians necessary to reproduce diffraction patterns at a required resolution, we performed scaling analysis. We found that $k_{0.5} \propto N_{g}^{1 / 3}$ and that the resolution obeys a universal behavior $k_{0.5} V^{1 / 3} \approx N_{g}^{1 / 3}$ (Fig. 10). This relationship can be used to set N_g for a given system size and a required resolution. As k_0.5 is independent of k_inc as discussed above, this relationship holds for any wavelength of incident beam.

Fig. 10 The scaling analysis demonstrates the universal behavior and illustrates the guideline of resolution. Panel (a) shows k_0.5 as a function of $N_{g}^{1 / 3}$ , suggesting that the resolution is in proportion to $N_{g}^{1 / 3}$ . Panel (b) shows that the slope can be normalized by V^1/3 and that one universal guideline was obtained. Purple and green marks stand for EF2 and groEL, respectively. The different symbol marks different incident beam angles. The blue inverse triangle marks polio virus capsid complex, which is the largest system studied. The orange diamond marks glutamate dehydrogenase (1euz) and red inverse hexagon represents STNVC (4bcu). Black line represents fitting line $k_{0.5} V^{1 / 3} = 0.90 N_{g}^{1 / 3}$ . The values of V are tabulated in Table 1.

Download Full Size | PDF

This universal behavior may be explained as follows. As the degree of freedom for each Gaussian is nine, 9N_g parameters are optimized in order to approximate the electron density of the molecule. This translates into 9N_g sampling points inside the volume V, whereby the sampling rate is (V/9N_g)^1/3. The highest resolution that such a sampling can represent is given by k_max = (1/2)(V/9N_g)^−1/3 ≈ 1.04(N_g/V)^1/3. This relation is similar to the observed relation, k_0.5 ≈ (N_g/V)^1/3.

3.5. Feasibility of angular alignment and model selection

When the data from real experiments are analyzed, incident-beam angles and in-plane rotation angle of diffraction images need to be estimated. For this aim, it is necessary to rotate GMMs so that their diffraction patterns fit the experimental patterns, whereby the incident-beam angles and in-plane rotation angles can be estimated. To see the feasibility of this procedure, we made ten hypothetically experimental diffraction patterns for each of 1n0u and 1n0v (“target” diffraction patterns). We first rotated the atomic models of 1n0u and 1n0v by using ten sets of Euler angles (α, β, and γ) from the aligned structures shown in Figs. 5(a) and 5(d). The ten sets of these Euler angles were made by drawing α, β, and γ from uniform distributions in [0, 360], [0, 180], and [−180, 180], respectively. The convention of rotation was R_z(γ)R_y(β)R_z(α). After the rotations, target diffraction patterns were calculated with the incident beam from z axis. As noted above, the structures of 1n0v and 1n0u are the apo-form and the holo sordarin-bound complex [35], respectively, differing by 14 Å RMSD.

First, we tested whether the three Euler angles could be found by maximizing similarity score between the target diffraction patterns and diffraction patterns of rotated GMM. In other words, we tested whether these angles could be given by arg max_α,β,γ H_α,β,γ, where H is measured between target diffraction patterns and diffraction patterns from GMM that are rotated by R_z(γ)R_y(β)R_z(α). The maximization was performed with exhaustive search. Here, α and β were searched every 5° and γ was searched every 1°. As can be seen in Fig. 11, when the target diffraction patterns and GMM’s patterns are from the same conformation, large similarity scores are observed around correct angles, for N_g ≥ 40. The number of successful angular assignments out of 10 trials, m, is summarized in Table 2. It is shown that, for N_g ≥ 40, the correct angles were almost always obtained. This indicates that, for EF2, the resolution of 1/k_0.5 ≈ 12 Å is sufficient for angular assignment.

Fig. 11 The similarity score is illustrated as a function of two of three Euler angles (α, β) of incident beam angles, defined as max_γ[H_α,β,γ], where γ corresponds to in-plane rotation angle and H = (1/5) $H = 1 / 2 \sum_{k = 0.02, 0.04, 0.06, 0.08, 0.1} CC (k)$ . In panels (a), (b), and (c), the similarity score was measured between the 1n0u atomic structure and the GMM obtained from 1n0u. In panels (d), (e), and (f), the similarity score was measured between the 1n0v atomic structure and GMM obtained from 1n0u. In panels (g), (h), and (i), the similarity score was measured between the 1n0u atomic structure and the GMM obtained from 1n0v. In panels (j), (k), and (l), the similarity score was measured between the 1n0v atomic structure and the GMM obtained from 1n0v. In the left panels (a), (d), (g), and (j), N_g = 20. In the central panels (b), (e), (h), and (k), N_g = 40. In the right panels (c), (f), (i), and (l), N_g = 100. In this trial, the correct angle is (α, β) = (236°, 152°).

Download Full Size | PDF

Table 2. The number of successful angular registrations m out of 10 trials is tabulated, together with the number of successful model selection n. The beam angle and in-plane rotation angle are different between the trials. If all three estimated Euler angles are simultaneously within ±7.5° of the corresponding correct Euler angles, the registration was considered to be correct. As to n, we considered that the model was accurately selected, if the maximum similarity score between target and GMM’s patterns that are from the same conformation is the larger than that between the incorrect pairs.

View Table | View all tables in this article

As GMM’s resolution relative to molecule size should matters more than the absolute value of the resolution, N_g = 40 could be a reasonable number that can be employed for angular assignment of an object that has as much asymmetry as EF2. We generally expect that angular assignment of molecule with greater symmetry is more difficult. Therefore, N_g may have to be larger for molecules with greater symmetry.

In addition, we tested whether the GMM that corresponds to the correct conformation can be selected for target diffraction patterns with unknown angles. We compared the similarity score of 1n0u atomic model against GMM made from 1n0u and the score of 1n0u atomic model against GMM made from 1n0v. As the former score is almost always higher than the latter score, we can select the correct conformation (see the second column from right in Table 2). Similarly, we can select GMM generated from 1n0v for the target diffraction patterns made from 1n0v by comparing the similarity score (the rightmost column of Table 2). These results indicate that the correct model can be selected on the basis of similarity score. This test demonstrates the feasibility of hybrid approach with GMM, whereby we will be able to choose a model that explains the diffraction images obtained in future XFEL experiment.

3.6. Computational time

For SPSim, the computational time for 200×200 pixel diffraction patterns was approximately given by 0.0015N + 4.55 second with linux PC containing i7-5930 @3.50GHz, where N denotes the number of atoms. On the other hand, via GMM, that was approximated to 0.0040N_g + 0.68 with the same machine.

4. Discussion

In a recent study about XFEL single particle 3D reconstruction of large structures, the resolution R reached 125 nm and 9 nm in the 450-nm giant mimivirus [14] and 70-nm PR772 [15], respectively. These numbers indicate R⁻¹V^1/3 ≈ 3.6 and 7.7, and thus correspond to N_g ≈ 42 and 410, respectively. Therefore, GMM has sufficient resolution to work on in the current XFEL studies. Furthermore, if necessary in future work, the resolution of GMM can be increased systematically in proportion to $N_{g}^{1 / 3}$ . In the case of EF2, the resolution of GMM reached k ≈ 0.2Å⁻¹ (R ≈ 5 Å). This shows that GMM can also be used for protein structure reconstruction. On the other hand, when experimental diffraction images do not provide high-resolution information, we can reduce N_g to gain computational efficiency and to avoid overfitting.

In addition, once a good initial GMM is obtained even with a relatively small N_g ≈ 40, the incident beam angle can be estimated very well, as demonstrated in Fig. 11. The conformational heterogeneity can also be distinguished with GMMs (Fig. 11 and Table 2). These observations show that GMM can be used in a hybrid approach to extract dynamic information from XFEL data. The presented method will be a low-resolution yet fast alternative to the approach with atomic models [31].

GMMs give us the ability to employ an exhaustive search for angular assignment due to the low-cost in its computation. As such angular assignments are also performed in cryo-EM (see, e.g., [42]), and it is attractive to use optimization scheme in EM-image-analysis tools such as Xmipp [43] in future applications.

As the diffraction images used in this study were ideal without shot noises, these images were compared with Pearson’s r of intensity CC(k). However, an actual experimental diffraction image can be very noisy. In the future, it will be interesting for us to test the robustness of GMM’s angular assignment over the very noisy diffraction patterns. The proposed approach can also be incorporated into diffraction pattern simulators that include other experimental setup parameters for further applications.

In conclusion, we have demonstrated that the resolution of GMM depends on the system size and the number of representing Gaussians as $R \approx N_{g}^{- 1 / 3} V^{1 / 3}$ . Thus, according to needs of resolution and computational resources, one can seek the efficient coarse-graining level. It was also demonstrated that the GMM can be used to estimate incident beam angles and to detect conformational variation. Even with the relatively small number of Gaussians, incident beam angles were accurately estimated and correct models were selected. Thus, GMMs can be used to efficiently generate various conformations and simulate expected diffraction patterns in order to identify most likely models which have high correspondence to a given experimental diffraction image. In summary, GMM has the potential to serve as a good coarse-grained model for XFEL, and can be a useful part of a hybrid approach, i.e., the 3D reconstruction and the modeling of conformational dynamics on the basis of XFEL single particle experiment.

Funding

JSPS (JP26119006, JP15K21711, JP16K05527, JP16K07286, JP17K07305, JP26790083); FOCUS for Establishing Supercomputing Center of Excellence.

Acknowledgments

The authors gratefully acknowledge Drs. Nakano and Tiwari of RIKEN for helpful and fruitful discussion.

References

1. H. N. Chapman, P. Fromme, A. Barty, T. A. White, R. A. Kirian, A. Aquila, M. S. Hunter, J. Schulz, D. P. DePonte, U. Weierstall, R. B. Doak, F. R. N. C. Maia, A. V. Martin, I. Schlichting, L. Lomb, N. Coppola, R. L. Shoeman, S. W. Epp, R. Hartmann, D. Rolles, A. Rudenko, L. Foucar, N. Kimmel, G. Weidenspointner, P. Holl, M. Liang, M. Barthelmess, C. Caleman, S. Boutet, M. J. Bogan, J. Krzywinski, C. Bostedt, S. Bajt, L. Gumprecht, B. Rudek, B. Erk, C. Schmidt, A. Hömke, C. Reich, D. Pietschner, L. Strüder, G. Hauser, H. Gorke, J. Ullrich, S. Herrmann, G. Schaller, F. Schopper, H. Soltau, K.-U. Kühnel, M. Messerschmidt, J. D. Bozek, S. P. Hau-Riege, M. Frank, C. Y. Hampton, R. G. Sierra, D. Starodub, G. J. Williams, J. Hajdu, N. Timneanu, M. M. Seibert, J. Andreasson, A. Rocker, O. Jönsson, M. Svenda, S. Stern, K. Nass, R. Andritschke, C.-D. Schröter, F. Krasniqi, M. Bott, K. E. Schmidt, X. Wang, I. Grotjohann, J. M. Holton, T. R. M. Barends, R. Neutze, S. Marchesini, R. Fromme, S. Schorb, D. Rupp, M. Adolph, T. Gorkhover, I. Andersson, H. Hirsemann, G. Potdevin, H. Graafsma, B. Nilsson, and J. C. H. Spence, “Femtosecond X-ray protein nanocrystallography,” Nature 470, 73–77 (2011). [CrossRef] [PubMed]

2. J. C. H. Spence, “XFELs for structure and dynamics in biology,” IUCrJ 4, 322–339 (2017). [CrossRef] [PubMed]

3. R. Neutze, R. Wouts, D. van der Spoel, E. Weckert, and J. Hajdu, “Potential for biomolecular imaging with femtosecond X-ray pulses,” Nature 406, 752–757 (2000). [CrossRef] [PubMed]

4. W. Liu, D. Wacker, C. Gati, G. W. Han, D. James, D. Wang, G. Nelson, U. Weierstall, V. Katritch, A. Barty, N. A. Zatsepin, D. Li, M. Messerschmidt, S. Boutet, G. J. Williams, J. E. Koglin, M. M. Seibert, C. Wang, S. T. A. Shah, S. Basu, R. Fromme, C. Kupitz, K. N. Rendek, I. Grotjohann, P. Fromme, R. A. Kirian, K. R. Beyerlein, T. A. White, H. N. Chapman, M. Caffrey, J. C. H. Spence, R. C. Stevens, and V. Cherezov, “Serial femtosecond crystallography of G protein-coupled receptors,” Science 342, 1521–1524 (2013). [CrossRef] [PubMed]

5. J. L. Thomaston, R. A. Woldeyes, T. Nakane, A. Yamashita, T. Tanaka, K. Koiwai, A. S. Brewster, B. A. Barad, Y. Chen, T. Lemmin, M. Uervirojnangkoorn, T. Arima, J. Kobayashi, T. Masuda, M. Suzuki, M. Sugahara, N. K. Sauter, R. Tanaka, O. Nureki, K. Tono, Y. Joti, E. Nango, S. Iwata, F. Yumoto, J. S. Fraser, and W. F. DeGrado, “XFEL structures of the influenza M2 proton channel: Room temperature water networks and insights into proton conduction,” Proc. Natl. Acad. Sci. 114, 13357–13362 (2017). [CrossRef] [PubMed]

6. R. Neutze and K. Moffat, “Time-resolved structural studies at synchrotrons and X-ray free electron lasers: Opportunities and challenges,” Curr. Opin. Struct. Biol. 22, 651–659 (2012). [CrossRef] [PubMed]

7. M. M. Seibert, T. Ekeberg, F. R. Maia, M. Svenda, J. Andreasson, O. Jönsson, D. Odić, B. Iwan, A. Rocker, D. Westphal, M. Hantke, D. P. Deponte, A. Barty, J. Schulz, L. Gumprecht, N. Coppola, A. Aquila, M. Liang, T. A. White, A. Martin, C. Caleman, S. Stern, C. Abergel, V. Seltzer, J. M. Claverie, C. Bostedt, J. D. Bozek, S. Boutet, A. A. Miahnahri, M. Messerschmidt, J. Krzywinski, G. Williams, K. O. Hodgson, M. J. Bogan, C. Y. Hampton, R. G. Sierra, D. Starodub, I. Andersson, S. Bajt, M. Barthelmess, J. C. Spence, P. Fromme, U. Weierstall, R. Kirian, M. Hunter, R. B. Doak, S. Marchesini, S. P. Hau-Riege, M. Frank, R. L. Shoeman, L. Lomb, S. W. Epp, R. Hartmann, D. Rolles, A. Rudenko, C. Schmidt, L. Foucar, N. Kimmel, P. Holl, B. Rudek, B. Erk, A. Hömke, C. Reich, D. Pietschner, G. Weidenspointner, L. Strüder, G. Hauser, H. Gorke, J. Ullrich, I. Schlichting, S. Herrmann, G. Schaller, F. Schopper, H. Soltau, K. U. Kühnel, R. Andritschke, C. D. Schröter, F. Krasniqi, M. Bott, S. Schorb, D. Rupp, M. Adolph, T. Gorkhover, H. Hirsemann, G. Potdevin, H. Graafsma, B. Nilsson, H. N. Chapman, and J. Hajdu, “Single mimivirus particles intercepted and imaged with an X-ray laser,” Nature 470, 78–82 (2011). [CrossRef] [PubMed]

8. M. Gallagher-Jones, Y. Bessho, S. Kim, J. Park, S. Kim, D. Nam, C. Kim, Y. Kim, D. Y. Noh, O. Miyashita, F. Tama, Y. Joti, T. Kameshima, T. Hatsui, K. Tono, Y. Kohmura, M. Yabashi, S. S. Hasnain, T. Ishikawa, and C. Song, “Macromolecular structures probed by combining single-shot free-electron laser diffraction with synchrotron coherent X-ray imaging,” Nat. Commun. 5, 3798 (2014). [CrossRef] [PubMed]

9. G. van der Schot, M. Svenda, F. R. N. C. Maia, M. Hantke, D. P. DePonte, M. M. Seibert, A. Aquila, J. Schulz, R. Kirian, M. Liang, F. Stellato, B. Iwan, J. Andreasson, N. Timneanu, D. Westphal, F. N. Almeida, D. Odic, D. Hasse, G. H. Carlsson, D. S. D. Larsson, A. Barty, A. V. Martin, S. Schorb, C. Bostedt, J. D. Bozek, D. Rolles, A. Rudenko, S. Epp, L. Foucar, B. Rudek, R. Hartmann, N. Kimmel, P. Holl, L. Englert, N.-T. Duane Loh, H. N. Chapman, I. Andersson, J. Hajdu, and T. Ekeberg, “Imaging single cells in a beam of live cyanobacteria with an X-ray laser,” Nat. Commun. 6, 5704 (2015). [CrossRef] [PubMed]

10. R. Xu, H. Jiang, C. Song, J. A. Rodriguez, Z. Huang, C.-C. Chen, D. Nam, J. Park, M. Gallagher-Jones, S. Kim, S. Kim, A. Suzuki, Y. Takayama, T. Oroguchi, Y. Takahashi, J. Fan, Y. Zou, T. Hatsui, Y. Inubushi, T. Kameshima, K. Yonekura, K. Tono, T. Togashi, T. Sato, M. Yamamoto, M. Nakasako, M. Yabashi, T. Ishikawa, and J. Miao, “Single-shot three-dimensional structure determination of nanocrystals with femtosecond X-ray free-electron laser pulses,” Nat. Commun. 5, 4061 (2014). [CrossRef] [PubMed]

11. T. Kimura, Y. Joti, A. Shibuya, C. Song, S. Kim, K. Tono, M. Yabashi, M. Tamakoshi, T. Moriya, T. Oshima, T. Ishikawa, Y. Bessho, and Y. Nishino, “Imaging live cell in micro-liquid enclosure by X-ray laser diffraction,” Nat. Commun. 5, 3052 (2014). [CrossRef] [PubMed]

12. Y. Takayama, Y. Inui, Y. Sekiguchi, A. Kobayashi, T. Oroguchi, M. Yamamoto, S. Matsunaga, and M. Nakasako, “Coherent X-ray diffraction imaging of chloroplasts from Cyanidioschyzon merolae by using X-ray free electron laser,” Plant Cell Physiol. 56, 1272–1286 (2015). [CrossRef] [PubMed]

13. C. Song, H. Jiang, A. Mancuso, B. Amirbekian, L. Peng, R. Sun, S. S. Shah, Z. H. Zhou, T. Ishikawa, and J. Miao, “Quantitative imaging of single, unstained viruses with coherent X rays,” Phys. Rev. Lett.101 (2008). [CrossRef]

14. T. Ekeberg, M. Svenda, C. Abergel, F. R. Maia, V. Seltzer, J.-M. Claverie, M. Hantke, O. Jönsson, C. Nettelblad, G. van der Schot, M. Liang, D. P. DePonte, A. Barty, M. M. Seibert, B. Iwan, I. Andersson, N. D. Loh, A. V. Martin, H. Chapman, C. Bostedt, J. D. Bozek, K. R. Ferguson, J. Krzywinski, S. W. Epp, D. Rolles, A. Rudenko, R. Hartmann, N. Kimmel, and J. Hajdu, “Three-dimensional reconstruction of the giant mimivirus particle with an X-ray free-electron laser,” Phys. Rev. Lett. 114, 098102 (2015). [CrossRef] [PubMed]

15. A. Hosseinizadeh, G. Mashayekhi, J. Copperman, P. Schwander, A. Dashti, R. Sepehr, R. Fung, M. Schmidt, C. H. Yoon, B. G. Hogue, G. J. Williams, A. Aquila, and A. Ourmazd, “Conformational landscape of a virus by single-particle X-ray scattering,” Nat. Methods 14, 877–881 (2017). [CrossRef] [PubMed]

16. M. Tegze and G. Bortel, “Atomic structure of a single large biomolecule from diffraction patterns of random orientations,” J. Struct. Biol. 179, 41–45 (2012). [CrossRef] [PubMed]

17. A. Tokuhisa, J. Taka, H. Kono, and N. Go, “Classifying and assembling two-dimensional X-ray laser diffraction patterns of a single particle to reconstruct the three-dimensional diffraction intensity function: Resolution limit due to the quantum noise,” Acta Crystallogr. Sect. A 68, 366–381 (2012). [CrossRef]

18. T. Oroguchi and M. Nakasako, “Three-dimensional structure determination protocol for noncrystalline biomolecules using x-ray free-electron laser diffraction imaging,” Phys. Rev. E 87, 022712 (2013). [CrossRef]

19. M. Nakano, O. Miyashita, S. Jonic, A. Tokuhisa, and F. Tama, “Single-particle XFEL 3D reconstruction of ribosome-size particles based on Fourier slice matching: Requirements to reach subnanometer resolution,” J. Synchrotron Radiat. 25, 1010–1021 (2018). [CrossRef] [PubMed]

20. O. Miyashita and Y. Joti, “X-ray free electron laser single-particle analysis for biological systems,” Curr. Opin. Struct. Biol. 43, 163–169 (2017). [CrossRef] [PubMed]

21. N.-T. D. Loh and V. Elser, “Reconstruction algorithm for single-particle diffraction imaging experiments,” Phys. Rev. E 80, 026705 (2009). [CrossRef]

22. H. T. Philipp, K. Ayyer, M. W. Tate, V. Elser, and S. M. Gruner, “Solving structure with sparse, randomly-oriented x-ray data,” Opt. Express 20, 13129–13137 (2012). [CrossRef] [PubMed]

23. M. Nakano, O. Miyashita, S. Jonic, C. Song, D. Nam, Y. Joti, and F. Tama, “Three-dimensional reconstruction for coherent diffraction patterns obtained by XFEL,” J. Synchrotron Radiat. 24, 727–737 (2017). [CrossRef] [PubMed]

24. C. Gorba, O. Miyashita, and F. Tama, “Normal-mode flexible fitting of high-resolution structure of biological molecules toward one-dimensional low-resolution data,” Biophys. J. 94, 1589–1599 (2008). [CrossRef]

25. A. G. Kikhney, A. Panjkovich, A. V. Sokolova, and D. I. Svergun, “DARA: A web server for rapid search of structural neighbours using solution small angle X-ray scattering data,” Bioinformatics 32, 616–618 (2016). [CrossRef]

26. H. Liu, A. Hexemer, and P. H. Zwart, “The Small Angle Scattering ToolBox (SASTBX): An open-source software for biomolecular small-angle scattering,” J. Appl. Crystallogr. 45, 587–593 (2012). [CrossRef]

27. M. Topf, K. Lasker, B. Webb, H. Wolfson, W. Chiu, and A. Sali, “Protein structure fitting and refinement guided by cryo-EM density,” Structure 16, 295–307 (2008). [CrossRef] [PubMed]

28. R. McGreevy, I. Teo, A. Singharoy, and K. Schulten, “Advances in the molecular dynamics flexible fitting method for cryo-EM modeling,” Methods 100, 50–60 (2016). [CrossRef] [PubMed]

29. O. Miyashita, C. Kobayashi, T. Mori, Y. Sugita, and F. Tama, “Flexible fitting to cryo-EM density map using ensemble molecular dynamics simulations,” J. Comput. Chem. 38, 1447–1461 (2017). [CrossRef] [PubMed]

30. H. Liu, B. K. Poon, D. K. Saldin, J. C. H. Spence, and P. H. Zwart, “Three-dimensional single-particle imaging using angular correlations from X-ray laser data,” Acta Crystallogr. Sect. A Foundations Crystallogr. 69, 365–373 (2013). [CrossRef]

31. A. Tokuhisa, S. Jonic, F. Tama, and O. Miyashita, “Hybrid approach for structural modeling of biological systems from X-ray free electron laser diffraction patterns,” J. Struct. Biol. 194, 325–336 (2016). [CrossRef] [PubMed]

32. T. Kawabata, “Multiple subunit fitting into a low-resolution density map of a macromolecular complex using a gaussian mixture model,” Biophys. J. 95, 4643–4658 (2008). [CrossRef] [PubMed]

33. M. F. Hantke, T. Ekeberg, and F. R. Maia, “Condor: A simulation tool for flash X-ray imaging,” J. Appl. Crystallogr. 49, 1356–1362 (2016). [CrossRef] [PubMed]

34. A. Tokuhisa, J. Arai, Y. Joti, Y. Ohno, T. Kameyama, K. Yamamoto, M. Hatanaka, B. Gerofi, A. Shimada, M. Kurokawa, F. Shoji, K. Okada, T. Sugimoto, M. Yamaga, R. Tanaka, M. Yokokawa, A. Hori, Y. Ishikawa, T. Hatsui, and N. Go, “High-speed classification of coherent X-ray diffraction patterns on the K computer for high-resolution single biomolecule imaging,” J. Synchrotron Radiat. 20, 899–904 (2013). [CrossRef] [PubMed]

35. R. Jørgensen, P. A. Ortiz, A. Carr-Schmid, P. Nissen, T. G. Kinzy, and G. R. Andersen, “Two crystal structures demonstrate large conformational changes in the eukaryotic ribosomal translocase,” Nat. Struct. Mol. Biol. 10, 379–385 (2003). [CrossRef]

36. M. J. Abraham, T. Murtola, R. Schulz, S. Páall, J. C. Smith, B. Hess, and E. Lindah, “Gromacs: High performance molecular simulations through multi-level parallelism from laptops to supercomputers,” SoftwareX 1–2, 19–25 (2015). [CrossRef]

37. M. Nakasako, T. Fujisawa, S. Adachi, T. Kudo, and S. Higuchi, “Large-scale domain movements and hydration structure changes in the active-site cleft of unligated glutamate dehydrogenase from Thermococcus profundus studied by cryogenic X-ray crystal structure analysis and small-angle X-ray scattering,” Biochemistry 40, 3069–3079 (2001). [CrossRef] [PubMed]

38. C. Chaudhry, A. L. Horwich, A. T. Brunger, and P. D. Adams, “Exploring the structural dynamics of the E. coli chaperonin GroEL using translation-libration-screw crystallographic refinement of intermediate states,” J. Mol. Biol. 342, 229–245 (2004). [CrossRef] [PubMed]

39. R. J. Ford, A. M. Barker, S. E. Bakker, R. H. Coutts, N. A. Ranson, S. E. Phillips, A. R. Pearson, and P. G. Stockley, “Sequence-specific, RNA-protein interactions overcome electrostatic barriers preventing assembly of satellite tobacco necrosis virus coat protein,” J. Mol. Biol. 425, 1050–1064 (2013). [CrossRef] [PubMed]

40. D. J. Filman, R. Syed, M. Chow, A. J. Macadam, P. D. Minor, and J. M. Hogle, “Structural factors that control conformational transitions and serotype specificity in type 3 poliovirus,” EMBO J. 8, 1567–1579 (1989). [CrossRef] [PubMed]

41. W. Humphrey, A. Dalke, and K. Schulten, “VMD: Visual molecular dynamics,” J. Mol. Graph. pp. 33–38 (1996). [CrossRef] [PubMed]

42. S. Jonić, C. O. S. Sorzano, P. Thévenaz, C. El-Bez, S. De Carlo, and M. Unser, “Spline-based image-to-volume registration for three-dimensional electron microscopy,” Ultramicroscopy 103, 303–317 (2005). [CrossRef]

43. J. M. de la Rosa-Trevín, J. Otón, R. Marabini, A. Zaldívar, J. Vargas, J. M. Carazo, and C. O. S. Sorzano, “Xmipp 3.0: An improved software suite for image processing in electron microscopy,” J. Struct. Biol. 184, 321–328 (2013). [CrossRef] [PubMed]

protein name (pdb id)	N	V [Å³]	reference
yeast elongation factor 2 (1n0u)	6375	6.9 × 10⁴	[35]
yeast elongation factor 2 (1n0v)	6419	6.9 × 10⁴	[35]
glutamate dehydrogenase (1euz)	19337	2.1 × 10⁵	[37]
groEL (1s×4)	50080	6.4 × 10⁵	[38]
satellite tobacco necrosis virus (4bcu)	88140	9.5 × 10⁵	[39]
polio virus (2plv)	398820	4.3 × 10⁶	[40]

protein name (pdb id)	N	V [Å³]	reference
yeast elongation factor 2 (1n0u)	6375	6.9 × 10⁴	[35]
yeast elongation factor 2 (1n0v)	6419	6.9 × 10⁴	[35]
glutamate dehydrogenase (1euz)	19337	2.1 × 10⁵	[37]
groEL (1s×4)	50080	6.4 × 10⁵	[38]
satellite tobacco necrosis virus (4bcu)	88140	9.5 × 10⁵	[39]
polio virus (2plv)	398820	4.3 × 10⁶	[40]

Gaussian mixture model for coarse-grained modeling from XFEL

Abstract

1. Introduction

2. Methods

2.1. Gaussian mixture model

2.2. Calculation of diffraction images

2.3. Correlation coefficient and measurement of resolution

2.4. Test of angular assignment and model selection by similarity score

2.5. Model systems

3. Results

3.1. Diffraction data simulated using GMMs is sensitive to conformational changes observed between two yeast translation elongation factor 2 structures

3.2. Data with groEL

3.3. Data with Satellite Tobacco Necrosis Virus Capsid (STNVC)

3.4. Effect of the number of Gaussians on simulation accuracy

3.5. Feasibility of angular alignment and model selection

3.6. Computational time

4. Discussion

Funding

Acknowledgments

References

Cited By

Figures (11)

Tables (2)

Equations (4)

Optics Express

N_g	m	m	n	n
N_g	1n0u	1n0v	1n0u	1n0v
20	4	2	9	8
30	9	7	10	10
40	10	9	10	10
80	10	10	10	10
100	10	10	10	10
200	10	10	10	10