## Abstract

Based on a distributed method of bit-error-rate (BER) monitoring, a novel multi-link faults restoration algorithm is proposed for dynamic optical networks. The concept of fuzzy fault set (FFS) is first introduced for multi-link faults localization, which includes all possible optical equipment or fiber links with a membership describing the possibility of faults. Such a set is characterized by a membership function which assigns each object a grade of membership ranging from zero to one. OSPF protocol extension is designed for the BER information flooding in the network. The BER information can be correlated to link faults through FFS. Based on the BER information and FFS, multi-link faults localization mechanism and restoration algorithm are implemented and experimentally demonstrated on a GMPLS enabled optical network testbed with 40 wavelengths in each fiber link. Experimental results show that the novel localization mechanism has better performance compared with the extended limited perimeter vector matching (LVM) protocol and the restoration algorithm can improve the restoration success rate under multi-link faults scenario.

©2013 Optical Society of America

## 1. Introduction

Dynamic all optical networks with GMPLS control plane is very attractive for the future core network as it can reduce the power consumption of electronic switches and be very efficient in delivering ultra large files for datacenter application and backup. Due to the huge bandwidth supported by the networks, any fault in the fiber or other component can lead to huge amount of data loss, so it is very important to incorporate efficient network monitoring, fast faults diagnosis and restoration mechanisms into the optical network. As an important fault type, multi-link fault will deteriorate the network performance to a large extent, which may occur in a shared risk link group (SRLG) [1]. A SRLG failure may cause multiple links to break simultaneously due to the failure of a common resource. Accordingly, some SRLG based schemes have been proposed to complete survivability for multi-link faults [2, 3]. For example, a dynamic shared-path protection algorithm (DSPP) is proposed based on Shared Risk Link Group (SRLG) for protecting the multi-link failures in WDM mesh networks [2], and a new path-protection algorithm called Enhanced Shared Backup Paths Protection (ESBPP) to designed to provide the complete survivability for double-link failures in wavelength-division-multiplexing optical networks. Compared to previous algorithms for double-link failures, ESBPP can obtain higher resource-utilization ratio and lower blocking probability [3].

Before carrying out protection and restoration mechanism, the network managers must first find out the actual number and the location of faults and where each fault occurs. In order to localize multi-link faults in large scale optical networks, the network manager must deploy an efficient localization algorithm to find out the most possible faulty elements based on received information. In the literature, there are many fast localization algorithms proposed for optical networks monitoring, especially for SRLG localization [4–6]. These methods may use different network models, channel description, information, and processing methodologies to localize link faults [7]. Usually, centralized fault localization mechanisms provide a list of components whose fault explains the observed alarms. For example, pre-computation and sequential diagnosis mechanisms are used to keep up with the scalability [8]. But distributed mechanisms rely on keep-alive or notification messages to locate the root of a fault. One of the most representative distributed-localization mechanisms is the link management protocol (LMP) [9], which is part of the GMPLS protocol suite. LMP is always required on supervisory channel, which should not be in-fiber to avoid loss of the supervisor channel in the event of link faults.

A distributed fault localization protocol, called Limited perimeter Vector Matching (LVM) protocol, is proposed for localizing single link fault in all optical networks [10]. This protocol assumes that no power monitoring is available at each intermediate node and only destination node and edge node are able to detect the power loss or quality degradation of an optical signal. Also parallel limited perimeter vector matching (P-LVM) protocol is proposed for localizing multi-link faults in all optical networks [11]. To handle multi-link faults, it tries to separate each fault in a small perimeter area after identifying each perimeter area with its corresponding fault and then localize the faults in parallel respectively in a distributed manner. In the paper [12], we implement fuzzy fault set (FFS) based multi-link faults localization mechanism in multi-domain large capacity optical networks. It has higher scalability, speed and success rate compared with extended LVM protocol.

On the other hand, the multi-link faults localization mechanism mainly depends on the obtained information, such as the bit-error-rate (BER) of the lightpath. Although analog optical monitoring techniques [13] can be used to estimate the BER of the links by measuring the optical signal-to-noise ratio (OSNR), but the techniques are expensive and the results may not be accurate [8, 14]. What’s more, some other schemes, such as m-cycle, m-path, and m-trail, will occupy extra bandwidth resource [15–17]. Because the problem of multi-link faults location is NP-hard [18], the processing time may become an issue for large scale mesh optical networks, which leads to the poor scalability for the schemes above. Then, we propose a novel BER monitoring approach for transparent and translucent dynamic optical networks, which can be used to monitor, detect and localize multiple soft link faults without incurring any additional optical monitoring equipment [19].

Different from the centralized and distributed fault localization mechanisms, our proposed multi-link fault localization algorithm can be performed at any node of the network based on fault fuzzy set (FFS), which is constructed with the BER information collected distributed. Compared with all-optical out-of-band monitoring mechanisms, such as monitoring-cycles (m-cycles), m-paths, and m-trails, no supervisory lightpaths are needed. Restoration action is necessary after the completion of localization phase. Through the BER monitoring approach and multiple faults localization mechanism, this paper proposes a novel multi-soft link faults restoration algorithm based on FFS.

The rest of this paper is organized as follows: the problem statement is described in Section 2, and then the concept of fault fuzzy set (FFS) is introduced in Section 3. The necessary protocol extension is designed in Section 4, and multi-link faults localization mechanism is proposed based on FFS in Section 5. Section 6 focuses on the multi-link faults restoration algorithm. Numeric results on the GMPLS enabled optical network testbed are discussed in Section 7. Finally Section 8 concludes the paper.

## 2. Problem statement

In order to illustrate the important nature of multi-link faults localization such as the mathematical model, we only consider the optical networks with no receivers to estimate the BER at each intra-domain nodes. Only edge nodes and destination nodes are equipped with receivers to estimate the BER. Then the optical flow traverses along the lightpath without any optical-to-electrical conversion except edge nodes and destination nodes.

We assume each fiber is wavelength multiplexed, so that the number of lightpaths traversing a link could be as large as the multiplexing degree. When a link fails, a set of destination nodes connected to lightpaths detects signal loss or high BER. As shown in Fig. 1
, the established lightpaths in the network are *P _{1}*:

*c-f-e*,

*P*:

_{2}*b-c-f-g-a, P*:

_{3}*c-f-g-h-d*. The destination nodes

*a*,

*d*detect the signal loss or abnormal BER but the destination node

*e*does not detect any problem. So the occurred faults affect lightpaths

*P*and

_{2}*P*. Here we only consider link fault. In the single fault case, we can determine that

_{3}*Link*is the failed link. But under multiple faults, we cannot determine which links are at fault because there are several combinations which can cause the observed status, such as:

_{1}*Link*&

_{3}*Link*,

_{4}*Link*&

_{3}*Link*,

_{5}*Link*&

_{3}*Link*&

_{4}*Link*, etc. In other words, we cannot determine whether these links failed or not and these links can be defined as network risk resources. We introduce the concept of fuzzy fault set (FFS) and put the network risk resources into FFS.

_{5}## 3. Concept of FFS

In fuzzy set theory, the concept of possibility is used instead of probability. Possibility is defined by a number between one (completely possible) and zero (totally impossible). Probability is an appropriate measure of uncertainty if statistical information is available. A classical set includes elements which can either belong to or not belong to the set, whereas elements of a fuzzy set may have various degrees of belonging. Fuzzy fault set theory offers new methods for modeling the inexactness and uncertainty concerning fault restoration. A FFS can be defined as follows:

If *L* is a collection of network links denoted by *x* then a fuzzy fault set *F* in *L* is a set of ordered pairs:

$\mu F(x)$ is called the membership function or grade of membership of *x* in *F*. The membership function describes the degree to which the element *x* belongs to the fuzzy fault set *F*. Operations with a fuzzy set are defined via their membership functions. Basic set operations are intersection, union and complement (Eqs. 1, 2 and 3).

As presented in Fig. 2
, the established light paths in the network are *P1: c-f-e, P2: b-c-f-g-a, P3: c-f-g-h-d*. The destination nodes *a, d* detect the signal loss or high BER but the destination node *e* does not detects anything abnormal. Under multiple faults, Link1, Link2, Link3, Link4 and Link5 are the possible failed links, namely the network risk resources. So the fuzzy fault set should include all these elements. The next issue is how to calculate the membership of each link in the FFS.

We use a bipartite graph to represent the dependency between likely causes and observed BER information, as shown in Fig. 3 . The top partition consists of all possible failed links and the bottom partition consists of all received BER information. An edge exists between a likely cause and received BER if the BER is larger than a threshold. Details about the method of BER information collection and threshold configuration can be found in reference [19]. This bipartite graph can be obtained based on network topology and the total established lightpaths. The multiple fault localization can be described as finding a likely causes set in top partition of the bipartite graph which include a minimal number of faults. Also the obtained likely fault set must be completely adequate to cover all the faulty links. This problem is similar to clique cover problem (CLCP) [20]. In computational complexity theory, finding a minimum clique cover is a graph-theoretical NP-complete problem. Similarly, we cannot find a likely causes set which include minimal number of faults in polynomial time. Even though we obtain the optimal likely cause set, we cannot guarantee the optimal likely causes set has occurred in the practical network because there may be several likely cause sets which include a minimal number of failed faults.

In order to calculate the membership of each likely cause, we specify the following hypothesis: (1) the FFS includes all likely causes. FFS is composed of all the obtained likely cause sets. (2) The summation on membership of each obtained likely cause set is 1. Hypothesis (1) indicates that FFS is a complete set, and it includes all possible failed elements. Hypothesis (2) indicates that each obtained likely cause set can lead to all observed symptoms and is the most likely cause. Below we give the formula of the membership calculation method. We use *SYM _{x}* to represent the BER caused by link

*x*and

*SYM*to represent the total BER at the bottom of Fig. 3.

*COM*(*x*, *y*) represents identical symptoms caused by element *x* and *y*. For the example shown in Fig. 3, *fault _{1}* in the top partition of bipartite graph leads to three BER values (

*b*,

_{1}*b*,

_{2}*b*) in the bottom partition of bipartite graph and the total number of BER values is nine (the BER values here are those larger than the threshold), so the membership of

_{3}*Fault*is 3/9. The

_{1}*fault*&

_{1}*fault*in the top partition of bipartite graph cause six BER values (

_{2}*b*,

_{1}*b*,

_{2}*b*,

_{3}*b*,

_{4}*b*,

_{5}*b*) in the bottom partition of the bipartite graph, so the membership of

_{6}*fault*&

_{1}*fault*is 6/9. The faults presented in Fig. 2 could be modeled into the bipartite graph as presented in Fig. 4 . Based on the above formulas, we can obtain the FFS:

_{2}The meaning of FFS: (1) includes all the possible failed elements. (2) Each element in FFS has a risk of fault (membership). (3) Each element in the FFS can be handled as network risk resource. (4) FFS provides the method how to handle network risk resources.

## 4. OSPF protocol extension for BER information flooding and FFS construction

BER information can be considered as the inputs for multiple faults localization mechanism based on FFS, so it is important to collect the BER information quickly and efficiently. OSPF is an adaptive routing protocol for Internet Protocol (IP) networks, which uses a link state routing algorithm and falls into the group of interior routing protocols, operating within a single autonomous region (AR). OSPF uses a hierarchical model to represent a network, where each AR is connected to the backbone network. The border router of each AR is in charge of connecting and exchanging information with the outside network.

In each AR, OSPF works by developing adjacencies with its neighbors, periodically sending hello packets to neighbors, flooding changes to neighbors whenever a link's status changes, and sending “paranoia updates” to neighbors of all recent link state changes every 30 minutes. In this paper, a new sub-Type Length Values (TLVs), i.e. Path-BER sub-TLV, is designed for the Opaque 10 Link State Advertisement (LSA) of OSPF protocol to support the BER monitoring. Path-BER sub-TLV contains the total BER value along a connection request’s path, and detailed information about the lightpath including the source node address, destination node address, and the wavelength serves the connection request.

Then we can construct the bipartite graph based on the received BER information. We use the *Bip_G* to represent the bipartite graph representing the dependency between BER values and faults. *F* represents the likely fault set and *A* represents the received BER value set. *a* represents one received BER and Fault(*a*) represents all the likely faults which can cause the BER *a*. NLP represents the normal light path set and *n* represents one normal light path in the normal light path set. Normal (*n*) represents the total normal elements in this normal light path.

Algorithm 1 Bip_G |
---|

Require: Bip_G to be set up |

1: F←Φ |

2: $a\in A$dofor |

3: Find all likely fault: Fault (a) and record the dependency |

4: F = F $\cup $Fault (a) |

5: end for |

6: $n\in NLP$dofor |

7: Find all normal element Normal (n) |

8: F = F /Normal (n) |

9: end for |

10: Return A, F and the dependency |

The output of algorithm 1 is the needed *Bip_G* and then we use Eq. (5) to calculate the membership of each element in *F*. Each element in *F* has a membership when this set becomes FFS while including all the likely faults. But from Algorithm 1, the output may be several separated bipartite graphs. We could obtain several separated FFS with the corresponding membership.

Equation (5) is used to calculate the membership of one failed element in the FFS but is not suitable for calculating the total risk of a light path. Equation (6) is used to calculate the total risk of a lightpath when it uses the network risk resources in one FFS. Equation (7) is used to calculate the total risk of a light path when it needs to use the network risk resources in several separated FFS.

We use the *Tol_R* represent the total risk of a light path. *L* represents the light path and *l* represents the one link along light path. *FT* represents total fuzzy fault sets obtained in the ** Algorithm1** and

*F*represents one separated fuzzy fault set in

_{i}*FT*. A strategy is proposed for using fuzzy fault sets in the restoration phase. It computes the total risk of a light path

*L*through calculating the summation on membership of each link along the light path

*L*.

Algorithm 2 Tol_R |
---|

Require: Tol_R to be calculated |

1: Tol_R←0 |

2: ${F}_{i}\in FT$dofor |

3: $l\in L\&\&l\in {F}_{i}$dofor |

4: Tol_R = Tol_R + ${\mu}_{{F}_{i}}(l)$ |

5: end for |

6: end for |

7: Return Tol_R |

## 5.Multi-link faults localization mechanism based on FFS

We also consider multi-domain all-optical networks with no receivers at each intra-domain nodes to estimate the BER, but only the edge and destination nodes are equipped with the receivers. Paths that span multiple domains are computed using the distributed model with the path computation element (PCE) responsible for each domain. The multiple faults localization can be divided into two phases: BER information flooding phase and faults localization phase. The BER information flooding phase can be implemented based on the OSPF protocol extension in Section 4, and then the faults localization phase can be completed based on PCE responsible for each domain.

In order to localize cross-domain link faults, we divide a lightpath into the inter-domain part and intra-domain part. Firstly we determine whether the faulty link belongs to the inter-domain or intra-domain. If we cannot determine whether an inter-domain link is faulty or not, we put this link into the FFS. Second we continue the faults localization separately in each domain.

- 1) Once the BER flooding phase finishes, the PCE in each domain receives all the BER information caused by the damaged lightpaths (DLP) passing through this domain. There are three special types of light path as depicted in Fig. 5 .

- 2) The received BERs are divided into the inter-domain and intra-domain parts. For the inter-domain case: if only one edge node in the adjacent domains detect the fault, we can conclude that a faulty link lie between this two edge nodes. If two edge nodes in adjacent domains all detect a fault, we put this link into the FFS and calculate the membership according to Eqs. (2, 3, 4). If both edge nodes in adjacent domains do not detect a fault, we conclude that this link is a healthy link. Then PCE will broadcast inter fault location and FFS messages to the PCEs in adjacent domains.
- 3) The fault localization algorithm is executed within each independent domain as depicted in Fig. 6 . In this step, the edge nodes are treated as virtual source nodes or sink nodes. We use a bipartite graph to represent the dependency between the BER and link faults [21]. The top partition consists of the likely faulty links and the bottom partition consists of the received BER. An edge exists between a likely faulty link and a received BER if a damaged light path (DLP) passes through this link and ends at the sink nodes which receive the BER. This bipartite graph is obtained according to the domain topology and the damaged light paths.
- 4) We use the ant colony algorithm to reduce the multi-link faults localization complexity so that we can obtain the minimal number of fault sets. The pheromones concentrations are computed according to the number of the obtained fault sets. The smaller the number of fault sets is, the heavier the pheromone deposition should be. The desirability takes on a value directly proportional to the degree of fault. This value means keeping using the fault with bigger sharing.
- 5) Once the failed intra-domain links are localized, each PCE will update the topology and re-calculate the new lightpath requests based on the updated topology and resource.

## 6. Survival restoration algorithms for multi-link faults

In this new survival restoration algorithm we compute the recovery path for each lightpath whose BER exceeds the threshold value (it is set as 0.5 in this paper) for each damaged light path (DLP) based on the FFS obtained above. If enough network wavelength resources are available, we can compute the recovery paths that avoid using any of the resources in the FFS. This recovery path is the shortest path to the destination node excluding the resources in the FFS. We can guarantee that the recovery path is fully capable of restoring the damaged traffic. On the other hand, if the recovery path cannot be established because there is insufficient wavelength resource, the restoration will fail. In this new scheme, we use the total risk (Tol_R) value to measure the reliability of a recovery path in the FFS. This scheme makes the best effort to restore each damaged lightpath as presented in Fig. 7 .

Based on the received BER information and network topology, the PCE can calculate the bipartite graph representing the dependency between the BER and faults, and the membership of each failed element. Next, the FFS, DLP and *K*-shortest paths are input in the restoration phase, and as the FFS is updated, the number of elements in FFS will decrease. We use the algorithm 2 calculates the total risk of *K*-shortest recovery paths for each DLP. If the membership of one element increases to one, we can conclude that it has a fault and delete it. The FFS is used to deal with uncertain faults, if we can determine that all the elements have failed, then the FFS becomes an ordinary fault set and we only need to delete them in the ordinary fault set.

The restoration phase makes the best effort to recover more DLPs based on the FFS. The inputs are the DLPs, FFS, network topology and *K*-shortest paths. The *K*-shortest paths are used to calculate *K*-shortest recovery paths for each DLP and all these *K*-shortest recovery paths must satisfy the threshold value of the risk in each recovery path. We use the risk as the link metric, which is obtained from the FFS. The threshold value of the risk is represented as *R* in the paper.

In the restoration phase we will successfully recover some DLPs but also find some DLPs cannot be recovered. If the successfully recovered DLPs use the risk resources in the FFS, we can infer that these risk resources have no fault because DLPs have been successfully set up. Then we can delete these resources from the FFS. All the recovery paths which cannot be set up for DLPs in the restoration phase can be seen as the new DLPs then we can update the relationship between BER and faults in the Fig. 3 using these new DLPs. After the steps above, we obtain a new FFS which will more accurately represent the fault status in the network than the FFS in the restoration phase. We use the notations and conventions as shown in Tab.1 .

The *i _{th}* DLP is defined by a

*4*-

*tuple*(

*s*,

_{i}*d*,

_{i}*π*).

_{i}*s*∈

_{i}*V*,

*d*∈

_{i}*V*, are the source and destination nodes of the DLP.

*π*is the number of requested restored light paths. ${c}_{j}^{\omega}$is the risk of using wavelength ${\lambda}_{\omega}$on link$j\in E$. ${c}_{j}^{\omega}={\mu}_{F}(j)$if wavelength ${\lambda}_{\omega}$is a free wavelength on link

_{i}*j*; ${c}_{j}^{\omega}=+\infty $, otherwise (wavelength ${\lambda}_{\omega}$is used by another light path passing through link$j$). ${C}_{i,k}^{\omega}={\displaystyle {\sum}_{j\in {p}_{i,k}}{c}_{j}^{\omega}}$ is the risk of using wavelength ${\lambda}_{\omega}$on

*P*, the

_{i, k}*k*alternate restoration path in

_{th}*R*connecting the source to the destination node of

_{i}*DLP i*. The risk function is determined as follows:

*ε*is a positive value corresponding to the risk accumulation on path

*P*(${\lambda}_{\omega}$is free on all the fiber links of

_{i,k}*P*). ${c}_{i,k}^{\omega}=+\infty $ if wavelength ${\lambda}_{\omega}$ is already used by other light path on at least one fiber link of

_{i,k}*P*. ${\gamma}_{i,k}^{\omega},\text{\hspace{0.17em}}1\le i\le D,\text{\hspace{0.17em}}1\le k\le K,\text{\hspace{0.17em}}1\le \omega \le W$, is a binary value indicating whether wavelength ${\lambda}_{\omega}$ is a path-free wavelength along the

_{i,k}*k*alternate path

_{th}*P*, connecting the source to the destination node of DLP$i$.

_{i,k}*W*-dimensional binary vector showing the availability of all wavelengths on

*P*. ${\sigma}_{i,k}={\displaystyle {\sum}_{\omega =1}^{W}{\gamma}_{i,k}^{W}},\text{\hspace{0.17em}}1\le i\le D,\text{\hspace{0.17em}}1\le k\le K$, is the number of available wavelengths on

_{i,k}*P*. ${B}_{i,k}$is the restoration set of paths in

_{i,k}*P*that has at least one common link with restoration path

*P*.

_{i,k}The following algorithm aims at selecting the paths and wavelengths which can satisfy reliability of restoration path when the number of available wavelengths on each fiber link is taken into account.

Algorithm 3Risk based Restoration Algorithm for DLP $i$ (RRA) |
---|

Require: $i$ DLP to be recovered, return a physical route and a free wavelength for the DLPi |

Input: K, R_{i}, W |

1: each forDLP $i$ do |

2: fork = 1 to K do |

3: ω = 1 to forW do |

4: Compute ${\gamma}_{i,k}^{W}$, ${c}_{j,k}^{\omega}$ |

5: end for |

6: end for |

7: (${{\displaystyle {\sum}_{k=1}^{K}\sigma}}_{i,k}\ge {\pi}_{i}$) and (${c}_{j,k}^{\omega}\le R$), Ifthen |

8: Set up the DLP |

9: k←1 |

10: p←1 |

11: ($\omega \le K$) && ($p\le {\pi}_{i}$) whiledo |

12: ω←1 |

13: ($\omega \le W$) and ($p\le {\pi}_{i}$) whiledo |

14: ${c}_{j,k}^{\omega}\le R$ ifthen |

15: ${c}_{j,k}^{\omega}\leftarrow +\infty $(note: the cost on wavelength ${\lambda}_{\omega}$of all paths in ${B}_{i,k}$that share at least one common fiber link with${P}_{i,k}^{}$) |

16: p = p + 1 |

17: end if |

18: $\omega =\omega +1$ |

19: end while |

20: end while |

21: else |

22: The DLP cannot be set up |

23: end if |

24: end for |

At the DLP_{i}, the associated *K*-alternate restoration paths are considered in turn according to their risk (sum of${\mu}_{F}(j)$). The DLP is set up whenever there is at least one restoration path in *Ri* on which there is at least one wavelength available. The wavelength assigned to this path is selected according to a first-fit scheme whenever multiple wavelengths are available on the considered path. If there is no wavelength to satisfy the demand, the DLP is rejected. Note that whenever there are enough wavelengths available on two or several distinct paths, the minimal ${c}_{j,k}^{\omega}$one is preferred because it is more reliable.

As mentioned above, the restoration phase computes the recovery path for DLP after the multiple faults emerge in the network. The restoration phase is initiated to compute recovery path for the DLP according to the above algorithm. The objective is to minimize the number of DLPs. We assume the KSP, wavelength and FFS are the input parameters. For each DLP, the worst-case is that we must check the each path of KSP and each available wavelength of each path. Hence the worst-case time complexity is O (K*W). In this part, we only consider the time complexity of restoration without adding the time complexity of KSP and multi-link localization.

## 7. Numeric results and analysis

#### 7.1 Testbed deployment

In order to validate the proposed localization mechanism and restoration algorithm, simulations have been conducted on a GMPLS enabled optical network testbed [22] as shown in Fig. 8 . The testbed consists of control plane, transport plane, management plane and service plane. The distributed control plane was implemented on an array of virtual machines created by VMWare software running at IBM servers. Since a virtual machine has its own computation resource and operation system, each control node could independently carry out the extended versions of GMPLS protocols, such as PCEP and OSPF-TE extensions. As shown in Fig. 9 , each protocol stack communicates with other protocol stack through their network cards. So it is the same as running each protocol stack on one computer, while a lot of resource and cost can be saved with this solution.

The simulation topology is composed of two COST239 and two NSFNET topologies as shown in Fig. 10 and each pair is interconnected by three inter-domain links. There are 40 channels in each link, and each channel can be up to 10Gbps. Wavelength continuity is considered within each domain but not for inter-domain. Traffic flows are generated at all sources uniformly, with a mean payload of 10 seconds. And the faults are chosen randomly among all the links.

#### 7.2 Results and analysis

### A. Performance of multiple faults localization mechanism

First we analyze the performance of the proposed multiple faults localization mechanism in terms of success rate and average BER information packets for localizing multi-link failures compared with the extended LVM protocol. Ant-colony heuristic algorithm [12] and LVM protocol runs under three and four faults for different times. For each time, the traffic is generated randomly, including the traffic load, traffic distribution, and service number. The success rate is the ratio of successful localization number and total localization number. And the average BER information packets are the information packets flooding in the network.

It can be seen that the ant-colony heuristic algorithm has better success rate because of the ant algorithm can run continuously, it adapts better to fault changes in real time. The simulation results in Fig. 11 and Fig. 12 show that the ant-based heuristic algorithm is more practical to the large-scale optical transport networks when we change the fault number. It is seen that as the number of lightpath (service number) increases, the BER information alarm packet number goes up slowly and the number of BER information packets is much smaller than the LVM protocol as in Fig. 13 , which means that there is less flooding information in the network.

### B. Performance of multiple faults restoration algorithm

Then we mainly analyze the success rate of the proposed multi-link faults restoration algorithm RRA compared with traditional restoration algorithm, in which the greedy algorithm (algorithm 1) is used to search for the most likely fault set and the D algorithm is used to calculate the recovery paths. In RRA algorithm, we calculate the end-to-end path for the new arrived path request based on FFS. The fault generator can produce concurrent three-link faults and concurrent four-link faults. The threshold value of risk *R* is set from 0.1 to 0.9. The success rate here has the same meaning with the success rate in the section above. But the total service number is all 10000 for different traffic loads.

Under three-link faults as presented in Fig. 14 , the success rate of new arrival path request gradually decreases with the increase of the traffic load. But the GA based success rate of new arrived path request fluctuates much bigger than RRA based success rate. As presented in Fig. 15 , the recovery success rate of damaged lightpath gradually increases with the increase of the traffic load. But the GA based success rate of damaged lightpath fluctuates much bigger than RRA based success rate.

Under four-link faults as presented in Figs. 16
and 17
, the same conclusion can be gained that the GA based recovery mechanism fluctuates much bigger than RRA based recovery mechanism. From the simulation results shown in Figs. 14 to 17, we can also find that the success rate and recovery success rate increase with *R* increasing, because more DLPs can be recovered. More lightpaths will improve the success rate of multi-link localization. The more lightpaths are in the network, the easier multi-link localization can be finished. Hence, the success rate of GA increases along with the traffic load in the Fig. 15 and Fig. 17. On the other hand, the remaining resource in the network decreases with the traffic load increasing, the remaining resource for new arrived path request also decreases. As presented in the Fig. 14 and Fig. 16, the success rate of new arrived path request gradually decreases with traffic load increasing.

## 8 Conclusions

The concept of the fuzzy fault set (FFS) is introduced from the point of NP-hard problem in the multiple faults localization problem and the calculation formula of the membership in the fuzzy fault set is given. A new approach for monitoring dynamic all-optical WDM networks without incurring additional hardware is used for the BER information collection. It can be easily implemented by extending the OSPF protocol. Based on the BER information and FFS, a novel multi-link faults restoration algorithm is proposed and demonstrated on a multi-domain GMPLS enabled optical network testbed. Experimental results show that it can improve the restoration success rate under multi-link faults.

## Acknowledgment

This work has been supported in part by 973 program (2010CB328204, 2012CB315604), 863 program (2012AA011301), NSFC project (61271189, 61201154, 60932004), RFDP Project (20090005110013, 20120005120019), and the Fundamental Research Funds for the Central Universities.

## References and links

**1. **D. Papadimitriou, F. Poppe, U. Dharanikota, R. Hartani, R. Jain, J. Jones, S. Venkatachalam, and Y. Xue, “Inference of shared risk link group,” *draft-papadimitriou-ccamp-srlg-processing-02.txt*, Jun.2003.

**2. **L. Guo, H. Yu, and L. Li, “Dynamic shared-path protection based on SRLG constraints in WDM mesh networks,” *Communications, Circuits and Systems, International Conference on*, Chengdu, China, June 2004.

**3. **L. Guo, L. Li, J. Cao, H. Yu, and X. Wei, “On finding feasible solutions with shared backup resources for surviving double-link failures in path-protected WDM mesh networks,” J. Lightwave Technol. **25**(1), 287–296 (2007). [CrossRef]

**4. **S. Ahuja, S. Ramasubramanian, and M. Krunz, “SRLG failure localization in optical networks,” IEEE ACM T NETWORK. **19**(4), 989–999 (2011). [CrossRef]

**5. **B. Wu, P. Ho, J. Tapolcai, and P. Babarczi, “Optimal allocation of monitoring trails for fast SRLG failure localization in all-optical networks,” *Global Telecommunications Conference (GLOBECOM 2010)*, Miami, Florida, USA, Dec 2010.

**6. **W. He, B. Wu, P. Ho, and J. Tapolcai, “Monitoring trail allocation for SRLG failure localization,” *Global Telecommunications Conference (GLOBECOM 2011)*, Houston, Texas, USA, Dec 2011.

**7. **C. Pinart, “A multilayer fault localization framework for IP over all-optical multilayer networks,” IEEE Netw. **23**(3), 4–9 (2009). [CrossRef]

**8. **C. Mas, I. Tomkos, and O. K. Tonguz, “Fault location algorithm for transparent optical networks,” IEEE J. Sel. Areas Comm. **23**(8), 1508–1519 (2005). [CrossRef]

**9. **J. Lang, ed., “Link Management Protocol (LMP),” *RFC4204*, October 2005.

**10. **A. V. Sichani and H. T. Mouftah, “Limited-perimeter vector matching fault-localization protocol for transparent all-optical communication networks,” IET Commun **1**(3), 472–478 (2007). [CrossRef]

**11. **M. Khair, J. Zheng, and H.T. Mouftah, “Distributed multi-faults localization protocol for all-optical networks,” *ONDM2009*, Braunschweig, Germany, February 2009.

**12. **J. Luo, S.Huang, J. Zhang, X. Li, and W. Gu, “A novel multi-fault localization mechanism in PCE-based multi-domain large capacity optical transport networks,” *OFC/NFOEC2012*, Los Angeles, CA, USA, March 2012.

**13. **D. C. Kilper, R. Bach, D. J. Blumenthal, D. Einstein, T. Landolsi, L. Ostar, M. Preiss, and A. E. Willner, “Optical performance monitoring,” J. Lightwave Technol. **22**(1), 294–304 (2004). [CrossRef]

**14. **T. Mizuochi, T. Kobayashi, J. Abe, K. I. Shida, T. Kobayashi, J. Abe, K. J. K. Motoshima, and K. Kasahara, “A comparative study of DPSK and OOK WDM transmission over transoceanic distances and their performance degradations due to nonlinear phase noise,” J. Lightwave Technol. **21**, 1933–1943 (2003).

**15. **B. Wu, P.-H. Ho, and K. L. Yeung, “Monitoring trail: on fast link failure localization in WDM mesh networks,” J. Lightwave Technol. **27**, 4175–4185 (2009). [CrossRef]

**16. **S. S. Ahuja, S. Ramasubramanian, and M. Krunz, “Single-link failure detection in all-optical networks using monitoring cycles and paths,” IEEE ACM T NETWORK. **17**(4), 1080–1093 (2009). [CrossRef]

**17. **J. Tapolcai, B. Wu and P.-H. Ho, “On monitoring and failure localization in mesh all-optical networks,” *IEEE INFOCOM2009*, Rio De Janeiro, Brazil, Apr. 2009.

**18. **N. S. V. Rao, “Computational complexity issues in operative diagnosis of graph-based systems,” IEEE T. COMPUT. **42**(4), 447–457 (1993). [CrossRef]

**19. **H. Li, Z. Yu, Y. Zhao, and J. Zhang, “Digital bit-error-rate monitoring and soft link-faults diagnosis for dynamic all-optical WDM network,” *ECOC2012,* Amsterdam, The Netherlands, Sep.2012.

**20. **C. Rhee and Y. D. Liang, “An NC algorithm for the clique cover problem in comparability graphs and its application,” J. Inform Process Lett **57**(5), 287–290 (1996). [CrossRef]

**21. **C. F. Lam, “Optical network technologies for datacenter networks,” *OFC/NFOEC2010*, Mountain View, CA, USA, March 2010.

**22. **J. Zhang, Y. Zhao, X. Chen, Y. Ji, M. Zhang, H. Wang, Y. Zhao, Y. Tu, Z. Wang, and H. Li, “The first experimental demonstration of a DREAM-based large-scale optical transport network with 1000 control plane nodes,” Opt. Express **19**(26), B746–B755 (2011). [CrossRef] [PubMed]