“SD-WAN”健康状态检查(Performance SLA)

SD-WAN状态检查:

对应命令行:
FGT100E_Master # config system virtual-wan-link
FGT100E_Master (virtual-wan-link) # config health-check
FGT100E_Master (health-check) # edit 114_Check
FGT100E_Master (114_Check) # show
config health-check
    edit "114_Check"
        set server "114.114.114.114"        // 可以指定两个被检测的server,server可以是IP或者FQDN
        set members 3 1 2 4                    //  指定健康检查发送的出接口
        config sla                                     // SLA Targets是可选项,SD-WAN规则中的 “Lowest Cost”规则 (mode sla) 和 “Maximize Bandwidth“规则 (mode load-balance)需要用到SLA Targets中的检测值,基于SLA检测值进行流量路径的判断和选择。
            edit 1
                set link-cost-factor latency packet-loss
                set latency-threshold 50
                set packetloss-threshold 3
            next
        end
    next
end
FGT100E_Master (114_Check) # show full-configuration
config health-check
    edit "114_Check"
        set probe-packets enable
        set addr-mode ipv4
        set server "114.114.114.114"
        set protocol ping                               // SLA probe protocols are ping, http, tcp-echo (cli), udp-echo (cli) and twamp (cli).
        set ha-priority 1
        set interval 500                                // 每隔500 milliseconds发送一下健康检测的探测
        set failtime 5                                   // 两个被探测的sla server在连续发起的5次探测都没有回应之后,没有回应的sd-wan接口成员将会被置位为dead状态,会将其在SD-WAN的选路接口组中剔除
        set recoverytime 5                         //  sla状态在两个sla server中其中的一个连续响应5次后切换回alive状态,会恢复到SD-WAN的选路接口中
        set update-cascade-interface enable
        set update-static-route enable
        set sla-fail-log-period 0
        set sla-pass-log-period 0
        set threshold-warning-packetloss 0
        set threshold-alert-packetloss 0
        set threshold-warning-latency 0
        set threshold-alert-latency 0
        set threshold-warning-jitter 0
        set threshold-alert-jitter 0
        set members 3 1 2 4
        config sla
            edit 1
                set link-cost-factor latency packet-loss
                set latency-threshold 50
                set packetloss-threshold 3
            next
        end
    next
end


关于server IP 后台的处理:

填写的SLA server IP会在FGT的内核中添加一个Fib路由条目项,以便每个SD-WAN接口拥有独立指向sla server ip的路由。
这些特定的内核路由(被标记为“proto=17”)将使得SD-WAN接口成员到sla服务器的可访问性完全独立于通常的路由表转发表(静态路由、动态路由),从而不管静态/动态路由状态如何都可以对sla服务器的发起探测请求。

FGT100E_Master # diag ip address list
IP=10.10.10.1->10.10.10.1/255.255.255.0 index=5 devname=dmz
IP=192.168.91.13->192.168.91.13/255.255.255.0 index=6 devname=mgmt
IP=202.100.1.100->202.100.1.100/255.255.255.0 index=7 devname=wan1
IP=101.100.1.10->101.100.1.10/255.255.255.0 index=8 devname=wan2
IP=192.168.10.1->192.168.10.1/255.255.255.0 index=11 devname=port1
IP=111.100.1.10->111.100.1.10/255.255.255.0 index=23 devname=port13
IP=127.0.0.1->127.0.0.1/255.0.0.0 index=30 devname=root
IP=169.254.1.1->169.254.1.1/255.255.255.0 index=33 devname=fortilink
IP=127.0.0.1->127.0.0.1/255.0.0.0 index=34 devname=vsys_ha
IP=169.254.0.1->169.254.0.1/255.255.255.192 index=35 devname=port_ha
IP=127.0.0.1->127.0.0.1/255.0.0.0 index=36 devname=vsys_fgfm
IP=169.254.0.65->169.254.0.65/255.255.255.192 index=37 devname=havdlink0
IP=169.254.0.66->169.254.0.66/255.255.255.192 index=38 devname=havdlink1
IP=127.0.0.1->127.0.0.1/255.0.0.0 index=39 devname=vsys_hamgmt
IP=114.100.1.204->114.100.1.196/255.255.255.255 index=47 devname=PPPOE1_DR_PENG

FGT100E_Master # get router info kernel | grep 114.114    // 内核下发的独立于路由转发表的路由
tab=254 vf=0 scope=0 type=1 proto=17 prio=0 114.100.1.204/255.255.255.255/0->114.114.114.114/32 pref=0.0.0.0 gwy=114.100.1.196 dev=47(PPPOE1_DR_PENG)
tab=254 vf=0 scope=0 type=1 proto=17 prio=0 101.100.1.10/255.255.255.255/0->114.114.114.114/32 pref=0.0.0.0 gwy=101.100.1.192 dev=8(wan2)
tab=254 vf=0 scope=0 type=1 proto=17 prio=0 202.100.1.100/255.255.255.255/0->114.114.114.114/32 pref=0.0.0.0 gwy=202.100.1.192 dev=7(wan1)
tab=254 vf=0 scope=0 type=1 proto=17 prio=0 111.100.1.10/255.255.255.255/0->114.114.114.114/32 pref=0.0.0.0 gwy=111.100.1.192 dev=23(port13)

tab=254 vf=0 scope=0 type=1 proto=17 prio=0 114.100.1.204/255.255.255.255/0->114.114.119.119/32 pref=0.0.0.0 gwy=114.100.1.196 dev=47(PPPOE1_DR_PENG)
tab=254 vf=0 scope=0 type=1 proto=17 prio=0 101.100.1.10/255.255.255.255/0->114.114.119.119/32 pref=0.0.0.0 gwy=101.100.1.192 dev=8(wan2)
tab=254 vf=0 scope=0 type=1 proto=17 prio=0 202.100.1.100/255.255.255.255/0->114.114.119.119/32 pref=0.0.0.0 gwy=202.100.1.192 dev=7(wan1)
tab=254 vf=0 scope=0 type=1 proto=17 prio=0 111.100.1.10/255.255.255.255/0->114.114.119.119/32 pref=0.0.0.0 gwy=111.100.1.192 dev=23(port13)

健康检测SLA历史记录:

FGT100E_Master # diag sys virtual-wan-link health-check 114_Check  // 实时SLA健康检查参数记录
Health Check(114_Check):
Seq(3): state(alive), packet-loss(0.000%) latency(103.851), jitter(2.119) sla_map=0x0
Seq(1): state(alive), packet-loss(0.000%) latency(104.251), jitter(1.978) sla_map=0x0
Seq(2): state(alive), packet-loss(1.000%) latency(104.010), jitter(2.366) sla_map=0x0
Seq(4): state(alive), packet-loss(1.000%) latency(104.442), jitter(2.415) sla_map=0x0

FGT100E_Master # diag sys virtual-wan-link sla-log 114_Check 1   // 某一个接口的十分钟的SLA历史记录log
Timestamp: Tue Oct 22 18:08:56 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.884, jitter: 1.693, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:08:56 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.851, jitter: 1.709, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:08:57 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.861, jitter: 1.679, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:08:57 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.943, jitter: 1.724, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:08:58 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.875, jitter: 1.672, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:08:58 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.818, jitter: 1.660, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:08:59 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.595, jitter: 1.493, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:08:59 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.623, jitter: 1.314, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:09:00 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.647, jitter: 1.310, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:09:00 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.512, jitter: 1.163, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:09:01 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.527, jitter: 1.014, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:09:01 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.547, jitter: 1.019, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:09:02 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.627, jitter: 1.078, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:09:02 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.654, jitter: 1.130, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:09:03 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.723, jitter: 1.135, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:09:03 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.688, jitter: 1.181, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:09:04 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.662, jitter: 1.172, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:09:04 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.596, jitter: 1.148, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:09:05 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.532, jitter: 1.146, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:09:05 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.464, jitter: 1.141, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:09:06 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.475, jitter: 1.144, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:09:06 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.455, jitter: 1.175, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:09:07 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.431, jitter: 1.170, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:09:08 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.475, jitter: 1.239, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:09:08 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.398, jitter: 1.206, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:09:09 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.377, jitter: 1.150, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:09:09 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.433, jitter: 1.099, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:09:10 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.424, jitter: 1.082, packet loss: 1.000%.
Timestamp: Tue Oct 22 18:09:10 2019, vdom root, health-check 114_Check, interface: wan1, status: up, latency: 102.479, jitter: 1.145, packet loss: 1.000%.

SLA的日志记录开启:
config system virtual-wan-link
config health-check
    edit "114_Check"
        set server "114.114.114.114" "114.114.119.119"
        set sla-fail-log-period 300    //   当成员处于失败状态时,将按指定的间隔生成sla日志 
        set sla-pass-log-period 120 //   当成员处于活动状态时,按指定的间隔生成sla日志
        set members 3 1 2 4
        config sla
            edit 1
                set link-cost-factor latency packet-loss
                set latency-threshold 50
                set packetloss-threshold 3
            next
        end
    next
end

SLA PASS日志:
date=… time=… logid="0100022925" type="event" subtype="system" level="information" vd="root" eventtime=1558965416 logdesc="Link monitor SLA information"name="114_Check" interface="wan1" status="up" msg="Latency: 102.700, jitter: 0.188, packet loss: 0.000%, inbandwidth: 100Mbps, outbandwidth: 100Mbps, bibandwidth: 200Mbps, sla_map: 0x0"

SLA FAIL日志:
date=… time=… logid="0100022925" type="event" subtype="system" level="notice" vd="root" eventtime=1558967950 logdesc="Link monitor SLA information" name="114_Check" interface="port13" status="down" msg="Latency: 0.000, jitter: 0.000, packet loss: 5.000%, inbandwidth: 100Mbps, outbandwidth: 100Mbps, bibandwidth: 200Mbps, sla_map: 0x0"

FGT100E_Master # diagnose sys virtual-wan-link member
Member(1): interface: wan1, gateway: 202.100.1.192, priority: 0, weight: 0
Member(2): interface: wan2, gateway: 101.100.1.192, priority: 100, weight: 0
Member(3): interface: port13, gateway: 111.100.1.192, priority: 0, weight: 0
Member(4): interface: PPPOE1_DR_PENG, gateway: 114.100.1.196, priority: 0, weight: 0

FGT100E_Master # diagnose sys virtual-wan-link health-check 114_Check   // 健康检查全部成功的情况下
Health Check(114_Check):
Seq(3): state(alive), packet-loss(0.000%) latency(37.532), jitter(4.034) sla_map=0x1
Seq(1): state(alive), packet-loss(0.000%) latency(37.421), jitter(3.994) sla_map=0x1
Seq(2): state(alive), packet-loss(0.000%) latency(37.424), jitter(3.935) sla_map=0x1
Seq(4): state(alive), packet-loss(0.000%) latency(37.623), jitter(3.815) sla_map=0x1

FGT100E_Master # diagnose sys virtual-wan-link health-check 114_Check   //制造接口的Member3(Port13)丢包,查看失败的健康检查结果为dead
Health Check(114_Check):
Seq(3): state(dead), packet-loss(96.000%) sla_map=0x0
Seq(1): state(alive), packet-loss(0.000%) latency(27.498), jitter(1.385) sla_map=0x1
Seq(2): state(alive), packet-loss(0.000%) latency(27.482), jitter(1.442) sla_map=0x1
Seq(4): state(alive), packet-loss(0.000%) latency(27.703), jitter(1.426) sla_map=0x1

FGT100E_Master # diagnose sys virtual-wan-link service  // 对应的SD-WAN规则下port13接口也会显示dead
Service(1): Address Mode(IPV4) flags=0x60
  TOS(0x0/0x0), Protocol(0: 1->65535), Mode(priority), link-cost-factor(latency), link-cost-threshold(10), health-check(Default_Office_365)
  Service role: standalone
  Member sub interface:
  Members:
    1: Seq_num(1), alive, latency: 53.240, selected
    2: Seq_num(2), alive, latency: 51.667, selected
    3: Seq_num(4), alive, latency: 82.328, selected
    4: Seq_num(3), dead   //不再被SD-WAN规则选路使用
  Internet Service: Microsoft-Office365(327782) Microsoft-Office365.Published(327880)
  Src address:
        192.168.10.0-192.168.10.255

FGT100E_Master # diag firewall proute list
list route policy info(vf=root):

id=2084110337 vwl_service=1(OFFICE_365) vwl_mbr_seq=1 2 4 dscp_tag=0xff 0xff flags=0x0 tos=0x00 tos_mask=0x00 protocol=0 sport=0:65535 iif=0 dport=1-65535 oif=7 gwy=202.100.1.192 oif=8 gwy=101.100.1.192 oif=47 gwy=114.100.1.196   // SD-WAN转发表中移除了oif 23 port13接口
source(1): 192.168.10.0-192.168.10.255
destination wildcard(1): 0.0.0.0/0.0.0.0
internet service(2): Microsoft-Office365(327782) Microsoft-Office365.Published(327880)

由于健康检查里面的更新静态路由打开了,因此相应的SD-WAN静态路由也会将port13对应的SD-WAN路由状态置位为inactive:

FGT100E_Master # get router info routing-table  database  // 对应的路由也会inactive
Routing table for VRF=0
Codes: K - kernel, C - connected, S - static, R - RIP, B - BGP
       O - OSPF, IA - OSPF inter area
       N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
       E1 - OSPF external type 1, E2 - OSPF external type 2
       i - IS-IS, L1 - IS-IS level-1, L2 - IS-IS level-2, ia - IS-IS inter area
       > - selected route, * - FIB route, p - stale info

S     > 0.0.0.0/0 [1/0] via 111.100.1.192, port13 inactive    // 对应的路由也会inactive
     *>           [1/0] via 114.100.1.196, PPPOE1_DR_PENG
     *>           [1/0] via 202.100.1.192, wan1
     *>           [1/0] via 101.100.1.192, wan2, [100/0]
C    *> 101.100.1.0/24 is directly connected, wan2
C    *> 111.100.1.0/24 is directly connected, port13
C    *> 114.100.1.196/32 is directly connected, PPPOE1_DR_PENG
C    *> 114.100.1.204/32 is directly connected, PPPOE1_DR_PENG
C    *> 192.168.10.0/24 is directly connected, port1
C    *> 202.100.1.0/24 is directly connected, wan1

只要健康检查失败,则port13就不会被策略路由/SD-WAN规则/路由转发数据了,和link-moniter的效果是类似的,其实进程也是使用的同样一个进程“link-monitor process (lnkmtd)”,debug lnkmtd进程:



讨论另外一个事情:SD-WAN健康检查检测到出口异常,就会引起路由出接口变化,那么策略路由/SD-WAN规则/路由在出接口发生了变化的时候,FGT在转发数据的时候是什么样的一个处理逻辑?

首先回答这个问题:
SD-WAN规则的下一跳/出接口切换了,流量就会立即切换吗?

有两条命令需要知道:
第一条命令:
config system global
      set snat-route-change disable(默认状态disable)
end

路由变化,Snat的会话表是否跟随着一起变化更新?
disable  会话表项不变化,继续保持该会话,保持原有出口  (默认值为disable)
enable  会话表项变化,会话状态会置位为:dirty,并且新来的数据将会重新上送CPU查询策略/路由等参数,重新进行会话的匹配,一旦路由变化将会匹配新的出口进行数据的转发

第二条命令:
config system interface
    edit "wan1"
        set preserve-session-route disable(默认状态disable)
    next
end

出接口是否强制开启会话保持?
disable 不保持
enable 强制保持会话,不会主动的更新会话中路由的下一跳地址

这个命令的优先级更高,只要开启了接口的会话保持,无论snat-route-change是否是enable,会话还是为会优先保持,会话/流量不会更变出接口信息。


总结来看:
默认情况已存在的SNAT会话将会进行会话保持,不会跟随着SD-WAN规则/路由的出接口变化,而改变数据的转发出口,还是会保持原有出接口进行转发数据。这样的会话数据是不会切换的。新建的SNAT的会话,则会按照新建会话处理流程上送CPU,查询当前最新的SD-WAN规则/路由等信息进行数据转发,会切换到新的出口上去。

而如果没有SNAT的环境下,仅仅是三层路由转发的会话,会话是会随着路由的变化而dirty,强制上送CPU进行SD-WAN规则/路由/策略的重新查询和转发处理,一旦路由出接口有变化,转发数据会立即跟换改会话数据流的出接口。

一旦在接口下配置了会话保持preserve-session-route enable,则会有最高的优先级保障会话保持的机制。

还有另外一个参数(异步路由)会影响到路由变化时候的数据转发逻辑:
config system settings
    set asymroute disable(默认状态disable)
end

默认情况下是disable,不允许非对称路由
此时路由变化,Session(NAT/非NAT)是否保持基于“snat-route-change”和“preserve-session-route”的配置结果而定

config system settings
    set asymroute enable
end
管理员手工enable,即允许非对称路由
此时无论“snat-route-change”和“preserve-session-route”如何配置,只要路由发生了变化,流量会重新查询路由强制将流量转发到新的下一跳出口。此时任何会话保持都会失效。这个拥有超高的优先级,打破一切的会话保持机制,当然不建议开启异步路由。


SD-WAN规则的下一跳/出接口切换后,业务流量切换注意:
因为SD-WAN的切换会相对频繁一些,以下参数如果调整不合适就会出现SD-WAN一切换就出现业务中断的情况。因此非特殊情况下,尽量保持以下三个参数为默认的状态,这样有利于SD-WAN路由切换时候业务的平滑切换。务必先理解清楚三个参数的含义,然后才可按照自己的需求选择性的开启。