summaryrefslogtreecommitdiffstats
path: root/docs/Chapter4/Resiliency.rst
blob: 8c4bb7513ed0cfd15cc2bbd446b3e6bf95f681af (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
.. This work is licensed under a Creative Commons Attribution 4.0 International License.
.. http://creativecommons.org/licenses/by/4.0
.. Copyright 2017 AT&T Intellectual Property.  All rights reserved.

VNF Resiliency
-------------------------

The VNF is responsible for meeting its resiliency goals and must factor
in expected availability of the targeted virtualization environment.
This is likely to be much lower than found in a traditional data center.
Resiliency is defined as the ability of the VNF to respond to error
conditions and continue to provide the service intended. A number of
software resiliency dimensions have been identified as areas that should
be addressed to increase resiliency. As VNFs are deployed into the
Network Cloud, resiliency must be designed into the VNF software to
provide high availability versus relying on the Network Cloud to achieve
that end.

Section 4.2 Resiliency in *VNF Guidelines* describes
the overall guidelines for designing VNFs to meet resiliency goals.
Below are more detailed resiliency requirements for VNFs.

All Layer Redundancy
^^^^^^^^^^^^^^^^^^^^^^

Design the VNF to be resilient to the failures of the underlying
virtualized infrastructure (Network Cloud). VNF design considerations
would include techniques such as multiple vLANs, multiple local and
geographic instances, multiple local and geographic data replication,
and virtualized services such as Load Balancers.

All Layer Redundancy Requirements


.. req::
    :id: R-52499
    :target: VNF
    :keyword: MUST

    The VNF **MUST** meet their own resiliency goals and not rely
    on the Network Cloud.

.. req::
    :id: R-42207
    :target: VNF
    :keyword: MUST

    The VNF **MUST** design resiliency into a VNF such that the
    resiliency deployment model (e.g., active-active) can be chosen at
    run-time.

.. req::
    :id: R-03954
    :target: VNF
    :keyword: MUST

    The VNF **MUST** survive any single points of failure within
    the Network Cloud (e.g., virtual NIC, VM, disk failure).

.. req::
    :id: R-89010
    :target: VNF
    :keyword: MUST

    The VNF **MUST** survive any single points of software failure
    internal to the VNF (e.g., in memory structures, JMS message queues).

.. req::
    :id: R-67709
    :target: VNF
    :keyword: MUST

    The VNF **MUST** be designed, built and packaged to enable
    deployment across multiple fault zones (e.g., VNFCs deployed in
    different servers, racks, OpenStack regions, geographies) so that
    in the event of a planned/unplanned downtime of a fault zone, the
    overall operation/throughput of the VNF is maintained.

.. req::
    :id: R-35291
    :target: VNF
    :keyword: MUST

    The VNF **MUST** support the ability to failover a VNFC
    automatically to other geographically redundant sites if not
    deployed active-active to increase the overall resiliency of the VNF.

.. req::
    :id: R-36843
    :target: VNF
    :keyword: MUST

    The VNF **MUST** support the ability of the VNFC to be deployable
    in multi-zoned cloud sites to allow for site support in the event of cloud
    zone failure or upgrades.

.. req::
    :id: R-00098
    :target: VNF
    :keyword: MUST NOT

    The VNF **MUST NOT** impact the ability of the VNF to provide
    service/function due to a single container restart.

.. req::
    :id: R-79952
    :target: VNF
    :keyword: SHOULD

    The VNF **SHOULD** support container snapshots if not for rebuild
    and evacuate for rollback or back out mechanism.

Minimize Cross Data-Center Traffic
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Avoid performance-sapping data center-to-data center replication delay
by applying techniques such as caching and persistent transaction paths
- Eliminate replication delay impact between data centers by using a
concept of stickiness (i.e., once a client is routed to data center "A",
the client will stay with Data center “A” until the entire session is
completed).

Minimize Cross Data-Center Traffic Requirements


.. req::
    :id: R-92935
    :target: VNF
    :keyword: SHOULD

    The VNF **SHOULD** minimize the propagation of state information
    across multiple data centers to avoid cross data center traffic.

Application Resilient Error Handling
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Ensure an application communicating with a downstream peer is equipped
to intelligently handle all error conditions. Make sure code can handle
exceptions seamlessly - implement smart retry logic and implement
multi-point entry (multiple data centers) for back-end system
applications.

Application Resilient Error Handling Requirements


.. req::
    :id: R-26371
    :target: VNF
    :keyword: MUST

    The VNF **MUST** detect communication failure for inter VNFC
    instance and intra/inter VNF and re-establish communication
    automatically to maintain the VNF without manual intervention to
    provide service continuity.

.. req::
    :id: R-18725
    :target: VNF
    :keyword: MUST

    The VNF **MUST** handle the restart of a single VNFC instance
    without requiring all VNFC instances to be restarted.

.. req::
    :id: R-06668
    :target: VNF
    :keyword: MUST

    The VNF **MUST** handle the start or restart of VNFC instances
    in any order with each VNFC instance establishing or re-establishing
    required connections or relationships with other VNFC instances and/or
    VNFs required to perform the VNF function/role without requiring VNFC
    instance(s) to be started/restarted in a particular order.

.. req::
    :id: R-80070
    :target: VNF
    :keyword: MUST

    The VNF **MUST** handle errors and exceptions so that they do
    not interrupt processing of incoming VNF requests to maintain service
    continuity (where the error is not directly impacting the software
    handling the incoming request).

.. req::
    :id: R-32695
    :target: VNF
    :keyword: MUST

    The VNF **MUST** provide the ability to modify the number of
    retries, the time between retries and the behavior/action taken after
    the retries have been exhausted for exception handling to allow the
    NCSP to control that behavior, where the interface and/or functional
    specification allows for altering behaviour.

.. req::
    :id: R-48356
    :target: VNF
    :keyword: MUST

    The VNF **MUST** fully exploit exception handling to the extent
    that resources (e.g., threads and memory) are released when no longer
    needed regardless of programming language.

.. req::
    :id: R-67918
    :target: VNF
    :keyword: MUST

    The VNF **MUST** handle replication race conditions both locally
    and geo-located in the event of a data base instance failure to maintain
    service continuity.

.. req::
    :id: R-36792
    :target: VNF
    :keyword: MUST

    The VNF **MUST** automatically retry/resubmit failed requests
    made by the software to its downstream system to increase the success rate.

.. req::
    :id: R-70013
    :target: VNF
    :keyword: MUST NOT

    The VNF **MUST NOT** require any manual steps to get it ready for
    service after a container rebuild.

.. req::
    :id: R-65515
    :target: VNF
    :keyword: MUST

    The VNF **MUST** provide a mechanism and tool to start VNF
    containers (VMs) without impacting service or service quality assuming
    another VNF in same or other geographical location is processing service
    requests.

.. req::
    :id: R-94978
    :target: VNF
    :keyword: MUST

    The VNF **MUST** provide a mechanism and tool to perform a graceful
    shutdown of all the containers (VMs) in the VNF without impacting service
    or service quality assuming another VNF in same or other geographical
    location can take over traffic and process service requests.

System Resource Optimization
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Ensure an application is using appropriate system resources for the task
at hand; for example, do not use network or IO operations inside
critical sections, which could end up blocking other threads or
processes or eating memory if they are unable to complete. Critical
sections should only contain memory operation, and should not contain
any network or IO operation.

System Resource Optimization Requirements


.. req::
    :id: R-22059
    :target: VNF
    :keyword: MUST NOT

    The VNF **MUST NOT** execute long running tasks (e.g., IO,
    database, network operations, service calls) in a critical section
    of code, so as to minimize blocking of other operations and increase
    concurrent throughput.

.. req::
    :id: R-63473
    :target: VNF
    :keyword: MUST

    The VNF **MUST** automatically advertise newly scaled
    components so there is no manual intervention required.

.. req::
    :id: R-74712
    :target: VNF
    :keyword: MUST

    The VNF **MUST** utilize FQDNs (and not IP address) for
    both Service Chaining and scaling.

.. req::
    :id: R-41159
    :target: VNF
    :keyword: MUST

    The VNF **MUST** deliver any and all functionality from any
    VNFC in the pool (where pooling is the most suitable solution). The
    VNFC pool member should be transparent to the client. Upstream and
    downstream clients should only recognize the function being performed,
    not the member performing it.

.. req::
    :id: R-85959
    :target: VNF
    :keyword: SHOULD

    The VNF **SHOULD** automatically enable/disable added/removed
    sub-components or component so there is no manual intervention required.

.. req::
    :id: R-06885
    :target: VNF
    :keyword: SHOULD

    The VNF **SHOULD** support the ability to scale down a VNFC pool
    without jeopardizing active sessions. Ideally, an active session should
    not be tied to any particular VNFC instance.

.. req::
    :id: R-12538
    :target: VNF
    :keyword: SHOULD

    The VNF **SHOULD** support load balancing and discovery
    mechanisms in resource pools containing VNFC instances.

.. req::
    :id: R-98989
    :target: VNF
    :keyword: SHOULD

    The VNF **SHOULD** utilize resource pooling (threads,
    connections, etc.) within the VNF application so that resources
    are not being created and destroyed resulting in resource management
    overhead.

.. req::
    :id: R-55345
    :target: VNF
    :keyword: SHOULD

    The VNF **SHOULD** use techniques such as "lazy loading" when
    initialization includes loading catalogues and/or lists which can grow
    over time, so that the VNF startup time does not grow at a rate
    proportional to that of the list.

.. req::
    :id: R-35532
    :target: VNF
    :keyword: SHOULD

    The VNF **SHOULD** release and clear all shared assets (memory,
    database operations, connections, locks, etc.) as soon as possible,
    especially before long running sync and asynchronous operations, so as
    to not prevent use of these assets by other entities.

Application Configuration Management
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Leverage configuration management audit capability to drive conformity
to develop gold configurations for technologies like Java, Python, etc.

Application Configuration Management Requirements


.. req::
    :id: R-77334
    :target: VNF
    :keyword: MUST

    The VNF **MUST** allow configurations and configuration parameters
    to be managed under version control to ensure consistent configuration
    deployment, traceability and rollback.

.. req::
    :id: R-99766
    :target: VNF
    :keyword: MUST

    The VNF **MUST** allow configurations and configuration parameters
    to be managed under version control to ensure the ability to rollback to
    a known valid configuration.

.. req::
    :id: R-73583
    :target: VNF
    :keyword: MUST

    The VNF **MUST** allow changes of configuration parameters
    to be consumed by the VNF without requiring the VNF or its sub-components
    to be bounced so that the VNF availability is not effected.

Intelligent Transaction Distribution & Management
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Leverage Intelligent Load Balancing and redundant components (hardware
and modules) for all transactions, such that at any point in the
transaction: front end, middleware, back end -- a failure in any one
component does not result in a failure of the application or system;
i.e., transactions will continue to flow, albeit at a possibly reduced
capacity until the failed component restores itself. Create redundancy
in all layers (software and hardware) at local and remote data centers;
minimizing interdependencies of components (i.e. data replication,
deploying non-related elements in the same container).

Intelligent Transaction Distribution & Management Requirements


.. req::
    :id: R-21558
    :target: VNF
    :keyword: SHOULD

    The VNF **SHOULD** use intelligent routing by having knowledge
    of multiple downstream/upstream endpoints that are exposed to it, to
    ensure there is no dependency on external services (such as load balancers)
    to switch to alternate endpoints.

.. req::
    :id: R-08315
    :target: VNF
    :keyword: SHOULD

    The VNF **SHOULD** use redundant connection pooling to connect
    to any backend data source that can be switched between pools in an
    automated/scripted fashion to ensure high availability of the connection
    to the data source.

.. req::
    :id: R-27995
    :target: VNF
    :keyword: SHOULD

    The VNF **SHOULD** include control loop mechanisms to notify
    the consumer of the VNF of their exceeding SLA thresholds so the consumer
    is able to control its load against the VNF.

Deployment Optimization
^^^^^^^^^^^^^^^^^^^^^^^^^^

Reduce opportunity for failure, by human or by machine, through smarter
deployment practices and automation. This can include rolling code
deployments, additional testing strategies, and smarter deployment
automation (remove the human from the mix).

Deployment Optimization Requirements


.. req::
    :id: R-73364
    :target: VNF
    :keyword: MUST

    The VNF **MUST** support at least two major versions of the
    VNF software and/or sub-components to co-exist within production
    environments at any time so that upgrades can be applied across
    multiple systems in a staggered manner.

.. req::
    :id: R-02454
    :target: VNF
    :keyword: MUST

    The VNF **MUST** support the existence of multiple major/minor
    versions of the VNF software and/or sub-components and interfaces that
    support both forward and backward compatibility to be transparent to
    the Service Provider usage.

.. req::
    :id: R-57855
    :target: VNF
    :keyword: MUST

    The VNF **MUST** support hitless staggered/rolling deployments
    between its redundant instances to allow "soak-time/burn in/slow roll"
    which can enable the support of low traffic loads to validate the
    deployment prior to supporting full traffic loads.

.. req::
    :id: R-64445
    :target: VNF
    :keyword: MUST

    The VNF **MUST** support the ability of a requestor of the
    service to determine the version (and therefore capabilities) of the
    service so that Network Cloud Service Provider can understand the
    capabilities of the service.

.. req::
    :id: R-56793
    :target: VNF
    :keyword: MUST

    The VNF **MUST** test for adherence to the defined performance
    budgets at each layer, during each delivery cycle with delivered
    results, so that the performance budget is measured and the code
    is adjusted to meet performance budget.

.. req::
    :id: R-77667
    :target: VNF
    :keyword: MUST

    The VNF **MUST** test for adherence to the defined performance
    budget at each layer, during each delivery cycle so that the performance
    budget is measured and feedback is provided where the performance budget
    is not met.

.. req::
    :id: R-49308
    :target: VNF
    :keyword: SHOULD

    The VNF **SHOULD** test for adherence to the defined resiliency
    rating recommendation at each layer, during each delivery cycle with
    delivered results, so that the resiliency rating is measured and the
    code is adjusted to meet software resiliency requirements.

.. req::
    :id: R-16039
    :target: VNF
    :keyword: SHOULD

    The VNF **SHOULD** test for adherence to the defined
    resiliency rating recommendation at each layer, during each
    delivery cycle so that the resiliency rating is measured and
    feedback is provided where software resiliency requirements are
    not met.

Monitoring & Dashboard
^^^^^^^^^^^^^^^^^^^^^^^^^

Promote dashboarding as a tool to monitor and support the general
operational health of a system. It is critical to the support of the
implementation of many resiliency patterns essential to the maintenance
of the system. It can help identify unusual conditions that might
indicate failure or the potential for failure. This would contribute to
improve Mean Time to Identify (MTTI), Mean Time to Repair (MTTR), and
post-incident diagnostics.

Monitoring & Dashboard Requirements


.. req::
    :id: R-34957
    :target: VNF
    :keyword: MUST

    The VNF **MUST** provide a method of metrics gathering for each
    layer's performance to identify/document variances in the allocations so
    they can be addressed.

.. req::
    :id: R-49224
    :target: VNF
    :keyword: MUST

    The VNF **MUST** provide unique traceability of a transaction
    through its life cycle to ensure quick and efficient troubleshooting.

.. req::
    :id: R-52870
    :target: VNF
    :keyword: MUST

    The VNF **MUST** provide a method of metrics gathering
    and analysis to evaluate the resiliency of the software from both
    a granular as well as a holistic standpoint. This includes, but is
    not limited to thread utilization, errors, timeouts, and retries.

.. req::
    :id: R-92571
    :target: VNF
    :keyword: MUST

    The VNF **MUST** provide operational instrumentation such as
    logging, so as to facilitate quick resolution of issues with the VNF to
    provide service continuity.

.. req::
    :id: R-48917
    :target: VNF
    :keyword: MUST

    The VNF **MUST** monitor for and alert on (both sender and
    receiver) errant, running longer than expected and missing file transfers,
    so as to minimize the impact due to file transfer errors.

.. req::
    :id: R-28168
    :target: VNF
    :keyword: SHOULD

    The VNF **SHOULD** use an appropriately configured logging
    level that can be changed dynamically, so as to not cause performance
    degradation of the VNF due to excessive logging.

.. req::
    :id: R-87352
    :target: VNF
    :keyword: SHOULD

    The VNF **SHOULD** utilize Cloud health checks, when available
    from the Network Cloud, from inside the application through APIs to check
    the network connectivity, dropped packets rate, injection, and auto failover
    to alternate sites if needed.

.. req::
    :id: R-16560
    :target: VNF
    :keyword: SHOULD

    The VNF **SHOULD** conduct a resiliency impact assessment for all
    inter/intra-connectivity points in the VNF to provide an overall resiliency
    rating for the VNF to be incorporated into the software design and
    development of the VNF.