From 7ff70854842a9c67e567d471dc05984fa6a22af8 Mon Sep 17 00:00:00 2001 From: mrichomme Date: Tue, 9 Nov 2021 17:26:29 +0100 Subject: [S3P] Add S3P Istanbul documentation This section includes the results of - CI tests - stability tests - resiliency tests Issue-ID: INT-1988 Signed-off-by: mrichomme Change-Id: I1ccaf28014e6e68022caf723796ffd166363f02b (cherry picked from commit 8bb34f0622352d0aaaa17e832a05af17eae79bed) --- docs/files/csv/s3p-instantiation.csv | 6 + docs/files/csv/s3p-sdc.csv | 6 + docs/files/s3p/istanbul-dashboard.png | Bin 0 -> 60652 bytes docs/files/s3p/istanbul_daily_healthcheck.png | Bin 0 -> 21941 bytes .../istanbul_daily_infrastructure_healthcheck.png | Bin 0 -> 21499 bytes docs/files/s3p/istanbul_daily_security.png | Bin 0 -> 16609 bytes docs/files/s3p/istanbul_daily_smoke.png | Bin 0 -> 21629 bytes .../s3p/istanbul_instantiation_stability_10.png | Bin 0 -> 90935 bytes docs/files/s3p/istanbul_resiliency.png | Bin 0 -> 15880 bytes docs/files/s3p/istanbul_sdc_stability.png | Bin 0 -> 75166 bytes docs/integration-s3p.rst | 504 +++++++-------------- 11 files changed, 172 insertions(+), 344 deletions(-) create mode 100644 docs/files/csv/s3p-instantiation.csv create mode 100644 docs/files/csv/s3p-sdc.csv create mode 100644 docs/files/s3p/istanbul-dashboard.png create mode 100644 docs/files/s3p/istanbul_daily_healthcheck.png create mode 100644 docs/files/s3p/istanbul_daily_infrastructure_healthcheck.png create mode 100644 docs/files/s3p/istanbul_daily_security.png create mode 100644 docs/files/s3p/istanbul_daily_smoke.png create mode 100644 docs/files/s3p/istanbul_instantiation_stability_10.png create mode 100644 docs/files/s3p/istanbul_resiliency.png create mode 100644 docs/files/s3p/istanbul_sdc_stability.png diff --git a/docs/files/csv/s3p-instantiation.csv b/docs/files/csv/s3p-instantiation.csv new file mode 100644 index 000000000..6b3febd3d --- /dev/null +++ b/docs/files/csv/s3p-instantiation.csv @@ -0,0 +1,6 @@ +Parameters;Istanbul;Honolulu +Number of tests;1310;1410 +Global success rate;97%;96% +Min duration;193s;81s +Max duration;2128s;2000s +mean duration;564s;530s \ No newline at end of file diff --git a/docs/files/csv/s3p-sdc.csv b/docs/files/csv/s3p-sdc.csv new file mode 100644 index 000000000..f89fef24a --- /dev/null +++ b/docs/files/csv/s3p-sdc.csv @@ -0,0 +1,6 @@ +Parameters;Istanbul;Honolulu +Number of tests;1085;715 +Global success rate;92%;93% +Min duration;111s;80s +Max duration;799s;1128s +mean duration;366s;565s \ No newline at end of file diff --git a/docs/files/s3p/istanbul-dashboard.png b/docs/files/s3p/istanbul-dashboard.png new file mode 100644 index 000000000..f8bad42ad Binary files /dev/null and b/docs/files/s3p/istanbul-dashboard.png differ diff --git a/docs/files/s3p/istanbul_daily_healthcheck.png b/docs/files/s3p/istanbul_daily_healthcheck.png new file mode 100644 index 000000000..e1cf16ae6 Binary files /dev/null and b/docs/files/s3p/istanbul_daily_healthcheck.png differ diff --git a/docs/files/s3p/istanbul_daily_infrastructure_healthcheck.png b/docs/files/s3p/istanbul_daily_infrastructure_healthcheck.png new file mode 100644 index 000000000..1e8877d0e Binary files /dev/null and b/docs/files/s3p/istanbul_daily_infrastructure_healthcheck.png differ diff --git a/docs/files/s3p/istanbul_daily_security.png b/docs/files/s3p/istanbul_daily_security.png new file mode 100644 index 000000000..605edb140 Binary files /dev/null and b/docs/files/s3p/istanbul_daily_security.png differ diff --git a/docs/files/s3p/istanbul_daily_smoke.png b/docs/files/s3p/istanbul_daily_smoke.png new file mode 100644 index 000000000..cdeb999da Binary files /dev/null and b/docs/files/s3p/istanbul_daily_smoke.png differ diff --git a/docs/files/s3p/istanbul_instantiation_stability_10.png b/docs/files/s3p/istanbul_instantiation_stability_10.png new file mode 100644 index 000000000..73749572a Binary files /dev/null and b/docs/files/s3p/istanbul_instantiation_stability_10.png differ diff --git a/docs/files/s3p/istanbul_resiliency.png b/docs/files/s3p/istanbul_resiliency.png new file mode 100644 index 000000000..567a98c5c Binary files /dev/null and b/docs/files/s3p/istanbul_resiliency.png differ diff --git a/docs/files/s3p/istanbul_sdc_stability.png b/docs/files/s3p/istanbul_sdc_stability.png new file mode 100644 index 000000000..67346cb0d Binary files /dev/null and b/docs/files/s3p/istanbul_sdc_stability.png differ diff --git a/docs/integration-s3p.rst b/docs/integration-s3p.rst index b73a49318..b41a37323 100644 --- a/docs/integration-s3p.rst +++ b/docs/integration-s3p.rst @@ -20,10 +20,14 @@ CI results ---------- As usual, a daily CI chain dedicated to the release is created after RC0. -A Honolulu chain has been created on the 6th of April 2021. +An Istanbul chain has been created on the 5th of November 2021. The daily results can be found in `LF daily results web site -`_. +`_. + +.. image:: files/s3p/istanbul-dashboard.png + :align: center + Infrastructure Healthcheck Tests ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -31,8 +35,9 @@ Infrastructure Healthcheck Tests These tests deal with the Kubernetes/Helm tests on ONAP cluster. The global expected criteria is **75%**. -The onap-k8s and onap-k8s-teardown providing a snapshop of the onap namespace in -Kubernetes as well as the onap-helm tests are expected to be PASS. + +The onap-k8s and onap-k8s-teardown, providing a snapshop of the onap namespace +in Kubernetes, as well as the onap-helm tests are expected to be PASS. nodeport_check_certs test is expected to fail. Even tremendous progress have been done in this area, some certificates (unmaintained, upstream or integration @@ -40,7 +45,7 @@ robot pods) are still not correct due to bad certificate issuers (Root CA certificate non valid) or extra long validity. Most of the certificates have been installed using cert-manager and will be easily renewable. -.. image:: files/s3p/honolulu_daily_infrastructure_healthcheck.png +.. image:: files/s3p/istanbul_daily_infrastructure_healthcheck.png :align: center Healthcheck Tests @@ -49,16 +54,9 @@ Healthcheck Tests These tests are the traditionnal robot healthcheck tests and additional tests dealing with a single component. -Some tests (basic_onboard, basic_cds) may fail episodically due to the fact that -the startup of the SDC is sometimes not fully completed. - -The same test is run as first step of smoke tests and is usually PASS. -The mechanism to detect that all the components are fully operational may be -improved, timer based solutions are not robust enough. - The expectation is **100% OK**. -.. image:: files/s3p/honolulu_daily_healthcheck.png +.. image:: files/s3p/istanbul_daily_healthcheck.png :align: center Smoke Tests @@ -69,13 +67,13 @@ See the :ref:`the Integration Test page ` for details. The expectation is **100% OK**. -.. figure:: files/s3p/honolulu_daily_smoke.png +.. figure:: files/s3p/istanbul_daily_smoke.png :align: center -An error has been detected on the SDNC preventing the basic_vm_macro to work. -See `SDNC-1529 `_ for details. -We may also notice that SO timeouts occured more frequently than in Guilin. -See `SO-3584 `_ for details. +An error has been reported since Guilin (https://jira.onap.org/browse/SDC-3508) on +a possible race condition in SDC preventing the completion of the certification in +SDC and leading to onboarding errors. +This error may occur in case of parallel processing. Security Tests ~~~~~~~~~~~~~~ @@ -83,243 +81,126 @@ Security Tests These tests are tests dealing with security. See the :ref:`the Integration Test page ` for details. -The expectation is **66% OK**. The criteria is met. +Waivers have been granted on different projects for the different tests. +The list of waivers can be found in +https://git.onap.org/integration/seccom/tree/waivers?h=istanbul. -It may even be above as 2 fail tests are almost correct: +The expectation is **100% OK**. The criteria is met. -- The unlimited pod test is still fail due testing pod (DCAE-tca). -- The nonssl tests is FAIL due to so and so-etsi-sol003-adapter, which were - supposed to be managed with the ingress (not possible for this release) and - got a waiver in Frankfurt. The pods cds-blueprints-processor-http and aws-web - are used for tests. - -.. figure:: files/s3p/honolulu_daily_security.png +.. figure:: files/s3p/istanbul_daily_security.png :align: center Resiliency tests ---------------- The goal of the resiliency testing was to evaluate the capability of the -Honolulu solution to survive a stop or restart of a Kubernetes control or -worker node. - -Controller node resiliency -~~~~~~~~~~~~~~~~~~~~~~~~~~ - -By default the ONAP solution is installed with 3 controllers for high -availability. The test for controller resiliency can be described as follows: - -- Run tests: check that they are PASS -- Stop a controller node: check that the node appears in NotReady state -- Run tests: check that they are PASS - -2 tests were performed on the weekly honolulu lab. No problem was observed on -controller shutdown, tests were still PASS with a stoped controller node. - -More details can be found in . - -Worker node resiliency -~~~~~~~~~~~~~~~~~~~~~~ - -In community weekly lab, the ONAP pods are distributed on 12 workers. The goal -of the test was to evaluate the behavior of the pod on a worker restart -(disaster scenario assuming that the node was moved accidentally from Ready to -NotReady state). -The original conditions of such tests may be different as the Kubernetes -scheduler does not distribute the pods on the same worker from an installation -to another. - -The test procedure can be described as follows: - -- Run tests: check that they are PASS (Healthcheck and basic_vm used) -- Check that all the workers are in ready state - :: - $ kubectl get nodes - NAME STATUS ROLES AGE VERSION - compute01-onap-honolulu Ready 18h v1.19.9 - compute02-onap-honolulu Ready 18h v1.19.9 - compute03-onap-honolulu Ready 18h v1.19.9 - compute04-onap-honolulu Ready 18h v1.19.9 - compute05-onap-honolulu Ready 18h v1.19.9 - compute06-onap-honolulu Ready 18h v1.19.9 - compute07-onap-honolulu Ready 18h v1.19.9 - compute08-onap-honolulu Ready 18h v1.19.9 - compute09-onap-honolulu Ready 18h v1.19.9 - compute10-onap-honolulu Ready 18h v1.19.9 - compute11-onap-honolulu Ready 18h v1.19.9 - compute12-onap-honolulu Ready 18h v1.19.9 - control01-onap-honolulu Ready master 18h v1.19.9 - control02-onap-honolulu Ready master 18h v1.19.9 - control03-onap-honolulu Ready master 18h v1.19.9 - -- Select a worker, list the impacted pods - :: - $ kubectl get pod -n onap --field-selector spec.nodeName=compute01-onap-honolulu - NAME READY STATUS RESTARTS AGE - onap-aaf-fs-7b6648db7f-shcn5 1/1 Running 1 22h - onap-aaf-oauth-5896545fb7-x6grg 1/1 Running 1 22h - onap-aaf-sms-quorumclient-2 1/1 Running 1 22h - onap-aai-modelloader-86d95c994b-87tsh 2/2 Running 2 22h - onap-aai-schema-service-75575cb488-7fxs4 2/2 Running 2 22h - onap-appc-cdt-58cb4766b6-vl78q 1/1 Running 1 22h - onap-appc-db-0 2/2 Running 4 22h - onap-appc-dgbuilder-5bb94d46bd-h2gbs 1/1 Running 1 22h - onap-awx-0 4/4 Running 4 22h - onap-cassandra-1 1/1 Running 1 22h - onap-cds-blueprints-processor-76f8b9b5c7-hb5bg 1/1 Running 1 22h - onap-dmaap-dr-db-1 2/2 Running 5 22h - onap-ejbca-6cbdb7d6dd-hmw6z 1/1 Running 1 22h - onap-kube2msb-858f46f95c-jws4m 1/1 Running 1 22h - onap-message-router-0 1/1 Running 1 22h - onap-message-router-kafka-0 1/1 Running 1 22h - onap-message-router-kafka-1 1/1 Running 1 22h - onap-message-router-kafka-2 1/1 Running 1 22h - onap-message-router-zookeeper-0 1/1 Running 1 22h - onap-multicloud-794c6dffc8-bfwr8 2/2 Running 2 22h - onap-multicloud-starlingx-58f6b86c55-mff89 3/3 Running 3 22h - onap-multicloud-vio-584d556876-87lxn 2/2 Running 2 22h - onap-music-cassandra-0 1/1 Running 1 22h - onap-netbox-nginx-8667d6675d-vszhb 1/1 Running 2 22h - onap-policy-api-6dbf8485d7-k7cpv 1/1 Running 1 22h - onap-policy-clamp-be-6d77597477-4mffk 1/1 Running 1 22h - onap-policy-pap-785bd79759-xxhvx 1/1 Running 1 22h - onap-policy-xacml-pdp-7d8fd58d59-d4m7g 1/1 Running 6 22h - onap-sdc-be-5f99c6c644-dcdz8 2/2 Running 2 22h - onap-sdc-fe-7577d58fb5-kwxpj 2/2 Running 2 22h - onap-sdc-wfd-fe-6997567759-gl9g6 2/2 Running 2 22h - onap-sdnc-dgbuilder-564d6475fd-xwwrz 1/1 Running 1 22h - onap-sdnrdb-master-0 1/1 Running 1 22h - onap-so-admin-cockpit-6c5b44694-h4d2n 1/1 Running 1 21h - onap-so-etsi-sol003-adapter-c9bf4464-pwn97 1/1 Running 1 21h - onap-so-sdc-controller-6899b98b8b-hfgvc 2/2 Running 2 21h - onap-vfc-mariadb-1 2/2 Running 4 21h - onap-vfc-nslcm-6c67677546-xcvl2 2/2 Running 2 21h - onap-vfc-vnflcm-78ff4d8778-sgtv6 2/2 Running 2 21h - onap-vfc-vnfres-6c96f9ff5b-swq5z 2/2 Running 2 21h - -- Stop the worker (shutdown the machine for baremetal or the VM if you installed - your Kubernetes on top of an OpenStack solution) -- Wait for the pod eviction procedure completion (5 minutes) - :: - $ kubectl get nodes - NAME STATUS ROLES AGE VERSION - compute01-onap-honolulu NotReady 18h v1.19.9 - compute02-onap-honolulu Ready 18h v1.19.9 - compute03-onap-honolulu Ready 18h v1.19.9 - compute04-onap-honolulu Ready 18h v1.19.9 - compute05-onap-honolulu Ready 18h v1.19.9 - compute06-onap-honolulu Ready 18h v1.19.9 - compute07-onap-honolulu Ready 18h v1.19.9 - compute08-onap-honolulu Ready 18h v1.19.9 - compute09-onap-honolulu Ready 18h v1.19.9 - compute10-onap-honolulu Ready 18h v1.19.9 - compute11-onap-honolulu Ready 18h v1.19.9 - compute12-onap-honolulu Ready 18h v1.19.9 - control01-onap-honolulu Ready master 18h v1.19.9 - control02-onap-honolulu Ready master 18h v1.19.9 - control03-onap-honolulu Ready master 18h v1.19.9 - -- Run the tests: check that they are PASS - -.. warning:: - In these conditions, **the tests will never be PASS**. In fact several components - will remeain in INIT state. - A procedure is required to ensure a clean restart. - -List the non running pods:: - - $ kubectl get pods -n onap --field-selector status.phase!=Running | grep -v Completed - NAME READY STATUS RESTARTS AGE - onap-appc-dgbuilder-5bb94d46bd-sxmmc 0/1 Init:3/4 15 156m - onap-cds-blueprints-processor-76f8b9b5c7-m7nmb 0/1 Init:1/3 0 156m - onap-portal-app-595bd6cd95-bkswr 0/2 Init:0/4 84 23h - onap-portal-db-config-6s75n 0/2 Error 0 23h - onap-portal-db-config-7trzx 0/2 Error 0 23h - onap-portal-db-config-jt2jl 0/2 Error 0 23h - onap-portal-db-config-mjr5q 0/2 Error 0 23h - onap-portal-db-config-qxvdt 0/2 Error 0 23h - onap-portal-db-config-z8c5n 0/2 Error 0 23h - onap-sdc-be-5f99c6c644-kplqx 0/2 Init:2/5 14 156 - onap-vfc-nslcm-6c67677546-86mmj 0/2 Init:0/1 15 156m - onap-vfc-vnflcm-78ff4d8778-h968x 0/2 Init:0/1 15 156m - onap-vfc-vnfres-6c96f9ff5b-kt9rz 0/2 Init:0/1 15 156m - -Some pods are not rescheduled (i.e. onap-awx-0 and onap-cassandra-1 above) -because they are part of a statefulset. List the statefulset objects:: - - $ kubectl get statefulsets.apps -n onap | grep -v "1/1" | grep -v "3/3" - NAME READY AGE - onap-aaf-sms-quorumclient 2/3 24h - onap-appc-db 2/3 24h - onap-awx 0/1 24h - onap-cassandra 2/3 24h - onap-dmaap-dr-db 2/3 24h - onap-message-router 0/1 24h - onap-message-router-kafka 0/3 24h - onap-message-router-zookeeper 2/3 24h - onap-music-cassandra 2/3 24h - onap-sdnrdb-master 2/3 24h - onap-vfc-mariadb 2/3 24h - -For the pods being part of the statefulset, a forced deleteion is required. -As an example if we consider the statefulset onap-sdnrdb-master, we must follow -the procedure:: - - $ kubectl get pods -n onap -o wide |grep onap-sdnrdb-master - onap-sdnrdb-master-0 1/1 Terminating 1 24h 10.42.3.92 node1 - onap-sdnrdb-master-1 1/1 Running 1 24h 10.42.1.122 node2 - onap-sdnrdb-master-2 1/1 Running 1 24h 10.42.2.134 node3 - - $ kubectl delete -n onap pod onap-sdnrdb-master-0 --force - warning: Immediate deletion does not wait for confirmation that the running - resource has been terminated. The resource may continue to run on the cluster - indefinitely. - pod "onap-sdnrdb-master-0" force deleted - - $ kubectl get pods |grep onap-sdnrdb-master - onap-sdnrdb-master-0 0/1 PodInitializing 0 11s - onap-sdnrdb-master-1 1/1 Running 1 24h - onap-sdnrdb-master-2 1/1 Running 1 24h - - $ kubectl get pods |grep onap-sdnrdb-master - onap-sdnrdb-master-0 1/1 Running 0 43s - onap-sdnrdb-master-1 1/1 Running 1 24h - onap-sdnrdb-master-2 1/1 Running 1 24h - -Once all the statefulset are properly restarted, the other components shall -continue their restart properly. -Once the restart of the pods is completed, the tests are PASS. +Istanbul solution to survive a stop or restart of a Kubernetes worker node. + +This test has been automated thanks to the +Litmus chaos framework(https://litmuschaos.io/) and automated in the CI on the +weekly chains. + +2 additional tests based on Litmus chaos scenario have been added but will be tuned +in Jakarta. + +- node cpu hog (temporary increase of CPU on 1 kubernetes node) +- node memory hog (temporary increase of Memory on 1 kubernetes node) + +The main test for Istanbul is node drain corresponding to the resiliency scenario +previously managed manually. + +The system under test is defined in OOM. +The resources are described in the table below: + +.. code-block:: shell + + +-------------------------+-------+--------+--------+ + | Name | vCPUs | Memory | Disk | + +-------------------------+-------+--------+--------+ + | compute12-onap-istanbul | 16 | 24Go | 10 Go | + | compute11-onap-istanbul | 16 | 24Go | 10 Go | + | compute10-onap-istanbul | 16 | 24Go | 10 Go | + | compute09-onap-istanbul | 16 | 24Go | 10 Go | + | compute08-onap-istanbul | 16 | 24Go | 10 Go | + | compute07-onap-istanbul | 16 | 24Go | 10 Go | + | compute06-onap-istanbul | 16 | 24Go | 10 Go | + | compute05-onap-istanbul | 16 | 24Go | 10 Go | + | compute04-onap-istanbul | 16 | 24Go | 10 Go | + | compute03-onap-istanbul | 16 | 24Go | 10 Go | + | compute02-onap-istanbul | 16 | 24Go | 10 Go | + | compute01-onap-istanbul | 16 | 24Go | 10 Go | + | etcd03-onap-istanbul | 4 | 6Go | 10 Go | + | etcd02-onap-istanbul | 4 | 6Go | 10 Go | + | etcd01-onap-istanbul | 4 | 6Go | 10 Go | + | control03-onap-istanbul | 4 | 6Go | 10 Go | + | control02-onap-istanbul | 4 | 6Go | 10 Go | + | control01-onap-istanbul | 4 | 6Go | 10 Go | + +-------------------------+-------+--------+--------+ + + +The test sequence can be defined as follows: + +- Cordon a compute node (prevent any new scheduling) +- Launch node drain chaos scenario, all the pods on the given compute node + are evicted + +Once all the pods have been evicted: + +- Uncordon the compute node +- Replay a basic_vm test + +This test has been successfully executed. + +.. image:: files/s3p/istanbul_resiliency.png + :align: center .. important:: - K8s node reboots/shutdown is showing some deficiencies in ONAP components in - regard of their availability measured with HC results. Some pods may - still fail to initialize after reboot/shutdown(pod rescheduled). + Please note that the chaos framework select one compute node (the first one by + default). + The distribution of the pods is random, on our target architecture about 15 + pods are scheduled on each node. The chaos therefore affects only a limited + number of pods. + +For the Istanbul tests, the evicted pods (compute01) were: + - However cluster as a whole behaves as expected, pods are rescheduled after - node shutdown (except pods being part of statefulset which need to be deleted - forcibly - normal Kubernetes behavior) +.. code-block:: shell - On rebooted node, should its downtime not exceed eviction timeout, pods are - restarted back after it is again available. + NAME READY STATUS RESTARTS AGE + onap-aaf-service-dbd8fc76b-vnmqv 1/1 Running 0 2d19h + onap-aai-graphadmin-5799bfc5bb-psfvs 2/2 Running 0 2d19h + onap-cassandra-1 1/1 Running 0 2d19h + onap-dcae-ves-collector-856fcb67bd-lb8sz 2/2 Running 0 2d19h + onap-dcaemod-distributor-api-85df84df49-zj9zn 1/1 Running 0 2d19h + onap-msb-consul-86975585d9-8nfs2 1/1 Running 0 2d19h + onap-multicloud-pike-88bb965f4-v2qc8 2/2 Running 0 2d19h + onap-netbox-nginx-5b9b57d885-hjv84 1/1 Running 0 2d19h + onap-portal-app-66d9f54446-sjhld 2/2 Running 0 2d19h + onap-sdnc-ueb-listener-5b6bb95c68-d24xr 1/1 Running 0 2d19h + onap-sdnc-web-8f5c9fbcc-2l8sp 1/1 Running 0 2d19h + onap-so-779655cb6b-9tzq4 2/2 Running 1 2d19h + onap-so-oof-adapter-54b5b99788-x7rlk 2/2 Running 0 2d19h -Please see `Integration Resiliency page `_ -for details. +In the future, it would be interesting to elaborate a resiliency testing strategy +in order to check the eviction of all the critical components. Stability tests --------------- -Three stability tests have been performed in Honolulu: +Stability tests have been performed on Istanbul release: - SDC stability test -- Simple instantiation test (basic_vm) - Parallel instantiation test +The results can be found in the weekly backend logs +https://logs.onap.org/onap-integration/weekly/onap_weekly_pod4_istanbul. + SDC stability test ~~~~~~~~~~~~~~~~~~ In this test, we consider the basic_onboard automated test and we run 5 -simultaneous onboarding procedures in parallel during 72h. +simultaneous onboarding procedures in parallel during 24h. The basic_onboard test consists in the following steps: @@ -329,13 +210,13 @@ The basic_onboard test consists in the following steps: - [SDC] YamlTemplateServiceOnboardStep: Onboard service described in YAML file in SDC. -The test has been initiated on the honolulu weekly lab on the 19th of April. +The test has been initiated on the Istanbul weekly lab on the 14th of November. As already observed in daily|weekly|gating chain, we got race conditions on some tests (https://jira.onap.org/browse/INT-1918). -The success rate is above 95% on the 100 first model upload and above 80% -until we onboard more than 500 models. +The success rate is expected to be above 95% on the 100 first model upload +and above 80% until we onboard more than 500 models. We may also notice that the function test_duration=f(time) increases continuously. At the beginning the test takes about 200s, 24h later the same @@ -345,31 +226,31 @@ explaining the linear decrease of the success rate. The following graphs provides a good view of the SDC stability test. -.. image:: files/s3p/honolulu_sdc_stability.png +.. image:: files/s3p/istanbul_sdc_stability.png :align: center -.. important:: - SDC can support up to 100s models onboarding. - The onbaording duration increases linearly with the number of onboarded - models - After a while, the SDC is no more usable. - No major Cluster resource issues have been detected during the test. The - memory consumption is however relatively high regarding the load. - -.. image:: files/s3p/honolulu_sdc_stability_resources.png - :align: center - +.. csv-table:: S3P Onboarding stability results + :file: ./files/csv/s3p-sdc.csv + :widths: 60,20,20 + :delim: ; + :header-rows: 1 -Simple stability test -~~~~~~~~~~~~~~~~~~~~~ - -This test consists on running the test basic_vm continuously during 72h. - -We observe the cluster metrics as well as the evolution of the test duration. +.. important:: + The onboarding duration increases linearly with the number of on-boarded + models, which is already reported and may be due to the fact that models + cannot be deleted. In fact the test client has to retrieve the list of + models, which is continuously increasing. No limit tests have been + performed. + However 1085 on-boarded models is already a vry high figure regarding the + possible ONAP usage. + Moreover the mean duration time is much lower in Istanbul. + It explains why it was possible to run 35% more tests within the same + time frame. -The test basic_vm is described in :ref:`the Integration Test page `. +Parallel instantiations stability test +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The basic_vm test consists in the different following steps: +The test is based on the single test (basic_vm) that can be described as follows: - [SDC] VendorOnboardStep: Onboard vendor in SDC. - [SDC] YamlTemplateVspOnboardStep: Onboard vsp described in YAML file in SDC. @@ -391,102 +272,37 @@ The basic_vm test consists in the different following steps: - [SO] YamlTemplateVfModuleAlaCarteInstantiateStep: Instantiate VF module described in YAML using SO a'la carte method. -The test has been initiated on the Honolulu weekly lab on the 26th of April 2021. -This test has been run after the test described in the next section. -A first error occured after few hours (mariadbgalera), then the system -automatically recovered for some hours before a full crash of the mariadb -galera. - -:: - - debian@control01-onap-honolulu:~$ kubectl get pod -n onap |grep mariadb-galera - onap-mariadb-galera-0 1/2 CrashLoopBackOff 625 5d16h - onap-mariadb-galera-1 1/2 CrashLoopBackOff 1134 5d16h - onap-mariadb-galera-2 1/2 CrashLoopBackOff 407 5d16h - - -It was unfortunately not possible to collect the root cause (logs of the first -restart of onap-mariadb-galera-1). - -Community members reported that they already faced such issues and suggest to -deploy a single maria instance instead of using MariaDB galera. -Moreover, in Honolulu there were some changes in order to allign Camunda (SO) -requirements for MariaDB galera.. - -During the limited valid window, the success rate was about 78% (85% for the -same test in Guilin). -The duration of the test remain very variable as also already reported in Guilin -(https://jira.onap.org/browse/SO-3419). The duration of the same test may vary -from 500s to 2500s as illustrated in the following graph: - -.. image:: files/s3p/honolulu_so_stability_1_duration.png - :align: center - -The changes in MariaDB galera seems to have introduced some issues leading to -more unexpected timeouts. -A troubleshooting campaign has been launched to evaluate possible evolutions in -this area. - -Parallel instantiations stability test -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Still based on basic_vm, 5 instantiation attempts are done simultaneously on the -ONAP solution during 48h. +10 instantiation attempts are done simultaneously on the ONAP solution during 24h. The results can be described as follows: -.. image:: files/s3p/honolulu_so_stability_5.png +.. image:: files/s3p/istanbul_instantiation_stability_10.png :align: center -For this test, we have to restart the SDNC once. The last failures are due to -a certificate infrastructure issue and are independent from ONAP. - -Cluster metrics -~~~~~~~~~~~~~~~ +.. csv-table:: S3P Instantiation stability results + :file: ./files/csv/s3p-instantiation.csv + :widths: 60,20,20 + :delim: ; + :header-rows: 1 + +The results are good with a success rate above 95%. After 24h more than 1300 +VNF have been created and deleted. + +As for SDC, we can observe a linear increase of the test duration. This issue +has been reported since Guilin. For SDC as it is not possible to delete the +models, it is possible to imagine that the duration increases due to the fact +that the database of models continuously increases. Therefore the client has +to retrieve an always bigger list of models. +But for the instantiations, it is not the case as the references +(module, VNF, service) are cleaned at the end of each test and all the tests +use the same model. Then the duration of an instantiation test should be +almost constant, which is not the case. Further investigations are needed. .. important:: - No major cluster resource issues have been detected in the cluster metrics - -The metrics of the ONAP cluster have been recorded over the full week of -stability tests: - -.. csv-table:: CPU - :file: ./files/csv/stability_cluster_metric_cpu.csv - :widths: 20,20,20,20,20 - :delim: ; - :header-rows: 1 - -.. image:: files/s3p/honolulu_weekly_cpu.png - :align: center - -.. image:: files/s3p/honolulu_weekly_memory.png - :align: center - -The Top Ten for CPU consumption is given in the table below: - -.. csv-table:: CPU - :file: ./files/csv/stability_top10_cpu.csv - :widths: 20,15,15,20,15,15 - :delim: ; - :header-rows: 1 - -CPU consumption is negligeable and not dimensioning. It shall be reconsider for -use cases including extensive computation (loops, optimization algorithms). - -The Top Ten for Memory consumption is given in the table below: - -.. csv-table:: Memory - :file: ./files/csv/stability_top10_memory.csv - :widths: 20,15,15,20,15,15 - :delim: ; - :header-rows: 1 - -Without surprise, the Cassandra databases are using most of the memory. - -The Top Ten for Network consumption is given in the table below: - -.. csv-table:: Network - :file: ./files/csv/stability_top10_net.csv - :widths: 10,15,15,15,15,15,15 - :delim: ; - :header-rows: 1 + The test has been executed with the mariadb-galera replicaset set to 1 + (3 by default). With this configuration the results during 24h are very + good. When set to 3, the error rate is higher and after some hours + most of the instantiation are failing. + However, even with a replicaset set to 1, a test on Master weekly chain + showed that the system is hitting another limit after about 35h + (https://jira.onap.org/browse/SO-3791). -- cgit 1.2.3-korg