aboutsummaryrefslogtreecommitdiffstats
path: root/docs/integration-s3p.rst
diff options
context:
space:
mode:
Diffstat (limited to 'docs/integration-s3p.rst')
-rw-r--r--docs/integration-s3p.rst524
1 files changed, 375 insertions, 149 deletions
diff --git a/docs/integration-s3p.rst b/docs/integration-s3p.rst
index 38a76f995..fe0890a1f 100644
--- a/docs/integration-s3p.rst
+++ b/docs/integration-s3p.rst
@@ -1,108 +1,84 @@
+.. This work is licensed under a
+ Creative Commons Attribution 4.0 International License.
.. _integration-s3p:
-:orphan:
-
-ONAP Maturity Testing Notes
----------------------------
+Stability/Resiliency
+====================
.. important::
The Release stability has been evaluated by:
- - The Daily Guilin CI/CD chain
- - A simple 24h healthcheck verification
- - A 7 days stability test
+ - The daily Honolulu CI/CD chain
+ - Stability tests
+ - Resiliency tests
.. note:
The scope of these tests remains limited and does not provide a full set of
KPIs to determinate the limits and the dimensioning of the ONAP solution.
CI results
-==========
+----------
As usual, a daily CI chain dedicated to the release is created after RC0.
-A Daily Guilin has been created on the 18th of November 2020.
-
-Unfortunately several technical issues disturbed the chain:
-
-- Due to policy changes in DockerHub (new quotas), the installation chain was
- not stable as the quota limit was rapidly reached. As a consequence the
- installation was incomplete and most of the tests were failing. The problem
- was fixed by the subscription of unlimitted account on DockerHub.
-- Due to an upgrade of the Git Jenkins plugin done by LF IT, the synchronization
- of the miror of the xtesting repository, used daily to generate the test suite
- dockers was corrupted. The dockers were built daily from Jenkins but with an
- id from the 25th of September. As a consequence the tests reported lots of
- failure because they were corresponding to Frankfurt tests without the
- adaptations done for Guilin. The problem was fixed temporarily by moving to
- GitLab.com Docker registry then by the downgrade of the plugin executed by LF
- IT during Thanksgiving break.
-
-The first week of the Daily Guilin results are therefore not really usable.
-Most of the results from the `daily Guilin result portal
-<https://logs.onap.org/onap-integration/daily/onap_daily_pod4_guilin/>`_
-are not trustable and may be misleading.
-The results became more stable from the the 6th of December.
-
-The graphs given hereafter are based on the data collected until the 8th of
-december. This Daily chain will be maintained during the Honolulu development
-cycle (Daily Master) and can be audited at any time. In case of reproducible
-errors, the integration team will open JIRA on Guilin.
-
-Several public Daily Guilin chains have been put in place, one in Orange
-(Helm v2) and one in DT (Helm v3). DT results are pushed in the test DB and can
-be observed in
-`ONAP Testing DT lab result page <http://testresults.opnfv.org/onap-integration/dt/dt.html>`_.
+A Honolulu chain has been created on the 6th of April 2021.
+
+The daily results can be found in `LF daily results web site
+<https://logs.onap.org/onap-integration/daily/onap_daily_pod4_honolulu/2021-04/>`_.
Infrastructure Healthcheck Tests
-................................
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
These tests deal with the Kubernetes/Helm tests on ONAP cluster.
-The global expected criteria is **50%** when installing with Helm 2.
-The onap-k8s and onap-k8s-teardown providing a snapshop of the onap namespace in
-kubernetes are expected to be PASS but two tests are expected to fail:
-
-- onap-helm (32/33 OK) due to the size of the SO helm chart (too big for Helm2).
-- nodeport_check_certs due to bad certificate issuers (Root CA certificate non
- valid). In theory all the certificate shall be generated during the installation
- and be valid for the 364 days after the installation. It is still not the case.
- However, for the first time, no certificate was expired. Next certificates to
- renew are:
- - Music (2021-02-03)
- - VID (2021-03-17)
- - Message-router-external (2021-03-25)
- - CDS-UI (2021-02-18)
- - AAI and AAI-SPARKY-BE (2021-03-17)
-
-.. image:: files/s3p/guilin_daily_infrastructure_healthcheck.png
+
+The global expected criteria is **75%**.
+The onap-k8s and onap-k8s-teardown providing a snapshop of the onap namespace in
+Kubernetes as well as the onap-helm tests are expected to be PASS.
+
+nodeport_check_certs test is expected to fail. Even tremendous progress have
+been done in this area, some certificates (unmaintained, upstream or integration
+robot pods) are still not correct due to bad certificate issuers (Root CA
+certificate non valid) or extra long validity. Most of the certificates have
+been installed using cert-manager and will be easily renewable.
+
+.. image:: files/s3p/honolulu_daily_infrastructure_healthcheck.png
:align: center
Healthcheck Tests
-.................
+~~~~~~~~~~~~~~~~~
These tests are the traditionnal robot healthcheck tests and additional tests
dealing with a single component.
+Some tests (basic_onboard, basic_cds) may fail episodically due to the fact that
+the startup of the SDC is sometimes not fully completed.
+
+The same test is run as first step of smoke tests and is usually PASS.
+The mechanism to detect that all the components are fully operational may be
+improved, timer based solutions are not robust enough.
+
The expectation is **100% OK**.
-.. image:: files/s3p/guilin_daily_healthcheck.png
+.. image:: files/s3p/honolulu_daily_healthcheck.png
:align: center
Smoke Tests
-...........
+~~~~~~~~~~~
-These tests are end to end tests.
+These tests are end to end and automated use case tests.
See the :ref:`the Integration Test page <integration-tests>` for details.
The expectation is **100% OK**.
-.. figure:: files/s3p/guilin_daily_smoke.png
+.. figure:: files/s3p/honolulu_daily_smoke.png
:align: center
-An error has been detected on the SDC when performing parallel tests.
-See `SDC-3366 <https://jira.onap.org/browse/SDC-3366>`_ for details.
+An error has been detected on the SDNC preventing the basic_vm_macro to work.
+See `SDNC-1529 <https://jira.onap.org/browse/SDNC-1529/>`_ for details.
+We may also notice that SO timeouts occured more frequently than in Guilin.
+See `SO-3584 <https://jira.onap.org/browse/SO-3584>`_ for details.
Security Tests
-..............
+~~~~~~~~~~~~~~
These tests are tests dealing with security.
See the :ref:`the Integration Test page <integration-tests>` for details.
@@ -111,43 +87,287 @@ The expectation is **66% OK**. The criteria is met.
It may even be above as 2 fail tests are almost correct:
-- the unlimited pod test is still fail due to only one pod: onap-ejbca.
-- the nonssl tests is FAIL due to so and os-vnfm adapter, which were supposed to
- be managed with the ingress (not possible for this release) and got a waiver
- in Frankfurt.
+- The unlimited pod test is still fail due testing pod (DCAE-tca).
+- The nonssl tests is FAIL due to so and so-etsi-sol003-adapter, which were
+ supposed to be managed with the ingress (not possible for this release) and
+ got a waiver in Frankfurt. The pods cds-blueprints-processor-http and aws-web
+ are used for tests.
-.. figure:: files/s3p/guilin_daily_security.png
+.. figure:: files/s3p/honolulu_daily_security.png
:align: center
-A simple 24h healthcheck verification
-=====================================
+Resiliency tests
+----------------
+
+The goal of the resiliency testing was to evaluate the capability of the
+Honolulu solution to survive a stop or restart of a Kubernetes control or
+worker node.
+
+Controller node resiliency
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+By default the ONAP solution is installed with 3 controllers for high
+availability. The test for controller resiliency can be described as follows:
+
+- Run tests: check that they are PASS
+- Stop a controller node: check that the node appears in NotReady state
+- Run tests: check that they are PASS
+
+2 tests were performed on the weekly honolulu lab. No problem was observed on
+controller shutdown, tests were still PASS with a stoped controller node.
+
+More details can be found in <https://jira.onap.org/browse/TEST-309>.
+
+Worker node resiliency
+~~~~~~~~~~~~~~~~~~~~~~
+
+In community weekly lab, the ONAP pods are distributed on 12 workers. The goal
+of the test was to evaluate the behavior of the pod on a worker restart
+(disaster scenario assuming that the node was moved accidentally from Ready to
+NotReady state).
+The original conditions of such tests may be different as the Kubernetes
+scheduler does not distribute the pods on the same worker from an installation
+to another.
+
+The test procedure can be described as follows:
+
+- Run tests: check that they are PASS (Healthcheck and basic_vm used)
+- Check that all the workers are in ready state
+ ::
+ $ kubectl get nodes
+ NAME STATUS ROLES AGE VERSION
+ compute01-onap-honolulu Ready <none> 18h v1.19.9
+ compute02-onap-honolulu Ready <none> 18h v1.19.9
+ compute03-onap-honolulu Ready <none> 18h v1.19.9
+ compute04-onap-honolulu Ready <none> 18h v1.19.9
+ compute05-onap-honolulu Ready <none> 18h v1.19.9
+ compute06-onap-honolulu Ready <none> 18h v1.19.9
+ compute07-onap-honolulu Ready <none> 18h v1.19.9
+ compute08-onap-honolulu Ready <none> 18h v1.19.9
+ compute09-onap-honolulu Ready <none> 18h v1.19.9
+ compute10-onap-honolulu Ready <none> 18h v1.19.9
+ compute11-onap-honolulu Ready <none> 18h v1.19.9
+ compute12-onap-honolulu Ready <none> 18h v1.19.9
+ control01-onap-honolulu Ready master 18h v1.19.9
+ control02-onap-honolulu Ready master 18h v1.19.9
+ control03-onap-honolulu Ready master 18h v1.19.9
+
+- Select a worker, list the impacted pods
+ ::
+ $ kubectl get pod -n onap --field-selector spec.nodeName=compute01-onap-honolulu
+ NAME READY STATUS RESTARTS AGE
+ onap-aaf-fs-7b6648db7f-shcn5 1/1 Running 1 22h
+ onap-aaf-oauth-5896545fb7-x6grg 1/1 Running 1 22h
+ onap-aaf-sms-quorumclient-2 1/1 Running 1 22h
+ onap-aai-modelloader-86d95c994b-87tsh 2/2 Running 2 22h
+ onap-aai-schema-service-75575cb488-7fxs4 2/2 Running 2 22h
+ onap-appc-cdt-58cb4766b6-vl78q 1/1 Running 1 22h
+ onap-appc-db-0 2/2 Running 4 22h
+ onap-appc-dgbuilder-5bb94d46bd-h2gbs 1/1 Running 1 22h
+ onap-awx-0 4/4 Running 4 22h
+ onap-cassandra-1 1/1 Running 1 22h
+ onap-cds-blueprints-processor-76f8b9b5c7-hb5bg 1/1 Running 1 22h
+ onap-dmaap-dr-db-1 2/2 Running 5 22h
+ onap-ejbca-6cbdb7d6dd-hmw6z 1/1 Running 1 22h
+ onap-kube2msb-858f46f95c-jws4m 1/1 Running 1 22h
+ onap-message-router-0 1/1 Running 1 22h
+ onap-message-router-kafka-0 1/1 Running 1 22h
+ onap-message-router-kafka-1 1/1 Running 1 22h
+ onap-message-router-kafka-2 1/1 Running 1 22h
+ onap-message-router-zookeeper-0 1/1 Running 1 22h
+ onap-multicloud-794c6dffc8-bfwr8 2/2 Running 2 22h
+ onap-multicloud-starlingx-58f6b86c55-mff89 3/3 Running 3 22h
+ onap-multicloud-vio-584d556876-87lxn 2/2 Running 2 22h
+ onap-music-cassandra-0 1/1 Running 1 22h
+ onap-netbox-nginx-8667d6675d-vszhb 1/1 Running 2 22h
+ onap-policy-api-6dbf8485d7-k7cpv 1/1 Running 1 22h
+ onap-policy-clamp-be-6d77597477-4mffk 1/1 Running 1 22h
+ onap-policy-pap-785bd79759-xxhvx 1/1 Running 1 22h
+ onap-policy-xacml-pdp-7d8fd58d59-d4m7g 1/1 Running 6 22h
+ onap-sdc-be-5f99c6c644-dcdz8 2/2 Running 2 22h
+ onap-sdc-fe-7577d58fb5-kwxpj 2/2 Running 2 22h
+ onap-sdc-wfd-fe-6997567759-gl9g6 2/2 Running 2 22h
+ onap-sdnc-dgbuilder-564d6475fd-xwwrz 1/1 Running 1 22h
+ onap-sdnrdb-master-0 1/1 Running 1 22h
+ onap-so-admin-cockpit-6c5b44694-h4d2n 1/1 Running 1 21h
+ onap-so-etsi-sol003-adapter-c9bf4464-pwn97 1/1 Running 1 21h
+ onap-so-sdc-controller-6899b98b8b-hfgvc 2/2 Running 2 21h
+ onap-vfc-mariadb-1 2/2 Running 4 21h
+ onap-vfc-nslcm-6c67677546-xcvl2 2/2 Running 2 21h
+ onap-vfc-vnflcm-78ff4d8778-sgtv6 2/2 Running 2 21h
+ onap-vfc-vnfres-6c96f9ff5b-swq5z 2/2 Running 2 21h
+
+- Stop the worker (shutdown the machine for baremetal or the VM if you installed
+ your Kubernetes on top of an OpenStack solution)
+- Wait for the pod eviction procedure completion (5 minutes)
+ ::
+ $ kubectl get nodes
+ NAME STATUS ROLES AGE VERSION
+ compute01-onap-honolulu NotReady <none> 18h v1.19.9
+ compute02-onap-honolulu Ready <none> 18h v1.19.9
+ compute03-onap-honolulu Ready <none> 18h v1.19.9
+ compute04-onap-honolulu Ready <none> 18h v1.19.9
+ compute05-onap-honolulu Ready <none> 18h v1.19.9
+ compute06-onap-honolulu Ready <none> 18h v1.19.9
+ compute07-onap-honolulu Ready <none> 18h v1.19.9
+ compute08-onap-honolulu Ready <none> 18h v1.19.9
+ compute09-onap-honolulu Ready <none> 18h v1.19.9
+ compute10-onap-honolulu Ready <none> 18h v1.19.9
+ compute11-onap-honolulu Ready <none> 18h v1.19.9
+ compute12-onap-honolulu Ready <none> 18h v1.19.9
+ control01-onap-honolulu Ready master 18h v1.19.9
+ control02-onap-honolulu Ready master 18h v1.19.9
+ control03-onap-honolulu Ready master 18h v1.19.9
+
+- Run the tests: check that they are PASS
+
+.. warning::
+ In these conditions, **the tests will never be PASS**. In fact several components
+ will remeain in INIT state.
+ A procedure is required to ensure a clean restart.
+
+List the non running pods::
+
+ $ kubectl get pods -n onap --field-selector status.phase!=Running | grep -v Completed
+ NAME READY STATUS RESTARTS AGE
+ onap-appc-dgbuilder-5bb94d46bd-sxmmc 0/1 Init:3/4 15 156m
+ onap-cds-blueprints-processor-76f8b9b5c7-m7nmb 0/1 Init:1/3 0 156m
+ onap-portal-app-595bd6cd95-bkswr 0/2 Init:0/4 84 23h
+ onap-portal-db-config-6s75n 0/2 Error 0 23h
+ onap-portal-db-config-7trzx 0/2 Error 0 23h
+ onap-portal-db-config-jt2jl 0/2 Error 0 23h
+ onap-portal-db-config-mjr5q 0/2 Error 0 23h
+ onap-portal-db-config-qxvdt 0/2 Error 0 23h
+ onap-portal-db-config-z8c5n 0/2 Error 0 23h
+ onap-sdc-be-5f99c6c644-kplqx 0/2 Init:2/5 14 156
+ onap-vfc-nslcm-6c67677546-86mmj 0/2 Init:0/1 15 156m
+ onap-vfc-vnflcm-78ff4d8778-h968x 0/2 Init:0/1 15 156m
+ onap-vfc-vnfres-6c96f9ff5b-kt9rz 0/2 Init:0/1 15 156m
+
+Some pods are not rescheduled (i.e. onap-awx-0 and onap-cassandra-1 above)
+because they are part of a statefulset. List the statefulset objects::
+
+ $ kubectl get statefulsets.apps -n onap | grep -v "1/1" | grep -v "3/3"
+ NAME READY AGE
+ onap-aaf-sms-quorumclient 2/3 24h
+ onap-appc-db 2/3 24h
+ onap-awx 0/1 24h
+ onap-cassandra 2/3 24h
+ onap-dmaap-dr-db 2/3 24h
+ onap-message-router 0/1 24h
+ onap-message-router-kafka 0/3 24h
+ onap-message-router-zookeeper 2/3 24h
+ onap-music-cassandra 2/3 24h
+ onap-sdnrdb-master 2/3 24h
+ onap-vfc-mariadb 2/3 24h
+
+For the pods being part of the statefulset, a forced deleteion is required.
+As an example if we consider the statefulset onap-sdnrdb-master, we must follow
+the procedure::
+
+ $ kubectl get pods -n onap -o wide |grep onap-sdnrdb-master
+ onap-sdnrdb-master-0 1/1 Terminating 1 24h 10.42.3.92 node1
+ onap-sdnrdb-master-1 1/1 Running 1 24h 10.42.1.122 node2
+ onap-sdnrdb-master-2 1/1 Running 1 24h 10.42.2.134 node3
+
+ $ kubectl delete -n onap pod onap-sdnrdb-master-0 --force
+ warning: Immediate deletion does not wait for confirmation that the running
+ resource has been terminated. The resource may continue to run on the cluster
+ indefinitely.
+ pod "onap-sdnrdb-master-0" force deleted
+
+ $ kubectl get pods |grep onap-sdnrdb-master
+ onap-sdnrdb-master-0 0/1 PodInitializing 0 11s
+ onap-sdnrdb-master-1 1/1 Running 1 24h
+ onap-sdnrdb-master-2 1/1 Running 1 24h
+
+ $ kubectl get pods |grep onap-sdnrdb-master
+ onap-sdnrdb-master-0 1/1 Running 0 43s
+ onap-sdnrdb-master-1 1/1 Running 1 24h
+ onap-sdnrdb-master-2 1/1 Running 1 24h
+
+Once all the statefulset are properly restarted, the other components shall
+continue their restart properly.
+Once the restart of the pods is completed, the tests are PASS.
-This test consists in running the Healthcheck tests every 10 minutes during
-24h.
+.. important::
-The test was run from the 6th of december to the 7th of december.
+ K8s node reboots/shutdown is showing some deficiencies in ONAP components in
+ regard of their availability measured with HC results. Some pods may
+ still fail to initialize after reboot/shutdown(pod rescheduled).
-The success rate was 100%.
+ However cluster as a whole behaves as expected, pods are rescheduled after
+ node shutdown (except pods being part of statefulset which need to be deleted
+ forcibly - normal Kubernetes behavior)
-The results are stored in the
-`test database <http://testresults.opnfv.org/onap/api/v1/results?pod_name=onap_daily_pod4_master-ONAP-oom&case_name=full>`_
+ On rebooted node, should its downtime not exceed eviction timeout, pods are
+ restarted back after it is again available.
-A 6 days stability test
-=======================
+Please see `Integration Resiliency page <https://jira.onap.org/browse/TEST-308>`_
+for details.
-This test consists on running the test basic_vm continuously during 1 week.
+Stability tests
+---------------
-We observe the cluster metrics as well as the evolution of the test duration.
-The test basic_vm is describe in :ref:`the Integration Test page <integration-tests>`.
+Three stability tests have been performed in Honolulu:
+
+- SDC stability test
+- Simple instantiation test (basic_vm)
+- Parallel instantiation test
+
+SDC stability test
+~~~~~~~~~~~~~~~~~~
+
+In this test, we consider the basic_onboard automated test and we run 5
+simultaneous onboarding procedures in parallel during 72h.
+
+The basic_onboard test consists in the following steps:
+
+- [SDC] VendorOnboardStep: Onboard vendor in SDC.
+- [SDC] YamlTemplateVspOnboardStep: Onboard vsp described in YAML file in SDC.
+- [SDC] YamlTemplateVfOnboardStep: Onboard vf described in YAML file in SDC.
+- [SDC] YamlTemplateServiceOnboardStep: Onboard service described in YAML file
+ in SDC.
+
+The test has been initiated on the honolulu weekly lab on the 19th of April.
+
+As already observed in daily|weekly|gating chain, we got race conditions on
+some tests (https://jira.onap.org/browse/INT-1918).
+
+The success rate is above 95% on the 100 first model upload and above 80%
+until we onboard more than 500 models.
+
+We may also notice that the function test_duration=f(time) increases
+continuously. At the beginning the test takes about 200s, 24h later the same
+test will take around 1000s.
+Finally after 36h, the SDC systematically answers with a 500 HTTP answer code
+explaining the linear decrease of the success rate.
+
+The following graphs provides a good view of the SDC stability test.
+
+.. image:: files/s3p/honolulu_sdc_stability.png
+ :align: center
+
+.. important::
+ SDC can support up to 100s models onboarding.
+ The onbaording duration increases linearly with the number of onboarded
+ models
+ After a while, the SDC is no more usable.
+ No major Cluster resource issues have been detected during the test. The
+ memory consumption is however relatively high regarding the load.
-Within a long duration test context, the test will onboard a service once then
-instantiate this service multiple times. Before instantiating, it will
-systematically contact the SDC and the AAI to verify that the resources already
-exist. In this context the most impacted component is SO, which was delivered
-relatively late compared to the other components.
+.. image:: files/s3p/honolulu_sdc_stability_resources.png
+ :align: center
-Basic_vm test
-.............
+
+Simple stability test
+~~~~~~~~~~~~~~~~~~~~~
+
+This test consists on running the test basic_vm continuously during 72h.
+
+We observe the cluster metrics as well as the evolution of the test duration.
+
+The test basic_vm is described in :ref:`the Integration Test page <integration-tests>`.
The basic_vm test consists in the different following steps:
@@ -171,84 +391,87 @@ The basic_vm test consists in the different following steps:
- [SO] YamlTemplateVfModuleAlaCarteInstantiateStep: Instantiate VF module
described in YAML using SO a'la carte method.
-The test has been initiated on a weekly lab on the 2nd of december.
-The results provided hereafter correspond to the period from 2020-12-02 to
-2020-12-08.
+The test has been initiated on the Honolulu weekly lab on the 26th of April 2021.
+This test has been run after the test described in the next section.
+A first error occured after few hours (mariadbgalera), then the system
+automatically recovered for some hours before a full crash of the mariadb
+galera.
-.. csv-table:: Basic_vm results
- :file: ./files/csv/stability_basic_vm.csv
- :widths: 70, 30
- :delim: ;
- :header-rows: 1
+::
-.. note::
+ debian@control01-onap-honolulu:~$ kubectl get pod -n onap |grep mariadb-galera
+ onap-mariadb-galera-0 1/2 CrashLoopBackOff 625 5d16h
+ onap-mariadb-galera-1 1/2 CrashLoopBackOff 1134 5d16h
+ onap-mariadb-galera-2 1/2 CrashLoopBackOff 407 5d16h
- The corrected success rate excludes the FAIL results obtained during the SDNC
- saturation phase.
- The cause of the errors shall be analyzed more in details. The huge majority of
- errors (79%) occurs on SO service creation, 18% on VNF creation and 3% on
- module creation.
-.. important::
- The test success rate is about 86%.
- CPU consumption is low (see next section).
- Memory consumption is high.
+It was unfortunately not possible to collect the root cause (logs of the first
+restart of onap-mariadb-galera-1).
- After ~ 24-48h, the test is systematically FAIL. The trace shows that the SDNC
- is no more responding. This error required the manual restart of the SDNC.
- It seems that the SDNC exceeds its limits set in OOM. The simple manual
- restart (delete of the pod was enough, the test after the restart is PASS,
- and keep most of the time PASS for the next 24-48h)
+Community members reported that they already faced such issues and suggest to
+deploy a single maria instance instead of using MariaDB galera.
+Moreover, in Honolulu there were some changes in order to allign Camunda (SO)
+requirements for MariaDB galera..
-We can observe the consequences of the manual restart of the SDNC on its memory
-graph as well as the memory threshold.
+During the limited valid window, the success rate was about 78% (85% for the
+same test in Guilin).
+The duration of the test remain very variable as also already reported in Guilin
+(https://jira.onap.org/browse/SO-3419). The duration of the same test may vary
+from 500s to 2500s as illustrated in the following graph:
-.. figure:: files/s3p/stability_sdnc_memory.png
- :align: center
+.. image:: files/s3p/honolulu_so_stability_1_duration.png
+ :align: center
-The duration of the test is increasing slowly over the week and can be described
-as follows:
+The changes in MariaDB galera seems to have introduced some issues leading to
+more unexpected timeouts.
+A troubleshooting campaign has been launched to evaluate possible evolutions in
+this area.
-.. figure:: files/s3p/basic_vm_duration.png
- :align: center
+Parallel instantiations stability test
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-If we consider the histogram, we can see the distribution of the duration.
+Still based on basic_vm, 5 instantiation attempts are done simultaneously on the
+ONAP solution during 48h.
-.. figure:: files/s3p/basic_vm_duration_histo.png
- :align: center
+The results can be described as follows:
-As a conclusion, the solution seems stable.
+.. image:: files/s3p/honolulu_so_stability_5.png
+ :align: center
-The memory issue detected in the SDNC may be due to a bad sizing of the limits
-and requests in OOM but a problem of light memory leak cannot be exclude.
-The workaround consisting in restarting of the SDNC seems to fix the issue.
-The issue is tracked in `SDNC-1430 <https://jira.onap.org/browse/SDNC-1430>`_.
-Further study shall be done on this topic to consildate the detection of the
-root cause.
+For this test, we have to restart the SDNC once. The last failures are due to
+a certificate infrastructure issue and are independent from ONAP.
Cluster metrics
-...............
+~~~~~~~~~~~~~~~
-The Metrics of the ONAP cluster on this 6 days period are given by the
-following tables:
+.. important::
+ No major cluster resource issues have been detected in the cluster metrics
+
+The metrics of the ONAP cluster have been recorded over the full week of
+stability tests:
.. csv-table:: CPU
:file: ./files/csv/stability_cluster_metric_cpu.csv
- :widths: 20,10,10,10,10,10,10,10
+ :widths: 20,20,20,20,20
:delim: ;
:header-rows: 1
-.. csv-table:: Memory
- :file: ./files/csv/stability_cluster_metric_memory.csv
- :widths: 20,10,10,10,10,10,10,10
+.. image:: files/s3p/honolulu_weekly_cpu.png
+ :align: center
+
+.. image:: files/s3p/honolulu_weekly_memory.png
+ :align: center
+
+The Top Ten for CPU consumption is given in the table below:
+
+.. csv-table:: CPU
+ :file: ./files/csv/stability_top10_cpu.csv
+ :widths: 20,15,15,20,15,15
:delim: ;
:header-rows: 1
-.. csv-table:: Network
- :file: ./files/csv/stability_cluster_metric_network.csv
- :widths: 10,15,15,15,15,15,15
- :delim: ;
- :header-rows: 1
+CPU consumption is negligeable and not dimensioning. It shall be reconsider for
+use cases including extensive computation (loops, optimization algorithms).
The Top Ten for Memory consumption is given in the table below:
@@ -258,9 +481,12 @@ The Top Ten for Memory consumption is given in the table below:
:delim: ;
:header-rows: 1
-At least 9 components exceeds their Memory Requests. And 7 are over the Memory
-limits set in OOM: the 2 Opendaylight controllers and the cassandra Databases.
+Without surprise, the Cassandra databases are using most of the memory.
+
+The Top Ten for Network consumption is given in the table below:
-As indicated CPU consumption is negligeable and not dimensioning.
-It shall be reconsider for use cases including extensive computation (loops,
-optimization algorithms).
+.. csv-table:: Network
+ :file: ./files/csv/stability_top10_net.csv
+ :widths: 10,15,15,15,15,15,15
+ :delim: ;
+ :header-rows: 1