aboutsummaryrefslogtreecommitdiffstats
path: root/docs/integration-s3p.rst
diff options
context:
space:
mode:
Diffstat (limited to 'docs/integration-s3p.rst')
-rw-r--r--docs/integration-s3p.rst339
1 files changed, 171 insertions, 168 deletions
diff --git a/docs/integration-s3p.rst b/docs/integration-s3p.rst
index 49c67850f..13e36c17a 100644
--- a/docs/integration-s3p.rst
+++ b/docs/integration-s3p.rst
@@ -1,204 +1,207 @@
+.. This work is licensed under a
+ Creative Commons Attribution 4.0 International License.
.. _integration-s3p:
-ONAP Maturity Testing Notes
----------------------------
-
-Historically integration team used to execute specific stability and resilience
-tests on target release. For frankfurt a stability test was executed.
-Openlab, based on Frankfurt RC0 dockers was also observed a long duration
-period to evaluate the overall stability.
-Finally the CI daily chain created at Frankfurt RC0 was also a precious indicator
-to estimate the solution stability.
-
-No resilience or stress tests have been executed due to a lack of resources
-and late availability of the release. The testing strategy shall be amended in
-Guilin, several requirements have been created to improve the S3P testing domain.
+:orphan:
Stability
=========
-ONAP stability was tested through a 72 hour test.
-The intent of the 72 hour stability test is not to exhaustively test all
-functions but to run a steady load against the system and look for issues like
-memory leaks that cannot be found in the short duration install and functional
-testing during the development cycle.
-
-Integration Stability Testing verifies that the ONAP platform remains fully
-functional after running for an extended amounts of time.
-This is done by repeated running tests against an ONAP instance for a period of
-72 hours.
-
-::
-
- **The 72 hour stability run result was PASS**
-
-The onboard and instantiate tests ran for over **115 hours** before environment
-issues stopped the test. There were errors due to both tooling and environment
-errors.
-
-The overall memory utilization only grew about **2%** on the work nodes despite
-the environment issues. Interestingly the kubernetes ochestration node memory
-grew more which could mean we are over driving the API's in some fashion.
-
-We did not limit other tenant activities in Windriver during this test run and
-we saw the impact from things like the re-installation of SB00 in the tenant
-and general network latency impacts that caused openstack to be slower to
-instantiate.
-For future stability runs we should go back to the process of shutting down
-non-critical tenants in the test environment to free up host resources for
-the test run (or other ways to prevent other testing from affecting the stability
-run).
-
-The control loop tests were **100% successful** and the cycle time for the loop was
-fairly consistent despite the environment issues. Future control loop stability
-tests should consider doing more policy edit type activites and running more
-control loop if host resources are available. The 10 second VES telemetry event
-is quite aggressive so we are sending more load into the VES collector and TCA
-engine during onset events than would be typical so adding additional loops
-should factor that in. The jenkins jobs ran fairly well although the instantiate
-Demo vFWCL took longer than usual and should be factored into future test planning.
-
-
-Methodology
-~~~~~~~~~~~
+.. important::
+ The Release stability has been evaluated by:
-The Stability Test has two main components:
+ - The daily CI/CD chain
+ - Stability tests
-- Running "ete stability72hr" Robot suite periodically. This test suite
- verifies that ONAP can instantiate vDNS, vFWCL, and VVG.
-- Set up vFW Closed Loop to remain running, then check periodically that the
- closed loop functionality is still working.
+.. note:
+ The scope of these tests remains limited and does not provide a full set of
+ KPIs to determinate the limits and the dimensioning of the ONAP solution.
-The integration-longevity tenant in Intel/Windriver environment was used for the
-72 hour tests.
+CI results
+----------
-The onap-ci job for "Project windriver-longevity-release-manual" was used for
-the deployment with the OOM set to frankfurt and Integration branches set to
-master. Integration master was used so we could catch the latest updates to
-integration scripts and vnf heat templates.
+As usual, a daily CI chain dedicated to the release is created after RC0.
-The jenkins job needs a couple of updates for each release:
+The daily results can be found in `LF DT lab daily results web site <https://logs.onap.org/onap-integration/daily/onap-daily-dt-oom-master/>`_.
-- Set the integration branch to 'origin/master'
-- Modify the parameters to deploy.sh to specify "-i master" and "-o frankfurt"
- to get integration master an oom frankfurt clones onto the nfs server.
+.. image:: files/s3p/jakarta-dashboard.png
+ :align: center
-The path for robot logs on dockerdata-nfs changed in Frankfurt so the
-/dev-robot/ becomes /dev/robot
-.. note::
- For Frankfurt release, the stability test has been executed on an
- kubernetes infrastructure based on El Alto recommendations. The kubernetes
- version was 1.15.3 (frankfurt 1.15.11) and the helm version was 2.14.2
- (frankfurt 2.16.6). However the ONAP dockers were updated to Frankfurt RC2
- candidate versions. The results are informative and can be compared with
- previous campaigns. The stability tests used robot container image
- **1.6.1-STAGING-20200519T201214Z**. Robot container was patched to use GRA_API
- since VNF_API has been deprecated.
+Infrastructure Healthcheck Tests
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Shakedown consists of creating some temporary tags for stability72hrvLB,
-stability72hrvVG,stability72hrVFWCL to make sure each sub test ran successfully
-(including cleanup) in the environment before the jenkins job started with the
-higher level testsuite tag stability72hr that covers all three test types.
+These tests deal with the Kubernetes/Helm tests on ONAP cluster.
-Clean out the old buid jobs using a jenkins console script (manage jenkins)
+The global expected criteria is **100%**.
-::
+The onap-k8s and onap-k8s-teardown, providing a snapshop of the onap namespace
+in Kubernetes, as well as the onap-helm tests are expected to be PASS.
- def jobName = "windriver-longevity-stability72hr"=
- def job = Jenkins.instance.getItem(jobName)
- job.getBuilds().each { it.delete() }
- job.nextBuildNumber = 1
- job.save()
+.. image:: files/s3p/istanbul_daily_infrastructure_healthcheck.png
+ :align: center
+Healthcheck Tests
+~~~~~~~~~~~~~~~~~
-appc.properties updated to apply the fix for DMaaP message processing to call
-http://localhost:8181 for the streams update.
+These tests are the traditionnal robot healthcheck tests and additional tests
+dealing with a single component.
-Results: 100% PASS
-~~~~~~~~~~~~~~~~~~
-=================== ======== ========== ======== ========= =========
-Test Case Attempts Env Issues Failures Successes Pass Rate
-=================== ======== ========== ======== ========= =========
-Stability 72 hours 77 19 0 58 100%
-vFW Closed Loop 60 0 0 100 100%
-**Total** 137 19 0 158 **100%**
-=================== ======== ========== ======== ========= =========
-
-Detailed results can be found at https://wiki.onap.org/display/DW/Frankfurt+Stability+Run+Notes
-
-.. note::
- - Overall results were good. All of the test failures were due to
- issues with the unstable environment and tooling framework.
- - JIRAs were created for readiness/liveness probe issues found while
- testing under the unstable environment. Patches applied to oom and
- testsuite during the testing helped reduce test failures due to
- environment and tooling framework issues.
- - The vFW Closed Loop test was very stable and self recovered from
- environment issues.
-
-Resources overview
-~~~~~~~~~~~~~~~~~~
-============ ====================== =========== ========== ==========
-Date #1 CPU #1 RAM CPU* RAM**
-============ ====================== =========== ========== ==========
-May 20 18:45 dcae-tca-anaytics:511m appc:2901Mi 1649 36092
-May 21 12:33 dcae-tca-anaytics:664m appc:2901Mi 1605 38221
-May 22 09:35 dcae-tca-anaytics:425m appc:2837Mi 1459 38488
-May 23 11:01 cassandra-1:371m appc:2849Mi 1829 39431
-============ ====================== =========== ========== ==========
-
-.. note::
- - Results are given from the command "kubectl -n onap top pods | sort -rn -k 3
- | head -20"
- - * sum of the top 20 CPU consumption
- - ** sum of the top 20 RAM consumption
+The expectation is **100% OK**.
-CI results
-==========
+.. image:: files/s3p/istanbul_daily_healthcheck.png
+ :align: center
+
+Smoke Tests
+~~~~~~~~~~~
-A daily Frankfurt CI chain has been created after RC0.
+These tests are end to end and automated use case tests.
+See the :ref:`the Integration Test page <integration-tests>` for details.
-The evolution of the full healthcheck test suite can be described as follows:
+The expectation is **100% OK**.
-|image1|
+.. figure:: files/s3p/istanbul_daily_smoke.png
+ :align: center
-Full healthcheck testsuite verifies the status of each component. It is
-composed of 47 tests. The success rate from the 9th to the 28th was never under
-95%.
+Security Tests
+~~~~~~~~~~~~~~
-4 test categories were defined:
+These tests are tests dealing with security.
+See the :ref:`the Integration Test page <integration-tests>` for details.
-- infrastructure healthcheck: test of ONAP kubernetes cluster and help chart status
-- healthcheck tests: verification of the components in the target deployment
- environment
-- smoke tests: basic VM tests (including onboarding/distribution/instantiation),
- and automated use cases (pnf-registrate, hvves, 5gbulkpm)
-- security tests
+Waivers have been granted on different projects for the different tests.
+The list of waivers can be found in
+https://git.onap.org/integration/seccom/tree/waivers?h=jakarta.
-The security target (66% for Frankfurt) was reached after the RC1. A regression
-due to the automation of the hvves use case (triggering the exposition of a
-public port in HTTP) was fixed on the 28th of May.
+nodeport_check_certs test is expected to fail. Even tremendous progress have
+been done in this area, some certificates (unmaintained, upstream or integration
+robot pods) are still not correct due to bad certificate issuers (Root CA
+certificate non valid) or extra long validity. Most of the certificates have
+been installed using cert-manager and will be easily renewable.
-|image2|
+The expectation is **80% OK**. The criteria is met.
-Orange Openlab
-==============
+.. figure:: files/s3p/istanbul_daily_security.png
+ :align: center
-The Orange Openlab is a community lab targeting ONAP end user. It provides an
-ONAP and cloud resources to discover ONAP.
-A Frankfurt pre-RC0 version was installed beginning of May. The usual gating
-testing suite was run daily in addition of the traffic generated by the lab
-users. The VM instantiation has been working well without any reinstallation
-over the **27** last days.
+Stability tests
+---------------
-Resilience
-==========
+Stability tests have been performed on Istanbul release:
-The resilience test executed in El Alto was not realized in Frankfurt.
+- SDC stability test
+- Parallel instantiation test
-.. |image1| image:: files/s3p/daily_frankfurt1.png
- :width: 6.5in
+The results can be found in the weekly backend logs
+https://logs.onap.org/onap-integration/weekly/onap_weekly_pod4_istanbul.
+
+SDC stability test
+~~~~~~~~~~~~~~~~~~
-.. |image2| image:: files/s3p/daily_frankfurt2.png
- :width: 6.5in
+In this test, we consider the basic_onboard automated test and we run 5
+simultaneous onboarding procedures in parallel during 24h.
+
+The basic_onboard test consists in the following steps:
+
+- [SDC] VendorOnboardStep: Onboard vendor in SDC.
+- [SDC] YamlTemplateVspOnboardStep: Onboard vsp described in YAML file in SDC.
+- [SDC] YamlTemplateVfOnboardStep: Onboard vf described in YAML file in SDC.
+- [SDC] YamlTemplateServiceOnboardStep: Onboard service described in YAML file
+ in SDC.
+
+The test has been initiated on the Istanbul weekly lab on the 14th of November.
+
+As already observed in daily|weekly|gating chain, we got race conditions on
+some tests (https://jira.onap.org/browse/INT-1918).
+
+The success rate is expected to be above 95% on the 100 first model upload
+and above 80% until we onboard more than 500 models.
+
+We may also notice that the function test_duration=f(time) increases
+continuously. At the beginning the test takes about 200s, 24h later the same
+test will take around 1000s.
+Finally after 36h, the SDC systematically answers with a 500 HTTP answer code
+explaining the linear decrease of the success rate.
+
+The following graphs provides a good view of the SDC stability test.
+
+.. image:: files/s3p/istanbul_sdc_stability.png
+ :align: center
+
+.. csv-table:: S3P Onboarding stability results
+ :file: ./files/csv/s3p-sdc.csv
+ :widths: 60,20,20,20
+ :delim: ;
+ :header-rows: 1
+
+.. important::
+ The onboarding duration increases linearly with the number of on-boarded
+ models, which is already reported and may be due to the fact that models
+ cannot be deleted. In fact the test client has to retrieve the list of
+ models, which is continuously increasing. No limit tests have been
+ performed.
+ However 1085 on-boarded models is already a vry high figure regarding the
+ possible ONAP usage.
+ Moreover the mean duration time is much lower in Istanbul.
+ It explains why it was possible to run 35% more tests within the same
+ time frame.
+
+Parallel instantiations stability test
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The test is based on the single test (basic_vm) that can be described as follows:
+
+- [SDC] VendorOnboardStep: Onboard vendor in SDC.
+- [SDC] YamlTemplateVspOnboardStep: Onboard vsp described in YAML file in SDC.
+- [SDC] YamlTemplateVfOnboardStep: Onboard vf described in YAML file in SDC.
+- [SDC] YamlTemplateServiceOnboardStep: Onboard service described in YAML file
+ in SDC.
+- [AAI] RegisterCloudRegionStep: Register cloud region.
+- [AAI] ComplexCreateStep: Create complex.
+- [AAI] LinkCloudRegionToComplexStep: Connect cloud region with complex.
+- [AAI] CustomerCreateStep: Create customer.
+- [AAI] CustomerServiceSubscriptionCreateStep: Create customer's service
+ subscription.
+- [AAI] ConnectServiceSubToCloudRegionStep: Connect service subscription with
+ cloud region.
+- [SO] YamlTemplateServiceAlaCarteInstantiateStep: Instantiate service described
+ in YAML using SO a'la carte method.
+- [SO] YamlTemplateVnfAlaCarteInstantiateStep: Instantiate vnf described in YAML
+ using SO a'la carte method.
+- [SO] YamlTemplateVfModuleAlaCarteInstantiateStep: Instantiate VF module
+ described in YAML using SO a'la carte method.
+
+10 instantiation attempts are done simultaneously on the ONAP solution during 24h.
+
+The results can be described as follows:
+
+.. image:: files/s3p/istanbul_instantiation_stability_10.png
+ :align: center
+
+.. csv-table:: S3P Instantiation stability results
+ :file: ./files/csv/s3p-instantiation.csv
+ :widths: 60,20,20,20
+ :delim: ;
+ :header-rows: 1
+
+The results are good with a success rate above 95%. After 24h more than 1300
+VNF have been created and deleted.
+
+As for SDC, we can observe a linear increase of the test duration. This issue
+has been reported since Guilin. For SDC as it is not possible to delete the
+models, it is possible to imagine that the duration increases due to the fact
+that the database of models continuously increases. Therefore the client has
+to retrieve an always bigger list of models.
+But for the instantiations, it is not the case as the references
+(module, VNF, service) are cleaned at the end of each test and all the tests
+use the same model. Then the duration of an instantiation test should be
+almost constant, which is not the case. Further investigations are needed.
+
+.. important::
+ The test has been executed with the mariadb-galera replicaset set to 1
+ (3 by default). With this configuration the results during 24h are very
+ good. When set to 3, the error rate is higher and after some hours
+ most of the instantiation are failing.
+ However, even with a replicaset set to 1, a test on Master weekly chain
+ showed that the system is hitting another limit after about 35h
+ (https://jira.onap.org/browse/SO-3791).