summaryrefslogtreecommitdiffstats
path: root/docs
diff options
context:
space:
mode:
Diffstat (limited to 'docs')
-rw-r--r--docs/development/devtools/distribution-s3p-results/distribution-jmeter-testcases.pngbin68050 -> 57822 bytes
-rw-r--r--docs/development/devtools/distribution-s3p-results/performance-monitor.pngbin27349 -> 136960 bytes
-rw-r--r--docs/development/devtools/distribution-s3p-results/performance-statistics.pngbin105215 -> 118840 bytes
-rw-r--r--[-rwxr-xr-x]docs/development/devtools/distribution-s3p-results/performance-threads.pngbin43635 -> 197890 bytes
-rw-r--r--docs/development/devtools/distribution-s3p-results/performance-threshold.pngbin68979 -> 77349 bytes
-rw-r--r--docs/development/devtools/distribution-s3p-results/stability-monitor.pngbin27304 -> 101015 bytes
-rw-r--r--docs/development/devtools/distribution-s3p-results/stability-statistics.pngbin98335 -> 111593 bytes
-rw-r--r--docs/development/devtools/distribution-s3p-results/stability-threads.pngbin47752 -> 202963 bytes
-rw-r--r--docs/development/devtools/distribution-s3p-results/stability-threshold.pngbin62425 -> 71809 bytes
-rw-r--r--docs/development/devtools/distribution-s3p.rst50
-rw-r--r--docs/development/prometheus-metrics.rst153
11 files changed, 155 insertions, 48 deletions
diff --git a/docs/development/devtools/distribution-s3p-results/distribution-jmeter-testcases.png b/docs/development/devtools/distribution-s3p-results/distribution-jmeter-testcases.png
index db28a7b2..86a437a7 100644
--- a/docs/development/devtools/distribution-s3p-results/distribution-jmeter-testcases.png
+++ b/docs/development/devtools/distribution-s3p-results/distribution-jmeter-testcases.png
Binary files differ
diff --git a/docs/development/devtools/distribution-s3p-results/performance-monitor.png b/docs/development/devtools/distribution-s3p-results/performance-monitor.png
index e7a12ed7..71fd7fca 100644
--- a/docs/development/devtools/distribution-s3p-results/performance-monitor.png
+++ b/docs/development/devtools/distribution-s3p-results/performance-monitor.png
Binary files differ
diff --git a/docs/development/devtools/distribution-s3p-results/performance-statistics.png b/docs/development/devtools/distribution-s3p-results/performance-statistics.png
index e6218537..3f8693c7 100644
--- a/docs/development/devtools/distribution-s3p-results/performance-statistics.png
+++ b/docs/development/devtools/distribution-s3p-results/performance-statistics.png
Binary files differ
diff --git a/docs/development/devtools/distribution-s3p-results/performance-threads.png b/docs/development/devtools/distribution-s3p-results/performance-threads.png
index b59b7db6..2488abd9 100755..100644
--- a/docs/development/devtools/distribution-s3p-results/performance-threads.png
+++ b/docs/development/devtools/distribution-s3p-results/performance-threads.png
Binary files differ
diff --git a/docs/development/devtools/distribution-s3p-results/performance-threshold.png b/docs/development/devtools/distribution-s3p-results/performance-threshold.png
index 85c2f5d4..73b20ff2 100644
--- a/docs/development/devtools/distribution-s3p-results/performance-threshold.png
+++ b/docs/development/devtools/distribution-s3p-results/performance-threshold.png
Binary files differ
diff --git a/docs/development/devtools/distribution-s3p-results/stability-monitor.png b/docs/development/devtools/distribution-s3p-results/stability-monitor.png
index 2d2848d9..bebaaeb0 100644
--- a/docs/development/devtools/distribution-s3p-results/stability-monitor.png
+++ b/docs/development/devtools/distribution-s3p-results/stability-monitor.png
Binary files differ
diff --git a/docs/development/devtools/distribution-s3p-results/stability-statistics.png b/docs/development/devtools/distribution-s3p-results/stability-statistics.png
index 04cd9063..f8465eb3 100644
--- a/docs/development/devtools/distribution-s3p-results/stability-statistics.png
+++ b/docs/development/devtools/distribution-s3p-results/stability-statistics.png
Binary files differ
diff --git a/docs/development/devtools/distribution-s3p-results/stability-threads.png b/docs/development/devtools/distribution-s3p-results/stability-threads.png
index a2e9e9f0..4cfd7a78 100644
--- a/docs/development/devtools/distribution-s3p-results/stability-threads.png
+++ b/docs/development/devtools/distribution-s3p-results/stability-threads.png
Binary files differ
diff --git a/docs/development/devtools/distribution-s3p-results/stability-threshold.png b/docs/development/devtools/distribution-s3p-results/stability-threshold.png
index a9cc71eb..f348761b 100644
--- a/docs/development/devtools/distribution-s3p-results/stability-threshold.png
+++ b/docs/development/devtools/distribution-s3p-results/stability-threshold.png
Binary files differ
diff --git a/docs/development/devtools/distribution-s3p.rst b/docs/development/devtools/distribution-s3p.rst
index 9ae93378..9a169bad 100644
--- a/docs/development/devtools/distribution-s3p.rst
+++ b/docs/development/devtools/distribution-s3p.rst
@@ -10,22 +10,6 @@ Policy Distribution component
72h Stability and 4h Performance Tests of Distribution
++++++++++++++++++++++++++++++++++++++++++++++++++++++
-VM Details
-----------
-
-The stability and performance tests are performed on VM's running in the OpenStack cloud
-environment in the ONAP integration lab.
-
-**Policy VM details**
-
-- OS: Ubuntu 18.04 LTS (GNU/Linux 4.15.0-151-generic x86_64)
-- CPU: 4 core
-- RAM: 15 GB
-- HardDisk: 39 GB
-- Docker version 20.10.7, build 20.10.7-0ubuntu1~18.04.2
-- Java: openjdk 11.0.11 2021-04-20
-
-
Common Setup
------------
@@ -88,7 +72,7 @@ Install and verify docker-compose
.. code-block:: bash
- # Install compose
+ # Install compose (check if version is still available or update as necessary)
sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
@@ -118,9 +102,9 @@ Modify the versions.sh script to match all the versions being tested.
vi ~/distribution/testsuites/stability/src/main/resources/setup/versions.sh
-Ensure the correct docker image versions are specified - e.g. for Istanbul-M4
+Ensure the correct docker image versions are specified - e.g. for Jakarta-M4
-- export POLICY_DIST_VERSION=2.6.1-SNAPSHOT
+- export POLICY_DIST_VERSION=2.7-SNAPSHOT
Run the start.sh script to start the components. After installation, script will execute
``docker ps`` and show the running containers.
@@ -137,14 +121,13 @@ Run the start.sh script to start the components. After installation, script will
Creating policy-api ... done
Creating policy-pap ... done
- CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
- f91be98ad1f4 nexus3.onap.org:10001/onap/policy-pap:2.5.1-SNAPSHOT "/opt/app/policy/pap…" 1 second ago Up Less than a second 6969/tcp policy-pap
- d92cdbe971d4 nexus3.onap.org:10001/onap/policy-api:2.5.1-SNAPSHOT "/opt/app/policy/api…" 1 second ago Up Less than a second 6969/tcp policy-api
- 9a019f5d641e nexus3.onap.org:10001/onap/policy-db-migrator:2.3.1-SNAPSHOT "/opt/app/policy/bin…" 2 seconds ago Up 1 second 6824/tcp policy-db-migrator
- 108ba238edeb nexus3.onap.org:10001/mariadb:10.5.8 "docker-entrypoint.s…" 3 seconds ago Up 1 second 3306/tcp mariadb
- bec9b223e79f nexus3.onap.org:10001/onap/policy-models-simulator:2.5.1-SNAPSHOT "simulators.sh" 3 seconds ago Up 1 second 3905/tcp simulator
- 74aa5abeeb08 nexus3.onap.org:10001/onap/policy-distribution:2.6.1-SNAPSHOT "/opt/app/policy/bin…" 3 seconds ago Up 1 second 6969/tcp, 9090/tcp policy-distribution
-
+ fa4e9bd26e60 nexus3.onap.org:10001/onap/policy-pap:2.6-SNAPSHOT-latest "/opt/app/policy/pap…" 1 second ago Up Less than a second 6969/tcp policy-pap
+ efb65dd95020 nexus3.onap.org:10001/onap/policy-api:2.6-SNAPSHOT-latest "/opt/app/policy/api…" 1 second ago Up Less than a second 6969/tcp policy-api
+ cf602c2770ba nexus3.onap.org:10001/onap/policy-db-migrator:2.4-SNAPSHOT-latest "/opt/app/policy/bin…" 2 seconds ago Up 1 second 6824/tcp policy-db-migrator
+ 99383d2fecf4 pdp/simulator "sh /opt/app/policy/…" 2 seconds ago Up 1 second pdp-simulator
+ 3c0e205c5f47 nexus3.onap.org:10001/onap/policy-models-simulator:2.6-SNAPSHOT-latest "simulators.sh" 3 seconds ago Up 2 seconds 3904/tcp simulator
+ 3ad00d90d6a3 nexus3.onap.org:10001/onap/policy-distribution:2.7-SNAPSHOT-latest "/opt/app/policy/bin…" 3 seconds ago Up 2 seconds 6969/tcp, 9090/tcp policy-distribution
+ bb0b915cdecc nexus3.onap.org:10001/mariadb:10.5.8 "docker-entrypoint.s…" 3 seconds ago Up 2 seconds 3306/tcp mariadb
.. note::
The containers on this docker-compose are running with HTTP configuration. For HTTPS, ports
@@ -165,7 +148,7 @@ Download and install JMeter
# Install JMeter
mkdir -p jmeter
cd jmeter
- wget https://dlcdn.apache.org//jmeter/binaries/apache-jmeter-5.4.1.zip
+ wget https://dlcdn.apache.org//jmeter/binaries/apache-jmeter-5.4.1.zip # check if valid version
unzip -q apache-jmeter-5.4.1.zip
rm apache-jmeter-5.4.1.zip
@@ -180,7 +163,7 @@ monitor CPU, Memory and GC for Distribution while the stability tests are runnin
sudo apt install -y visualvm
-Run these commands to configure permissions
+Run these commands to configure permissions (if permission errors happens, use ``sudo su``)
.. code-block:: bash
@@ -255,6 +238,7 @@ The 72h stability test will run the following steps sequentially in a single thr
- **Add CSAR** - Adds CSAR to the directory that distribution is watching
- **Get Healthcheck** - Ensures Healthcheck is returning 200 OK
- **Get Statistics** - Ensures Statistics is returning 200 OK
+- **Get Metrics** - Ensures Metrics is returning 200 OK
- **Assert PDP Group Query** - Checks that PDPGroupQuery contains the deployed policy
- **Assert PoliciesDeployed** - Checks that the policy is deployed
- **Undeploy/Delete Policy** - Undeploys and deletes the Policy for the next loop
@@ -342,7 +326,7 @@ time and rest call throughput for all the requests when the number of requests a
saturate the resource and find the bottleneck.
It also tests that distribution can handle multiple policy CSARs and that these are deployed within
-30 seconds consistently.
+60 seconds consistently.
Setup Details
@@ -358,7 +342,7 @@ Performance test plan is different from the stability test plan.
- Instead of handling one policy csar at a time, multiple csar's are deployed within the watched
folder at the exact same time.
-- We expect all policies from these csar's to be deployed within 30 seconds.
+- We expect all policies from these csar's to be deployed within 60 seconds.
- There are also multithreaded tests running towards the healthcheck and statistics endpoints of
the distribution service.
@@ -368,7 +352,7 @@ Running the Test Plan
Check if /tmp folder permissions to allow the Testplan to insert the CSAR into the
/tmp/policydistribution/distributionmount folder.
-Clean up from previous run. If necessary, put containers down with script `down.sh` from setup
+Clean up from previous run. If necessary, put containers down with script ``down.sh`` from setup
folder mentioned on :ref:`Setup components <setup-distribution-s3p-components>`
.. code-block:: bash
@@ -401,3 +385,5 @@ Test Results
.. image:: distribution-s3p-results/performance-monitor.png
.. image:: distribution-s3p-results/performance-threads.png
+
+End of document
diff --git a/docs/development/prometheus-metrics.rst b/docs/development/prometheus-metrics.rst
index 90bc9225..341d6d5a 100644
--- a/docs/development/prometheus-metrics.rst
+++ b/docs/development/prometheus-metrics.rst
@@ -12,28 +12,149 @@ Prometheus Metrics support in Policy Framework Components
This page explains the prometheus metrics exposed by different Policy Framework components.
-XACML-PDP
+
+1. Context
+==========
+
+Collecting application metrics is the first step towards gaining insights into Policy Fwk services and infrastructure from point of view of Availability, Performance, Reliability and Scalability.
+
+The goal of monitoring is to achieve the below operational needs:
+
+1. Monitoring via dashboards: Provide visual aids to display health, key metrics for use by Operations.
+2. Alerting: Something is broken, and the issue must be addressed immediately OR, something might break soon, and proactive measures are taken to avoid such a situation.
+3. Conducting retrospective analysis: Rich information that is readily available to better troubleshoot issues.
+4. Analyzing trends: How fast is the usage growing? How is the incoming traffic like? Helps assess needs for scaling to meet forecasted demands.
+
+The principles outlined in the `Four Golden Signals <https://sre.google/sre-book/monitoring-distributed-systems/#xref_monitoring_golden-signals>`__ developed by Google Site Reliability Engineers has been adopted to define the key metrics for Policy Framework.
+
+- Request Rate: # of requests per second as served by Policy services.
+- Event Processing rate: # of requests/events per second as processed by the PDPs.
+- Errors: # of those requests/events processed that are failing.
+- Latency/Duration: Amount of time those requests take, and for PDPs relevant metrics for event processing times.
+- Saturation: Measures the degree of fullness or % utilization of a service emphasizing the resources that are most constrained: CPU, Memory, I/O, custom metrics by domain.
+
+
+2. Policy Framework Key metrics
+===============================
+
+System Metrics common across all Policy components
+--------------------------------------------------
+
+These standard metrics are available and exposed via a Prometheus endpoint since Istanbul release and can be categorized as below:
+
+CPU Usage
*********
-The following Prometheus metric counters are present in the current release:
+CPU usage percentage can be derived *"system_cpu_usage"* for springboot applications and *"process_cpu_seconds_total* for non springboot applications using `PromQL <https://prometheus.io/docs/prometheus/latest/querying/basics/>`__ .
+
+Process uptime
+**************
+
+The process uptime in seconds is available via *"process_uptime_seconds"* for springboot applications or *"process_start_time_seconds"* otherwise.
+
+Status of the applications is available via the standard *"up"* metric.
+
+JVM memory metrics
+******************
+
+These metrics begin with the prefix *"jvm_memory_"*.
+There is a lot of data here however, one of the key metric to monitor would be the total heap memory usage, *E.g. sum(jvm_memory_used_bytes{area="heap"})*.
+
+`PromQL <https://prometheus.io/docs/prometheus/latest/querying/basics/>`__ can be leveraged to represent the total or rate of memory usage.
+
+JVM thread metrics
+******************
+
+These metrics begin with the prefix *"jvm_threads_"*. Some of the key data to monitor for are:
+
+- *"jvm_threads_live_threads"* (springboot apps), or *"jvm_threads_current"* (non springboot) shows the total number of live threads, including daemon and non-daemon threads
+- *"jvm_threads_peak_threads"* (springboot apps), or *"jvm_threads_peak"* (non springboot) shows the peak total number of threads since the JVM started
+- *"jvm_threads_states_threads"* (springboot apps), or *"jvm_threads_state"* (non springboot) shows number of threads by thread state
+
+JVM garbage collection metrics
+******************************
+
+There are many garbage collection metrics, with prefix *"jvm_gc_"* available to get deep insights into how the JVM is managing memory. They can be broadly categorized into:
+
+- Pause duration *"jvm_gc_pause_"* for springboot applications gives us information about how long GC took. For non springboot application, the collection duration metrics *"jvm_gc_collection_"* provide the same information.
+- Memory pool size increase can be assessed using *"jvm_gc_memory_allocated_bytes_total"* and *"jvm_gc_memory_promoted_bytes_total"* for springboot applications.
+
+Average garbage collection time and rate of garbage collection per second are key metrics to monitor.
+
+
+Key metrics for Policy API
+--------------------------
+
++-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| Metric name | Metric description | Metric labels |
++=====================================+====================================================================================================+=======================================================================================================================================================================+
+| process_uptime_seconds | Uptime of policy-api application in seconds. | |
++-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| http_server_requests_seconds_count | Number of API requests filtered by uri, REST method and response status among other labels | "exception": any exception string; "method": REST method used; "outcome": response status string; "status": http response status code; "uri": REST endpoint invoked |
++-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| http_server_requests_seconds_sum | Time taken for an API request filtered by uri, REST method and response status among other labels | "exception": any exception string; "method": REST method used; "outcome": response status string; "status": http response status code; "uri": REST endpoint invoked |
++-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+
+Key metrics for Policy PAP
+--------------------------
+
++-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| Metric name | Metric description | Metric labels |
++=====================================+====================================================================================================+=======================================================================================================================================================================+
+| process_uptime_seconds | Uptime of policy-pap application in seconds. | |
++-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| http_server_requests_seconds_count | Number of API requests filtered by uri, REST method and response status among other labels | "exception": any exception string; "method": REST method used; "outcome": response status string; "status": http response status code; "uri": REST endpoint invoked |
++-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| http_server_requests_seconds_sum | Time taken for an API request filtered by uri, REST method and response status among other labels | "exception": any exception string; "method": REST method used; "outcome": response status string; "status": http response status code; "uri": REST endpoint invoked |
++-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| pap_policy_deployments | Number of TOSCA policy deploy/undeploy operations | "operation": Possibles values are deploy, undeploy; "status": Deploy/Undeploy status values - SUCCESS, FAILURE, TOTAL |
++-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+
+Key metrics for APEX-PDP
+------------------------
+
++---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
+| Metric name | Metric description | Metric labels |
++=============================================+=====================================================================================+======================================================================================================================+
+| process_start_time_seconds | Uptime of apex-pdp application in seconds | |
++---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
+| pdpa_policy_deployments_total | Number of TOSCA policy deploy/undeploy operations | "operation": Possibles values are deploy, undeploy; "status": Deploy/Undeploy status values - SUCCESS, FAILURE, TOTAL |
++---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
+| pdpa_policy_executions_total | Number of TOSCA policy executions | "status": Execution status values - SUCCESS, FAILURE, TOTAL" |
++---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
+| pdpa_engine_state | State of APEX engine | "engine_instance_id": ID of the engine thread |
++---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
+| pdpa_engine_last_start_timestamp_epoch | Epoch timestamp of the instance when engine was last started to derive uptime from | "engine_instance_id": ID of the engine thread |
++---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
+| pdpa_engine_event_executions | Number of APEX event execution counter per engine thread | "engine_instance_id": ID of the engine thread |
++---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
+| pdpa_engine_average_execution_time_seconds | Average time taken to execute an APEX policy in seconds | "engine_instance_id": ID of the engine thread |
++---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
+
+Key metrics for Drools PDP
+--------------------------
-- pdpx_policy_deployments_total counts the total number of deployment operations.
-- pdpx_policy_decisions_total counts the total number of decisions.
+Key metrics for XACML PDP
+-------------------------
-pdpx_policy_deployments_total
-+++++++++++++++++++++++++++++
++--------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| Metric name | Metric description | Metric labels |
++================================+===================================================+==============================================================================================================================================================================================================================+
+| process_start_time_seconds | Uptime of policy-pap application in seconds. | |
++--------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| pdpx_policy_deployments_total | Counts the total number of deployment operations | "deploy": Counts the number of successful or failed deploys; "undeploy": Counts the number of successful or failed undeploys |
++--------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| pdpx_policy_decisions_total | Counts the total number of decisions | permit: Counts the number of permit decisions; "deny": Counts the number of deny decisions; "indeterminant": Counts the number of indeterminant decisions; "not_applicable": Counts the number of not applicable decisions. |
++--------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-This counter supports the following labels:
-- "deploy": Counts the number of successful or failed deploys.
-- "undeploy": Counts the number of successful or failed undeploys.
+Key metrics for Policy Distribution
+-----------------------------------
-pdpx_policy_decisions_total
-+++++++++++++++++++++++++++
+3. OOM changes to enable prometheus monitoring for Policy Framework
+===================================================================
-This counter supports the following labels:
+Policy Framework uses ServiceMonitor custom resource definition (CRD) to allow Prometheus to monitor the services it exposes. Label selection is used to determine which services are selected to be monitored.
+For label management and troubleshooting refer to the documentation at: `Prometheus operator <https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/troubleshooting.md#overview-of-servicemonitor-tagging-and-related-elements>`__.
-- "permit": Counts the number of permit decisions.
-- "deny": Counts the number of deny decisions.
-- "indeterminant": Counts the number of indeterminant decisions.
-- "not_applicable": Counts the number of not applicable decisions.
+`OOM charts <https://github.com/onap/oom/tree/master/kubernetes/policy/components>`__ for policy include ServiceMonitor and properties can be overrided based on the deployment specifics.