After DOR test on AIO-DX, alarm for CPU level above 90% was not cleared for more than 5 mins

Bug #1797438 reported by Anujeyan Manokeran
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Bin Qian

Bug Description

Bug Description : After the DOR test CPU level for both controller is going above 90% . Standby controller (controller-0) CPU level was not cleared for longer time more than 5mins. This was observed AIO-duplex system .

: Executing command...
[2018-10-09 21:10:29,207] 262 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2018-10-09 21:10:30,628] 382 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+--------------------------------------------------+-------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+--------------------------------------------------+-------------------+----------+----------------------------+
| 84bc0544-9074-42e5-817b-35342b35c3b8 | 100.101 | Platform CPU threshold exceeded; 95%, actual 95% | host=controller-1 | critical | 2018-10-09T21:08:52.881034 |
| 6d631cbd-fa20-4e73-84ed-077b8dbf1863 | 100.101 | Platform CPU threshold exceeded; 95%, actual 97% | host=controller-0 | critical | 2018-10-09T20:41:01.853283 |
+--------------------------------------+----------+--------------------------------------------------+-------------------+----------+----------------------------+
controller-1:~$

[2018-10-09 21:14:24,098] 419 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2018-10-09 21:14:24,098] 262 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2018-10-09 21:14:25,684] 382 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+--------------------------------------------------+-------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+--------------------------------------------------+-------------------+----------+----------------------------+
| 6d631cbd-fa20-4e73-84ed-077b8dbf1863 | 100.101 | Platform CPU threshold exceeded; 95%, actual 97% | host=controller-0 | critical | 2018-10-09T20:41:01.853283 |
+--------------------------------------+----------+--------------------------------------------------+-------------------+----------+----------------------------+
controller-1:~$

Severity
--------
Major

Steps to Reproduce
------------------
1. Launch VMs
2. Power off all the hosts
3. Power on all the hosts
4. Wait for all the hosts to enabled active
5. Verify alarms . CPU for both controller was 90-95%
6. Controller-0 alarm was not cleared for CPU level for more than 5 mins
Expected Behavior
------------------
Alarm clearing with in 5mins .

Actual Behavior
----------------
As per description

Reproducibility
---------------
50% reproducible . Alarm is not clearing within 5 minutes

System Configuration
--------------------
Duplex system

Branch/Pull Time/Commit
-----------------------
2018-10-08_01-52-01

Timestamp/Logs
--------------
2018-10-09 21:10:29,207

Tags: stx.1.0 stx.ha
Ghada Khalil (gkhalil)
summary: - STX: After DOR test on AIO-DX alarm for CPU level above 90% was not
- cleared for more than 5 mins
+ After DOR test on AIO-DX, alarm for CPU level above 90% was not cleared
+ for more than 5 mins
Dariush Eslimi (deslimi)
Changed in starlingx:
assignee: nobody → Bin Qian (bqian20)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Suspecting this is a result of an SM CPU hog introduced in the last few weeks. Gating for stx.2018.10

tags: added: stx.ha
tags: added: stx.2018.10
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-ha (master)

Reviewed: https://review.openstack.org/610022
Committed: https://git.openstack.org/cgit/openstack/stx-ha/commit/?id=3eb68cc2310f6c634ef9b9b167a1450b1cdd983b
Submitter: Zuul
Branch: master

commit 3eb68cc2310f6c634ef9b9b167a1450b1cdd983b
Author: Bin Qian <email address hidden>
Date: Fri Oct 12 08:34:36 2018 -0400

    Add idle time to worker thread

    Adding 50ms idle time in worker thread loop.
    This is to fix sm taking 100% cpu for a busy loop.

    Closes-Bug 1797438

    Change-Id: Ia41acfab86c0188ceb5c80822010376977c6fc74
    Signed-off-by: Bin Qian <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-ha (r/2018.10)

Fix proposed to branch: r/2018.10
Review: https://review.openstack.org/610998

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-ha (r/2018.10)

Reviewed: https://review.openstack.org/610998
Committed: https://git.openstack.org/cgit/openstack/stx-ha/commit/?id=a63f6ea6caca9a2ab5bab2f44f7bd1caba816315
Submitter: Zuul
Branch: r/2018.10

commit a63f6ea6caca9a2ab5bab2f44f7bd1caba816315
Author: Bin Qian <email address hidden>
Date: Fri Oct 12 08:34:36 2018 -0400

    Add idle time to worker thread

    Adding 50ms idle time in worker thread loop.
    This is to fix sm taking 100% cpu for a busy loop.

    Closes-Bug 1797438

    Change-Id: Ia41acfab86c0188ceb5c80822010376977c6fc74
    Signed-off-by: Bin Qian <email address hidden>

Ken Young (kenyis)
tags: added: stx.1.0
removed: stx.2018.10
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.