Grid Infrastructure Cluster Health Monitor – A Deconstruction

While digging around Grid Infrastructure logs, I came across this new feature with 12c called Cluster Health Monitor (CHM) – I knew the MGMTDB database was good for something when I opted to install it even though it is not required.

From Oracle’s Documentation

“The Cluster Health Monitor (CHM) detects and analyzes operating system and cluster resource-related degradation and failures. CHM stores real-time operating system metrics in the Oracle Grid Infrastructure Management Repository that you can use for later triage with the help of My Oracle Support should you have cluster issues.”

Consisting of three components (see below), the CHM collects and stores data for later review on the cluster’s over-all health.

System Services Monitor (osysmond) 

Where every node in the cluster contains this process, it is responsible for up-to-date monitoring and metric collection service at the Operating System level.

Cluster Logger Service (ologgerd)

This process is what actually retrieves data from the osysmond and writes it to the repository

Grid Infrastructure Management Respository

An Oracle instance which stores the collected data from osysmond. This will only run on a single (hub) node in a cluster, and by design will fail-over to another node should the one its on be unavailable. Interestingly enough, the data files for the instance are located on the same disk group as the OCR and Voting files. Oracle Docs do not talk about any specific sizing, but the onclumon utility is responsible for retention of the stored data.

Let’s take a look at all processes associated with a Grid Infrastructure setup.


[root@flex1 ~]# ps -ef | grep root | grep grid
root 2210 1 1 21:07 ? 00:00:57 /u01/app/ reboot
root 2516 1 0 21:07 ? 00:00:06 /u01/app/
root 2729 1 0 21:07 ? 00:00:02 /u01/app/
root 2743 1 0 21:07 ? 00:00:02 /u01/app/
root 4809 1 0 21:08 ? 00:00:20 /u01/app/ reboot
root 5699 1 1 21:08 ? 00:01:04 /u01/app/
root 5705 1 0 21:08 ? 00:00:39 /u01/app/ reboot
root 5969 1 0 21:08 ? 00:00:22 /u01/app/
root 19405 1 0 22:10 ? 00:00:01 /u01/app/ -trace-level 1 -ip-address -startup-endpoint ipc://GNS_flex1.muscle_5969_a598858b344350d1
root 20713 1 0 22:13 ? 00:00:01 /u01/app/ -M -d /u01/app/

Diagnostics Collection

The most convenient method to query the data in the CHM Repository, is by executing the oclumon utility. 

To collect diagnostic information, preferably all nodes in a cluster, you can run the script located in the $GRID_HOME/bin. There are options with this script to collected either all, or specific CRS daemon process logs.

[root@flex1 tmp]# /u01/app/ --help
Production Copyright 2004, 2010, Oracle. All rights reserved

Cluster Ready Services (CRS) diagnostic collection tool

[--crs] For collecting crs diagnostic information
[--adr] For collecting diagnostic information for ADR; specify ADR location
[--chmos] For collecting Cluster Health Monitor (OS) data
[--acfs] Unix only. For collecting ACFS diagnostic information
[--all] Default.For collecting all diagnostic information.
[--core] UNIX only. Package core files with CRS data
[--afterdate] UNIX only. Collects archives from the specified date. Specify in mm/dd/yyyy format
[--aftertime] Supported with -adr option. Collects archives after the specified time. Specify in YYYYMMDDHHMISS24 format
[--beforetime] Supported with -adr option. Collects archives before the specified date. Specify in YYYYMMDDHHMISS24 format
[--crshome] Argument that specifies the CRS Home location
[--incidenttime] Collects Cluster Health Monitor (OS) data from the specified time. Specify in MM/DD/YYYYHH24:MM:SS format
If not specified, Cluster Health Monitor (OS) data generated in the past 24 hours are collected
[--incidentduration] Collects Cluster Health Monitor (OS) data for the duration after the specified time. Specify in HH:MM format.
If not specified, all Cluster Health Monitor (OS) data after incidenttime are collected
1. You can also do the following --collect --crs --crshome
--clean cleans up the diagnosability
information gathered by this script
--coreanalyze UNIX only. Extracts information from core files
and stores it in a text file

1. First off, we need to find out which node the OLOGGERD service is currently running.

[root@flex1 bin]# /u01/app/ manage -get master

Master = flex1

2. Good, it happens to run on the same node I am currently on. Next, we can invoke the script to collect the data in the repository.

[root@flex1 tmp]# /u01/app/ --collect
Production Copyright 2004, 2010, Oracle. All rights reserved
Cluster Ready Services (CRS) diagnostic collection tool
The following CRS diagnostic archives will be created in the local directory.
crsData_flex1_20140206_2335.tar.gz -> logs,traces and cores from CRS home. Note: core files will be packaged only with the --core option.
ocrData_flex1_20140206_2335.tar.gz -> ocrdump, ocrcheck etc
coreData_flex1_20140206_2335.tar.gz -> contents of CRS core files in text format

osData_flex1_20140206_2335.tar.gz -> logs from Operating System
Collecting crs data
/bin/tar: log/flex1/cssd/ocssd.log: file changed as we read it
Collecting OCR data
Collecting information from core files
No corefiles found
The following diagnostic archives will be created in the local directory.
acfsData_flex1_20140206_2335.tar.gz -> logs from acfs log.
Collecting acfs data
Collecting OS logs
Collecting sysconfig data

3. It generates a few tar balls, and a text file.

[root@flex1 tmp]# ls -lhtr
total 25M
-rw-r--r-- 1 root root 25M Feb 6 23:36 crsData_flex1_20140206_2335.tar.gz
-rw-r--r-- 1 root root 57K Feb 6 23:37 ocrData_flex1_20140206_2335.tar.gz
-rw-r--r-- 1 root root 927 Feb 6 23:37 acfsData_flex1_20140206_2335.tar.gz
-rw-r--r-- 1 root root 329K Feb 6 23:37 osData_flex1_20140206_2335.tar.gz
-rw-r--r-- 1 root root 31K Feb 6 23:37 sysconfig_flex1_20140206_2335.txt

4. You could limit the data that collected by using date fields

[root@flex1 tmp]# /u01/app/ --collect --afterdate 02/04/2014

5. I was curious, so I untar’d the crsData_flex1_20140206_2335.tar.gz file, and found that the logs from the following locations in the $GRID_HOME directory.



Now that we have dispensed with the logs, let’s see what this fancy OCLUMON can do.

1. First off, we need to set the logging level for the daemon we’d like to monitor.

[root@flex1 tmp]# /u01/app/ debug log osysmond CRFMOND:3

2. Next, start the process with the dumpnodeview

[root@flex1 tmp]# /u01/app/ dumpnodeview -n flex1


Node: flex1 Clock: '14-02-07 00.02.08' SerialNo:2081



#pcpus: 1 #vcpus: 1 cpuht: N chipname: Intel(R) cpu: 26.27 cpuq: 2 physmemfree: 141996 physmemtotal: 4055420 mcache: 2017648 swapfree: 3751724 swaptotal: 4063228 hugepagetotal: 0 hugepagefree: 0 hugepagesize: 2048 ior: 105 iow: 248 ios: 55 swpin: 0 swpout: 0 pgin: 105 pgout: 182 netr: 47.876 netw: 24.977 procs: 302 rtprocs: 12 #fds: 24800 #sysfdlimit: 6815744 #disks: 9 #nics: 3 nicErrors: 0


topcpu: 'apx_vktm_+apx1(7240) 3.40' topprivmem: 'java(19943) 138464' topshm: 'ora_mman_sport(14493) 223920' topfd: 'ocssd.bin(2778) 341' topthread: 'console-kit-dae(1973) 64'


Node: flex1 Clock: '14-02-07 00.02.13' SerialNo:2082



#pcpus: 1 #vcpus: 1 cpuht: N chipname: Intel(R) cpu: 46.13 cpuq: 17 physmemfree: 110400 physmemtotal: 4055420 mcache: 2027560 swapfree: 3751708 swaptotal: 4063228 hugepagetotal: 0 hugepagefree: 0 hugepagesize: 2048 ior: 7714 iow: 210 ios: 399 swpin: 0 swpout: 6 pgin: 7212 pgout: 190 netr: 19.810 netw: 17.537 procs: 303 rtprocs: 12 #fds: 24960 #sysfdlimit: 6815744 #disks: 9 #nics: 3 nicErrors: 0


topcpu: 'apx_vktm_+apx1(7240) 3.00' topprivmem: 'java(19943) 138464' topshm: 'ora_mman_sport(14493) 223920' topfd: 'ocssd.bin(2778) 341' topthread: 'console-kit-dae(1973) 64'

This will regularly dump an output similar to a “top” command in linux. As with the script, there are date duration parameters for oclumon as well.

3. The data (in the MGMTDB instance) is stored in the CHM schema.

SQL> select table_name from dba_tables where owner = 'CHM';


10 rows selected.

4. As mentioned earlier, you can also manage the repository retention period from oclumon.

4.1 To find out the current settings, we can issue the -get parameter.

4.1.1 Find the repository size, in bytes.

[root@flex1 tmp]# /u01/app/ manage -get repsize

CHM Repository Size = 136320

4.1.2 Find the repository data file location

[root@flex1 tmp]# /u01/app/ manage -get reppath

CHM Repository Path = +DATA/_MGMTDB/DATAFILE/sysmgmtdata.260.835192031

4.1.3 Find the master and logger nodes

[root@flex1 tmp]# /u01/app/ manage -get master

Master = flex1
[root@flex1 tmp]# /u01/app/ manage -get alllogger

Loggers = flex1,
[root@flex1 tmp]# /u01/app/ manage -get mylogger

Logger = flex1

4.2 To set parameters, follow some of the examples below.

4.2.1 The changeretentiontime is merely an indicator for how much longer the underlying tablespace can accommodate the collected data. The value is (I believe) in seconds.

[root@flex1 tmp]# /u01/app/ manage -repos changeretentiontime 1000

The Cluster Health Monitor repository can support the desired retention for 2 hosts

4.2.2 Change the repository’s tablespace size (in MB). This also changes the retention period.

[root@flex1 tmp]# /u01/app/ manage -repos changerepossize 6000
The Cluster Health Monitor repository was successfully resized.The new retention is 399240 seconds.

The alert.log for -MGMTDB shows a simple ALTER TABLESPACE command.

Fri Feb 07 00:28:11 2014

The size of the instance has obviously increased as well.

[root@flex1 tmp]# /u01/app/ manage -get repsize

CHM Repository Size = 399240

5. And last, but not least, the version check!

[root@flex1 tmp]# /u01/app/ version

Cluster Health Monitor (OS), Version - Production Copyright 2007, 2013 Oracle. All rights reserved.

Well, I hope this has been an insightful post on the new CHM feature in the 12c release of Grid Infrastructure. If anything, the will be a nice replacement to the RDA that Support might request. I haven’t had to troubleshoot any clusteware issues on 12c, but I plan to break this environment and use the oclumon utility to debug the processes at a later date.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s


Things I see and learn!

Thoughts from James H. Lui

If you Care a Little More, Things Happen. Bees can be dangerous. Always wear protective clothing when approaching or dealing with bees. Do not approach or handle bees without proper instruction and training.

bdt's blog

Sharing stuff (by Bertrand Drouvot)

Frits Hoogland Weblog

IT Technology; Yugabyte, Postgres, Oracle, linux, TCP/IP and other stuff I find interesting

Vishal desai's Oracle Blog

Just another weblog

%d bloggers like this: