An Implementation of GoldenGate Monitoring configured in a Real Application Cluster Environment

Nearly a year ago I wrote about GoldenGate monitoring using a Metric Extension, and then later on expanded upon the creation of a Metric Extension, which I’ve installed and configured at several customer sites where GoldenGate was running on a stand alone server.

Over the last few months, several people had approached me on the process, as well as improved on it. One such improvement is accredited to Bobby Curtis (@dbasolved) who taught me how to buffer in Perl. Bobby also has a neat collection of monitoring scripts for GoldenGate, you can find them here.

The latest implementation, is a collaboration led by my tenacious and talented colleague Tucker Thompson (LinkedIn). He is responsible to maintain and manage the Enterprise Manager environment from an operational perspective for (among many other) a rather large retail corporation – let’s call them Furry Feet (FF). Their environment contains multiple Exadata machines including several dozen non-Exadata environments.

A few weeks ago he approached me with a question on GoldenGate monitoring with Enterprise Manager that does not involve the GoldenGate Plugin. In his own words, Tucker described the problem below:

“The client was previously using [custom built] crontab scripts to monitor multiple items (including GoldenGate) in their large Exadata environment, despite having an OEM 12c implementation. Our desire was to move all of their crontab elements into a centralized strategy utilizing OEM 12c.

The current GoldenGate plugin for OEM 12c was tested, but seemed very buggy and the client was not ready to use it yet.

The client has AGCTL configured to assist in running multiple highly available GoldenGate instances in the same Exadata DBM.”

As per Oracle’s documentation; Agent Control (AGCTL) is the agent command line utility to manage bundled agents (XAG) for application High Availability (HA) using Oracle Grid Infrastructure.

Tucker explains why this solution wasn’t always reliable by stating:

“The crontab scripts operated per compute node to check the logs for errors, send a lag status to the elements, and check the AGCTL status to determine where the instance was running. However, we found that any alert in the alert.log triggered a critical alert through their ticketing system, as it gripped for any ORA-XXXXX error.

For instance, if there were any long running queries, we would get a ticket created. Another major issue was that we would encounter multiple issues where the status of GoldenGate in AGCTL did not accurately reflect is actual status. For example, an instance could be showing as down through AGCTL, but through GGSCI, the status was RUNNING.”

We seem to find a pattern with issues in monitoring of GoldenGate, don’t we? This doesn’t necessarily mean that the tool itself is at fault but rather the available options. Since I had already come up with an adequate way to monitor GoldenGate using Metric Extensions in EM12c, but it was designed to run against a host target specifically where GoldenGate runs in a stand alone mode. In FF’s environment, there were several GoldenGate instances running across the various Exadata compute nodes that were configured to fail-over and restart GoldenGate seamlessly. This made for an interesting problem to resolve because my initial script assumes a static GoldenGate home.

As an example, three Nodes in a cluster each with a different GoldenGate instance that is managed by XAG.

NewImage

Tucker’s innovative solution was to retrieve the information from clusterware via AGCTL to run the GoldenGate check against the nodes where the instance is currently running.

“What this script does is execute against the Exadata Database Machine as a target. This means that it will first find a compute node available, run AGCTL to determine the names of the GoldenGate instances, and respective nodes they are configured to run on. This information is always available from any node, and the script does not take into account where AGCTL thinks GoldenGate is actually currently running.

Next, with that information registered, the script runs olsnodes to grab the host names of all compute nodes registered in the DBM. It then uses information pulled from the AGCTL configuration per GoldenGate instance to ssh to each compute node and grep for the manager process for that specific GG instance. With the manager found running on a certain node, the script then runs ggsci from that node against that GG instance, and parses the results to tell us if the different components are running, stopped, or abended. It will also take the lag into consideration and set warning thresholds, rather than a critical alert for any amount of lag. The script even goes as far to add the agctl status as an informational column, so it can be seen if agctl is showing as down, but GGSCI shows all processes running fine.

NewImage

If the manager is not found anywhere in the Exadata environment, it extracts the first node that the instance is configured to run on from the AGCTL configuration, and runs ggsci from that node. This allows the script to still show all of the components as stopped or abended, and their respective lags.”

Another thing to note is that if a GoldenGate instance relocates to a different node for some reason, instead of just getting an alert that the instance went down we would get that alert, followed by a clear alert once the GoldenGate objects (manager, extract, replicat, pump) are back up and running on a different node.

Once tested via the Metric Extension setup screens, the output looks like:

NewImage

This strategy allows for one script to monitor multiple diagnostics for across GoldenGate instances configured to run in a large, highly available environment.”

It outputs the following information when run from a prompt.

ggate_baby077|MANAGER|MANAGER|RUNNING|0|0|0|exababy01db02|Goldengate  instance 'ggate_baby077' is running on exababy01db02|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_baby077
ggate_baby077|CODS_E01|EXTRACT|RUNNING|22940|25|0|exababy01db02|Goldengate  instance 'ggate_baby077' is running on exababy01db02|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_baby077
ggate_baby077|CODS_R01|REPLICAT|RUNNING|0|3|0|exababy01db02|Goldengate  instance 'ggate_baby077' is running on exababy01db02|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_baby077
ggate_baby077|CODS_R02|REPLICAT|RUNNING|0|3|0|exababy01db02|Goldengate  instance 'ggate_baby077' is running on exababy01db02|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_baby077
ggate_baby077|CODS_R03|REPLICAT|RUNNING|0|3|0|exababy01db02|Goldengate  instance 'ggate_baby077' is running on exababy01db02|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_baby077
ggate_baby077|CODS_R04|REPLICAT|RUNNING|0|3|0|exababy01db02|Goldengate  instance 'ggate_baby077' is running on exababy01db02|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_baby077
ggate_baby077|CODS_R05|REPLICAT|RUNNING|0|3|0|exababy01db02|Goldengate  instance 'ggate_baby077' is running on exababy01db02|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_baby077
ggate_baby077|CODS_R06|REPLICAT|RUNNING|0|3|0|exababy01db02|Goldengate  instance 'ggate_baby077' is running on exababy01db02|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_baby077
ggate_baby077|CODS_R08|REPLICAT|RUNNING|0|3|0|exababy01db02|Goldengate  instance 'ggate_baby077' is running on exababy01db02|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_baby077
ggate_baby077|CODS_R09|REPLICAT|RUNNING|0|3|0|exababy01db02|Goldengate  instance 'ggate_baby077' is running on exababy01db02|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_baby077
ggate_baby077|CODS_R10|REPLICAT|RUNNING|0|3|0|exababy01db02|Goldengate  instance 'ggate_baby077' is running on exababy01db02|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_baby077
ggate_baby077|CODS_R11|REPLICAT|RUNNING|0|3|0|exababy01db02|Goldengate  instance 'ggate_baby077' is running on exababy01db02|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_baby077
ggate_baby077|CODS_R12|REPLICAT|RUNNING|0|3|0|exababy01db02|Goldengate  instance 'ggate_baby077' is running on exababy01db02|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_baby077
ggate_baby19sss|MANAGER|MANAGER|RUNNING|0|0|0|exababy01db04|Goldengate  instance 'ggate_baby19sss' is running on exababy01db04|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_baby19ssa
ggate_baby19sss|MKCN_E01|EXTRACT|RUNNING|7|6|0|exababy01db04|Goldengate  instance 'ggate_baby19sss' is running on exababy01db04|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_baby19ssa
ggate_baby19sss|POMS_E01|EXTRACT|RUNNING|9|5|0|exababy01db04|Goldengate  instance 'ggate_baby19sss' is running on exababy01db04|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_baby19ssa
ggate_baby19sss|POMS_E02|EXTRACT|RUNNING|8|9|0|exababy01db04|Goldengate  instance 'ggate_baby19sss' is running on exababy01db04|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_baby19ssa
ggate_baby19sss|POMS_P02|EXTRACT|RUNNING|0|6|0|exababy01db04|Goldengate  instance 'ggate_baby19sss' is running on exababy01db04|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_baby19ssa
ggate_baby19sss|XFER_E01|EXTRACT|RUNNING|8|0|0|exababy01db04|Goldengate  instance 'ggate_baby19sss' is running on exababy01db04|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_baby19ssa
ggate_baby19sss|XFER_EM1|EXTRACT|RUNNING|7|1|0|exababy01db04|Goldengate  instance 'ggate_baby19sss' is running on exababy01db04|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_baby19ssa
ggate_baby19sss|XFER_P01|EXTRACT|RUNNING|0|8|0|exababy01db04|Goldengate  instance 'ggate_baby19sss' is running on exababy01db04|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_baby19ssa
ggate_baby19sss|XFER_PM1|EXTRACT|RUNNING|0|6|0|exababy01db04|Goldengate  instance 'ggate_baby19sss' is running on exababy01db04|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_baby19ssa
ggate_baby19sss|BASE_R01|REPLICAT|RUNNING|0|7|0|exababy01db04|Goldengate  instance 'ggate_baby19sss' is running on exababy01db04|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_baby19ssa
ggate_baby19sss|MKCN_R01|REPLICAT|RUNNING|0|7|0|exababy01db04|Goldengate  instance 'ggate_baby19sss' is running on exababy01db04|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_baby19ssa
ggate_baby19sss|POMS_R01|REPLICAT|RUNNING|0|5|0|exababy01db04|Goldengate  instance 'ggate_baby19sss' is running on exababy01db04|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_baby19ssa
ggate_test|MANAGER|MANAGER|STOPPED|0|0|1|exababy01db01|Goldengate  instance 'ggate_test' is not running|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_test
ggate_test|TEST_E01|EXTRACT|ABENDED|6|7|2|exababy01db01|Goldengate  instance 'ggate_test' is not running|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_test
ggate_test|TEST_R01|REPLICAT|ABENDED|0|0|2|exababy01db01|Goldengate  instance 'ggate_test' is not running|/u01/app/oracle/product/11.2.1.0.6/gghome_1/gg_test

image002

The perl script itself is fairly simple, and can be plugged into the Metric Extension example from earlier. The only caveat is, that your GoldenGate environment must be an XAG resource.

Once the Metric Extension is deployed, its information is accessible in the “All Metrics” section for the Exadata Database Machine.

image001

The point of this exercise was to solve a particular use-case where GoldenGate instances are configured as clusterware resources which can be restarted on different nodes each time. What I would like to see is, an implementation of this GoldenGate monitoring script in an implementation that doesn’t necessarily use XAG in a clustered environment.

Thanks again to Tucker for coming up with the idea to retrieve the information and it was fun to incorporate my original script into his version.

Cheers!

4 comments

  1. Sridhar Subramaniam · · Reply

    Hey Maaz, I am not using Metric Extensions and Cloud Control yet to manage GoldenGate but I have a question setting up XAG.

    My attempts to follow the only document on XAG, HA best practices from Oracle, is failing and I was wondering whether you could point me into doing anything differently. I have raised a support call with Oracle but that is not really helping as much.

    Apologies for the rather long email!

    Following is a brief on the problem

    I am trying to establish HA for a GoldenGate setup running on a 3 node Oracle 11.2.0.4.4 RAC on RHEL 6.3 64-bit environment.

    I am following the Oracle GoldenGate HA best practice documents
    http://www.oracle.com/technetwork/products/clusterware/overview/ogg-xag-bp-1915977.pdf
    http://www.oracle.com/technetwork/products/clusterware/overview/ogiba-v2-1916262.pdf

    As part of the setup, I have done the following:
    1. Downloaded the latest version of the xagpack_6.zip
    2. Installed this on all 3 nodes in the RAC using “xagsetup.sh –install –directory $XAG_HOME –all_nodes”. The install has gone through fine
    3. Created an Application VIP, started the vip
    4. Created a Cluster resource

    Attempts to start the Cluster resource failes with “*** Could not open error log ggserr.log (error 13,Permission denied) ***” error

    Here is all what I have done

    Create the application vip
    su –
    $GRID_HOME/bin/appvipcfg create -network=1 -ip=10.1.40.41 -vipname=au-ogg-auapprod-vip -user=grid -group=oinstall

    Grant permission to oracle and grid users
    $GRID_HOME/bin/crsctl setperm resource au-ogg-auapprod-vip -u user:grid:r-x
    $GRID_HOME/bin/crsctl setperm resource au-ogg-auapprod-vip -u user:oracle:r-x

    Start the application vip
    $GRID_HOME/bin/crsctl status resource au-ogg-auapprod-vip
    $GRID_HOME/bin/crsctl start resource au-ogg-auapprod-vip

    [grid@prod-ora-n1][+ASM1][~]
    [20:40:23]$ $GRID_HOME/bin/crsctl status resource au-ogg-auapprod-vip
    NAME=au-ogg-auapprod-vip
    TYPE=app.appvip_net1.type
    TARGET=ONLINE
    STATE=ONLINE on prod-ora-n1

    Create Cluster resource
    su – grid
    $XAG_HOME/bin/agctl add goldengate auapprod_source –gg_home /u02/GG/Source/AUAPPROD –instance_type source –nodes prod-ora-n1,prod-ora-n2,prod-ora-n3 –vip_name au-ogg-auapprod-vip –filesystems ora.registry.acfs –databases ora.auapprod.db –oracle_home /u01/app/oracle/product/11.2.0/dbhome_1 –monitor_extracts XUAP01,PUAP01 –critical_extracts XUAP01,PUAP01

    [grid@prod-ora-n1][+ASM1][~]
    [20:40:31]$ $XAG_HOME/bin/agctl config goldengate auapprod_source
    GoldenGate location is: /u02/GG/Source/AUAPPROD
    GoldenGate instance type is: source
    Configured to run on Nodes: prod-ora-n1 prod-ora-n2 prod-ora-n3
    ORACLE_HOME location is: /u01/app/oracle/product/11.2.0/dbhome_1
    Databases needed: ora.auapprod.db
    File System resources needed: ora.registry.acfs
    Extracts to monitor: XUAP01,PUAP01
    Replicats to monitor:
    Critical extracts: XUAP01,PUAP01
    Critical replicats:
    Autostart on DataGuard role transition to PRIMARY: no
    Autostart JAgent: no

    Attempts to start the resource takes a long time and it finally fails

    $XAG_HOME/bin/agctl start goldengate auapprod_source

    Looking into the XAG_HOME/../log drectory, I see that the resource start attempts are failing.

    Questions:
    1. GI Home is owned by ‘grid’, Oracle DB home by ‘oracle’ and GG home by ‘oggadm’. Who should be the user when I create the application vip? Sure I run this as root, but in the following command, user=grid. Is this correct?

    $GRID_HOME/bin/appvipcfg create -network=1 -ip=10.1.40.41 -vipname=au-ogg-auapprod-vip -user=grid -group=oinstall

    2. Do I need to setperm to grant oggadm’ the privilege including rw?
    3. Do you see anything incorrect in what I am doing?

    Would really appreciate if you could help guide me towards resolving the issue.

    Regards

    Sridhar Subramaniam
    Sydney, Australia

    1. Hi Sridhar,

      Having never set up XAG before, unfortunately I will not be able to assist you. Clearly is it a permissions issue. Does the “oggadm” user share the same groups as the “grid” or “oracle” users?

      Cheers,
      Maaz

  2. Brenda Eagleston · · Reply

    I had this problem, and found that I had all required files for the local node, but not the 2nd node ( my action script and ggprofile are local as we do only want to mount the acfs mount on the node where GG is running– which is why I’m here. I have the resource running but cant switch nodes and manage the mount point

  3. Rebecca · · Reply

    good doc may help you: https://mdinh.wordpress.com/2014/12/14/goldendate-ha-maa-rac-acfs-xag/
    also you have File System resources needed: ora.registry.acfs
    you should give the correct acfs resource name, not registry acfs which is for managing mount registry.
    You should have a acfs resource name from acfs which is created as crs managed

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Musings

Things I see and learn!

Thoughts from James H. Lui

If you Care a Little More, Things Happen. Bees can be dangerous. Always wear protective clothing when approaching or dealing with bees. Do not approach or handle bees without proper instruction and training.

bdt's oracle blog

Sharing experience (by Bertrand Drouvot)

Frits Hoogland Weblog

IT Technology; Oracle, linux, TCP/IP and other stuff I find interesting

Vishal desai's Oracle Blog

Just another WordPress.com weblog

%d bloggers like this: