26.6 Creating Package Monitoring Scripts, If Necessary | HP-UX CSE(c) Official Study Guide and Desk Reference

As we have seen, applications that spawn child processes are likely to need an application monitoring script to integrate them into a Serviceguard package. Here is the application monitoring script that I created for my application, the monitoring script I called CLOCKWATCH .sh .

 #!/sbin/sh ######################################################################## ## ## ServiceGuard shell script for clockwatch application. ## ## This will monitor all processes in the array called PROC[] ## To add elements to the array ensure the array subscript increase ##  i.e.  ## PROC[0]="proc1" ## PROC[1]="proc2" ## ## To start clockwatch simply run ... ## ##    CLOCKWATCH.sh start ## ## To stop clockwatch simply run ... ## ##    CLOCKWATCH.sh stop ## ## To monitor clockwatch simply run ... ## ##    CLOCKWATCH.sh monitor ## ##  When one of the processes in the PROC[] array fials, this script ##  will exit. ## ## P.S. it is important the process name is unique and identifiable. ## If this script and the process being monitored share any common ## naming characteristics the monitoring script could end up monitoring ## itself and never die ! ## APPNAME=clockwatch CLOCKHOME=/clockwatch BIN=${CLOCKHOME}/bin LOG=${CLOCKHOME}/logs RETVAL=0 PROC[0]="/clockwatch/bin/clockwatch" case  in 'start')        echo "Attempting to start ${APPNAME}"                 if [ -f ${LOG}/watchlog ]                 then                         mv ${LOG}/watchlog ${LOG}/OLDwatchlog                 fi                 ${BIN}/clockwatch ${LOG}                 ;; 'stop')         PID=$(cat ${LOG}/.watchpid)                 if ps -fp ${PID} >&- 2>&-                 then                         echo "Attempting to stop ${APPNAME}; PID = ${PID}"                         kill -SIGUSR1 $(cat ${LOG}/.watchpid)                 else                       echo "PID = ${PID} does not exist. Application may already be dead."                         if [ -f ${LOG}/.watchpid ]                         then                              echo "Removing ${LOG}/.watchpid to allow ${APPNAME} to be  restarted on another node."                                 rm -f ${LOG}/.watchpid                         fi                 fi                 ;; 'monitor')      while                         ( for proc in ${PROC[*]}                           do                                 ps -ef  grep $proc  grep -v grep >&- 2>&-                                 let RETVAL=RETVAL+$?                           done                           return $RETVAL                         )                 do                         sleep 10                 done                 ;; *)              echo "Usage :  #!/sbin/sh ######################################################################## ## ## ServiceGuard shell script for clockwatch application. ## ## This will monitor all processes in the array called PROC[] ## To add elements to the array ensure the array subscript increase ##  i.e.  ## PROC[0]="proc1" ## PROC[1]="proc2" ## ## To start clockwatch simply run ... ## ## CLOCKWATCH.sh start ## ## To stop clockwatch simply run ... ## ## CLOCKWATCH.sh stop ## ## To monitor clockwatch simply run ... ## ## CLOCKWATCH.sh monitor ## ## When one of the processes in the PROC[] array fials, this script ## will exit. ## ## P.S. it is important the process name is unique and identifiable. ## If this script and the process being monitored share any common ## naming characteristics the monitoring script could end up monitoring ## itself and never die ! ## APPNAME=clockwatch CLOCKHOME=/clockwatch BIN=${CLOCKHOME}/bin LOG=${CLOCKHOME}/logs RETVAL=0 PROC[0]="/clockwatch/bin/clockwatch" case $1 in 'start') echo "Attempting to start ${APPNAME}" if [ -f ${LOG}/watchlog ] then mv ${LOG}/watchlog ${LOG}/OLDwatchlog fi ${BIN}/clockwatch ${LOG} ;; 'stop') PID=$(cat ${LOG}/.watchpid) if ps -fp ${PID} >&- 2>&- then echo "Attempting to stop ${APPNAME}; PID = ${PID}" kill -SIGUSR1 $(cat ${LOG}/.watchpid) else echo "PID = ${PID} does not exist. Application may already be dead." if [ -f ${LOG}/.watchpid ] then echo "Removing ${LOG}/.watchpid to allow ${APPNAME} to be  restarted on another node." rm -f ${LOG}/.watchpid fi fi ;; 'monitor') while ( for proc in ${PROC[*]} do ps -ef  grep $proc  grep -v grep >&- 2>&- let RETVAL=RETVAL+$? done return $RETVAL ) do sleep 10 done ;; *) echo "Usage : $0 <startstopmonitor>" exit 1 ;; esac 
 <startstopmonitor>"                 exit 1                 ;; esac

My intention is to use a shared volume group and store the clockwatch program and logfiles within it, under filesystem /clockwatch/bin and /clockwatch/logs , respectively.

I hope that the comments at the beginning of the script are quite self-explanatory. Essentially, if I want to monitor multiple processes, I simply add entries into the PROC[] array. All processes listed will be monitored, and if any of them is not running, the monitoring script will die. Take particular note of the P.S. at the end of the script comments. It is a common mistake to name the monitoring script similar to the application. In my example, I have intentionally made some directory and file names the same to illustrate this point. The critical aspect in my application monitoring script is the process name I am monitoring. Through my extensive testing, I have ensured that the process name is unique: /clockwatch/bin/clockwatch . If I had simply listed the process name as clockwatch , the monitoring script would have ended up monitoring itself because the monitoring script resides in a clockwatch directory that is displayed as part of the process name. Another thing to note is in how I 'stop' the application. The 'stop' function is executed whenever the application is halted. This is also the case when the application fails; Serviceguard needs to ensure that all filesystems, volume groups, and processes are no longer in use before starting the application on another node. In the case of clockwatch , I am checking for the existence of the PID file /clockwatch/logs/.watchpid . If this file does exist, it tells the clockwatch application that an abnormal termination happened previously and it should not start up again. This automated cleanup policy is necessary in the case of clockwatch , because without it the application will not start up in an adoptive node. This again, illustrates the importance of understanding how your application works.

We use the customer_defined_run_cmds to start up clockwatch . We use the customer_defined_halt_cmds to issue the necessary shutdown commands for clockwatch . In writing my application monitoring script, I decided to emulate a standard Serviceguard Toolkit insofar as I would have one central script that I could use to start, halt, and monitor my application. I hope this will make the administration of this application a little easier.

I am now ready to move on to the next task: "Distribute the application monitoring scripts to all relevant nodes in the cluster."