TNPM/TNPM dc

From neil.tappsville.com
Jump to navigationJump to search

Datachannel - how to fix them

Start with ensuring that the hosts file is populated the 'proviso way'

that is the first entry contains the 'short' hostname. (Well I hope you name your servers HOSTNAME and not host.blah.blah.blah as proviso will hate you)

$ head /etc/inet/hosts
#
# Internet host table
#
::1     localhost
127.0.0.1       localhost
10.111.100.125  HOSTNAME hostname hostname.admin.network.co.nz loghost
10.111.100.113  hostname.console.netowork.co.nz hostname.console
10.111.64.71    hostname.uname.network.co.nz hostname.uname
146.171.1.1 ish.network.co.nz
146.171.59.144 hostname.prod.network.co.nz

Stop all datachannel components on all boxes

pkill -9 visual

Remove the 'working' files from the state directory

rm /appl/proviso/datachannel/state/*.pid
rm /appl/proviso/datachannel/state/*.bos
rm /appl/proviso/datachannel/state/walk*
rm /appl/proviso/datachannel/log/walk*

Its also worth cleaning up /tmp - remove anything belonging to pvuser

and the installer/deployer tmp files

rm -rf /tmp/[[Pv Install]]
rm -rf /tmp/[[Proviso Consumer]]
rm -rf /tmp/inst*


Datachannel Patches 4.4.3+

4.4.3+ Patches end with .pvst, these can simply be placed in the 'binary' state folder and the datachannel/version file can be updated to reflect the latest patch. Assume /appl/proviso/datachannel/ is where the datachannel binaries reside, and /appl/proviso/data/datachannel is where the 'floating' data files are (e.g. LDR.6/done)

/appl/proviso/datachannel/state/IF0061.pvst

cat /appl/proviso/datachannel/version
: 4.4.3.2
: IFLabel: IF0061
: Application: Ginger.265.30
: Dataload Version: Ginger.91.2

When successfully loaded you will see the following message when a component starts

2011.08.26-00.18.13 UTC AMGRW.SF2068-6477       1
PATCH   Loaded Patch File: /appl/proviso/datachannel/state/IF0061.pvst


Datachannel Patches 4.4.1 and Test Patches

4.4.1 Paches end with .st, these can simply be placed in the 'binary' state folder, however they require a correct 'startup' else they will not be loaded. Assume /appl/proviso/datachannel/ is where the datachannel binaries reside, and /appl/proviso/data/datachannel is where the 'floating' data files are (e.g. LDR.6/done)

put test fix in place
/appl/proviso/datachannel/state/test_fix.st

Find the current version
grep Application /appl/proviso/datachannel/version | cut -d " " -f2

Create a startup file
echo "'../state/test_fix.st' asFilename fileIn." > /appl/proviso/datachannel/state/Ginger.265.30.startup

When successfully loaded you will see the following message when a component starts

22011.08.26-00.18.12 UTC AMGRW.SF2068-6477       I
PATCH   Executed patch file with contents: '../state/test_fix.st' asFilename fileIn.

How to examine BOF files for uniqueness

How do I examine BOF files to makes sure no duplicates exist? Answer

In order to confirm that there exist no duplicates in a suspected BOF file, do the following 
bofDump <filename> | awk -F"," {'print $3 $2 $5'} | sort | wc - l

Output from this command will be in the form of a number.

Then, do the following on the same file 
bofDump <filename> | awk -F"," {'print $3 $2 $5'} | sort | uniq | wc -l

Output from this command will also be a number.

If the number produced by the first command is equal to the number in the second command, then it can be confirmed that there are no duplicates in the file.

Timezone and aggset issues

If multiple timezones are in use in the installation,

  • ensure that the timezone is applied to the correct reporting groups at the root level (one timezone per group!)
  • Aggsets are installed
  • Aggsets are applied to the datachannels (in the database)

if the last step isnt completed the datachannel will compute the aggregation in the CME, but the LDR simply wont load it into the database. you can check for UPDATEDBSTATS messages which displayed how much is loaded? (not sure) and to which tablespace and which aggset

$ grep LDR.6 proviso.log | grep UPDATEDBSTATS
2013.06.18-05.33.38 UTC LDR.6-4897      2       STOREDPROCOK    UPDATEDBSTATS - (6,BASE,000,H0,1371506400,670)
2013.06.18-05.33.41 UTC LDR.6-4897      2       STOREDPROCOK    UPDATEDBSTATS - (6,NRAW,000,H0,1371506400,804)
2013.06.18-05.37.58 UTC LDR.6-4897      2       STOREDPROCOK    UPDATEDBSTATS - (6,1DGA,001,H0,1371513600,231)
2013.06.18-05.38.01 UTC LDR.6-4897      2       STOREDPROCOK    UPDATEDBSTATS - (6,1DRA,001,H0,1371513600,117)


Files ".jch" piling up under UBA 'state' directory

Technote (FAQ)

Why are files with extension "*.jch" piling up under UBA '../state' directory? Cause

Changes 'upstream' the Data Channel, such as new added devices, can cause an increase in the volume of input data to the UBA, causing it to create many jch journal files as it was building its memory model, and the UBA could not keep up with the volume, and that rendered its housekeeping logic moot as the load of incoming data was too much for it to consolidate and purge effectively Answer

These ".jch" files that you are seeing piling up are 'journal' files, and they are key files that the DC component uses to persist data on disk of elements in its running memory model, that way if there was some failure it restarted it will know how to rebuild its running image. They contain items such as a list of processed files, a list of files that are in its channel awaiting to be processed (things in state/do), as well as components of the existing meta-data for inventory.

Most UBAs, as well as all of the core Data Channel components, have a mechanism to consolidate and purge old journal files. There are two types of journal files "jch" and "jcp" files, and the way it works is, as the DC component is running it creates journal files (jch) and after some conditions are met, either an inventory insert or passage of time, the component will run a flush of all jch files and consolidate information that needs to be persisted and purge all old unneeded info, and the remaining persisted info will be in a jcp file. The jcp files are usually larger than the jch files and there are usually just 1 or 2 of them vs. many jch files.


This means that the DC components should have housekeeping and garbage collection code in them to handle this for you, so you should never delete the journal files manually, unless advised to by Support. Doing so could seriously corrupt the memory model of the Data Channel component. It would be like essentially deleting its memory of itself and in some cases force it to rebuild the meta-data model, and in more extreme cases, cause it to orphan unprocessed incoming inputs, and data loss could and would occur.'

Datachannel - dccmd interaction with components

The dccmd utility is a command line program used to manage a Data Channel environment. It can start and stop components, report component status, or issue debug commands.

dccmd work flow explained below 
AMGR -->> CNS Send request to get the IOR of CMGR

AMGR <<-- CNS Receive IOR of CMGR AMGR -->> Communicate directly with CMGR via ORB using IOR

Acronym explanation
 : IOR - Interoperable Object Reference:

It contains the communication details that a client uses to communicate with a CORBA object.

ORB: Object Request Broker Provides the mechanism required for distributed objects to communicate with one another, whether locally or on remote devices, written in different languages, or at different locations on a network.

CNS - Component Name Service (also known as Channel Name Service) CNS enables Netcool/Proviso components to communicate with one another

CMGR - Channel Manager; it manages data flow between components

AMGR - Application Manager

Manipulate bofdump

The values returned from bofdump are in sudo scientific notation, you can convert the values using printf in the shell (bash)

If there is no value after d, use the value as it is (e+0) else replace d with e+

bofDump  1.2014.01.20-23.30.00-00368.250.NRAW.BOF

[[3421]], (Mid) 12757,(Rid) 200771194, (date) [[1390260600]], 2014.01.20-23.30.00, (Val) 0.0d
this case the value is '0' however
printf '%f' 0.e+0
0.000000

[[3425]], (Mid) 51535,(Rid) 200771194, (date) [[1390260600]], 2014.01.20-23.30.00, (Val) 1.390260867d12
printf '%f' 1.390260867e+12
1390260867000.000000

[[4724]], (Mid) 12762,(Rid) 200772756, (date) [[1390261500]], 2014.01.20-23.45.00, (Val) 8.159325d7
printf '%f'  8.159325e+7
81593250.000000

[[4118]], (Mid) 12758,(Rid) 200770272, (date) [[1390261500]], 2014.01.20-23.45.00, (Val) 960.0d
printf '%f' 960.0e+0
960.000000

[[4088]], (Mid) 12758,(Rid) 200770173, (date) [[1390261500]], 2014.01.20-23.45.00, (Val) 139573.0d
printf '%f' 139573.0e+0
139573.000000

A Negative
[[279554]], (Mid) 101747652,(Rid) 200757931, (date) [[1390262402]], 2014.01.21-00.00.02, (Val) -0.0066005192212094d
printf '%f' -0.0066005192212094e+0
-0.006601

use '%0.f' to round to zero decimal places.


=How to interpret Poll Id values in .BOF file created by SNMP DLTechnote (FAQ)

Question After using bofDump utility to convert .BOF data file collected by SNMP DL , I’m observing that Poll Id value for same RID MID combination has different values , however I did not changed polling interval could you explain Poll Id .


Answer The Poll Id is the time of the polling within the hour, rounded to the polling frequency. So for collections at 15min polling (900s) ; there will be 4 measures produced during each hour, at Poll Id = {0 , 900, 1800, 2700} ; even if the actual collection happened slightly delayed at (for example) { 115, 1021, 1945, 2856 }


This column is necessary to 'align' data , that may be produced at the same frequency, but are not collected exactly at the same second. This is also used by CME, to de-duplicate data if 2 different SNMP collectors are both sending data for the same collection number, to a single CME. For each metrical, resource ID, the collection time may be slightly different on each collector, but the Poll Id will be the same, allowing the de-duplication.


Find how much memory each CME is using

grep MEM_STAT proviso.log | perl -ne 'print "$1 $2\n" if /(CME.''')\-.'''?Total Image Size:(.*?) Used/' | sort | uniq


2014.05.06-21.10.46 UTC CME.2.7-4313    2       MEM_STATS       Total Image Size: 1,567,953 kb Used 1,347,098 kb Free: 220,855 kb
2014.05.06-22.02.33 UTC CME.2.7-4313    2       MEM_STATS       Total Image Size: 1,567,953 kb Used 1,343,074 kb Free: 224,879 kb
2014.05.06-22.06.14 UTC CME.2.7-4313    2       MEM_STATS       Total Image Size: 1,567,953 kb Used 1,343,184 kb Free: 224,769 kb
2014.05.06-22.10.27 UTC CME.2.7-4313    2       MEM_STATS       Total Image Size: 1,567,953 kb Used 1,347,083 kb Free: 220,870 kb
2014.05.06-23.15.36 UTC CME.2.7-4313    2       MEM_STATS       Total Image Size: 1,567,953 kb Used 1,343,054 kb Free: 224,899 kb
2014.05.06-23.17.48 UTC CME.2.7-4313    2       MEM_STATS       Total Image Size: 1,567,953 kb Used 1,343,152 kb Free: 224,801 kb



CME.1.1  981,993 kb
CME.1.2  513,226 kb
CME.1.3  747,609 kb
CME.1.4  1,646,081 kb
CME.1.5  1,255,441 kb
CME.1.6  1,246,201 kb
CME.2.10  474,162 kb
CME.2.11  396,034 kb
CME.2.12  1,294,505 kb
CME.2.13  1,099,185 kb
CME.2.14  1,207,137 kb
CME.2.15  903,865 kb


Run Command on Datachannel Hosts

One of the most useful tools, should be part of the product

#!/bin/bash
# Run command on all SNMP datachannels
# ENHANCED VERSION....
# 201307 - Neil and Tomas
# 201406 - Neil ported from dataload to datachannel
# usage: run''command''on_datachannel.sh "command && command2 && command3"

# REMINDER: x && y, if x gracefully exits run y
# REMINDER: x ; y, no matter how x goes, run y

command=$1

printf "Gathering information, please wait\n"

. /appl/proviso/datachannel/dataChannel.env
#dcoutput=`dccmd debug CMGR "self dbCfgPrint" | egrep -i "$FTE.'''SOURCE.'''SNMP" | cut -d "@" -f2 | sort -n -t "." -k 3`
dcoutput=`dccmd debug CMGR "self dbCfgPrint" | grep -i "AMGR=AMGR" | cut -d "." -f 5 |  sort | uniq`

printf "\nRunning command: $command\n"

# Takes a command to run and runs it on each datachannel host...

for entry in $dcoutput
do
: # will looked like : STGPROVISODL2
: host=`echo $entry `
: printf "\n############# $host #############\n "
: dl_cmd=". /appl/proviso/datachannel/dataChannel.env && $command"
: ssh -q $host "$dl_cmd"
done


Thresholds

Proviso, disabling CME for triggering trap as part of interpolation calculation

  • Technote (FAQ)
  • Question
How to disable condition when CME triggering the Burst threshold violation, at the moment when the data value is under the threshold violation level
  • Cause


The CME triggering trap as threshold violation, at the moment when the collected data value is under the threshold violation level.

  • Answer


The known specific condition when CME sent burst threshold violation event when the current data value does not exceed the threshold , happened due to interpolation calculation while CME calculates how much of the time from the last threshold state change from violation to under the violation should be added to the time accumulator for Critical or Warning time . If that added time causes the time accumulated to exceed the threshold duration period defined in Warning or Critical time, a trap will be generated, even though the current data sample is below the threshold.

To disable interpolation calculation and terminate CME from triggering burst threshold violation with the value below threshold level, enter value -1 for Warning and or Critical Time.

Then Save threshold and exit

You could validate change , when you open threshold again to review changes you will see word RESERVED as value within Critical or Warning time window.

LDR MEM_EMERGENCY IV1669573 2014-04-14

By Default the memory allowed for a LDR is 1GB, if >1GB is required the next step is 4G

Change the topology - add the following

LDR.1.PV''MEMORY''POLICY=TRUE

Add the following entry to ../topologyEditor/metadata/core/dataChannel.xsd and ../topologyEditor/plugins/Icc Common_1.0.0/src/metadata/core/dataChannel.xsd files under LDR properties:

Add the lines written in blue below, right after the property.USE_PIPE under LDR properties section in the datachannel.xsd
<xsd:complexType name="LDR">
 ...
 <xsd:element name="USE_PIPE" type="xsd:boolean" default="false">
 <xsd:annotation>
 <xsd:documentation>
 <ext:label>USE''PIPE''LABEL</ext:label>
 <ext:description>USE''PIPE''DESC</ext:description>
 <ext:help>USE''PIPE''HELP</ext:help>
 </xsd:documentation>
 <xsd:appinfo>
 <ext:advanced>false</ext:advanced>
 <ext:readOnly>false</ext:readOnly>
 <ext:visible>true</ext:visible>
 </xsd:appinfo>
 </xsd:annotation>
 </xsd:element>
 <xsd:element name="PV''MEMORY''POLICY" type="xsd:boolean" default="false">
 <xsd:annotation>
 <xsd:documentation>
 <ext:label>PV''MEMORY''POLICY_LABEL</ext:label>
 <ext:description>PV''MEMORY''POLICY_DESC</ext:description>
 <ext:help>PV''MEMORY''POLICY_HELP</ext:help>
 </xsd:documentation>
 <xsd:appinfo>
 <ext:advanced>false</ext:advanced>
 <ext:readOnly>false</ext:readOnly>
 <ext:visible>true</ext:visible>
 </xsd:appinfo>
 </xsd:annotation>
 </xsd:element>r
 </xsd:sequence>
 </xsd:extension>
 </xsd:complexContent>
 </xsd:complexType>