TNPM/TNPM dc
Contents
- 1 Datachannel - how to fix them
- 2 How to examine BOF files for uniqueness
- 3 Timezone and aggset issues
- 4 Files ".jch" piling up under UBA 'state' directory
- 5 Datachannel - dccmd interaction with components
- 6 Manipulate bofdump
- 7 =How to interpret Poll Id values in .BOF file created by SNMP DLTechnote (FAQ)
- 8 Thresholds
Datachannel - how to fix them
Start with ensuring that the hosts file is populated the 'proviso way'
that is the first entry contains the 'short' hostname. (Well I hope you name your servers HOSTNAME and not host.blah.blah.blah as proviso will hate you)
$ head /etc/inet/hosts # # Internet host table # ::1 localhost 127.0.0.1 localhost 10.111.100.125 HOSTNAME hostname hostname.admin.network.co.nz loghost 10.111.100.113 hostname.console.netowork.co.nz hostname.console 10.111.64.71 hostname.uname.network.co.nz hostname.uname 146.171.1.1 ish.network.co.nz 146.171.59.144 hostname.prod.network.co.nz
Stop all datachannel components on all boxes
pkill -9 visual
Remove the 'working' files from the state directory
rm /appl/proviso/datachannel/state/*.pid rm /appl/proviso/datachannel/state/*.bos rm /appl/proviso/datachannel/state/walk* rm /appl/proviso/datachannel/log/walk*
Its also worth cleaning up /tmp - remove anything belonging to pvuser
and the installer/deployer tmp files
rm -rf /tmp/[[Pv Install]] rm -rf /tmp/[[Proviso Consumer]] rm -rf /tmp/inst*
Datachannel Patches 4.4.3+
4.4.3+ Patches end with .pvst, these can simply be placed in the 'binary' state folder and the datachannel/version file can be updated to reflect the latest patch. Assume /appl/proviso/datachannel/ is where the datachannel binaries reside, and /appl/proviso/data/datachannel is where the 'floating' data files are (e.g. LDR.6/done)
/appl/proviso/datachannel/state/IF0061.pvst cat /appl/proviso/datachannel/version : 4.4.3.2 : IFLabel: IF0061 : Application: Ginger.265.30 : Dataload Version: Ginger.91.2
When successfully loaded you will see the following message when a component starts
2011.08.26-00.18.13 UTC AMGRW.SF2068-6477 1 PATCH Loaded Patch File: /appl/proviso/datachannel/state/IF0061.pvst
Datachannel Patches 4.4.1 and Test Patches
4.4.1 Paches end with .st, these can simply be placed in the 'binary' state folder, however they require a correct 'startup' else they will not be loaded. Assume /appl/proviso/datachannel/ is where the datachannel binaries reside, and /appl/proviso/data/datachannel is where the 'floating' data files are (e.g. LDR.6/done)
put test fix in place /appl/proviso/datachannel/state/test_fix.st Find the current version grep Application /appl/proviso/datachannel/version | cut -d " " -f2 Create a startup file echo "'../state/test_fix.st' asFilename fileIn." > /appl/proviso/datachannel/state/Ginger.265.30.startup
When successfully loaded you will see the following message when a component starts
22011.08.26-00.18.12 UTC AMGRW.SF2068-6477 I PATCH Executed patch file with contents: '../state/test_fix.st' asFilename fileIn.
How to examine BOF files for uniqueness
How do I examine BOF files to makes sure no duplicates exist? Answer
- In order to confirm that there exist no duplicates in a suspected BOF file, do the following
bofDump <filename> | awk -F"," {'print $3 $2 $5'} | sort | wc - l
Output from this command will be in the form of a number.
- Then, do the following on the same file
bofDump <filename> | awk -F"," {'print $3 $2 $5'} | sort | uniq | wc -l
Output from this command will also be a number.
If the number produced by the first command is equal to the number in the second command, then it can be confirmed that there are no duplicates in the file.
Timezone and aggset issues
If multiple timezones are in use in the installation,
- ensure that the timezone is applied to the correct reporting groups at the root level (one timezone per group!)
- Aggsets are installed
- Aggsets are applied to the datachannels (in the database)
if the last step isnt completed the datachannel will compute the aggregation in the CME, but the LDR simply wont load it into the database. you can check for UPDATEDBSTATS messages which displayed how much is loaded? (not sure) and to which tablespace and which aggset
$ grep LDR.6 proviso.log | grep UPDATEDBSTATS 2013.06.18-05.33.38 UTC LDR.6-4897 2 STOREDPROCOK UPDATEDBSTATS - (6,BASE,000,H0,1371506400,670) 2013.06.18-05.33.41 UTC LDR.6-4897 2 STOREDPROCOK UPDATEDBSTATS - (6,NRAW,000,H0,1371506400,804) 2013.06.18-05.37.58 UTC LDR.6-4897 2 STOREDPROCOK UPDATEDBSTATS - (6,1DGA,001,H0,1371513600,231) 2013.06.18-05.38.01 UTC LDR.6-4897 2 STOREDPROCOK UPDATEDBSTATS - (6,1DRA,001,H0,1371513600,117)
Files ".jch" piling up under UBA 'state' directory
Technote (FAQ)
Why are files with extension "*.jch" piling up under UBA '../state' directory? Cause
Changes 'upstream' the Data Channel, such as new added devices, can cause an increase in the volume of input data to the UBA, causing it to create many jch journal files as it was building its memory model, and the UBA could not keep up with the volume, and that rendered its housekeeping logic moot as the load of incoming data was too much for it to consolidate and purge effectively Answer
These ".jch" files that you are seeing piling up are 'journal' files, and they are key files that the DC component uses to persist data on disk of elements in its running memory model, that way if there was some failure it restarted it will know how to rebuild its running image. They contain items such as a list of processed files, a list of files that are in its channel awaiting to be processed (things in state/do), as well as components of the existing meta-data for inventory.
Most UBAs, as well as all of the core Data Channel components, have a mechanism to consolidate and purge old journal files. There are two types of journal files "jch" and "jcp" files, and the way it works is, as the DC component is running it creates journal files (jch) and after some conditions are met, either an inventory insert or passage of time, the component will run a flush of all jch files and consolidate information that needs to be persisted and purge all old unneeded info, and the remaining persisted info will be in a jcp file. The jcp files are usually larger than the jch files and there are usually just 1 or 2 of them vs. many jch files.
This means that the DC components should have housekeeping and garbage collection code in them to handle this for you, so you should never delete the journal files manually, unless advised to by Support. Doing so could seriously corrupt the memory model of the Data Channel component. It would be like essentially deleting its memory of itself and in some cases force it to rebuild the meta-data model, and in more extreme cases, cause it to orphan unprocessed incoming inputs, and data loss could and would occur.'
Datachannel - dccmd interaction with components
The dccmd utility is a command line program used to manage a Data Channel environment. It can start and stop components, report component status, or issue debug commands.
- dccmd work flow explained below
- AMGR -->> CNS Send request to get the IOR of CMGR
AMGR <<-- CNS Receive IOR of CMGR AMGR -->> Communicate directly with CMGR via ORB using IOR
- Acronym explanation
- : IOR - Interoperable Object Reference:
It contains the communication details that a client uses to communicate with a CORBA object.
ORB: Object Request Broker Provides the mechanism required for distributed objects to communicate with one another, whether locally or on remote devices, written in different languages, or at different locations on a network.
CNS - Component Name Service (also known as Channel Name Service) CNS enables Netcool/Proviso components to communicate with one another
CMGR - Channel Manager; it manages data flow between components
AMGR - Application Manager
Manipulate bofdump
The values returned from bofdump are in sudo scientific notation, you can convert the values using printf in the shell (bash)
If there is no value after d, use the value as it is (e+0) else replace d with e+
bofDump 1.2014.01.20-23.30.00-00368.250.NRAW.BOF [[3421]], (Mid) 12757,(Rid) 200771194, (date) [[1390260600]], 2014.01.20-23.30.00, (Val) 0.0d this case the value is '0' however printf '%f' 0.e+0 0.000000 [[3425]], (Mid) 51535,(Rid) 200771194, (date) [[1390260600]], 2014.01.20-23.30.00, (Val) 1.390260867d12 printf '%f' 1.390260867e+12 1390260867000.000000 [[4724]], (Mid) 12762,(Rid) 200772756, (date) [[1390261500]], 2014.01.20-23.45.00, (Val) 8.159325d7 printf '%f' 8.159325e+7 81593250.000000 [[4118]], (Mid) 12758,(Rid) 200770272, (date) [[1390261500]], 2014.01.20-23.45.00, (Val) 960.0d printf '%f' 960.0e+0 960.000000 [[4088]], (Mid) 12758,(Rid) 200770173, (date) [[1390261500]], 2014.01.20-23.45.00, (Val) 139573.0d printf '%f' 139573.0e+0 139573.000000 A Negative [[279554]], (Mid) 101747652,(Rid) 200757931, (date) [[1390262402]], 2014.01.21-00.00.02, (Val) -0.0066005192212094d printf '%f' -0.0066005192212094e+0 -0.006601 use '%0.f' to round to zero decimal places.
=How to interpret Poll Id values in .BOF file created by SNMP DLTechnote (FAQ)
Question After using bofDump utility to convert .BOF data file collected by SNMP DL , I’m observing that Poll Id value for same RID MID combination has different values , however I did not changed polling interval could you explain Poll Id .
Answer
The Poll Id is the time of the polling within the hour, rounded to the polling frequency.
So for collections at 15min polling (900s) ; there will be 4 measures produced during each hour, at Poll Id = {0 , 900, 1800, 2700} ; even if the actual collection happened slightly delayed at (for example) { 115, 1021, 1945, 2856 }
This column is necessary to 'align' data , that may be produced at the same frequency, but are not collected exactly at the same second.
This is also used by CME, to de-duplicate data if 2 different SNMP collectors are both sending data for the same collection number, to a single CME. For each metrical, resource ID, the collection time may be slightly different on each collector, but the Poll Id will be the same, allowing the de-duplication.
Find how much memory each CME is using
grep MEM_STAT proviso.log | perl -ne 'print "$1 $2\n" if /(CME.''')\-.'''?Total Image Size:(.*?) Used/' | sort | uniq 2014.05.06-21.10.46 UTC CME.2.7-4313 2 MEM_STATS Total Image Size: 1,567,953 kb Used 1,347,098 kb Free: 220,855 kb 2014.05.06-22.02.33 UTC CME.2.7-4313 2 MEM_STATS Total Image Size: 1,567,953 kb Used 1,343,074 kb Free: 224,879 kb 2014.05.06-22.06.14 UTC CME.2.7-4313 2 MEM_STATS Total Image Size: 1,567,953 kb Used 1,343,184 kb Free: 224,769 kb 2014.05.06-22.10.27 UTC CME.2.7-4313 2 MEM_STATS Total Image Size: 1,567,953 kb Used 1,347,083 kb Free: 220,870 kb 2014.05.06-23.15.36 UTC CME.2.7-4313 2 MEM_STATS Total Image Size: 1,567,953 kb Used 1,343,054 kb Free: 224,899 kb 2014.05.06-23.17.48 UTC CME.2.7-4313 2 MEM_STATS Total Image Size: 1,567,953 kb Used 1,343,152 kb Free: 224,801 kb CME.1.1 981,993 kb CME.1.2 513,226 kb CME.1.3 747,609 kb CME.1.4 1,646,081 kb CME.1.5 1,255,441 kb CME.1.6 1,246,201 kb CME.2.10 474,162 kb CME.2.11 396,034 kb CME.2.12 1,294,505 kb CME.2.13 1,099,185 kb CME.2.14 1,207,137 kb CME.2.15 903,865 kb
Run Command on Datachannel Hosts
One of the most useful tools, should be part of the product
#!/bin/bash # Run command on all SNMP datachannels # ENHANCED VERSION.... # 201307 - Neil and Tomas # 201406 - Neil ported from dataload to datachannel # usage: run''command''on_datachannel.sh "command && command2 && command3" # REMINDER: x && y, if x gracefully exits run y # REMINDER: x ; y, no matter how x goes, run y command=$1 printf "Gathering information, please wait\n" . /appl/proviso/datachannel/dataChannel.env #dcoutput=`dccmd debug CMGR "self dbCfgPrint" | egrep -i "$FTE.'''SOURCE.'''SNMP" | cut -d "@" -f2 | sort -n -t "." -k 3` dcoutput=`dccmd debug CMGR "self dbCfgPrint" | grep -i "AMGR=AMGR" | cut -d "." -f 5 | sort | uniq` printf "\nRunning command: $command\n" # Takes a command to run and runs it on each datachannel host... for entry in $dcoutput do : # will looked like : STGPROVISODL2 : host=`echo $entry ` : printf "\n############# $host #############\n " : dl_cmd=". /appl/proviso/datachannel/dataChannel.env && $command" : ssh -q $host "$dl_cmd" done
Thresholds
Proviso, disabling CME for triggering trap as part of interpolation calculation
- Technote (FAQ)
- Question
How to disable condition when CME triggering the Burst threshold violation, at the moment when the data value is under the threshold violation level
- Cause
The CME triggering trap as threshold violation, at the moment when the collected data value is under the threshold violation level.
- Answer
The known specific condition when CME sent burst threshold violation event when the current data value does not exceed the threshold , happened due to interpolation calculation while CME calculates how much of the time from the last threshold state change from violation to under the violation should be added to the time accumulator for Critical or Warning time . If that added time causes the time accumulated to exceed the threshold duration period defined in Warning or Critical time, a trap will be generated, even though the current data sample is below the threshold.
To disable interpolation calculation and terminate CME from triggering burst threshold violation with the value below threshold level, enter value -1 for Warning and or Critical Time.
Then Save threshold and exit
You could validate change , when you open threshold again to review changes you will see word RESERVED as value within Critical or Warning time window.
LDR MEM_EMERGENCY IV1669573 2014-04-14
By Default the memory allowed for a LDR is 1GB, if >1GB is required the next step is 4G
Change the topology - add the following
LDR.1.PV''MEMORY''POLICY=TRUE
Add the following entry to ../topologyEditor/metadata/core/dataChannel.xsd and ../topologyEditor/plugins/Icc Common_1.0.0/src/metadata/core/dataChannel.xsd files under LDR properties:
Add the lines written in blue below, right after the property.USE_PIPE under LDR properties section in the datachannel.xsd
<xsd:complexType name="LDR"> ... <xsd:element name="USE_PIPE" type="xsd:boolean" default="false"> <xsd:annotation> <xsd:documentation> <ext:label>USE''PIPE''LABEL</ext:label> <ext:description>USE''PIPE''DESC</ext:description> <ext:help>USE''PIPE''HELP</ext:help> </xsd:documentation> <xsd:appinfo> <ext:advanced>false</ext:advanced> <ext:readOnly>false</ext:readOnly> <ext:visible>true</ext:visible> </xsd:appinfo> </xsd:annotation> </xsd:element> <xsd:element name="PV''MEMORY''POLICY" type="xsd:boolean" default="false"> <xsd:annotation> <xsd:documentation> <ext:label>PV''MEMORY''POLICY_LABEL</ext:label> <ext:description>PV''MEMORY''POLICY_DESC</ext:description> <ext:help>PV''MEMORY''POLICY_HELP</ext:help> </xsd:documentation> <xsd:appinfo> <ext:advanced>false</ext:advanced> <ext:readOnly>false</ext:readOnly> <ext:visible>true</ext:visible> </xsd:appinfo> </xsd:annotation> </xsd:element>r </xsd:sequence> </xsd:extension> </xsd:complexContent> </xsd:complexType>