OBIEE Monitoring and Diagnostics with InfluxDB and Grafana

February 8, 2015, 1:21 pm

≫ Next: An Introduction to Analysing ODI Runtime Data Through Elasticsearch and Kibana 4

≪ Previous: Concurrent RPD Development in OBIEE

In this article I’m going to look at collecting time-series metrics into the InfluxDB database and visualising them in snazzy Grafana dashboards. The datasets I’m going to use are OS metrics (CPU, Disk, etc) and the DMS metrics from OBIEE, both of which are collected using the support for a Carbon/Graphite listener in InfluxDB.

The Dynamic Monitoring System (DMS) in OBIEE is one of the best ways of being able to peer into the internals of the product and find out quite what’s going on. Whether performing diagnostics on a specific issue or just generally monitoring to make sure things are ticking over nicely, using the DMS metrics you can level-up your OBIEE sysadmin skills beyond what you’d get with Fusion Middleware Control out of the box. In fact, the DMS metrics are what you can get access to with Cloud Control 12c (EM12c) – but for that you need EM12c and the BI Management Pack. In this article we’re going to see how to easily set up our DMS dashboard.
N.B. if you’ve read my previous articles, what I write here (use InfluxDB/Grafana) supersedes what I wrote in those (use Graphite) as my recommended approach to working with arbitrary time-series metrics.

Overview

To get the DMS data out of OBIEE we’re going to use the obi-metrics-agent tool that Rittman Mead open-sourced last year. This connects to OPMN and pulls the data out. We’ll store the data in InfluxDB, and then visualise it in Grafana. Whilst not mandatory for the DMS stats, we’ll also setup collectl so that we can show OS stats alongside the DMS ones.

InfluxDB

InfluxDB is a database, but unlike a RDBMS such as Oracle – good for generally everything – it is a what’s called a Time-Series Database (TSDB). This category of database focuses on storing data for a series, holding a given value for a point in time. Generally they’re optimised for handling large quantities of inbound metrics (think Internet of Things), rather than necessarily excelling at handling changes to the data (update/delete) – but that’s fine here since metric events in the past don’t generally change.

I’m using InfluxDB here for a few reasons:

Grafana supports it as a source, with lots of active development for its specific features.
It’s not Graphite. Whilst I have spent many a happy hour using Graphite I’ve spent many a frustrating day and night trying to install the damn thing – every time I want to use it on a new installation. It’s fundamentally long in the tooth, a whilst good for its time is now legacy in my mind. Graphite is also several things – a data store (whisper), a web application (graphite web), and a data collector (carbon). Since we’re using Grafana, the web front end that Graphite provides is redundant, and is where a lot of the installation problems come from.
KISS! Yes I could store time series data in Oracle/mySQL/DB2/yadayada, but InfluxDB does one thing (storing time series metrics) and one thing only, very well and very easily with almost no setup.

For an eloquent discussion of Time-Series Databases read these couple of excellent articles by Baron Schwarz here and here.

Grafana

On the front-end we have Grafana which is a web application that is rapidly becoming accepted as one of the best time-series metric visualisation tools available. It is a fork of Kibana, and can work with data held in a variety of sources including Graphite and InfluxDB. To run Grafana you need to have a web server in place – I’m using Apache just because it’s familiar, but Grafana probably works with whatever your favourite is too.

OS

This article is based around the OBIEE SampleApp v406 VM, but should work without modification on any OL/CentOS/RHEL 6 environment.

InfluxDB and Grafana run on both RHEL and Debian based Linux distros, as well as Mac OS. The specific setup steps detailed here might need some changes according on the OS.

Getting Started with InfluxDB

InfluxDB Installation and Configuration as a Graphite/Carbon Endpoint

InfluxDB is a doddle to install. Simply download the rpm, unzip it, and run. BOOM. Compared to Graphite, this makes it a massive winner already.

wget http://s3.amazonaws.com/influxdb/influxdb-latest-1.x86_64.rpm
sudo rpm -ivh influxdb-latest-1.x86_64.rpm

This downloads and installs InfluxDB into /opt/influxdb and configures it as a service that will start at boot time.

Before we go ahead an start it, let’s configure it to work with existing applications that are sending data to Graphite using the Carbon protocol. InfluxDB can support this and enables you to literally switch Graphite out in favour of InfluxDB with no changes required on the source.

Edit the configuration file that you’ll find at /opt/influxdb/shared/config.toml and locate the line that reads:

[input_plugins.graphite]

In v0.8.8 this is at line 41. In the following stanza set the plugin to enabled, specify the listener port, and give the name of the database that you want to store data in, so that it looks like this.

# Configure the graphite api
[input_plugins.graphite]
enabled = true
# address = "0.0.0.0" # If not set, is actually set to bind-address.
port = 2003
database = "carbon"  # store graphite data in this database
# udp_enabled = true # enable udp interface on the same port as the tcp interface

Note that the file is owned by a user created at installation time, influxdb, so you’ll need to use sudo to edit the file.

Now start up InfluxDB:

sudo service influxdb start

You should see it start up successfully:

[oracle@demo influxdb]$ sudo service influxdb start
Setting ulimit -n 65536
Starting the process influxdb [ OK ]
influxdb process was started [ OK ]

You can see the InfluxDB log file and confirm that the Graphite/Carbon listener has started:

[oracle@demo shared]$ tail -f /opt/influxdb/shared/log.txt
[2015/02/02 20:24:04 GMT] [INFO] (github.com/influxdb/influxdb/cluster.func·005:1187) Recovered local server
[2015/02/02 20:24:04 GMT] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:133) recovered
[2015/02/02 20:24:04 GMT] [INFO] (github.com/influxdb/influxdb/coordinator.(*Coordinator).ConnectToProtobufServers:898) Connecting to other nodes in the cluster
[2015/02/02 20:24:04 GMT] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:139) Starting admin interface on port 8083
[2015/02/02 20:24:04 GMT] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:152) Starting Graphite Listener on 0.0.0.0:2003
[2015/02/02 20:24:04 GMT] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:178) Collectd input plugins is disabled
[2015/02/02 20:24:04 GMT] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:187) UDP server is disabled
[2015/02/02 20:24:04 GMT] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:187) UDP server is disabled
[2015/02/02 20:24:04 GMT] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:216) Starting Http Api server on port 8086
[2015/02/02 20:24:04 GMT] [INFO] (github.com/influxdb/influxdb/server.(*Server).reportStats:254) Reporting stats: &client.Series{Name:"reports", Columns:[]string{"os", "arch", "id", "version"}, Points:[][]interface {}{[]interface {}{"linux", "amd64", "e7d3d5cf69a4faf2", "0.8.8"}}}

At this point if you’re using the stock SampleApp v406 image, or if indeed any machine with a firewall configured, you need to open up ports 8083 and 8086 for InfluxDB. Edit /etc/sysconfig/iptables (using sudo) and add:

-A INPUT -m state --state NEW -m tcp -p tcp --dport 8083 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 8086 -j ACCEPT

immediately after the existing ACCEPT rules. Restart iptables to pick up the change:

sudo service iptables restart

If you now go to http://localhost:8083/ (replace localhost with the hostname of the server on which you’ve installed InfluxDB), you’ll get the InfluxDB web interface. It’s fairly rudimentary, but suffices just fine:

Login as root/root, and you’ll see a list of nothing much, since we’ve not got any databases yet. You can create a database from here, but for repeatability and a general preference for using the command line here is how to create a database called carbon with the HTTP API called from curl (assuming you’re running it locally; change localhost if not):

curl -X POST 'http://localhost:8086/db?u=root&p=root' -d '{"name": "carbon"}'

Simple huh? Now hit refresh on the web UI and after logging back in again you’ll see the new database:

You can call the database anything you want, just make sure what you create in InfluxDB matches what you put in the configuration file for the graphite/carbon listener.

Now we’ll create a second database that we’ll need later on to hold the internal dashboard definitions from Grafana:

curl -X POST 'http://localhost:8086/db?u=root&p=root' -d '{"name": "grafana"}'

You should now have two InfluxDB databases, primed and ready for data:

Validating the InfluxDB Carbon Listener

To make sure that InfluxDB is accepting data on the carbon listener use the NetCat (nc) utility to send some dummy data to it:

echo "example.foo.bar 3 `date +%s`"|nc localhost 2003

Now go to the InfluxDB web interface and click Explore Data ». In the query field enter

list series

To see the first five rows of data itself use the query

select * from /.*/ limit 5

InfluxDB Queries

You’ll notice that what we’re doing here (“SELECT … FROM …”) looks pretty SQL-like. Indeed, InfluxDB support a SQL-like query language, which if you’re coming from an RDBMS background is nicely comforting ;-)

The syntax is documented, but what I would point out is the apparently odd /.*/ constructor for the “table” is in fact a regular expression (regex) to match the series for which to return values. We could have written select * from example.foo.bar but the .* wildcard enclosed in the / / regex delimiters is a quick way to check all the series we’ve got.

Going off on a bit of a tangent (but hey, why not), let’s write a quick Python script to stick some randomised data into InfluxDB. Paste the following into a terminal window to create the script and make it executable:

cat >~/test_carbon.py<<EOF
#!/usr/bin/env python
import socket
import time
import random
import sys

CARBON_SERVER = sys.argv[1]
CARBON_PORT = int(sys.argv[2])

while True:
        message = 'test.data.foo.bar %d %d\n' % (random.randint(1,20),int(time.time()))
        print 'sending message:\n%s' % message
        sock = socket.socket()
        sock.connect((CARBON_SERVER, CARBON_PORT))
        sock.sendall(message)
        time.sleep(1)
        sock.close()
EOF
chmod u+x ~/test_carbon.py

And run it: (hit Ctrl-C when you’ve had enough)

$ ~/test_carbon.py localhost 2003
sending message:
test.data.foo.bar 3 1422910401

sending message:
test.data.foo.bar 5 1422910402
[...]

Now we’ve got two series in InfluxDB:

example.foo.bar – that we sent using nc
test.data.foo.bar – using the python script

Let’s go back to the InfluxDB web UI and have a look at the new data, using the literal series name in the query:

select * from test.data.foo.bar

Well fancy that – InfluxDB has done us a nice little graph of the data. But more to the point, we can see all the values in the series.

And a regex shows us both series, matching on the ‘foo’ part of the name:

select * from /foo/ limit 3

Let’s take it a step further. InfluxDB supports aggregate functions, such as max, min, and so on:

select count(value), max(value),mean(value),min(value) from test.data.foo.bar

Whilst we’re at it, let’s bring in another way to get data out – with the HTTP API, just like we used for creating the database above. Given a query, it returns the data in json format. There’s a nice little utility called jq which we can use to pretty-print the json, so let’s install that first:

sudo yum install -y jq

and then call the InfluxDB API, piping the return into jq:

curl --silent --get 'http://localhost:8086/db/carbon/series?u=root&p=root' --data-urlencode "q=select count(value), max(value),mean(value),min(value) from test.data.foo.bar"|jq '.'

The result should look something like this:

[
  {
    "name": "test.data.foo.bar",
    "columns": [
      "time",
      "count",
      "max",
      "mean",
      "min"
    ],
    "points": [
      [
        0,
        12,
        14,
        5.666666666666665,
        1
      ]
    ]
  }
]

We could have used the Web UI for this, but to be honest the inclusion of the graphs just confuses things because there’s nothing to graph and the table of data that we want gets hidden lower down the page.

Setting up obi-metrics-agent to Send OBIEE DMS metrics to InfluxDB

obi-metrics-agent is an open-source tool from Rittman Mead that polls your OBIEE system to pull out all the lovely juicy DMS metrics from it. It can write them to file, insert them to an RDBMS, or as we’re using it here, send them to a carbon-compatible endpoint (such as Graphite, or in our case, InfluxDB).

To install it simply clone the git repository (I’m doing it to /opt but you can put it where you want)

# Install pre-requisite
sudo yum install -y libxml2-devel python-devel libxslt-devel python-pip
sudo pip install lxml
# Clone the git repository
git clone https://github.com/RittmanMead/obi-metrics-agent.git ~/obi-metrics-agent
# Move it to /opt folder
sudo mv ~/obi-metrics-agent /opt

and then run it:

cd /opt/obi-metrics-agent

./obi-metrics-agent.py \
--opmnbin /app/oracle/biee/instances/instance1/bin/opmnctl \
--output carbon \
--carbon-server localhost

I’ve used line continuation character \ here to make the statement clearer. Make sure you update opmnbin for the correct path of your OPMN binary as necessary, and localhost if your InfluxDB server is not local to where you are running obi-metrics-agent.

After running this you should be able to see the metrics in InfluxDB. For example:

select * from /Oracle_BI_DB_Connection_Pool\..+\.*Busy/ limit 5

Setting up collectl to Send OS metrics to InfluxDB

collectl is an excellent tool written by Mark Seger and reports on all sorts of OS-level metrics. It can run interactively, write metrics to file, and/or send them on to a carbon endpoint such as InfluxDB.

Installation is a piece of cake, using the EPEL yum repository:

# Install the EPEL yum repository
sudo rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/`uname -p`/epel-release-6-8.noarch.rpm
# Install collectl
sudo yum install -y collectl
# Set it to start at boot
sudo chkconfig --level 35 collectl on

Configuration to enable logging to InfluxDB is a simple matter of modifying the /etc/collectl.conf configuration file either by hand or using this set of sed statements to do it automagically.
The localhost in the second sed command is the hostname of the server on which InfluxDB is running:

sudo sed -i.bak -e 's/^DaemonCommands/#DaemonCommands/g' /etc/collectl.conf
sudo sed -i -e '/^#DaemonCommands/a DaemonCommands = -f \/var\/log\/collectl -P -m -scdmnCDZ --export graphite,localhost:2003,p=.os,s=cdmnCDZ' /etc/collectl.conf

If you want to log more frequently than ten seconds, make this change (for 5 second intervals here):

sudo sed -i -e '/#Interval =     10/a Interval = 5' /etc/collectl.conf

Restart collectl for the changes to take effect:

sudo service collectl restart

As above, a quick check through the web UI should confirm we’re getting data through into InfluxDB:

Note the very handy regex lets us be lazy with the series naming. We know there is a metric called in part ‘cputotal’, so using /cputotal/ can match anything with it in.

Installing and Configuring Grafana

Like InfluxDB, Grafana is also easy to install, although it does require a bit of setting up. It needs to be hooked into a web server, as well as configured to connect to a source for metrics and storing dashboard definitions.

First, download the binary (this is based on v1.9.1, but releases are frequent so check the downloads page for the latest):

cd ~
wget http://grafanarel.s3.amazonaws.com/grafana-1.9.1.zip

Unzip it and move it to /opt:

unzip grafana-1.9.1.zip
sudo mv grafana-1.9.1 /opt

Configuring Grafana to Connect to InfluxDB

We need to do a bit of configuration, so first create the configuration file based on the template given:

cd /opt/grafana-1.9.1
cp config.sample.js config.js

And now open the config.js file in your favourite text editor. Grafana supports various sources for metrics data, as well as various targets to which it can save the dashboard definitions. The configuration file helpfully comes with configuration elements for many of these, but all commented out. Uncomment the InfluxDB stanzas and amend them as follows:

datasources: {
    influxdb: {
        type: 'influxdb',
        url: "http://sampleapp:8086/db/carbon",
        username: 'root',
        password: 'root',
    },
    grafana: {
        type: 'influxdb',
        url: "http://sampleapp:8086/db/grafana",
        username: 'root',
        password: 'root',
        grafanaDB: true
    },
},

Points to note:

The servername is the server host as you will be accessing it from your web browser. So whilst the configuration we did earlier was all based around ‘localhost’, since it was just communication within components on the same server, the Grafana configuration is what the web application from your web browser uses. So unless you are using a web browser on the same machine as where InfluxDB is running, you must put in the server address of your InfluxDB machine here.
The default InfluxDB username/password is root/root, not admin/admin
Edit the database names in the url, either as shown if you’ve followed the same names used earlier in the article or your own versions of them if not.

Setting Grafana up in Apache

Grafana runs within a web server, such as Apache or nginx. Here I’m using Apache, so first off install it:

sudo yum install -y httpd

And then set up an entry for Grafana in the configuration folder by pasting the following to the command line:

cat > /tmp/grafana.conf <<EOF
Alias /grafana /opt/grafana-1.9.1

<Location /grafana>
Order deny,allow
Allow from 127.0.0.1
Allow from ::1
Allow from all
</Location>
EOF

sudo mv /tmp/grafana.conf /etc/httpd/conf.d/grafana.conf

Now restart Apache:

sudo service httpd restart

And if the gods of bits and bytes are smiling on you, when you go to http://yourserver/grafana you should see:

Note that as with InfluxDB, you may well need to open your firewall for Apache which is on port 80 by default. Follow the same iptables instructions as above to do this.

Building Grafana Dashboards on Metrics Held in InfluxDB

So now we’ve set up our metric collectors, sending data into InfluxDB.

Let’s see now how to produce some swanky dashboards in Grafana.

Grafana has a concept of Dashboards, which are made up of Rows and within those Panels. A Panel can have on it a metric Graphs (duh), but also static text or single figure metrics.

To create a new dashboard click the folder icon and select New:

You get a fairly minimal blank dashboards. On the left you’ll notice a little green tab: hover over that and it pops out to form a menu box, from where you can choose the option to add a graph panel:

Grafana Graph Basics

On the blank graph that’s created click on the title (with the accurate text “click here”) and select edit from the options that appear. This takes you to the graph editing page, which looks equally blank but from here we can now start adding metrics:

In the box labelled series start typing Active_Sessions and notice that Grafana will autocomplete it to any available metrics matching this:

Select Oracle_BI_PS_Sessions.Active_Sessions and your graph should now display the metric.

To change the time period shown in the graph, use the time picker at the top of the screen.You can also click & drag (“brushing”) on any graph to select a particular slice of time.

So, set the time filter to 15 minutes ago and from the Auto-refresh submenu set it to refresh every 5 seconds. Now login to your OBIEE instance, and you should see the Active Sessions value increase (one per session login):

To add another to the graph you can click on Add query at the bottom right of the page, or if it’s closely related to the one you’ve defined already click on the cog next to it and select duplicate:

In the second query add Oracle_BI_General.Total_sessions (remember, you can just type part of the string and Grafana autocompletes based on the metric series stored in InfluxDB). Run a query in OBIEE to cause sessions to be created on the BI Server, and you should now see the Total sessions increase:

To save the graph, and the dashboard, click the Save icon. To return to the dashboard to see how your graph looks alongside others, or to add a new dashboards, click on Back to dashboard.

Grafana Graph Formatting

Let’s now take a look at the options we’ve got for modifying the styling of the graph. There are several tabs/sections to the graph editor – General, Metrics (the default), Axes & Grid, and Display Styles. The first obvious thing to change is the graph title, which can be changed on the General tab:

From here you can also change how the graph is sized on the dashboard using the Span and Height options. A new feature in recent versions of Grafana is the ability to link dashboards to help with analysis paths – guided navigation as we’d call it in OBIEE – and it’s from the General tab here that you can define this.

On the Metrics tab you can specify what text to use in the legend. By default you get the full series name, which is usually too big to be useful as well as containing a lot of redundant repeating text. You can either specify literal text in the alias field, or you can use segments of the series name identified by $x where x is the zero-based segment number. In the example I’ve hardcoded the literal value for the second metric query, and used a dynamic segment name for the first:

On the Axes & Grid tab you can specify the obvious stuff like min/max scales for the axes and the scale to use (bits, bytes, etc). To put metrics on the right axis (and to change the colour of the metric line too) click on the legend line, and from there select the axis/colour as required:

You can set thresholds to overlay on the graph (to highlight warning/critical values, for example), as well as customise the legend to show an aggregate value for each metric, show it in a table, or not at all:

The last tab, Display Styles, has even more goodies. One of my favourite new additions to Grafana is the Tooltip. Enabling this gives you a tooltip when you hover over the graph, displaying the value of all the series at that point in time:

You can change the presentation of the graph, which by default is a line, adding bars and/or points, as well as changing the line width and fill.

Solid Fill:
Bars only
Points and translucent fill:

Advanced InfluxDB Query Building in Grafana

Identifying Metric Series with RegEx

In the example above there were two fairly specific metrics that we wanted to report against. What you will find is much more common is wanting to graph out a set of metrics from the same ‘family’. For example, OBIEE DMS metrics include a great deal of information about each Connection Pool that’s defined. They’re all in a hierarchy that look like this:

obi11-01.OBI.Oracle_BI_DB_Connection_Pool.Star_01_-_Sample_App_Data_ORCL_Sample_Relational_Connection

Under which you’ve got

Capacity
Current Connection Count
Current Queued Requests

and so on.

So rather than creating an individual metric query for each of these (similar to how we did for the two session metrics previously) we’ll use InfluxDB’s rather smart regex method for identifying metric series in a query. And because Grafana is awesome, writing the regex isn’t as painful as it could be because the autocomplete validates your expression in realtime. Let’s get started.

First up, let’s work out the root of the metric series that we want. In this case, it’s the orcl connection pool. So in the series box, enter /orcl/. The / delimiters indicate that it is a regex query. As soon as you enter the second / you’ll get the autocomplete showing you the matching series:

/orcl/

If you scroll down the list you’ll notice there’s other metrics in there beside Connection Pool ones, so let’s refine our query a bit

/orcl_Connection_Pool/

That’s better, but we’ve now got all the Connection Pool metrics, which whilst are fascinating to study (no, really) complicate our view of the data a bit, so let’s pick out just the ones we want. First up we’ll put in the dot that’s going to precede any of the final identifiers for the series (.Capacity, .Current Connection Count, etc). A dot is a special character in regex so we need to escape it \.

/orcl_Connection_Pool\./

And now let’s check we’re on the right lines by specifying just Capacity to match:

/orcl_Connection_Pool\.Capacity/

Excellent. So we can now add in more permutations, with a bracketed list of options separated with the pipe (regex OR) character:

/orcl_Connection_Pool\.(Capacity|Current)/

We can use a wildcard .* for expressions that are not directly after the dot that we specified in the match pattern. For example, let’s add any metric that includes Queued:

/orcl_Connection_Pool\.(Capacity|Current|.*Queued)/

But now we’ve a rather long list of matches, so let’s refine the regex to narrow it down:

/orcl_Connection_Pool\.(Capacity|Current|Peak.*Queued).+(Requests|Connection)/

(Something else I tried before this was regex negative look-behind, but it looks like Go (which InfluxDB is written in) doesn’t support it).

Setting the Alias to $4, and the legend to include values in a table format gives us this:

Now to be honest here, in this specific example, I could have created four separate metric queries in a fraction of the time it took to construct that regex. That doesn’t detract from the usefulness and power of regex though, it simply illustrates the point of using the right tool for the right job, and where there’s a few easily identified and static metrics, a manual selection may be quicker.

Aggregates

By default Grafana will request the mean of a series at the defined grain of time from InfluxDB. The grain of time is calculated automatically based on the time window you’ve got shown in your graph. If you’re collecting data every five seconds, and build a graph to show a week’s worth of data, showing all 120960 data points will end up in a very indistinct line:

So instead Grafana generates an InfluxDB query that rolls the data up to more sensible intervals – in the case of a week’s worth of data, every 10 minutes:

You can see, and override, the time grouping in the metric panel. By default it’s dynamic and you can see the current value in use in lighter text, like this:

You can also set an optional minimal time grouping in the second of the “group by time” box (beneath the first). This is a time grouping under which Grafana will never go, so if you always want to roll up to, say, at least a minute (but higher if the duration of the graph requires it), you’d set that here.

So I’ve said that InfluxDB can roll up the figures – but how does it roll up multiple values into one? By default, it takes the mean of all the values. Depending on what you’re looking at, this can be less that desirable, because you may miss important spikes and troughs in your data. So you can change the aggregate rule, to look at the maximum value, minimum, and so on. Do this by clicking on the aggregation in the metric panel:

This is the same series of data, but shown as 5 second samples rolled up to a minute, using the mean, max, and min aggregate rules:

For a look at how all three series can be better rendered together see the discussion of Series Specific Overrides later in this article.

You can also use aggregate functions with measures that may not be simple point in time values. For example, an incrementing/accumulating measure (such as a counter like “number of requests since launch”) you actually want to graph the rate of change, the delta between each point. To do this, use the derivative function. In this graph you can see the default aggregation (mean, in green) against derivative, in yellow. One is in effect the “actual” value of the measure, the other is the rate of change, which is much more useful to see in a time series.

Note that if you are using derivative you may need to fix the group by time to the grain at which you are storing data. In my example I am storing data every 5 seconds, but if the default time grain on the graph is 1s then it won’t show the derivative data.

See more details about the aggregations available in InfluxDB in the docs here. If you want to use an aggregation (or any query) that isn’t supported in the Grafana interface simply click on the cog icon and select Raw query mode from where you can customise the query to your heart’s content.

Drawing inverse graphs

As mentioned just above, you can customise the query sent to InfluxDB, which means you can do this neat trick to render multiple related series that would otherwise overlap by inverting one of them. In this example I’ve got the network I/O drawn conventionally:

But since metrics like network I/O, disk I/O and so on have a concept of adding and taking, it feels much more natural to see the input as ‘positive’ and output as ‘negative’.

Which certainly for my money is easier to see at a glance whether we’ve got data coming or going, and at what volume. To implement this simply set up your series as usual, and then for the series you want to invert click on the cog icon and select Raw query mode. Then in place of

mean(value)

put

mean(value*-1)

Series Specific Overrides

The presentation options that you specify for a graph will by default apply to all series shown in the graph. As we saw previously you can change the colour, width, fill etc of a line, or render the graph as bars and/or points instead. This is all good stuff, but presumes that all measures are created equal – that every piece of data on the graph has the same meaning and importance. Often we’ll want to change how display a particular set of data, and we can use Series Specific Overrides in Grafana to do that.

For example in this graph we can see the number of busy connections and the available capacity:

But the actual (Busy Connections) is the piece of data we want to see at a glance, against the context of the available Capacity. So by setting up a Series Specific Override we can change the formatting of each line individually – calling out the actual (thick green) and making the threshold more muted (purple):

To configure a Series Specific Override got to the Display Styles panel and click Add series override rule. Pick the specific series or use a regex to identify it, and then use the + button to add formatting options:

A very useful formatting option is Z-index, which enables you to define the layering on the graph so that a given series is rendered on top (or below) another. To bring something to the very front use a Z-index of 3; for the very back use -3. Series Specific Overrides are also a good way of dynamically assigning multiple Y axes.

Another great use of Series Specific Overrides is to show the min/max range for data as a shaded area behind the main line, thus providing more context for aggregate data. I discussed above how Grafana can get InfluxDB to roll up (aggregate) values across time periods to make graphs more readable when shown for long time frames – and how this can mask data exceptions. If you only show the mean, you miss small spikes and troughs; if you only show the max or min then you over or under count the actual impact of the measure. But, we can have the best of all worlds! The next two graphs show the starting point – showing just the mean (missing the subtleties of a data series) and showing all three versions of a measure (ugly and unusable):

Instead of this, let’s bring out the mean, but still show it in context of the range of the values within the aggregate:

I hope you’d agree that this a much cleaner and clearer way of presenting the data. To do it we need two steps:

Make sure that each metric has an alias. This is used in the label but importantly is also used in the next step to identify each data series. You can skip this bit if you really want and regex the series to match directly in the next step, but setting an alias is much easier
On the Display Styles tab click Add series override rule at the bottom of the page. In the alias or regex box you should see your aliases listed. Select the one which is the maximum series. Then choose the formatting option Fill below to and select the minimum series

You’ll notice that Grafana automagically adds in a second rule to disable lines for the minimum series, as well as on the existing maximum series rule.

Optionally, add another rule for your mean series, setting the Z-index to 3 to bring it right to the front.

All pretty simple really, and a nice result:

Variables in Grafana (a.k.a. Templating)

In lots of metric series there is often going to be groups of measures that are associated with reoccurring instances of a parent. For example, CPU details for multiple servers, or in the OBIEE world connection pool details for multiple connection pools.

centos-base.os.cputotals.user
db12c-01.os.cputotals.user
gitserver.os.cputotals.user
media02.os.cputotals.user
monitoring-01.os.cputotals.user
etc

Instead of creating a graph for each permutation, or modifying the graph each time you want to see a different instance, you can instead use Templating, which is basically creating a variable that can be incorporated into query definitions.

To create a template you first need to enable it per dashboard, using the cog icon in the top-right of the dashboard:

Then open the Templating option from the menu opened by clicking on the cog on the left side of the screen

Now set up the name of the variable, and specify a full (not partial, as you would in the graph panel) InfluxDB query that will return all the values for the variable – or rather, the list of all series from which you’re going to take the variable name.

Let’s have a look at an example. Within the OBIEE DMS metrics you have details about the thread pools within the BI Server, and there are different thread pool types, and it is that type that I want to store. Here’s a snippet of the series:

[...]
obi11-01.OBI.Oracle_BI_Thread_Pool.DB_Gateway.Peak_Queued_Requests
obi11-01.OBI.Oracle_BI_Thread_Pool.DB_Gateway.Peak_Queued_Time_milliseconds
obi11-01.OBI.Oracle_BI_Thread_Pool.DB_Gateway.Peak_Thread_Count
obi11-01.OBI.Oracle_BI_Thread_Pool.Server.Accumulated_Requests
obi11-01.OBI.Oracle_BI_Thread_Pool.Server.Average_Execution_Time_milliseconds
obi11-01.OBI.Oracle_BI_Thread_Pool.Server.Average_Queued_Requests
obi11-01.OBI.Oracle_BI_Thread_Pool.Server.Average_Queued_Time_milliseconds
obi11-01.OBI.Oracle_BI_Thread_Pool.Server.Avg_Request_per_sec
[...]

Looking down the list, it’s the DB_Gateway and Server values that I want to extract. First up is some regex to return the series with the thread pool name in:

/.*Oracle_BI_Thread_Pool.*/

and now build it as part of an InfluxDB query:

list series /.*Oracle_BI_Thread_Pool.*/

You can validate this against InfluxDB directly using the web UI for InfluxDB or curl as described much earlier in this article. Put the query into the Grafana Template definition and hit the green play button. You’ll get a list back of all series returned by the query:

Now we want to extract out the threadpool names and we do this using the regex capture group ( ):

/.*Oracle_BI_Thread_Pool\.(.*)\./

Hit play again and the results from the first query are parsed through the regex and you should have just the values you need:

If the values are likely to change (for example, Connection Pool names will change in OBIEE depending on the RPD) then make sure you select Refresh on load. Click Add and you’re done.

You can also define variables with fixed values, which is good if they’re never going to change, or they are but you’ve not got your head around RegEx. Simply change the Type to Custom and enter comma-separated values.

To use the variable simply reference it prefix with a dollar sign, in the metric definition:

or in the title:

To change the value selected just use the dropdown from the top of the screen:

Annotations

Another very nice feature of Grafana is Annotations. These are overlays on each graph at a given point in time to provide additional context to the data. How I use it is when analysing test data to be able to see what script I ran when:

There’s two elements to Annotations – setting them up in Grafana, and getting the data into the backend (InfluxDB in this case, but they work with other data sources such as Graphite too).

Storing an Annotation

An annotation is nothing more than some time series data, but typically a string at a given point in time rather than a continually changing value (measure) over time.

To store it just just chuck the data at InfluxDB and it creates the necessary series. In this example I’m using one called events but it could be called foobar for all it matters. You can read more about putting data into InfluxDB here and choose one most suitable to the event that it is you want to record to display as an annotation. I’m running some bash-based testing, so curl fits well here, but if you were using a python program you could use the python InfluxDB client, and so on.

Sending data with curl is easy, and looks like this:

curl -X POST -d '[{"name":"events","columns":["id","action"],"points":[["big load test","start"]]}]' 'http://monitoring-server.foo.com:8086/db/carbon/series?u=root&p=root'

The main bit of interest, other than the obvious server name and credentials, is the JSON payload that we’re sending. Pulling it out and formatting it a bit more nicely:

{  
   "name":"events",
   "columns":[  
      "test-id",
      "action"
   ],
   "points":[  
      [  
         "big load test",
         "start"
      ]
   ]
}

So the series (“table”) we’re loading is called events, and we’re going to store an entry for this point in time with two columns, test-id and action storing values big load test and start respectively. Interestingly (and something that’s very powerful) is that InfluxDB’s schema can evolve in a way that no traditional RDBMS could. Never mind that we’re not had to define events before loading it, we could even load it at subsequent time points with more columns if we want to simply by sending them in the data payload.

Coming back to real-world usage, we want to make the load as dynamic as possible, so with a few variables and a bit of bash magic we have something like this that will automatically load to InfluxDB the start and end time of every load test that gets run, along with the name of the script that ran it and the host on which it ran:

INFLUXDB_HOST=monitoring-server.foo.com
INFLUXDB_PORT=8086
INFLUXDB_USER=root
INFLUXDB_PW=root
HOSTNAME=$(hostname)
SCRIPT=`basename $0`
curl -X POST -d '[{"name":"events","columns":["host","id","action"],"points":[["'"$HOSTNAME"'","'"$SCRIPT"'","start"]]}]' "http://$INFLUXDB_HOST:$INFLUXDB_PORT/db/carbon/series?u=$INFLUXDB_USER&p=$INFLUXDB_PW"

echo 'Load testing bash code goes here. For now let us just go to sleep'
sleep 60

curl -X POST -d '[{"name":"events","columns":["host","id","action"],"points":[["'"$HOSTNAME"'","'"$SCRIPT"'","end"]]}]' "http://$INFLUXDB_HOST:$INFLUXDB_PORT/db/carbon/series?u=$INFLUXDB_USER&p=$INFLUXDB_PW"

Displaying annotations in Grafana

Once we’ve got a series (“table”) in InfluxDB with our events in, pulling them through into Grafana is pretty simple. Let’s first check the data we’ve got, by going to the InfluxDB web UI (http://influxdb:8083) and from the Explore Data » page running a query against the series we’ve loaded:

select * from events

The time value is in epoch milliseconds, and the remaining values are whatever you sent to it.

Now in Grafana enable Annotations for the dashboard (via the cog in the top-right corner)

Once enabled use the cog in the top-left corner to open the menu from which you select the Annotations dialog. Click on the Add tab. Give the event group a name, and then the InfluxDB query that pulls back the relevant data. All you need to do is take the above query that you used to test out the data and append the necessary time predicate where $timeFilter so that only events for the time window currently being shown are returned:

select * from events where $timeFilter

Click Add and then set your time window to include a period when an event was recorded. You should see a nice clear vertical line and a marker on the x-axis that when you hover over it gives you some more information:

You can use the Column Mapping options in the Annotations window to bring in additional information into the tooltip. For example, in my event series I have the id of the test, action (start/end), and the hostname. I can get this overlaid onto the tooltip by mapping the columns thus:

Which then looks like this on the graph tooltips:

N.B. currently (Grafana v1.9.1) when making changes to an Annotation definition you need to refresh the graph views after clicking Update on the annotation definition, otherwise you won’t see the change reflected in the annotations on the graphs.

Sparklines

Everything I’ve written about Grafana so far has revolved around the graphs that it creates, and unsurprisingly because this is the core feature of the tool, the bread and butter. But there are other visualisation options available – “Singlestat”, and “Text”. The latter is pretty obvious and I’m not going to discuss it here, but Singlestat, a.k.a. Sparkline and/or Performance Tiles, is awesome and well worth a look. First, an illustration of what I’m blathering about:

A nice headline figure of the current number of active sessions, along with a sparkline to show the trend of the metric.

To add one of these to your dashboard go to the green row menu icon on the left (it’s mostly hidden and will pop out when you hover over it) and select Add Panel -> singlestat.

On the panel that appears go to the edit screen as shown :

In the Metrics panel specify the series as you would with a graph, but remember you need to pull back just a single series – no point writing a regex to match multiple ones. Here I’m going to show the number of queued requests on a connection pool. Note that because I want to show the latest value I change the aggregation to last:

In the General tab set a title for the panel, as well as the width of it – unlike graphs you typically want these panels to be fairly narrow since the point is to show a figure not lots of detail. You’ll notice that I’ve also defined a Drilldown / detail link so that a user can click on the summary figure and go to another dashboard to see more detail.

The Options tab gives you the option to set font size, prefixes/suffixes, and is also where you set up sparkline and conditional formatting.

Tick the Spark line box to draw a sparkline within the panel – if you’re not seen them before sparklines are great visualisations for showing the trend of a metric without fussing with axes and specific values. Tick the Background mode to use the entire height of the panel for the graph and overlay the summary figure on top.

Now for the bit I think is particularly nice – conditional formatting of the singlestat panel. It’s dead easy and not a new concept but is really great way to let a user see at a real glance if there’s something that needs their attention. In the case of this example here, queueing connections, any queueing is dodgy and more than a few is bad (m’kay). So let’s colour code it:

You can even substitute values for words – maybe the difference between 61 queued sessions and 65 is fairly irrelevant, it’s the fact that there are that magnitude of queued sessions that is more the problem:

Note that the values are absolutes, not ranges. There is an open issue for this so hopefully that will change. The effect is nice though:

Conclusion

Hopefully this article has given you a good idea of what is possible with data stored in InfluxDB and visualised in Grafana, and how to go about doing it.

If you’re interested in OBIEE monitoring you might also be interested in the ELK suite of tools that complements what I have described here well, giving an overall setup like this:

You can read more about its use with OBIEE here, or indeed get in touch with us if you’d like to learn more or have us come and help with your OBIEE monitoring and diagnostics.

↧

An Introduction to Analysing ODI Runtime Data Through Elasticsearch and Kibana 4

March 12, 2015, 12:39 am

≫ Next: Instrumenting OBIEE Database Connections For Improved Performance Diagnostics

≪ Previous: OBIEE Monitoring and Diagnostics with InfluxDB and Grafana

An important part of working with ODI is analysing the performance when it runs, and identifying steps that might be inefficient as well as variations in runtime against a baseline trend. The Operator tool in ODI itself is great for digging down into individual sessions and load plan executions, but for broader analysis we need a different approach. We also need to make sure we keep the data available for trend analysis, as it’s often the case that tables behind Operator are frequently purged for performance reasons.

In this article I’m going to show how we can make use of a generic method of pulling information out of an RDBMS such as Oracle and storing it in Elasticsearch, from where it can be explored and analysed through Kibana. It’s standalone, it’s easy to do, it’s free open source – and it looks and works great! Here I’m going to use it for supporting the analysis of ODI runtime information, but it is equally applicable to any time-based data you’ve got in an RDBMS (e.g. OBIEE Usage Tracking data).

Kibana is an open-source data visualisation and analysis tool, working with data stored in Elasticsearch. These tools work really well for very rapid analysis of any kind of data that you want to chuck at them quickly and work with. By skipping the process of schema definition and data modelling the time taken to the first results is drastically reduced. It enables to you quickly start “chucking about” data and getting meaning out of it before you commit full-scale to how you want to analyse it, which is what the traditional modelling route can sometimes force you to do prematurely.

ODI writes runtime information to the database, about sessions run, steps executed, time taken and rows processed. This data is important for analysing things like performance issues, and batch run times. Whilst with the equivalent runtime data (Usage Tracking) from OBIEE there is the superb RPD/Dashboard content that Oracle ship in SampleApp v406, for ODI the options aren’t as vast, ultimately being based on home-brew SQL against the repository tables using the repository schema documentation from Oracle. Building an OBIEE metadata model against the ODI schema is one option, but then requires an OBIEE server on which to run it – or merging into an existing OBIEE deployment – which means that it can become more hassle than it’s worth. It also means a bunch of up-front modelling before you get any kind of visualisations and data out. By copying the data across into Elasticsearch it’s easy to quickly build analyses against it, and has the additional benefit of retaining the data as long as you’d like meaning that it’s still available for long-term trend analysis once the data’s been purged from the ODI repository itself.

Let’s take a bit of a walk through the ODI dashboard that I’ve put together. First up is a view on the number of sessions that have run over time, along with their duration. For duration I’ve shown 50th (median), 75th and 95th percentiles to get an idea of the spread of session runtimes. At the moment we’re looking at all sessions, so it’s not surprising that there is a wide range since there’ll always be small sessions and longer ones:

Next up on the dashboard comes a summary of top sessions by runtime, both cumulative and per-session. The longest running sessions are an obvious point of interest, but cumulative runtime is also important; something may only take a short while to run when compared to some singular long-running sessions, but if it runs hundreds of times then it all adds up and can give a big performance boost if time is shaved off it.

Plotting out session execution times is useful to be able to see both when the longest running sessions ran:

The final element on this first dashboard is one giving the detail for each of the top x long-running session executions, including the session number so that it can be examined in further detail through the Operator tool.

Kibana dashboards are interactive, so you can click on a point in a graph to zoom in on that time period, as well as click & drag to select an arbitrary range. The latter technique is sometimes known as “Brushing”, and if I’m not describing it very well have a look at this example here and you’ll see in an instant what I mean.

As you focus on a time period in one graph the whole dashboard’s time filter changes, so where you have a table of detail data it then just shows it for the range you’ve selected. Notice also that the granularity of the aggregation changes as well, from a summary of every three hours in the first of the screenshots through to 30 seconds in the last. This is a nice way of presenting a summary of data, but isn’t always desirable (it can mask extremes and abnormalities) so can be configured to be fixed as well.

Time isn’t the only interaction on the dashboard – anything that’s not a metric can be clicked on to apply a filter. So in the above example where the top session by cumulative time are listed out we might want to find out more about the one with several thousand executions

Simply clicking on it then filters the dashboard and now the session details table and graph show information just for that session, including duration, and rows processed:

Session performance analysis

As an example of the benefit of using a spread of percentiles we can see here is a particular session that had an erratic runtime with great variation, that then stabilised. The purple line is the 95th percentile response time; the green and blue are 50th and 75th respectively. It’s clear that whilst up to 75% of the sessions completed in about the same kind of time each time they ran, the remaining quarter took anything up to five times as long.

One of the most important things in performance is ensuring consistent performance, and that is what happens here from about half way along the horizontal axis at c.February:

But what was causing the variation? By digging a notch deeper and looking at the runtime of the individual steps within the given session it can be seen that the inconsistent runtime was caused by a single step (the green line in this graph) within the execution. When this step’s runtime stabilises, so does the overall performance of the session:

This is performing a port-mortem on a resolved performance problem to illustrate how useful the data is – obviously if there were still a performance problem we’d have a clear path of investigation to pursue thanks to this data.

How?

Data’s pulled from the ODI repository tables using Elasticsearch JDBC river, from where it’s stored and indexed in Elasticsearch, and presented through Kibana 4 dashboards.

The data load from the repository tables into Elasticsearch is incremental, meaning that the solution works for both historical analysis and more immediate monitoring too. Because the data’s actually stored in Elasticsearch for analysis it means the ODI repository tables can be purged if required and you can still work with a full history of runtime data in Kibana.

If you’re interested in finding out more about this solution and how Rittman Mead can help you with your ODI and OBIEE implementation and monitoring needs, please do get in touch.

↧

Instrumenting OBIEE Database Connections For Improved Performance Diagnostics

March 22, 2015, 6:30 pm

≫ Next: OBIEE nqcmd Tidbits

≪ Previous: An Introduction to Analysing ODI Runtime Data Through Elasticsearch and Kibana 4

Nearly four years ago I wrote a blog post entitled “Instrumenting OBIEE – The Final Chapter”. With hindsight, that title suffix (“The Final Chapter”) may have been a tad presumptuous and naïve of me (or perhaps I can just pretend to be ironic now and go for a five-part-trilogy style approach…). Back then OBIEE 11g had only just been released (who remembers 11.1.1.3 in all its buggy-glory?), and in the subsequent years we’ve had significant patchset releases of OBIEE 11g bringing us up to 11.1.1.7.150120 now and with talk of OBIEE 12c around the corner.

As a fanboi of Cary Millsap and his approach to measuring and improving performance, instrumenting code in general – and OBIEE specifically – is something that’s interested me for a long time. The article was the final one that I wrote on my personal blog before joining Rittman Mead and it’s one that I’ve been meaning to re-publish here for a while. A recent client engagement gave me cause to revisit the instrumentation approach and refine it slightly as well as update it for a significant change made in OBIEE 11.1.1.7.1.

What do I mean by instrumentation? Instrumentation is making your program expose information about what is being done, as well as actually doing it. Crudely put, it’s something like this:

10 PRINT "THE TIME IS " NOW()
20 CALL DO_MY_THING()
30 PRINT "I'VE DONE THAT THING, IT TOOK " X " SECONDS"
40 GOTO 10

Rather than just firing some SQL at the database, instead we associate with that SQL information about what program sent it, and what that program was doing, who was using it, and so on. Instrumentation enables you to start analysing performance metrics against tangible actions rather than just amorphous clumps of SQL. It enables you to understand the workload profile on your system and how that’s affecting end users.

Pop quiz: which of these is going to be easier to work with for building up an understanding of a system’s behaviour and workload?

CLIENT_INFO          MODULE                    ACTION       CPU_TIME DISK_READS 
-------------------- ------------------------  ---------- ---------- ---------- 
                                               a17ff8e1         2999          1 
                                               fe6abd92         1000          6 
                                               a264593a         5999          2 
                                               571fe814         5000         12 
                                               63ea4181         7998          4 
                                               7b2fcb68        11999          5

CLIENT_INFO          MODULE                    ACTION       CPU_TIME DISK_READS
-------------------- ------------------------  ---------- ---------- ----------
06 Column Selector   GCBC Dashboard/Performan  a17ff8e1         2999          1
05 Table with condit GCBC Dashboard/Performan  a264593a         5999          2
06 View Selector     GCBC Dashboard/Performan  571fe814         5000         12
05 Table with condit GCBC Dashboard/Performan  63ea4181         7998          4
<unsaved analysis>   nqsserver@obi11-01        fe6abd92         1000          6
<unsaved analysis>   nqsserver@obi11-01        7b2fcb68        11999          5

The second one gives us the same information as before, plus the analysis being run by OBIEE, and the dashboard and page.

The benefits of instrumentation work both ways. It makes DBAs happy because they can look at resource usage on the database and trace it back easily to the originating OBIEE dashboard and user. Instrumentation also makes life much easier for troubleshooting OBIEE performance because it’s easy to trace a user’s entire session through from browser, through the BI Stack, and down into the database.

Instrumentation for OBIEE – Step By Step

If you want the ‘tl;dr’ version, the “how” rather than the “why”, here we go. For full details of why it works, see later in the article.

In your RPD create three session variables. These are going to be the default values for variables that we’re going to send to the database. Make sure you set “Enable any user to set the value”.
- SAW_SRC_PATH
- SAW_DASHBOARD
- SAW_DASHBOARD_PG
Set up a session variable initialization block to populate these variables. It is just a “dummy” init block as all you’re doing is setting them to empty/default values, so a ‘SELECT … FROM DUAL’ is just fine:

For each Connection Pool you want to instrument, go to the Connection Scripts tab and add these three scripts to the Execute before query section:

-- Pass the OBIEE user's name to CLIENT_IDENTIFIER
call dbms_session.set_identifier('VALUEOF(NQ_SESSION.USER)')

-- Pass the Analysis name to CLIENT_INFO
call dbms_application_info.set_client_info(client_info=>SUBSTR('VALUEOF(NQ_SESSION.SAW_SRC_PATH)',(LENGTH('VALUEOF(NQ_SESSION.SAW_SRC_PATH)')-instr('VALUEOF(NQ_SESSION.SAW_SRC_PATH)','/',-1,1))*-1))

-- Pass the dashboard name & page to MODULE
-- NB OBIEE >=11.1.1.7.131017 will set ACTION itself so there is no point setting it here (it will get overridden)
call dbms_application_info.set_module(module_name=> SUBSTR('VALUEOF(NQ_SESSION.SAW_DASHBOARD)', ( LENGTH('VALUEOF(NQ_SESSION.SAW_DASHBOARD)') - INSTR('VALUEOF(NQ_SESSION.SAW_DASHBOARD)', '/', -1, 1) ) *- 1) || '/' || 'VALUEOF(NQ_SESSION.SAW_DASHBOARD_PG)' ,action_name=> '' );

You can leave the comments in there, and in fact I’d recommend doing so to make it clear for future RPD developers what these scripts are for.

Your connection pool should look like this:

An important point to note is that you generally should not be adding these scripts to connection pools that are used for executing initialisation blocks. Initialisation block queries won’t have these request variables so if you did want to instrument them you’d need to find something else to include in the instrumentation.

Once you’ve made the above changes you should see MODULE, CLIENT_IDENTIFIER and CLIENT_INFO being populated in the Oracle system views :

SELECT SID, 
       PROGRAM, 
       CLIENT_IDENTIFIER, 
       CLIENT_INFO, 
       MODULE, 
       ACTION 
FROM   V$SESSION 
WHERE  LOWER(PROGRAM) LIKE 'nqsserver%';

SID PROGRAM CLIENT_ CLIENT_INFO              MODULE                       ACTION
--- ------- ------- ------------------------ ---------------------------- --------
 17 nqsserv prodney Geographical Analysis 2  11.10 Flights Delay/Overview 32846912
 65 nqsserv prodney Delayed Fligth % history 11.10 Flights Delay/Overview 4bc2a368
 74 nqsserv prodney Delayed Fligth % history 11.10 Flights Delay/Overview 35c9af67
193 nqsserv prodney Geographical Analysis 2  11.10 Flights Delay/Overview 10bdad6c
302 nqsserv prodney Geographical Analysis 1  11.10 Flights Delay/Overview 3a39d178
308 nqsserv prodney Delayed Fligth % history 11.10 Flights Delay/Overview 1fad81e0
421 nqsserv prodney Geographical Analysis 2  11.10 Flights Delay/Overview 4e5d36c1

You’ll note that we don’t set ACTION – that’s because OBIEE now sends a hash of the physical query text across in this column, meaning we can’t use it ourselves. Unfortunately the current version of OBIEE doesn’t store the physical query hash anywhere other than in nqquery.log, meaning that you can’t take advantage of it (i.e. link it back to data from Usage Tracking) within the database alone.

That’s all there is to it – easy! If you want to understand exactly how and why it works, read on…

Instrumentation for OBIEE – How Does it Work?

Connection Pools

When OBIEE runs a dashboard, it does so by taking each analysis on that dashboard and sending a Logical Request for that analysis to the BI Server (nqsserver). The BI Server parses and compiles that Logical request into one or more Physical requests which it then sends to the source database(s).

OBIEE connects to the database via a Connection Pool which specifies the database-specific connection information including credentials, data source name (such as TNS for Oracle). The Connection Pool, as the name suggests, pools connections so that OBIEE is not going through the overhead of connecting and disconnecting for every single query that it needs to run. Instead it will open one or more connections as needed, and share that connection between queries as needed.

As well as the obvious configuration options in a connection pool such as database credentials, OBIEE also supports the option to send additional SQL to the database when it opens a connection and/or sends a new query. It’s this nice functionality that we piggy-back to enable our instrumentation.

Variables

The information that OBIEE can send back through its database connection is limited by what we can expose in variables. From the BI Server’s point of view there are three types of variables:

Repository
Session
Request

The first two are fairly simple concepts; they’re defined within the RPD and populated with Initialisation Blocks (often known as “init blocks”) that are run by the BI Server either on a schedule (repository variables) or per user (session variables). There’s a special type of session variables known as System Session Variables, of which USER is a nice obvious example. These variables are pre-defined in OBIEE and are generally populated automatically when the user session begins (although some, like LOGLEVEL, still need an init block to set them explicitly).

The third type of variable, request variable, is slightly less obvious in function. In a nutshell, they are variables that are specified in the logical request sent to the BI Server, and are passed through to the internals of the BI Server. They’re often used for activating or disabling certain functionality. For example, you can tell OBIEE to specifically not use its cache for a request (even if it finds a match) by setting the request variable DISABLE_CACHE_HIT.

Request variables can be set manually inline in an analysis from the Advanced tab:

And they can also be set from Variable Prompts either within a report prompt or as a standalone dashboard prompt object. The point about request variables is that they are freeform; if they specify the name of an existing session variable then they will override it (if permitted), but they do not require the session variable to exist. We can see this easily enough – and see a variable request prompt in action at the same time. From the Prompts tab of an analysis I’ve added a Variable Prompt (rather than the usual Column Prompt) and given it a made up name, FOO:

Now when I run the analysis I specify a value for it:

and in the query log there’s the request variable:

-------------------- SQL Request, logical request hash:
bfb12eb6
SET VARIABLE FOO='BAR';
SELECT
   0 s_0,
   "A - Sample Sales"."Base Facts"."1- Revenue" s_1
FROM "A - Sample Sales"
ORDER BY 1
FETCH FIRST 5000001 ROWS ONLY

I’ve cut the quoted Logical SQL down to illustrate the point about the variable, because what was actually there is this:

-------------------- SQL Request, logical request hash:
bfb12eb6
SET VARIABLE QUERY_SRC_CD='Report',SAW_SRC_PATH='/users/prodney/request variable example',FOO='BAR', PREFERRED_CURRENCY='USD';
SELECT
   0 s_0,
   "A - Sample Sales"."Base Facts"."1- Revenue" s_1
FROM "A - Sample Sales"
ORDER BY 1
FETCH FIRST 5000001 ROWS ONLY

which brings me on very nicely to the key point here. When Presentation Services sends a query to the BI Server it does so with a bunch of request variables set, including QUERY_SRC_CD and SAW_SRC_PATH. If you’ve worked with OBIEE for a while then you’ll recognise these names – they’re present in the Usage Tracking table S_NQ_ACCT. Ever wondered how OBIEE knows what values to store in Usage Tracking? Now you know. It’s whatever Presentation Services tells it to. You can easily test this yourself by playing around in nqcmd:

[oracle@demo ~]$ rlwrap nqcmd -d AnalyticsWeb -u prodney -p Admin123 -NoFetch

-------------------------------------------------------------------------------
          Oracle BI ODBC Client
          Copyright (c) 1997-2013 Oracle Corporation, All rights reserved
-------------------------------------------------------------------------------

[...]

Give SQL Statement: SET VARIABLE QUERY_SRC_CD='FOO',SAW_SRC_PATH='BAR';SELECT 0 s_0 FROM "A - Sample Sales"
SET VARIABLE QUERY_SRC_CD='FOO',SAW_SRC_PATH='BAR';SELECT 0 s_0 FROM "A - Sample Sales"

Statement execute succeeded

and looking at the results in S_NQ_ACCT:

BIEE_BIPLATFORM@pdborcl > select to_char(start_ts,'YYYY-MM-DD HH24:MI:SS') as start_ts,saw_src_path,query_src_cd from biee_biplatform.s_nq_acct where start_ts > sysdate -1 order by start_ts;

START_TS            SAW_SRC_PATH                             QUERY_SRC_CD
------------------- ---------------------------------------- --------------------
2015-03-21 11:55:10 /users/prodney/request variable example  Report
2015-03-21 12:44:41 BAR                                      FOO
2015-03-21 12:45:26 BAR                                      FOO
2015-03-21 12:45:28 BAR                                      FOO
2015-03-21 12:46:23 BAR                                      FOO

Key takeaway here: Presentation Services defines a bunch of useful request variables when it sends Logical SQL to the BI Server:

QUERY_SRC_CD
SAW_SRC_PATH
SAW_DASHBOARD
SAW_DASHBOARD_PG

Embedding Variables in Connection Script Calls

There are four options that we can configure when connecting to the database from OBIEE. These are:

CLIENT_IDENTIFIER
CLIENT_INFO
MODULE
ACTION

As of OBIEE version 11.1.1.7.1 (i.e. OBIEE >= 11.1.1.7.131017) OBIEE automatically sets the ACTION field to a hash of the physical query – for more information see Doc ID 1941378.1. That leaves us with three remaining fields (since OBIEE sets ACTION after anything we do with the Connection Pool):

CLIENT_IDENTIFIER
CLIENT_INFO
MODULE

The syntax of the command in a Connection Script is physical SQL and the VALUEOF function to extract the OBIEE variable:

VALUEOF(REPOSITORY_VARIABLE)
VALUEOF(NQ_SESSION.SESSION_VAR)
VALUEOF(NQ_SESSION.REQUEST_VAR)

As a simple example here is passing the userid of the OBIEE user, using the Execute before query connection script:

-- Pass the OBIEE user's name to CLIENT_IDENTIFIER
call dbms_session.set_identifier('VALUEOF(NQ_SESSION.USER)')

This would be set for every Connection Pool – but only those used for query execution – not init blocks. Run a query that is routed through the Connection Pool you defined the script against and check out V$SESSION:

SQL> select sid,program,client_identifier from v$session where program like 'nqsserver%';

       SID PROGRAM                                          CLIENT_IDENTIFIER
---------- ------------------------------------------------ ----------------------------------------------------------------
        22 nqsserver@demo.us.oracle.com (TNS V1-V3)         prodney

The USER session variable is always present, so this is a safe thing to do. But, what about SAW_SRC_PATH? This is the path in the Presentation Catalog of the analysis being executed. Let’s add this into the Connection Pool script, passing it through as the CLIENT_INFO:

-- Pass the Analysis name to CLIENT_INFO
call dbms_application_info.set_client_info(client_info=>'VALUEOF(NQ_SESSION.SAW_SRC_PATH)')

This works just fine for analyses within a dashboard, or standalone analyses that have been saved. But what about a new analysis that hasn’t been saved yet? Unfortunately the result is not pretty:

[10058][State: S1000] [NQODBC] [SQL_STATE: S1000] [nQSError: 10058] A general error has occurred.
[nQSError: 43113] Message returned from OBIS.
[nQSError: 43119] Query Failed:
[nQSError: 23006] The session variable, NQ_SESSION.SAW_SRC_PATH, has no value definition.
Statement execute failed

That’s because SAW_SRC_PATH is a request variable and since the analysis has not been saved Presentation Services does not pass it to BI Server as a request variable. The same holds true for SAW_DASHBOARD and SAW_DASHBOARD_PG if you run an analysis outside of a dashboard – the respective request variables are not set and hence the connection pool script causes the query itself to fail.

The way around this is we cheat, slightly. If you create a session variable with the names of these request variables that we want to use in the connection pool scripts then we avoid the above nasty failures. If the request variables are set then all is well, and if they are not then we fall back on whatever value we initialise the session variable with.

The final icing on the cake of the solution given above is a bit of string munging with INSTR and SUBSTR to convert and concatenate the dashboard path and page into a single string, so instead of :

/shared/01. QuickStart/_portal/1.30 Quickstart/Overview

we get:

1.30 Quickstart/Overview

Which is much easier on the eye when looking at dashboard names. Similarly with the analysis path we strip all but the last section of it.

Granular monitoring of OBIEE on the database

Once OBIEE has been configured to be more articulate in its connection to the database, it enables the use of DBMS_MONITOR to understand more about the performance of given dashboards, analyses, or queries for a given user. Through DBMS_MONITOR the collection of statistics such as DB time, DB CPU, and so can be triggered, as well as trace-file generation for queries matching the criteria specified.

As an example, here is switching on system statistics collection for just one dashboard in OBIEE, using SERV_MOD_ACT_STAT_ENABLE

call dbms_monitor.SERV_MOD_ACT_STAT_ENABLE(
    module_name=>'GCBC Dashboard/Overview'
    ,service_name=>'orcl'
);

Now Oracle stats to collect information whenever that particular dashboard is run, which we can use to understand more about how it is performing from a database point of view:

SYS@orcl AS SYSDBA> select module,stat_name,value from V$SERV_MOD_ACT_STATS;

MODULE                   STAT_NAME                           VALUE
------------------------ ------------------------------ ----------
GCBC Dashboard/Overview  user calls                             60
GCBC Dashboard/Overview  DB time                              6789
GCBC Dashboard/Overview  DB CPU                               9996
GCBC Dashboard/Overview  parse count (total)                    15
GCBC Dashboard/Overview  parse time elapsed                    476
GCBC Dashboard/Overview  execute count                          15
GCBC Dashboard/Overview  sql execute elapsed time             3887
[...]

Similarly the CLIENT_IDENTIFIER field can be used to collect statistics with CLIENT_ID_STAT_ENABLE or trigger trace file generation with CLIENT_ID_TRACE_ENABLE. What you populate CLIENT_IDENTIFIER with it up to you – by default the script I’ve detailed at the top of this article inserts the OBIEE username in it, but you may want to put the analysis here if that’s of more use from a diagnostics point of view on the database side. The CLIENT_INFO field is still available for the other item, but cannot be used with DBMS_MONITOR for identifying queries.

↧

OBIEE nqcmd Tidbits

March 23, 2015, 8:42 pm

≫ Next: Lifting the Lid on OBIEE Internals with Linux Diagnostics Tools

≪ Previous: Instrumenting OBIEE Database Connections For Improved Performance Diagnostics

nqcmd is the ODBC command line tool that always has, and hopefully always will, shipped with OBIEE. It enables you to manually fire queries directly at the BI Server, rather than through the usual way of Presentation Services generating Logical SQL and sending it to BI Server. This can be useful in several cases:

Automated cache purging, by sending one of the SAPurge[…] ODBC commands to the BI Server, usually done as part of a script
Automated execution of Logical SQL, often done to support testing scenarios
Load Testing the BI Server (via a magic undocumented switch, SA_NQCMD_ADVANCED)
Manual interogation of the BI Server – if you want to poke and prod nqsserver without launching a web browser, nqcmd is your friend :)

In using nqcmd there’re a couple of things I want to demonstrate here that I find useful but haven’t seen discussed [in detail] elsewhere.

Query Log via nqcmd

All BI Server queries run with a LOGLEVEL>=1 will write some log details to nqquery.log. The usual route to view this is either on the server directly itself, transferring it off with a tool such as WinSCP, or through the Administration page of OBIEE. Another option that is available is from nqcmd itself. You need to do two things:

Set the environment variable SA_NQCMD_ADVANCED to Yes
Include the command line arguments -ShowQueryLog -H when you invoke nqcmd. I don’t know what -H does – it’s just specified as being required for this to work.

Here’s a simple example in action:

[oracle@demo ~]$ export SA_NQCMD_ADVANCED=Yes
[oracle@demo ~]$ nqcmd -d AnalyticsWeb -u prodney -p Admin123 -ShowQueryLog -H

-------------------------------------------------------------------------------
          Oracle BI ODBC Client
          Copyright (c) 1997-2013 Oracle Corporation, All rights reserved
-------------------------------------------------------------------------------



Connection open with info:
[0][State: 01000] [DataDirect][ODBC lib] Application's WCHAR type must be UTF16, because odbc driver's unicode type is UTF16

        [T]able info
        [C]olumn info
        [D]ata type info
        [F]oreign keys info
        [P]rimary key info
        [K]ey statistics info
        [S]pecial columns info
        [Q]uery statement
Select Option: Q

Give SQL Statement: SET VARIABLE LOGLEVEL=1:SELECT "A - Sample Sales"."Base Facts"."1- Revenue" s_1 FROM "A - Sample Sales"
SET VARIABLE LOGLEVEL=1:SELECT "A - Sample Sales"."Base Facts"."1- Revenue" s_1 FROM "A - Sample Sales"
-----------------------
s_1
-----------------------
70000000.00
-----------------------
Row count: 1
-----------------------
[2015-03-21T16:36:31.000+00:00] [OracleBIServerComponent] [TRACE:1] [USER-0] [] [ecid: 0054Sw944KmFw000jzwkno0003ac0000rl,0] [tid: 56660700] [requestid: 201f0002] [sessionid: 201f0000] [username: prodney] ###
########################################### [[
-------------------- SQL Request, logical request hash:
d2294415
SET VARIABLE LOGLEVEL=1:SELECT "A - Sample Sales"."Base Facts"."1- Revenue" s_1 FROM "A - Sample Sales"

]]
[2015-03-21T16:36:31.000+00:00] [OracleBIServerComponent] [TRACE:1] [USER-34] [] [ecid: 0054Sw94mRzFw000jzwkno0003ac0000ro,0] [tid: 56660700] [requestid: 201f0002] [sessionid: 201f0000] [username: prodney] -------------------- Query Status: Successful Completion [[

]]
[2015-03-21T16:36:31.000+00:00] [OracleBIServerComponent] [TRACE:1] [USER-28] [] [ecid: 0054Sw94mRzFw000jzwkno0003ac0000ro,0] [tid: 56660700] [requestid: 201f0002] [sessionid: 201f0000] [username: prodney] -------------------- Physical query response time 0 (seconds), id <<333971>> [[

]]

]]
[2015-03-21T16:36:31.000+00:00] [OracleBIServerComponent] [TRACE:1] [USER-29] [] [ecid: 0054Sw94mRzFw000jzwkno0003ac0000ro,0] [tid: 56660700] [requestid: 201f0002] [sessionid: 201f0000] [username: prodney] -------------------- Physical Query Summary Stats: Number of physical queries 1, Cumulative time 0, DB-connect time 0 (seconds) [[

]]
[2015-03-21T16:36:31.000+00:00] [OracleBIServerComponent] [TRACE:1] [USER-33] [] [ecid: 0054Sw94mRzFw000jzwkno0003ac0000ro,0] [tid: 56660700] [requestid: 201f0002] [sessionid: 201f0000] [username: prodney] -------------------- Logical Query Summary Stats: Elapsed time 0, Response time 0, Compilation time 0 (seconds) [[

]]

Neat! But so what? Well, I see two uses straight away:

In some situations you may not have access to the filesystem of the server on which the BI Server is running. For example, as a consultant I’ve been to clients where I’m given the Administration Tool client installation only. If I want to debug an RPD that I’m developing I’ll usually want to poke around in nqquery.log to see quite what physical SQL is being generated – and now I can.
There was a discussion on the EMG mailing list recently about generating Physical SQL without executing it on the database. I’m going to discuss this in the next section of this article, and to do the analysis for this rapidly I’m using the inline query log.

Generating Physical SQL for OBIEE without Executing it – SKIP_PHYSICAL_QUERY_EXEC

OBIEE generates the Physical SQL that it runs against the database dynamically, at runtime. It takes the Logical request (“Logical SQL”), runs it through the RPD and generates one or more “Physical SQL” statements to be executed on the database as required to pull back the necessary data. A question arose recently on the EMG mailing list as to whether it is possible to get the Physical SQL – without executing it. You can imagine the benefits of this (namely, regression testing) since executing the database query each time is typically going to be expensive in machine resource and time consuming.

In SampleApp v406 there is a /home/oracle/scripts/PhysicalSQLGenerator, which does two things. First off it generates the Logical SQL for a given analysis, presumably using the generateReportSQL web service. It then takes that and runs it through nqcmd, scraping the nqquery.log for the resulting Physical SQL. In all of this no database queries get run. Very cool. But what’s the “secret sauce” at play here – can we distill it down in order to use it ourselves?

First, let’s look at how the SampleApp script does it. It sets some additional request variables in the Logical SQL:

[oracle@demo PhysicalSQLGenerator]$ cat lsql-out-dir/q1.lsql
SET VARIABLE SKIP_PHYSICAL_QUERY_EXEC=1, LOGLEVEL=2, DISABLE_CACHE_HIT=1, DISABLE_CACHE_SEED=1, QUERY_SRC_CD='SampleApp-PSQLGEN', SAW_SRC_PATH='/users/prodney/folder/request variable example':SELECT
   0 s_0,
   "A - Sample Sales"."Base Facts"."1- Revenue" s_1
FROM "A - Sample Sales"
ORDER BY 1
FETCH FIRST 5000001 ROWS ONLY
;

And if we extract the relevant part out of the bash script we can see that it also uses a couple of extra command line arguments (-q -NoFetch) when invoking nqcmd:

nqcmd -q -NoFetch -d AnalyticsWeb -u weblogic -p Admin123 -s lsql-out-dir/q1.lsql

When it’s run we check nqquery.log and lo-and-behold we get this: (edited for brevity)

------------------- Sending query to database named 01 - Sample App Data (ORCL) (id: <<69923>>), connection pool named Sample Relational Connection, logical request hash dd4fb54f, physical request hash 8d6f36
3d: [[
WITH
SAWITH0 AS (select sum(T42442.Revenue) as c1
from
     BISAMPLE.SAMP_REVENUE_FA2 T42442 /* F21 Rev. (Aggregate 2) */ )
select D1.c1 as c1, D1.c2 as c2 from ( select distinct 0 as c1,
     D1.c1 as c2
from
     SAWITH0 D1 ) D1 where rownum <= 5000001

]]

Query Status: Successful Completion [[

Rows 0, bytes 24 retrieved from database query id: <<69923>> Simulation Gateway 

Physical query response time 0 (seconds), id <<69923>> Simulation Gateway

Whilst the log says it is “Sending query to database” it does no such thing, and the “Simulation Gateway” is the giveaway clue. Proof that it doesn’t connect to the database? I shut the database down, and it still worked just fine. Crude, yes, but effective.

I’ll intersperse here the little trick that I mentioned in the first part of this article : -ShowQueryLog. It’s tedious switching back and forth between nqcmd and the nqquery.log when doing this kind of testing, so let’s do it all as one:

export SA_NQCMD_ADVANCED=Yes
nqcmd -H -ShowQueryLog -q -NoFetch -d AnalyticsWeb -u weblogic -p Admin123 -s lsql-out-dir/q1.lsql

Unfortunately it looks like -ShowQueryLog is mutually exclusive to -q and -NoFetch since it doesn’t return anything, even though the nqquery.log did get additional entries. But that’s fine, since by removing these two flags in order to get -ShowQueryLog to work we’re whittling down what is actually needed to generate the physical SQL on its own without database execution. Here’s the nqcmd, showing the query log inline and showing still the “Simulation Gateway” indicative of no physical query execution:

[oracle@demo PhysicalSQLGenerator]$ export SA_NQCMD_ADVANCED=Yes
[oracle@demo PhysicalSQLGenerator]$ nqcmd -H -ShowQueryLog -d AnalyticsWeb -u weblogic -p Admin123 -s lsql-out-dir/q1.lsql

-------------------------------------------------------------------------------
          Oracle BI ODBC Client
          Copyright (c) 1997-2013 Oracle Corporation, All rights reserved
-------------------------------------------------------------------------------

[...]

------------------------------------
s_0          s_1
------------------------------------
------------------------------------
Row count: 0
------------------------------------
[2015-03-23T05:52:57.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-0] [] [ecid: 0054Ut7AJ33Fw000jzwkno0005UZ00005Q,0] [tid: 8f194700] [requestid: 8a1e0002] [sessionid: 8a1e0000] [username: weblogic] ############################################## [[
-------------------- SQL Request, logical request hash:
dd4fb54f
SET VARIABLE SKIP_PHYSICAL_QUERY_EXEC=1, LOGLEVEL=2, DISABLE_CACHE_HIT=1, DISABLE_CACHE_SEED=1, QUERY_SRC_CD='SampleApp-PSQLGEN', SAW_SRC_PATH='/users/prodney/folder/request variable example':SELECT
   0 s_0,
   "A - Sample Sales"."Base Facts"."1- Revenue" s_1
FROM "A - Sample Sales"
ORDER BY 1
FETCH FIRST 5000001 ROWS ONLY



[...]

[2015-03-23T05:52:57.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-18] [] [ecid: 0054Ut7AK5DFw000jzwkno0005UZ00005S,0] [tid: 8f194700] [requestid: 8a1e0002] [sessionid: 8a1e0000] [username: weblogic] -------------------- Sending query to database named 01 - Sample App Data (ORCL) (id: <<70983>>), connection pool named Sample Relational Connection, logical request hash dd4fb54f, physical request hash 8d6f363d: [[
WITH
SAWITH0 AS (select sum(T42442.Revenue) as c1
from
     BISAMPLE.SAMP_REVENUE_FA2 T42442 /* F21 Rev. (Aggregate 2) */ )
select D1.c1 as c1, D1.c2 as c2 from ( select distinct 0 as c1,
     D1.c1 as c2
from
     SAWITH0 D1 ) D1 where rownum <= 5000001

]]
[2015-03-23T05:52:57.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-34] [] [ecid: 0054Ut7AYi0Fw000jzwkno0005UZ00005T,0] [tid: 8f194700] [requestid: 8a1e0002] [sessionid: 8a1e0000] [username: weblogic] -------------------- Query Status: Successful Completion [[

]]
[2015-03-23T05:52:57.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-26] [] [ecid: 0054Ut7AYi0Fw000jzwkno0005UZ00005T,0] [tid: 8f194700] [requestid: 8a1e0002] [sessionid: 8a1e0000] [username: weblogic] -------------------- Rows 0, bytes 24 retrieved from database query id: <<70983>> Simulation Gateway [[

]]
[2015-03-23T05:52:57.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-28] [] [ecid: 0054Ut7AYi0Fw000jzwkno0005UZ00005T,0] [tid: 8f194700] [requestid: 8a1e0002] [sessionid: 8a1e0000] [username: weblogic] -------------------- Physical query response time 0 (seconds), id <<70983>> Simulation Gateway [[

[...]

It’s clear that the “-q -Nofetch” parameters used in nqcmd don’t have an effect on whether the physical query is executed (they’re to do with whether nqcmd as an ODBC client pulls back and displays the data you ask for). It’s actually just a single request variable that does the job, and it goes under the rather obvious name of SKIP_PHYSICAL_QUERY_EXEC. When set to 1 it generates all the necessary physical SQL but doesn’t execute it, and the presence of “Simulation Gateway” in the log signals this.

↧

Lifting the Lid on OBIEE Internals with Linux Diagnostics Tools

March 27, 2015, 7:44 am

≫ Next: Analysing ODI performance with Flame Graphs

≪ Previous: OBIEE nqcmd Tidbits

There comes the point in any sufficiently complex or difficult problem diagnosis that the log files in OBIEE alone are not sufficient for building up a complete picture of what’s going on. Even with the debug/trace data that Presentation Services and other components can be configured precisely to write you’re sometimes just left having to guess what is going on inside the black box of each of the OBIEE system components.

Here we’re going to look at a couple of examples of lifting the lid just a little bit further on what OBIEE is up to, using standard Linux diagnostic tools. These are not something to be reaching for in the first instance, but more getting on to a last resort. Almost always the problem is simpler than you’ll think, and leaping for an network trace or stack trace is going to be missing the wood for the trees.

Diagnostics in action

At a client recently they had a problem with a custom skin deployment on a clustered (scaled-out) OBIEE deployment. Amongst other things the skin was setting the default palette for charts (viewui/chart/dvt-graph-skin.xml), and they were seeing only 50% of chart executions pick up the custom palette – the other 50% used the default. If either entire node was shut down, things were fine, but otherwise it was a 50:50 chance what the colours would be. Most odd….

When you configure a custom skin in OBIEE you should be setting CustomerResourcePhysicalPath in instanceconfig.xml, along with CustomerResourceVirtualPath. Both these are necessary so that Presentation Services knows:

Logical – How to generate URLs for content requested by the user’s browser (eg logos, CSS files, etc).
Physical – How to physically reference files on the file system that are read by OBIEE itself (eg XML files, language files)

The way the client had configured their custom skin was that it was on storage local to each node, and in a node-specific path, something like this:

/data/instance1/s_custom/
/data/instance2/s_custom/

Writing out the details in hindsight always makes a problem’s root cause a lot more obvious, but at the time this was a tricky problem. Let’s start with the basics. Java Host is responsible for rendering charts, and for some reason, it was not reading the custom colour scheme file from the custom skin correctly. Presentation Services uses all the available Java Hosts in a cluster to request charts, presumably on some kind of round-robin basis. An analysis request on NODE01 has a 50:50 chance of getting its chart rendered on Java Host on NODE01 or Java Host on NODE02:

Turned all the log files up to 11 didn’t yield anything useful. For some reason half the time Java Host would just “ignore” the custom skin. Shutting down each node proved that in isolation the custom skin configuration on each node was definitely correct, because then the colours started working just fine. It was only when multiple Java Hosts across the nodes were active that there was a problem.

How Java Host picks up the custom skin is entirely undocumented, and I ended up figuring out that it must get the path to the skin as part of the chart request from Presentation Services. Since Presentation Services on NODE01 has been configured with a CustomerResourcePhysicalPath of /data/instance1/s_custom/, Java Host on NODE02 would fail to find this path (since on NODE02 the skin is located at /data/instance2/s_custom/) and so fall back on the default. This was my hypothesis that I then proved by making the path available for each skin available on each node (symlink, or using a standard path would also have worked, eg /data/shared/s_custom, or even better, a shared mount point), and from there everything worked just fine.

But a hypothesis and successful resolution alone wasn’t entirely enough. Sure the client was happy, but there was that little itch, that unknown “black box” system that appeared to behave how I had deduced, but could we know for sure?

tcpdump – network analysis

All of the OBIEE components communicate with each other and the outside world over TCP. When Presentation Services wants a chart rendered it does so by sending a request to Java Host – over TCP. Using the tcpdump tool we can see that in action, and inspect what gets sent:

$ sudo tcpdump -i venet0 -i lo -nnA 'port 9810'

The -A flag capture the ASCII representation of the packet; use -X if you want ASCII and hex. Port 9810 is the Java Host listen port.

The output looks like this:

You’ll note that in this case it’s intra-node communication, i.e. src and dest IP addresses are the same. The port for Java Host (9810) is clear, and we can verify that the src port (38566) is Presentation Services with the -p (process) flag of netstat:

$ sudo netstat -pn |grep 38566
tcp        0      0 192.168.10.152:38566        192.168.10.152:9810         ESTABLISHED 5893/sawserver

So now if you look in a bit more detail at the footer of the request from Presentation Services that tcpdump captured you’ll see loud and clear (relatively) the custom skin path with the graph customisation file:

Proof that the Presentation Services is indeed telling Java Host where to go and look for the custom attributes (including colours)! NB this is on a test environment, so that paths vary from the /data/instance... example above)

strace – system call analysis

So tcpdump gives us the smoking gun, but can we find the corpse as well? Sure we can! strace is a tool for tracing system calls, and a fantastically powerful one, but here’s a very simple example:

$strace -o /tmp/obijh1_strace.log -f -p $(pgrep -f obijh1)

-o means to write it to file, -f follows child processes as well, and -p passes the process id that strace should attach to. Have set the trace running I run my chart, and then go and pick through my trace file.

We know it’s the dvt-graph-skin.xml file that Java Host should be reading to pick up the custom colours, so let’s search for that:

Well there we go – Java Host went to go and look for the skin in the path that it was given by Presentation Services, and couldn’t find it. From there it’ll fall back on the product defaults.

Right Tool, Right Job

As as I said at the top of this article, these diagnostic tools are not the kind of things you’d be using day to day. Understanding their output is not always easy and it’s probably easy to do more harm than good with false assumption about what a trace is telling you. But, in the right situations, they are great for really finding out what is going on under the covers of OBIEE.

↧

Analysing ODI performance with Flame Graphs

April 1, 2015, 4:01 pm

≫ Next: BI Forum 2015 Preview — OBIEE Regression Testing, and Data Discovery with the ELK stack

≪ Previous: Lifting the Lid on OBIEE Internals with Linux Diagnostics Tools

Flame Graphs are a visualisation that I learnt about through the excellent Linux systems performance work of Brendan Gregg, and saw Luca Canali talk about recently at UKOUG Tech 14. They’re a brilliant way of summarising extremely dense information in a way from which the main components accounting for the most time can be identified. I was recently doing some analysis for a client on their ODI batch runtime and I thought it would be a good idea to try them out. Load Plans can have complex hierarchies to them and working out which main sections account for what time can be tricky, as can following a load plan step through to a session and on to a session step and its constituent parts.

A flame graph is made up of the “stack trace” on the y-axis, and the amount of time spent in each on the x-axis. This is different from most other standard visualisations where the x-axis represents the passage of time, and instead summarises the data at multiple levels of the stack trace hierarchy. The “stack trace” in this case with ODI is Load plan -> load plan step (load plan step […]) -> session -> session step -> task. It’s as easy to see the overall run time as it is a load plan step part way down, as a constituent task of a session step. And what’s more, flame graphs look nice! This may seem a flimsy reason for using them on their own, but it’s a bonus over trawling through dull tables of data alone.

Looking at the flame graph above (taken from a demo BI Apps implementation) it’s nice and easy to see that the Warehouse Load Phase accounts for c.75% of the time, within which the two areas accounting for most time are AP and AR balances. This is from literally a single glance at one graphic. Flame Graphs are built as SVGs which enables them to be interactive (here’s an example). Clicking on any of the stack trace boxes drills into that area, so for the tasks taking less time (and so displaying less text) this is useful to see the specifics. Here’s the GL balance load in detail, showing how long the row inserts take in proportion to the index build:

Creating the flame graph is simple. You just need a stack trace that is semi-colon separated, followed by a space-delimited counter value at the end. A bit of recursive SQL magic with the SNP_ tables (helpfully documented by Oracle here) gives us this kind of output file with one line for every task executed and its duration:

;Start_Load_Plan;Global_Variable_Refresh;Source_Extract_Phase;1_General;2_General_PRE-SDE;3_PRE-SDE_Day;Finalize_Day;Finalize_W_DAY_D;CREATE_INDEXES;Create_Indexes_:_W_DAY_D_2/2;EXEC_TABLE_MAINT_PROC;TABLE_MAINT_PROC;Create Indexes 3
[...]

which you then run through the Flame Graph tool:

cat /tmp/odi.out |~/git/FlameGraph/flamegraph.pl --title "EBSVISION FIN HR_21_20141021_223159 / 2014-10-24 15:41:42" > /tmp/odi-flame-graph.svg

Simply load the resulting SVG into a web browser such as Chrome, and you’re done. Here’s an example that you can download and try out.

↧

BI Forum 2015 Preview — OBIEE Regression Testing, and Data Discovery with the ELK stack

April 24, 2015, 5:18 am

≫ Next: Using the ELK Stack to Analyse Donor’s Choose Data

≪ Previous: Analysing ODI performance with Flame Graphs

I’m pleased to be presenting at both of the Rittman Mead BI Forums this year; in Brighton it’ll be my fourth time, whilst Atlanta will be my first, and my first trip to the city too. I’ve heard great things about the food, and I’m sure the forum content is going to be awesome too (Ed: get your priorities right).

OBIEE Regression Testing

In Atlanta I’ll be talking about Smarter Regression testing for OBIEE. The topic of Regression Testing in OBIEE is one that is – at last – starting to gain some real momentum. One of the drivers of this is the recognition in the industry that a more Agile approach to delivering BI projects is important, and to do this you need to have a good way of rapidly testing changes made. The other driver that I see is OBIEE 12c and the Baseline Validation Tool that Oracle announced at Oracle OpenWorld last year. Understanding how OBIEE works, and therefore how changes made can be tested most effectively, is key to a successful and efficient testing process.

In this presentation I’ll be diving into the OBIEE stack and explaining where it can be tested and how. I’ll discuss the common approaches and the relative strengths of each.

If you’ve not registered for the Atlanta BI Forum then do so now as places are limited and selling out fast. It runs May 14–15 with an optional masterclass on Wednesday 13th May from Mark Rittman and Jordan Meyer.

Data Discovery with the ELK Stack

My second presentation is at the Brighton forum the week before Atlanta, and I’ll be talking about Data Discovery and Systems Diagnostics with the ELK stack. The ELK stack is a set of tools from a company called Elastic, comprising Elasticsearch, Logstash and Kibana (E – L – K!). Data Discovery is a crucial part of the life cycle of acquiring, understanding, and exploiting data (one could even say, leverage the data). Before you can operationalise your reporting, you need to understand what data you have, how it relates, and what insights it can give you. This idea of a “Discovery Lab” is one of the key components of the Information Management and Big Data Reference Architecture that Oracle and Rittman Mead produced last year:

ELK gives you great flexibility to ingest data with loose data structures and rapidly visualise and analyse it. I wrote about it last year with an example of analysing data from our blog and associated tweets with data originating in Hadoop, and more recently have been analysing twitter activity using it. The great power of Kibana (the “K” of ELK) is the ability to rapidly filter and aggregate data, as well as see a summary of values within a data field:

The second aspect of my presentation is still on data discovery, but “discovering data” within the logfiles of an application stack such as OBIEE. ELK is perfectly suited to in-depth diagnostics against dense volumes of log data that you simply could not handle within simple log viewers or Enterprise Manager, such as the individual HTTP requests and types of value passed within the interactions of a single user session:

By its nature of log streaming and full text search, ELK also lends itself well to near real time system monitoring dashboards reporting the status of systems including OBIEE and ODI, and I’ll be discussing this in more detail during my talk.

The Brighton BI Forum is on 7–8 May, with an optional masterclass on Wednesday 6th May from Mark Rittman and Jordan Meyer. If you’ve not registered for the Brighton BI Forum then do so now as places are very limited!

Don’t forget, we’re running a Data Visualisation Challenge at each of the forums, and if you need to convince your boss to let you go you can find a pre-written ‘justification’ letter here.

↧

Using the ELK Stack to Analyse Donor’s Choose Data

April 25, 2015, 12:35 pm

≫ Next: What’s New in OBIEE 11.1.1.9 for Systems Administrators and Developers

≪ Previous: BI Forum 2015 Preview — OBIEE Regression Testing, and Data Discovery with the ELK stack

Donor’s Choose is an online charity in America through which teachers can post details of projects that need funding and donors can give money towards them. The data from the charity since it began in 2000 is available to download freely here in several CSV datasets. In this article I’m going to show how to use the ELK stack of data discovery tools from Elastic to easily import some data (the donations dataset) and quickly start analysing it to produce results such as this one:

I’m assuming you’ve downloaded and unzipped Elasticsearch, Logstash and Kibana and made Java available if not already. I did this on a Mac, but the tools are cross-platform and should work just the same on Windows and Linux. I’d also recommend installing Kopf, which is an excellent plugin for the management of Elasticsearch.

CSV Data Ingest with Logstash

First off we’re going to get the data in to Elasticsearch using Logstash, after which we can do some analysis using Kibana.

To import the data with Logstash requires a configuration file which in this case is pretty straightforward. We’ll use the file input plugin, process it with the csv filter, set the date of the event to the donation timestamp (rather than now), cast a few fields to numeric, and then output it using the elasticsearch plugin. See inline comments for explanation of each step:

input {  
    file {  
        # This is necessary to ensure that the file is  
        # processed in full. Without it logstash will default  
        # to only processing new entries to the file (as would  
        # be seen with a logfile for a live application, but  
        # not static data like we're working with here)  
        start_position  => beginning  
        # This is the full path to the file to process.  
        # Wildcards are valid.  
        path =>  ["/hdd/ELK/data/opendata/opendata_donations.csv"]  
    }
}

filter {  
        # Process the input using the csv filter.  
        # The list of column names I took manually from the  
        # file itself  
        csv {separator => ","  
                columns => ["_donationid","_projectid","_donor_acctid","_cartid","donor_city","donor_state","donor_zip","is_teacher_acct","donation_timestamp","donation_to_project","donation_optional_support","donation_total","dollar_amount","donation_included_optional_support","payment_method","payment_included_acct_credit","payment_included_campaign_gift_card","payment_included_web_purchased_gift_card","payment_was_promo_matched","via_giving_page","for_honoree","donation_message"]}

        # Store the date of the donation (rather than now) as the  
        # event's timestamp  
        # 
        # Note that the data in the file uses formats both with and  
        # without the milliseconds, so both formats are supplied  
        # here.  
        # Additional formats can be specified using the Joda syntax  
        # (http://joda-time.sourceforge.net/api-release/org/joda/time/format/DateTimeFormat.html)  
        date { match => ["donation_timestamp", "yyyy-MM-dd HH:mm:ss.SSS", "yyyy-MM-dd HH:mm:ss"]}  
        # ------------
        # Cast the numeric fields to float (not mandatory but makes for additional analysis potential)
        mutate {
        convert => ["donation_optional_support","float"]
        convert => ["donation_to_project","float"]
        convert => ["donation_total","float"]
        }
}

output {  
        # Now send it to Elasticsearch which here is running  
        # on the same machine.  
        elasticsearch { host => "localhost" index => "opendata" index_type => "donations"}  
        }

With the configuration file created, we can now run the import:

./logstash-1.5.0.rc2/bin/logstash agent -f ./logstash-opendata-donations.conf

This will take a few minutes, during which your machine CPU will rocket as logstash processes all the records. Since logstash was originally designed for ingesting logfiles as they’re created it doesn’t actually exit after finishing processing the file, but you’ll notice your machine’s CPU return to normal, at which point you can hit Ctrl-C to kill logstash.

If you’ve installed Kopf then you can see at a glance how much data has been loaded:

Or alternatively query the index using Elasticsearch’s API directly:

curl -XGET 'http://localhost:9200/opendata/_status?pretty=true'

[...]  
    "opendata" : {  
      "index" : {  
        "primary_size_in_bytes" : 3679712363,  
      },  
[...]  
      "docs" : {  
        "num_docs" : 2608803,

Note that Elasticsearch will take more space than the source data (in total the 1.2Gb dataset ends up taking c.5Gb)

Data Exploration with Kibana

Now we can go to Kibana and start to analyse the data. From the Settings page of Kibana add the opendata index that we’ve just created:

Go to Discover and if necessary click the cog icon in the top right to set the index to opendata. The time filter defaults to the last 15 minutes only, and if your logstash has done its job right the events should have the timestamp of the actual donation, so you need to click on the time filter in the very top right of the screen to change time period to, for example, Previous year. Now you should see a bunch of data:

Click the toggle on one of the events to see the full data for it, including things like the donation amount, the message with the donation, and geographical details of the donor. You can find details of all the fields on the Donor’s Choose website here.

Click on the fields on the left to see a summary of the data within, showing very easily that within that time frame and sample of 500 records:

two thirds of donations were in the 10-100 dollar range
four-fifths included the optional donation towards the running costs of Donor’s Choose.

You can add fields into the table itself (which by default just shows the complete row of data) by clicking on add for the fields you want:

Let’s save this view (known as a “Search”), since it can be used on a Dashboard later:

Data Visualisation with Kibana

One of my favourite features of Kibana is its ability to aggregate data at various dimensions and grains with ridiculous ease. Here’s an example: (click to open full size)

Now let’s amend that chart to show the method of donation, or the donation amount range, or both: (click to open full size)

You can also change the aggregation from the default “Count” (in this case, number of donations) to other aggregations including sum, median, min, max, etc. Here we can compare cheque (check) vs paypal as a payment method in terms of amount given:

Kibana Dashboards

Now let’s bring the visualisations together along with the data table we saw in the the Discover tab. Click on Dashboard, and then the + icon:

Select the visualisations that you’ve created, and then switch to the Searches tab and add in the one that you saved earlier. You’ve now got a data table showing all currently selected data, along with various summaries on it.

You can rearrange the dashboard by dragging each box around to suit. Once you’ve got the elements of the dashboard in place you can start to drill into your data further. To zoom in on a time period click and drag a selection over it, and to filter on a particular data item (for example, state in the “Top ten states” visualisation) click on it and accept the prompt at the top of the screen. You can also use the freetext search at the top of the screen (this is valid on the Discover and Visualize pages too) to search across the dataset, or within a given field.

Example Analysis

Let’s look at some actual data analyses now. One of the most simple is the amount given in donations over time, split by amount given to project and also as the optional support amount:

One of the nice things about Kibana is the ability to quickly change resolution in a graph’s time frame. By default a bar chart will use an “Auto” granularity on the time axis, updating as you zoom in and out so that you always see an appropriate level of aggregation. This can be overridden to show, for example, year-on-year changes:

You can also easily switch the layout of the chart, for example to show the percentage of the two aggregations relative to each other. So whilst the above chart shows the optional support amount increasing by the year, it’s actually remaining pretty much the same when taken as a percentage of the donations overall – which if you look into the definition of the field (“we encourage donors to dedicate 15% of each donation to support the work that we do.“) makes a lot of sense

Analysis based on text in the data is easy. You can use the Terms sub-aggregation, where here we can see the top five states in terms of donation amount, California consistently being the top of the table.

Since the Terms sub-aggregation shows the Top-x only, you can’t necessarily judge the importance of those values in relation to the rest of the data. To do this more specific analysis you can use the Filters sub-aggregation to use free-form searches to create buckets, such as here to look at how much those from NY and CA donated, vs all other states. The syntax is field:value to include it, and -field:value to negate it. You can string these expressions together using AND and OR.

A lot of the analysis generally sits well in the bar chart visualisation, but the line chart has a role to play too. Donations are grouped according to the value range (<10, between 10 and 100, > 100), and these plot out nicely when considering the number of donations made (rather than total value). Whilst the total donation in a time period is significant, so is the engagement with the donors hence the number of donations made is important to analyse:

As well as splitting lines and bars, you can split charts themselves, which works well when you want to start comparing multiple dimensions without cluttering up a single chart. Here’s the same chart as previously but split out with one line per instance. Arguably it’s clearer to understand, and the relative values of the three items can be better seen here than in the clutter of the previous chart:

Following on from this previous graph, I’m interested in the spike in mid-value ($10-$100) donations at the end of 2011. Let’s pull the graph onto a dashboard and dig into it a bit. I’ve saved the visualisation and brought it in with the saved Search (from the Discover page earlier) and an additional visualisation showing payment methods for the donations:

Now I can click and drag the time frame to isolate the data of interest and we see that the number of donations jumps eight-fold at this point:

Clicking on one of the data points drills into it, and we eventually see that the spike was attributable to the use of campaign gift cards, presumably issued with a value > $10 and < $100.

elkodvis0502

Limitations

The simplicity described in this article comes at a cost, or rather, has its limits. You may well notice fields in the input data such as “_projectid”, and if you wanted to relate a donation to a given project you’d need to go and look that project code up manually. There’s no (easy) way of doing this in Elasticsearch – whilst you can easily bring in all the project data too and search on projectid, you can’t display the two (project and donation) alongside each other (easily). That’s because Elasticsearch is a document store, not a relational database. There are some options discussed on the Elasticsearch blog for handling this, none of which to my mind are applicable to this kind of data discovery (but Elasticsearch is used in a variety of applications, not just as a data store for Kibana, so in others cases it is more relevant). Given that, and if you wanted to resolve this relationship, you’d have to go about it a different way, maybe using the linux join command to pre-process the files and denormalise them prior to ingest with logstash. At this point you reach the “right tool/right job” decision – ELK is great, but not for everything :-)

Reprocessing

If you need to reload the data (for example, when building this I reprocessed the file in order to define the numerics as such, rather than the default string), you need to :

Drop the Elasticsearch data:

curl -XDELETE 'http://localhost:9200/opendata'

Remove the “sincedb” file that logstash uses to record where it last read from in a file (useful for tailing changing input files; not so for us with a static input file)
```
rm ~/.sincedb*
```
(better here would be to define a bespoke sincedb path in the file input parameters so we could delete a specific sincedb file without impacting other logstash processing that may be using sincedb in the same path)
Rerun the logstash as above

↧

What’s New in OBIEE 11.1.1.9 for Systems Administrators and Developers

May 14, 2015, 10:42 am

≫ Next: Security patches released for OBIEE 11.1.1.7/11.1.1.9, and ODI DQ 11.1.1.3

≪ Previous: Using the ELK Stack to Analyse Donor’s Choose Data

After over two years since the last major release of OBIEE, Oracle released version 11.1.1.9 in May 2015. You can find the installers here and documentation here. 11.1.1.9 is termed the “terminal release” of the 11g line, and the 12c version is already out in closed-beta. We’d expect to see patchsets for 11g to continue for some time covering bugs and any security issues, but for new functionality in 11g I would hazard a guess that this is pretty much it as Oracle concentrate their development efforts on OBIEE 12c and BICS, particularly Visual Analyser.

For both the end user and backend administrator/developer, OBIEE 11.1.1.9 has brought with it some nice little touches, none of which are going to revolutionise the OBIEE world but many of which are going to make life with the tool just that little bit smoother. In this article we take a look at what 11.1.1.9 brings for the sysadmin & developer.

BI Server Query Instrumentation and Usage Tracking

There are some notable developments here:

Millisecond precision when logging events from the BI Server
Usage Tracking now includes the physical query hash, which is what is also visible in the database, enabling end-to-end tracing
User sessions can be tracked and summarised more precisely because session ID is now included in Usage Tracking.
The execution of initialisation blocks is now also recorded, in a new Usage Tracking table called S_NQ_INITBLOCK.

Millisecond precision in BI Server logs

OBIEE 11.1.1.9 writes the nqquery.log with millisecond precision for both the timestamp of each entry, and also the summary timings for a query execution (at last!). It also calls out explicitly “Total time in BI Server” which is a welcome addition from a time profiling/performance analysis point of view:

[2016-07-31T02:11:48.231-04:00 [...] Sending query to database named X0 - Airlines Demo Dbs (ORCL) (id: <<221516>>), connection pool named Aggr Connection, logical request hash 544131ec, physical request hash 5018e5db: [[  
[2016-07-31T02:12:04.31-04:00 [...] Query Status: Successful Completion  
[2016-07-31T02:12:04.31-04:00 [...] Rows 2, bytes 32 retrieved from database query id: <<221516>>  
[2016-07-31T02:12:04.31-04:00 [...] Physical query response time 2.394 (seconds), id <<221516>>  
[2016-07-31T02:12:04.31-04:00 [...] Physical Query Summary Stats: Number of physical queries 1, Cumulative time 2.394, DB-connect time 0.002 (seconds)  
[2016-07-31T02:12:04.31-04:00 [...] Rows returned to Client 2  
[2016-07-31T02:12:04.31-04:00 [...] Logical Query Summary Stats: Elapsed time 16.564, Total time in BI Server 16.555, Response time 16.564, Compilation time 0.768 (seconds), Logical hash 544131ec

One thing to notice here is the subsecond timestamp precision seems to vary between 2 and 3 digits, which may or may not be a bug.

Being able to see this additional level of precision is really important. Previously OBIEE recorded information by the second, which was fine if you were looking at query executions taking dozens of seconds or minutes – but hopefully our aspirations for systems performance are actually closer to the realms of seconds or subsecond. At this scale the level of precision in the timings really matters. On the assumption that OBIEE was rounding values to the nearest whole number, you’d see “0 seconds” for a Logical SQL compile (for example) that was maybe 0.499 seconds. Per query this is not so significant, but if those queries run frequently then cumulatively that time stacks up and would be useful to be properly aware of and target with optimisation if needed.

Usage Tracking changes

Usage Tracking has five new columns for each logical query recorded in S_NQ_ACCT:

ECID
TENANT_ID
SERVICE_NAME
SESSION_ID
HASH_ID

The presence of SESSION_ID is very useful, because it means that user behaviour can be more accurately analysed. For example, within a session, how many reports does a user run? What is the median duration of a session? Note that the session here is the session as seen by the BI Server, rather than Presentation Services.

ECID is also very useful for being able to link data in Usage Tracking back to more detailed entries in nqquery.log. Note that an ECID is multipart and concanated with RID and you won’t necessarily get a direct hit on the ECID you find in Usage Tracking with that in nqquery.log, but rather a substring of it. In this example here the root ECID is 11d1def534ea1be0:20f8da5c:14d4441f7e9:–8000–0000000000001891,0:1:103 and the varying component of the relationship (RID) id 1 and 3 respectively:

Usage Tracking:

select ecid,session_id,start_dt,start_hour_min ,saw_src_path from biee_biplatform.s_nq_acct

sa50208

nqquery.log:

[2015-05-12T08:58:38.704-04:00] [...] [ecid: 11d1def534ea1be0:20f8da5c:14d4441f7e9:-8000-0000000000001891,0:1:103:3] [...]  
-------------------- SQL Request, logical request hash:  
3fabea2b  
SET VARIABLE QUERY_SRC_CD='Report',SAW_DASHBOARD='/shared/02. Visualizations/_portal/2.11 Table Designs',SAW_DASHBOARD_PG='Conditional Format',SAW_SRC_PATH='/shared/02. Visualizations/Configured Visuals/Conditional Formats/CF based on a hidden column',PREFERRED_CURRENCY='USD';SELECT^M  
   0 s_0,^M  
[...]

In the above example note how the absence of a timezone in the Usage Tracking data is an impedance to accurate interpretation of the results, compared to nqquery.log which has a fully qualified timezone offset.

Usage Tracking changes – Physical Hash ID

As well as additions to the logical query table, there are two new columns for each physical query logged in S_NQ_DB_ACCT:

HASH_ID
PHYSICAL_HASH_ID

The implications of this are important – there is now native support in OBIEE for tracing OBIEE workloads directly down to the database (as discussed for OBIEE < 11.1.1.9 here), because the PHYSICAL_HASH_ID is what OBIEE sets as the ACTION field when it connects to the database and is available in Oracle through both AWR, V$ views, and DBMS_MONITOR. For example, in V$SESSION the ACTION field is set to the physical hash:

SQL> select username,program,action 
  from v$session where lower(program) like 'nqs%';

USERNAME PROGRAM                                          ACTION  
-------- ------------------------------------------------ ---------  
BISAMPLE nqsserver@demo.us.oracle.com (TNS V1-V3)         5065e891  
BISAMPLE nqsserver@demo.us.oracle.com (TNS V1-V3)         2b6148b2  
BISAMPLE nqsserver@demo.us.oracle.com (TNS V1-V3)  
BISAMPLE nqsserver@demo.us.oracle.com (TNS V1-V3)         8802f14e  
BISAMPLE nqsserver@demo.us.oracle.com (TNS V1-V3)         206c8d54  
BISAMPLE nqsserver@demo.us.oracle.com (TNS V1-V3)         c1c121a7

The ACTION is also available in many EM screens such as this one:

Now with OBIEE 11.1.1.9 the physical hash – which was previously only available in the nqquery.log file – is available in S_NQ_DB_ACCT which can in turn be joined to S_NQ_ACCT to find out the logical request related to the physical query seen on the database. Cool huh!

SELECT PHYSICAL_HASH_ID,  
       USER_NAME,  
       SAW_SRC_PATH,  
       SAW_DASHBOARD,  
       SAW_DASHBOARD_PG  
FROM   BIEE_BIPLATFORM.S_NQ_DB_ACCT PHYS  
       INNER JOIN BIEE_BIPLATFORM.S_NQ_ACCT LOGL  
               ON LOGL.ID = PHYS.LOGICAL_QUERY_ID  
WHERE  PHYS.PHYSICAL_HASH_ID = '5065e891'

sa50207

This can be extended even further to associate AWR workload reports with specific OBIEE requests:

sa50209

One little grumble (no pleasing some people…) – it would have been nice if Usage Tracking also stored:

Timings at millisecond precision as well
The number of bytes (rather than just row count)
A proper TIMESTAMP WITH TIME ZONE (rather than the weird triplet of TS/DT/HOUR_MIN)
“Total time in BI Server”

Who knows, maybe in 12c?…

Footnote – START_TS in Usage Tracking in 11.1.1.9

As a note for others who may hit this issue, my testing has shown that Usage Tracking in 11.1.1.9 appears to have introduced a bug with START_TS (on both S_NQ_ACCT and S_NQ_DB_ACCT), in that it stores only the date, not date + time as it did in previous versions. For example:

11.1.1.7:

SELECT TO_CHAR(START_TS, 'YYYY-MM-DD HH24:MI:SS') AS START_TS, 
       TO_CHAR(START_DT, 'YYYY-MM-DD HH24:MI:SS') AS START_DT, 
       START_HOUR_MIN 
FROM   S_NQ_ACCT 
WHERE  ROWNUM < 2 

START_TS            START_DT            START_HOUR_MIN   
------------------- ------------------- -----  
2015-03-19 15:32:23 2015-03-19 00:00:00 15:32

11.1.1.9:

SELECT TO_CHAR(START_TS, 'YYYY-MM-DD HH24:MI:SS') AS START_TS, 
       TO_CHAR(START_DT, 'YYYY-MM-DD HH24:MI:SS') AS START_DT, 
       START_HOUR_MIN 
FROM   S_NQ_ACCT 
WHERE  ROWNUM < 2 

START_TS            START_DT            START_HOUR_MIN   
------------------- ------------------- -----  
2015-01-27 00:00:00 2015-01-27 00:00:00 10:41

Initialisation Block information in Usage Tracking

A new table, S_NQ_INITBLOCK, has been added to BIPLATFORM and holds details of when an init block ran, for which user, and importantly, how long it took. From a performance analysis point of view this is really valuable data and it’s good to seeing it being added to the diagnostic data captured to database with Usage Tracking.

From a glance at the data it looks like there’s a bit of a bonus logging going on, with user sign in/sign out also recorded (“SIGNNING ON/SIGNED ON/SIGNED OFF”).

2015-05-13_22-56-30

Note that there is no MBean for Init Block Usage Tracking, so regardless of how you configure the rest of Usage Tracking, you need to go to NQSConfig.ini to enable this one.

Presentation Services Cursor Cache

Oracle have added some additional Administration functionality for viewing and managing sessions and the cursor cache in Presentation Services. These let you track and trace more precisely user sessions.

From the Administration Page in OBIEE the new options are:

Set dynamic log level per session from manage sessions
Filter cursor cache based on specific user sessions
Change sort order of cursor cache
Show Presentation Services diagnostics per cursor
Download cursor cache list as CSV
- I can see this being useful for integration into bespoke monitoring applications, to which I must profess an interest…

Some of these are somewhat low-level and will not be used day-to-day, but the general move towards a more open diagnostics interface with OBIEE is really positive and I hope we see more of it in 12c… :-)

Command Line Aggregate Advisor

Only for use by those with an Exalytics licence, the Summary Advisor was previously available in the Windows Administration Tool only but can now be run from the command line:

[oracle@demo setup]$ nqaggradvisor -h

Usage:  
    nQAggrAdvisor -d <dataSource> -u <userName> -o <outputFile> -c <tupleInQuotes>  
                  [-p <password>] [-F <factFilter>] [-z <maxSizeAggr>] [-g <gainThreshold>]  
                  [-l <minQueryTime>] [-t <timeoutMinutes>] [-s <startDate>]  
                  [-e <endDate>] [-C <on/off>] [-M <on/off>] [-K <on/off>]

Options:  
    -d      : Data source name  
    -u      : User name  
    -o      : Output aggregate persistence script file name  
    -c      : Aggregate persistence target - tuple in quotes: Fully qualified Connection pool, fully qualified schema name, capacity in MB  
    -p      : Password  
    -F      : Fact filter file name  
    -z      : Max size of any single aggregate in MB  
    -g      : Summary advisor will run until performance improvement for new aggregates drops below this value, default = 1  
    -l      : The minimum amount of query time accumulated per LTS in seconds, before it is included for analysis, default = 0  
    -t      : Max run time in minutes - 0 for unlimited, default = 0  
    -s      : Statistics start date  
    -e      : Statistics end date  
    -C      : Prefer optimizer estimates - on/off, default = off  
    -M      : Only include measures used in queries - on/off, default = off  
    -K      : Use surrogate keys - on/off, default = on

Examples:  
    nQAggrAdvisor -d "AnalyticsWeb" -u "Administrator" -p "ADMIN" -o "C:\temp\aggr_advisor.out.txt"  
        -c "DW_Aggr"."Connection Pool","DW_Aggr".."AGGR",1000

    nQAggrAdvisor -d "AnalyticsWeb" -u "Administrator" -p "ADMIN" -o "C:\temp\aggr_advisor.out.txt" -F "C:\temp\fact_filter.txt" -g 10  
        -c "TimesTen_instance1"."Connection Pool","dbo",2000 -s "2011-05-02 08:00:00" -e "2011-05-07 18:30:00"  -C on -M on -K off

Note that in the BIPLATFORM schema S_NQ_SUMMARY_STATISTICS is now called S_NQ_SUMMARY_ADVISOR.

HTML5 images

In previous versions of OBIEE graph images were rendered in Flash by default, and PNG on mobile devices. You could force it to use PNG for all images but would loose the interactivity (tooltips etc). Now in OBIEE 11.1.1.9 you can change the default from Flash to HTML5. This removes the need for a Flash plugin and is generally the way that a lot of visualisations are done on the web nowadays. To my eye there’s no difference in appearance:

To use HTML5 graphs by default, edit instanceconfig.xml and under <Charts> section add:

<DefaultWebImageType>html5</DefaultWebImageType>

Note that html5 is case-sensitive. The config file should look something like this:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>  
<WebConfig xmlns="oracle.bi.presentation.services/config/v1.1">  
   <ServerInstance>  
   [...]  
        <Views>  
        [...]  
            <Charts>  
                <DefaultWebImageType>html5</DefaultWebImageType>  
            [...]  
            </Charts>  
        [...]  
        </Views>  
    [...]  
   </ServerInstance>  
</WebConfig>

If Presentation Services doesn’t come back up when you restart it after making this change then check the stdout logfile console~coreapplication_obips1~1.log as well as the standard sawlog.log file, both of which you’ll find in $FMW_HOME/instances/instance1/diagnostics/logs/OracleBIPresentationServicesComponent/. The reason to check the console log file is that Presentation Services will refuse to start if the configuration supplied is invalid, and you’ll see an error message stating this here.

NQS ODBC functions

One for the Neos amongst you, a quick call of NQSGetSQLProcedures (as seen in SampleApp dashboard 7.90 NQS ODBC Procedures) and comparison with 11.1.1.7.150120 shows the following new & changed NQS ODBC calls. If this means nothing to you then it probably doesn’t need to, but if you’re interested in exploiting OBIEE functionality from all angles, documented or not, then these might be of interest. It goes without saying, these are entirely undocumented and unsupported, completely liable to change or be removed at any time by Oracle.

Added
- NQSGetUserDefinedFunctions
- NQSPAFIntegration
- NQSSearchPresentationObjects
- NQS_GetAllCacheEntries
- NQS_GetOverallCacheInfo
- NQS_GetRepositories
- NQS_LoadNewBaseRP
- NQS_LoadNewRPVersion
- NQS_LockSessionAgainstAutoRPSwitchOver
- NQS_SetRPDReadOnlyMode
- NQS_SwitchOverThisSessionToNewRP
- SAPurgeCacheBySubjectArea
- SAPurgeCacheEntryByIDVector
- SAPurgeXSACache
- SASeedXSACache
Modified
- NQSGetQueryLogExcerpt (additional parameter)
- SAPurgeInternalCache (additional enum)
Removed
- NQSChangeSelfPassword

Web Services

Web Services are one of the best ways to integrate with OBIEE programatically. You don’t need to be building heavy java apps just to use them – you can create and send the necessary SOAP messages from python or even just send it from bash with curl.

There are 2.5 new WSDLs – two new ones (v9, v10) plus v8 which has changed. The new services are:

KPIAssessmentService
ScorecardAssessmentService
ScorecardMetadataService
UserPersonalizationService

You’ll find documentation for the Web Services in the Integrator’s Guide.

User Image Upload

Users can now upload their own images for use in Title views, conditional formats, etc. From an administration point of view this means you’ll want to be keeping an eye on /root/shared/custom/images/ in the Presentation Catalog, either on disk and/or through the OBIEE Catalog View, switch to Admin and enable “Show Hidden Items”:

QUERY_LIMIT_WARNING_INSTEAD_OF_ERROR

This new setting in NQSConfig.ini will warn users when they’re breaching defined query limits, but it won’t abort the query.

Pointless hacks

If you’re a geek like me, part of the fun of a new tool is simply poking around and seeing what’s new – not necessarily what’s useful. There’s plenty of great new stuff in 11.1.1.9, but let’s take a look at the “under the hood”, just Because Geek.

It was John Minkjan who first blogged several years about the xsd configuration schema files, and it is from this that we can find all the things that Presentation Services might be able to do – not just what it definitely can do, and not just what Oracle have documented that it can do. I wrote about some of these options a while back, and there are a few new ones in 11.1.1.9.

ALL OF THESE ARE COMPLETELY UNDOCUMENTED AND UNSUPPORTED. DO NOT USE THEM.

EnableCloudBIEEHome sets the home page of OBIEE to be as it would be on BI Cloud Service (BICS). This is completely pointless since all the interesting stuff (Load Data, Model, Manage) is non-existent, even if it does give us a clue which application deployments are going to be supplying them (bimodeler and biserviceadministration respectively)
GridViews/ShowDataModels outputs a bunch of debug data in Answers Table or Pivot Views:
VirusScannerConfiguration – When a user uploads a custom image, this command will be called with it. For example, this simple script writes to a file the time and name of the file passed to it:
```
echo '---' >> /tmp/log.txt  
date >> /tmp/log.txt  
echo $1 >> /tmp/log.txt
```
If I save this as /tmp/test-script.sh and add it to instanceconfig.xml:
```
<VirusScannerConfiguration>  
   <ScannerInvocationCommandLine>/tmp/test-script.sh</ScannerInvocationCommandLine>  
</VirusScannerConfiguration>
```
When I upload an image I get a row written to my log file. That in itself isn’t useful, but it could be a handy hook maybe from an auditing point of view, or indeed, virus scanning:
```
[oracle@demo tmp]$ cat /tmp/log.txt  
---  
Wed May 20 16:01:47 EDT 2015  
/app/oracle/biee/instances/instance1/tmp/OracleBIPresentationServicesComponent/coreapplication_obips1/defaultpool/sawserver_8673_5553759a_2-1.tmp
```

↧

Security patches released for OBIEE 11.1.1.7/11.1.1.9, and ODI DQ 11.1.1.3

July 14, 2015, 11:14 pm

≫ Next: OBIEE BI Server Cache Management Strategies

≪ Previous: What’s New in OBIEE 11.1.1.9 for Systems Administrators and Developers

Oracle issued their quarterly Critical Patch Update yesterday, and with it notice of several security issues of note:

The most serious for OBIEE (CVE-2013-2186) rates 7.5 (out of 10) on the CVSS scale, affecting the OBIEE Security Platform on both 11.1.1.7 and 11.1.1.9. The access vector is by the network, there’s no authentication required, and it can partially affect confidentiality, integrity, and availability.
- The patch for users of OBIEE 11.1.1.7 is to install the latest patchset, 11.1.1.7.150714 (3GB, released – by no coincidence I’m sure – just yesterday too).
- For OBIEE 11.1.1.9 there is a small patch (64Kb), number 21235195.
There’s also an issue affecting BI Mobile on the iPad prior to 11.1.1.7, the impact being partial impact on integrity.
For users of ODI DQ 11.1.1.3 there’s a whole slew of issues, fixed in CPU patch 21418574.
Exalytics users who are on ILOM versions earlier that 3.2.6 are also affected by two issues (one of which is 10/10 on the CVSS scale)

The CPU document also notes that it is the final patch date for 10.1.3.4.2. If you are still on 10g, now really is the time to upgrade!

Full details of the issues can be found in Critical Patch Update document, and information about patches on My Oracle Support, DocID 2005667.1.

↧

OBIEE BI Server Cache Management Strategies

September 23, 2015, 10:36 am

≫ Next: Managing the OBIEE BI Server Cache from ODI 12c

≪ Previous: Security patches released for OBIEE 11.1.1.7/11.1.1.9, and ODI DQ 11.1.1.3

The OBIEE BI Server cache can be one of the most effective ways of improving response times of OBIEE dashboards. By using data already in the cache it reduces load on the database, the network, and the BI Server.

Should you be using it? I always describe it as the “icing on the cake” – it’s not a fix for a badly-designed OBIEE system, but it does make a lot of sense to use once you’re happy that the foundations for the system are in place. If the foundations are not not in place? Then you’re just papering over the cracks and at some point it’s probably going to come back to bite you. As Mark Rittman put it nearly seven years ago, it’s “[…]usually the last desperate throw of the dice”. The phrase “technical debt”? Yeh, that. But, BI Server caching used after performance review and optimisation rather than instead of – then it’s a Good Thing.

So you’ve decided to use the BI Server cache, and merrily trotted over to Enterprise Manager to enable it, restarted the BI Server, and now your work is done, right? Not quite. Because the BI Server cache will start to store data from all the queries that you run, and use it to satisfy subsequent queries. Not only will it match on a direct hit for the same query, it will use a subset of an existing cache entry where appropriate, and can even aggregate up from what’s in the cache to satisfy a query at a higher level. Clever stuff. But, what happens when you load new data into your data warehouse? Well, the BI Server continues to serve requests out of the cache, because why shouldn’t it? And herein lies the problem with “just turn caching on”. You have to have a cache management strategy.

A cache management strategy sounds grand doesn’t it? But it boils down to two things:

Accuracy – Flush any data from the cache that is now stale
Speed – Prime the cache so that as many queries get a hit on it, first time

Maintaining an Accurate Cache

Every query that is run through the BI Server, whether from a Dashboard, Answers, or more funky routes such as custom ODBC clients or JDBC, will end up in cache. It’s possible to “seed” (“prime”/“warmup”) the cache explicitly, and this is discussed later. The only time you won’t see data in the cache is if (a) you have BI Server caching disabled, or (b) you’ve disabled the Cacheable option for a physical table that is involved in providing the data for the query being run.

You can see metadata for the current contents of the cache in the Administration Tool when connected online to the BI Server, through the Manage -> Cache menu option. This gives you lots of useful information (particularly when you come to optimising cache usage) including the size of each entry, when it was created, when it was last used, and so on.

Purging Options

So we’ve a spread of queries run that hit various dimension and fact tables and created lots of cache entries. Now we’ve loaded data into our underlying database, so we need to make sure that the next time a user runs an OBIEE query that uses the new data they can see it. Otherwise we commit the cardinal sin of any analytical system and show the user incorrect data which is a Bad Thing. It may be fast, but it’s WRONG….

We can purge the whole cache, but that’s a pretty brutal approach. The cache is persisted to disk and can hold lots of data stretching back months – to blitz all of that just because one table has some new data is overkill. A more targeted approach is to purge by physical database, physical table, or even logical query. When would you use these?

Purge entire cache – the nuclear option, but also the simplest. If your data model is small and a large proportion of the underlying physical tables may have changed data, then go for this
Purge by Physical Database – less brutal that clearing the whole cache, if you have various data sources that are loaded at different points in the batch schedule then targeting a particular physical database makes sense.
Purge by Physical Table – if many tables within your database have remained unchanged, whilst a large proportion of particular tables have changed (or it’s a small table) then this is a sensible option to run for each affected table
Purge by Query – If you add a few thousand rows to a billion row fact table, purging all references to that table from the cache would be a waste. Imagine you have a table with sales by day. You load new sales figures daily, so purging the cache by query for recent data is obviously necessary, but data from previous weeks and months may well remain untouched so it makes sense to leave queries against those in the cache. The specifics of this choice are down to you and your ETL process and business rules inherent in the data (maybe there shouldn’t be old data loaded, but what happens if there is? See above re. serving wrong data to users). This option is the most complex to maintain because you risk leaving behind in the cache data that may be stale but doesn’t match the precise set of queries that you purge against.

Which one is correct depends on

your data load and how many tables you’ve changed
your level of reliance on the cache (can you afford low cache hit ratio until it warms up again?)
time to reseed new content

If you are heavily dependant on the cache and have large amounts of data in it, you are probably going to need to invest time in a precise and potentially complex cache purge strategy. Conversely if you use caching as the ‘icing on the cake’ and/or it’s quick to seed new content then the simplest option is to purge the entire cache. Simple is good; OBIEE has enough moving parts without adding to its complexity unnecessarily.

Note that OBIEE itself will perform cache purges in some situations including if a dynamic repository variable used by a Business Model (e.g. in a Logical Column) gets a new value through a scheduled initialisation block.

Performing the Purge

There are several ways in which we can purge the cache. First I’ll discuss the ones that I would not recommend except for manual testing:

Administration Tool -> Manage -> Cache -> Purge. Doing this every time your ETL runs is not a sensible idea unless you enjoy watching paint dry (or need to manually purge it as part of a deployment of a new RPD etc).
In the Physical table, setting Cache persistence time. Why not? Because this time period starts from when the data was loaded into the cache, not when the data was loaded into your database.
An easy mistake to make would be to think that with a daily ETL run, setting the Cache persistence time to 1 day might be a good idea. It’s not, because if your ETL runs at 06:00 and someone runs a report at 05:00, there is a going to be a stale cache entry present for another 23 hours. Even if you use cache seeding, you’re still relinquishing control of the data accuracy in your cache. What happens if the ETL batch overruns or underruns?
The only scenario in which I would use this option is if I was querying directly against a transactional system and wanted to minimise the number of hits OBIEE made against it – the trade-off being users would deliberately be seeing stale data (but sometimes this is an acceptable compromise, so long as it’s made clear in the presentation of the data).

So the two viable options for cache purging are:

BI Server Cache Purge Procedures

These are often called “ODBC” Procedures but technically ODBC is just one – of several – ways that the commands can be sent to the BI Server to invoke.

As well as supporting queries for data from clients (such as Presentation Services) sent as Logical SQL, the BI Server also has its own set of procedures. Many of these are internal and mostly undocumented (Christian Berg does a great job of explaining them here, and they do creep into the documentation here and here), but there are some cache management ones that are fully supported and documented. They are:

SAPurgeCacheByQuery
SAPurgeCacheByTable
SAPurgeCacheByDatabase
SAPurgeAllCache
SAPurgeCacheBySubjectArea (>= 11.1.1.9)
SAPurgeCacheEntryByIDVector (>= 11.1.1.9)

The names of these match up to the purge processes that I describe above. The syntax is in the documentation, but what I am interested in here is how you can invoke them. They are my preferred method for managing the BI Server cache because they enable you to tightly couple your data load (ETL) to your cache purge. Setting the cache to purge based on a drop-dead timer (whether crontab, tivoli, Agent/iBot, whatever) gives you a huge margin of error if your ETL runtime does not remain consistent. Whether it organically increases in runtime as data volumes increase, or it fails and has to be fixed and restarted, ETL does not always finish bang-on when it is ‘supposed’ to.

You can call these procedures in the several ways, including:

nqcmd – one of the most common ways, repeated on many a blog, but requires nqcmd/OBIEE to be installed on the machine running it. nqcmd is a command-line ODBC client for connecting to the BI Server
ODBC – requires BI to be installed on the machine running it in order to make the OBIEE ODBC driver available
JDBC – just requires the OBIEE JDBC driver, which is a single .jar file and thus portable
Web Service – the OBIEE BI Server Web Service can be used to invoke these procedures from any machine with no dependencies other than some WSM configuration on the OBIEE server side.

My preference is for JDBC or Web Service, because they can be called from anywhere. In larger organisations the team building the ETL may have very little to do with OBIEE, and so asking them to install OBIEE components on their server in order to trigger cache purging can be quite an ask. Using JDBC only a single .jar needs copying onto the server, and using the web service not even that:

curl --silent --header "Content-Type: text/xml;charset=UTF-8" 
--user weblogic:Admin123 
--data @purge_cache_soap.xml 
http://192.168.56.102:7780/AdminService/AdminService

[...]
[59118] Operation SAPurgeAllCache succeeded!
[...]

For details of configuring ODI to use the BI Server JDBC driver in order to tightly couple the cache management into an existing ODI load job, stay tuned for a future blog!

Event Polling Tables (EPT)

NB Not Event “Pooling” Tables as I’ve often seen this called

The second viable approach to automated cache purging is EPT, which is a decoupled approach to managing the cache purge, with two components:

An application (your ETL) inserts a row into the table S_NQ_EPT (which is created at installation time by the RCU in the BIPLATFORM schema) with the name of the physical table in which data has been changed
The BI Server polls (hence the name) the S_NQ_EPT table periodically, and if it finds entries in it, purges the cache of data that is from those tables.

So EPT is in a sense the equivalent of using SAPurgeCacheByTable, but in a manner that is not tightly coupled. It relies on configuring the BI Server for EPT, and there is no easy way to know from your ETL if the cache purge has actually happened. It also means that the cache remains stale potentially as long as the polling interval that you’ve configured. Depending on when you’re running your ETL and the usage patterns of your users this may not be an issue, but if you are running ETL whilst users are on the system (for example intra-day micro ETL batches) you could end up with users seeing stale data. Oracle themselves recommend not setting the polling interval any lower than 10 minutes.
EPT has the benefit of being very easy to implement on the ETL side, because it is simply a database table into which the ETL developers need to insert a row for each table that they update during the ETL.

Seeding the Cache

Bob runs an OBIEE dashboard, and the results are added to the cache so that when Bill runs the same dashboard Bill gets a great response rate because his dashboard runs straight from cache. Kinda sucks for Bob though, because his query ran slow as it wasn’t in the cache yet. What’d be nice would be that for the first user on a dashboard the results were already in cache. This is known as seeding the cache, or ‘priming’ it. Because the BI Server cache is not dumb and will hit the cache for queries that aren’t necessarily direct replicas of what previously ran working out the optimal way to seed the cache can take some ~~trial and error~~ careful research. The documentation does a good job of explaining what will and won’t qualify for a cache hit, and it’s worth reading this first.

There are several options for seeding the cache. These all assume you’ve figured out the queries that you want to run in order to load the results into cache.

Run the analysis manually, which will return the analysis data to you and insert it into the BI Server Cache too.
Create an Agent to run the analysis with destination set to Oracle BI Server Cache (For seeding cache), and then either:
1. Schedule the analysis to run from an Agent on a schedule
2. Trigger it from a Web Service in order to couple it to your ETL data load / cache purge batch steps.
Use the BI Server Procedure SASeedQuery (which is what the Agent does in the background) to load the given query into cache without returning the data to the client. This is useful for doing over JDBC/ODBC/Web Service (as discussed for purging above). You could just run the Logical SQL itself, but you probably don’t want to pull the actual data back to the client, hence using the procedure call instead.

Sidenote – Checking the RPD for Cacheable Tables

The RPD Query Tool is great for finding objects matching certain criteria. However, it seems to invert results when looking for Cacheable Physical tables – if you add a filter of Cacheable = false you get physical tables where Cacheable is enabled! And the same in reverse (Cacheable = true -> shows Physical tables where Cacheable is disabled)

Day in the Life of an OBIEE Cache Entry (Who Said BI Was Boring?)

In this example here I’m running a very simple report from SampleApp v406:

The Logical SQL for this is:

SELECT  
   0 s_0,  
   "A - Sample Sales"."Time"."T02 Per Name Month" s_1,  
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2  
FROM "A - Sample Sales"  
ORDER BY 1, 2 ASC NULLS LAST  
FETCH FIRST 5000001 ROWS ONLY

Why’s that useful to know? Because when working with the cache resubmitting queries is needed frequently and doing so directly from an interface like nqcmd is much faster (for me) than a web GUI. Horses for courses…

So I’ve run the query and now we have a cache entry for it. How do we know? Because we see it in the nqquery.log (and if you don’t have it enabled, go and enable it now):

[2015-09-23T15:58:18.000+01:00] [OracleBIServerComponent] [TRACE:3] [USER-42]  
[] [ecid: 00586hFR07mFw000jzwkno0005Qx00007U,0] [tid: 84a35700]  
[requestid: a9730015] [sessionid: a9730000] [username: weblogic]  
--------------------  
Query Result Cache: [59124] The query for user 'weblogic' was inserted into 
the query result cache.  
The filename is '/app/oracle/biee/instances/instance1/bifoundation/OracleBIServerComponent/coreapplication_obis1/cache/NQS__735866_57498_0.TBL'.

We see it in Usage Tracking (again, if you don’t have this enabled, go and enable it now):

SELECT TO_CHAR(L.START_TS, 'YYYY-MM-DD HH24:Mi:SS') LOGICAL_START_TS,  
  SAW_SRC_PATH,  
  NUM_CACHE_INSERTED,  
  NUM_CACHE_HITS,  
  NUM_DB_QUERY  
FROM BIEE_BIPLATFORM.S_NQ_ACCT L  
ORDER BY START_TS DESC;

We can also see it in the Administration Tool (when connected online to the BI Server):

We can even see it and touch it (figuratively) on disk:

So we have the data in the cache. The same query run again will now use the cache entry, as seen in nqquery.log:

[2015-09-23T16:09:24.000+01:00] [OracleBIServerComponent] [TRACE:3] [USER-21]
[] [ecid: 11d1def534ea1be0:6066a19d:14f636f1dea:-8000-000000000000b948,0:1:1:5]  
[tid: 87455700] 
[requestid: a9730017] [sessionid: a9730000] [username: weblogic]  
--------------------  
Cache Hit on query: [[  
Matching Query: SET VARIABLE QUERY_SRC_CD='Report',SAW_SRC_PATH='/users/weblogic/Cache Test 01',PREFERRED_CURRENCY='USD';SELECT  
   0 s_0,  
   "A - Sample Sales"."Time"."T02 Per Name Month" s_1,  
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2  
FROM "A - Sample Sales"  
ORDER BY 1, 2 ASC NULLS LAST  
FETCH FIRST 5000001 ROWS ONLY

Created by:     weblogic

and in Usage Tracking:

“Interestingly” Usage Tracking shows a count of 1 for number of DB queries run, which we would not expect for a cache hit. The nqquery.log shows the same, but no query logged as being sent to the database, so I’m minded to dismiss this as an instrumentation bug.

Now what about if we want to run a query but not use the BI Server Cache? This is an easy one, plenty blogged about it elsewhere – use the Request Variable DISABLE_CACHE_HIT=1. This overrides the built in system session variable of the same name. Here I’m running it directly against the BI Server, prefixed onto my Logical SQL – if you want to run it from within OBIEE you need the Advanced tab in the Answers editor.

SET VARIABLE SAW_SRC_PATH='/users/weblogic/Cache Test 01',
DISABLE_CACHE_HIT=1:SELECT  
   0 s_0,  
   "A - Sample Sales"."Time"."T02 Per Name Month" s_1,  
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2  
FROM "A - Sample Sales"  
ORDER BY 1, 2 ASC NULLS LAST  
FETCH FIRST 5000001 ROWS ONLY

Now we get a cache ‘miss’, because we’ve specifically told the BI Server to not use the cache. As you’d expect, Usage Tracking shows no cache hit, but it does show a cache insert – because why shouldn’t it?

If you want to run a query without seeding the cache either, you can use DISABLE_CACHE_SEED=1:

SET VARIABLE SAW_SRC_PATH='/users/weblogic/Cache Test 01',
DISABLE_CACHE_HIT=1,DISABLE_CACHE_SEED=1:SELECT  
   0 s_0,  
   "A - Sample Sales"."Time"."T02 Per Name Month" s_1,  
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2  
FROM "A - Sample Sales"  
ORDER BY 1, 2 ASC NULLS LAST  
FETCH FIRST 5000001 ROWS ONLY

These request variables can be set per analysis, or per user by creating a session initialisation block to assign the required values to the respective variables.

Cache Location

The BI Server cache is held on disk, so it goes without saying that storing it on fast (eg SSD) disk is a Good Idea. There’s no harm in giving it its own filesystem on *nix to isolate it from other work (in terms of filesystems filling up) and to make monitoring it super easy.

Use the DATA_STORAGE_PATHS configuration element in NQSConfig.ini to change the location of the BI Server cache.

Summary

Use BI Server Caching as the ‘icing on the cake’ for performance in your OBIEE system. Make sure you have your house in order first – don’t use it to try to get around bad design.
Use the SAPurgeCache procedures to directly invoke a purge, or the Event Polling Tables for a more loosely-coupled approach. Decide carefully which purge approach is best for your particular caching strategy.
If using the SAPurgeCache procedures, use JDBC or Web Services to call them so that there is minimal/no installation required to call them from your ETL server.
Invest time in working out an optimal cache seeding strategy, making use of Usage Tracking to track cache hit ratios.
Integrate both purge and seeding into your ETL. Don’t use a schedule-based approach because it will come back to haunt you with its inflexibility and scope for error.

The post OBIEE BI Server Cache Management Strategies appeared first on Rittman Mead Consulting.

↧

Managing the OBIEE BI Server Cache from ODI 12c

September 24, 2015, 12:47 pm

≫ Next: Using the BI Server Metadata Web Service for Automated RPD Modifications

≪ Previous: OBIEE BI Server Cache Management Strategies

I wrote recently about the OBIEE BI Server Cache and how useful it can be, but how important it is to manage it properly, both in the purging of stale data and seeding of new. In this article I want to show how to walk-the-walk and not just talk-the-talk (WAT? But you’re a consultant?!). ODI is the premier data integration tool on the market and one that we are great fans of here at Rittman Mead. We see a great many analytics implementations built with ODI for the data load (ELT, strictly speaking, rather than ETL) and then OBIEE for the analytics on top. Managing the BI Server cache from within your ODI batch makes a huge amount of sense. By purging and reseeding the cache directly after the data has been loaded into the database we can achieve optimal cache usage with no risk of stale data.

There are two options for cleanly hooking into OBIEE from ODI 12c with minimal fuss: JDBC, and Web Services. JDBC requires the OBIEE JDBC driver to be present on the ODI Agent machine, whilst Web Services have zero requirement on the ODI side, but a bit of config on the OBIEE side.

Setting up the BI Server JDBC Driver and Topology

Here I’m going to demonstrate using JDBC to connect to OBIEE from ODI. It’s a principle that was originally written up by Julien Testut here. We take the OBIEE JDBC driver bijdbc.jar from $FMW_HOME/Oracle_BI1/bifoundation/jdbc and copy it to our ODI machine. I’m just using a local agent for my testing, so put it in ~/.odi/oracledi/userlib/. For a standalone agent it should go in $AGENT_HOME/odi/agent/lib.

[oracle@ip-10-103-196-207 ~]$ cd /home/oracle/.odi/oracledi/userlib/  
[oracle@ip-10-103-196-207 userlib]$ ls -l  
total 200  
-rw-r----- 1 oracle oinstall    332 Feb 17  2014 additional_path.txt  
-rwxr-xr-x 1 oracle oinstall 199941 Sep 22 14:50 bijdbc.jar

Now fire up ODI Studio, sign in to your repository, and head to the Topology pane. Under Physical Architecture -> Technologies and you’ll see Oracle BI

Right click and select New Data Server. Give it a sensible name and put your standard OBIEE credentials (eg. weblogic) under the Connection section. Click the JDBC tab and click the search icon to the right of the JDBC Driver text box. Select the default, oracle.bi.jdbc.AnaJdbcDriver, and then in the JDBC Url box put your server and port (9703, unless you’ve changed the listen port of OBIEE BI Server)

Now click Test Connection (save the data server when prompted, and click OK at the message about creating a physical schema), and select the Local Agent with which to run it. If you get an error then click Details to find out the problem.

One common problem can be the connection through to the OBIEE server port, so to cut ODI out of the equation try this from the command prompt on your ODI machine (assuming it’s *nix):

nc -vz my-obiee-server.foo.com 9703

If the host resolves correctly and the port is open then you should get:

Connection to my-obiee-server.foo.com 9703 port [tcp/*] succeeded!

If not you’ll get something like:

nc: my-obiee-server.foo.com port 9703 (tcp) failed: Connection refused

Check the usual suspects – firewall (eg iptables) on the OBIEE server, firewalls on the network between the ODI and OBIEE servers, etc.

Assuming you’ve got a working connection you now need to create a Physical Schema. Right click on the new data server and select New Physical Schema.

OBIEE’s BI Server acts as a “database” to clients, within which there are “schemas” (Subject Areas) and “tables” (Presentation Tables). On the New Physical Schema dialog you just need to set Catalog (Catalog), and when you click the drop-down you should see a list of the Subject Areas within your RPD. Pick one – it doesn’t matter which.

Save the physical schema (ignore the context message). At this point your Physical Architecture for Oracle BI should look like this:

Now under Logical Architecture locate the Oracle BI technology, right click on it and select New Logical Schema. From the Physical Schemas dropdown select the one that you’ve just created. Give a name to the Logical Schema.

Your Logical Architecture for Oracle BI should look like this:

Building the Cache Management Routine

Full Cache Purge

Over in the Designer tab go to your ODI project into which you want to integrate the OBIEE cache management functions. Right click on Procedures and select Create New Procedure. Give it a name such as OBIEE Cache – Purge All and set the Target Technology to Oracle BI

Switch to the Tasks tab and add a new Task. Give it a name, and set the Schema to the logical schema that you defined above. Under Target Command enter the call you want to make to the BI Server, which in this case is

call SAPurgeAllCache();

Save the procedure and then from the toolbar menu click on Run. Over in the Operator tab you should see the session appear and soon after complete – all being well – successfully.

You can go and check your BI Server Cache from the OBIEE Administration Tool to confirm that it is now empty:

And confirm it through Usage Tracking:

From what I can see at the default log levels, nothing gets written to either nqquery.log or nqserver.log for this action unless there is an error in your syntax in which case it is logged in nqserver.log:

(For more information on that particular error see here)

Partial Cache Purge

This is the same pattern as above – create an ODI Procedure to call the relevant OBIEE command, which for purging by table is SAPurgeCacheByTable. We’re going to get a step more fancy now, and add a variable that we can pass in so that the Procedure is reusable multiple times over throughout the ODI execution for different tables.

First off create a new ODI Variable that will hold the name of the table to purge. If you’re working with multiple RPD Physical Database/Catalog/Schema objects you’ll want variables for those too:

Now create a Procedure as before, with the same settings as above but a different Target Command, based on SAPurgeCacheByTable and passing in the four parameters as single quoted, comma separated values. Note that these are the Database/Catalog/Schema/Table as defined in the RPD. So “Database” is not your TNS or anything like that, it’s whatever it’s called in the RPD Physical layer. Same for the other three identifiers. If there’s no Catalog (and often there isn’t) just leave it blank.

When including ODI Variable(s) make sure you still single-quote them. The command should look something like this:

Now let’s seed the OBIEE cache with a couple of queries, one of which uses the physical table and one of which doesn’t. When we run our ODI Procedure we should see one cache entry go and the other remain. Here’s the seeded cache:

And now after executing the procedure:

And confirmation through Usage Tracking of the command run:

Cache Seeding

As before, we use an ODI Procedure to call the relevant OBIEE command. To seed the cache we can use SASeedQuery which strictly speaking isn’t documented but a quick perusal of the nqquery.log when you run a cache-seeding OBIEE Agent shows that it is what is called in the background, so we’re going to use it here (and it’s mentioned in support documents on My Oracle Support, so it’s not a state secret). The documentation here gives some useful advice on what you should be seeding the cache with — not necessarily only exact copies of the dashboard queries that you want to get a cache hit for.

Since this is a cookie-cutter of what we just did previously you can use the Duplicate Selection option in ODI Designer to clone one of the other OBIEE Cache procedures that you’ve already created. Amend the Target Command to:

When you run this you should see a positive confirmation in the nqserver.log of the cache seed:

[2015-09-23T23:23:10.000+01:00] [OracleBIServerComponent] [TRACE:3]  
[USER-42] [] [ecid: 005874imI9nFw000jzwkno0007q700008K,0] [tid: 9057d700]  
[requestid: 477a0002] [sessionid: 477a0000] [username: weblogic]  
--------------------  
Query Result Cache: [59124]  
The query for user 'weblogic' was inserted into the query result cache.  
The filename is '/app/oracle/biee/instances/instance1/bifoundation/OracleBIServerComponent/coreapplication_obis1/cache/NQS__735866_84190_2.TBL'. [[

A very valid alternative to calling SASeedQuery would be to call the OBIEE SOA Web Service to trigger an OBIEE Agent that populated the cache (by setting ‘Destination’ to ‘Oracle BI Server Cache (For seeding cache)’). OBIEE Agents can also be ‘daisy chained’ so that one Agent calls another on completion, meaning that ODI could kick off a single ‘master’ OBIEE Agent which then triggered multiple ‘secondary’ OBIEE Agents. The advantage of this approach over SASeedQuery is that cache seeding is more likely to change as OBIEE usage patterns do, and it is easier for OBIEE developers to maintain all the cache seeding code within ‘their’ area (OBIEE Presentation Catalog) than put in a change request to the ODI developers each time to change a procedure.

Integrating it in the ODI batch

You’ve two options here, using Packages or Load Plans. Load Plans were introduced in ODI 11.1.1.5 and are a clearer and more flexible of orchestrating the batch.

To use it in a load plan create a serial step that will call a mapping followed by the procedure to purge the affected table. In the procedure step in the load plan set the value for the variable. At the end of the load plan, call the OBIEE cache seed step:

Alternatively, to integrate the above procedures into a Package instead of a load plan you need to add two steps per mapping. First, the variable is updated to hold the name of the table just loaded, and then the OBIEE cache is purged for the affected table. At the end of the flow a call is made to reseed the cache:

These are some very simple examples, but hopefully illustrate the concept and the powerful nature of integrating OBIEE calls directly from ODI. For more information about OBIEE Cache Management, see my post here.

The post Managing the OBIEE BI Server Cache from ODI 12c appeared first on Rittman Mead Consulting.

↧

Using the BI Server Metadata Web Service for Automated RPD Modifications

September 25, 2015, 5:33 am

≫ Next: Forays into Kafka – Logstash transport / centralisation

≪ Previous: Managing the OBIEE BI Server Cache from ODI 12c

A little-known new feature of OBIEE 11g is a web service interface to the BI Server. Called the “BI Server Metadata Web Service” it gives a route into making calls into the BI Server using SOAP-based web services calls. Why is this useful? Because it means you can make any call to the BI Server (such as SAPurgeAllCache) from any machine without needing to install any OBIEE-related artefacts such as nqcmd, JDBC drivers, etc. The IT world has been evolving over the past decade or more towards a more service-based architecture (remember the fuss about SOA?) where we can piece together the functionality we need rather than having one monolithic black box trying to do everything. Being able to make use of this approach in our BI deployments is a really good thing. We can do simple things like BI Server cache management using a web service call, but we can also do more funky things, such as actually updating repository variable values in real time – and we can do it from within our ETL jobs, as part of an automated deployment script, and so on.

Calling the BI Server Metadata Web Service

First off, let’s get the web service configured and working. The documentation for the BI Server Metadata Web Service can be found here, and it’s important to read it if you’re planning on using this. What I describe here is the basic way to get it up and running.

Configuring Security

We need to configure the security against the web service to define what kind of authentication is required by it to use. If you don’t do this, you won’t be able to make calls to it. Setting up the security is a case of attaching a security policy in WebLogic Server to the web service (called AdminService) itself. I’ve used oracle/wss_http_token_service_policy which means that the credentials can be passed through using standard HTTP Basic authentication.

You can do this through Enterprise Manager:

Or you can do it through WLST using the attachWebServicePolicy call.

You also need to configure WSM, adding some entries to the credential store as detailed here.

Testing the Web Service

The great thing about web services is that they can be used from pretty much anywhere. Many software languages will have libraries for making SOAP calls, and if they don’t, it’s just HTTP under the covers so you can brew your own. For my testing I’m using the free version of SoapUI. In anger, I’d use something like curl for a standalone script or hook it up to ODI for integration in the batch.

Let’s fire up SoapUI and create a new project. Web services provide a Web Service Definition Language (WSDL) that describes what and how they work, which is pretty handy. We can pass this WSDL to SoapUI for it to automatically build some sample requests for us. The WSDL for this web service is

http://<biserver>:<port>/AdminService/AdminService?WSDL

Where biserver is your biserver (duh) and port is the managed server port (usually 9704, or 7780).

We’ve now got a short but sweet list of the methods we can invoke:

Expand out callProcedureWithResults and double click on Request 1. This is a template SOAP message that SoapUI has created for you, based on the WSDL.

Edit the XML to remove the parameters section, and just to test things out in procedureName put GetOBISVersion(). Your XML request should look like this:

<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ws="http://ws.admin.obiee.oracle/">  
   <soapenv:Header/>  
   <soapenv:Body>  
      <ws:callProcedureWithResults>  
         <procedureName>GetOBISVersion()</procedureName>  
      </ws:callProcedureWithResults>  
   </soapenv:Body>  
</soapenv:Envelope>

If you try and run this now (green arrow or Cmd-Enter on the Mac) you’ll see the exact same XML appear on the right pane, which is strange… but click on Raw and you’ll see what the problem is :

Remember the security stuff we set up on the server previously? Well as the client we now need to keep our side of the bargain, and pass across our authentication with the SOAP call. Under the covers this is a case of sending HTTP Basic auth (since we’re using oracle/wss_http_token_service_policy), and how you do this depends on how you’re making the call to the web service. In SoapUI you click on the Auth button at the bottom of the screen, Add New Authentication, Type: Basic, and then put your OBIEE server username/password in.

Now you can click submit on the request again, and you should see in the response pane (on the right) in the XML view details of your BI Server version (you may have to scroll to the right to see it)

This simple test is just about validating the end-to-end calling of a BI Server procedure from a web service. Now we can get funky with it….

Updating Repository Variables Programatically

Repository variables in OBIEE are “global”, in that every user session sees the same value. They’re often used for holding things like “what was the last close of business date”, or “when was the data last updated in the data warehouse”. Traditionally these have been defined as dynamic repository variables with an initialisation block that polled a database query on a predefined schedule to get the current value. This meant that to have a variable showing when your data was last loaded you’d need to (a) get your ETL to update a row in a database table with a timestamp and then (b) write an init block to poll that table to get the value. That polling of the table would have to be as frequent as you needed in order to show the correct value, so maybe every minute. It’s kinda messy, but it’s all we had. Here I’d like to show an alternative approach.

Let’s say we have a repository variable, LAST_DW_REFRESH. It’s a timestamp that we use in report predicates and titles so that when they are run we can show the correct data based on when the data was last loaded into the warehouse. Here’s a rather over-simplified example:

The Title view uses the code:

Data correct as of: @{biServer.variables['LAST_DW_REFRESH']}

Note that we’re referencing the variable here. We could also put it in the Filter clause of the analysis. Somewhat tenuously, let’s imagine we have a near-realtime system that we’re federating a DW across direct from OLTP, and we just want to show data from the last load into the DW:

For the purpose of this example, the report is less important than the diagnostics it gives us when run, in nqquery.log. First we see the inbound logical request:

-------------------- SQL Request, logical request hash:  
45f9492a  
SET VARIABLE QUERY_SRC_CD='Report',PREFERRED_CURRENCY='USD';SELECT  
   0 s_0,  
   "A - Sample Sales"."Time"."T05 Per Name Year" s_1,  
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2  
FROM "A - Sample Sales"  
WHERE  
("Time"."T00 Calendar Date" <  VALUEOF("LAST_DW_REFRESH"))  
ORDER BY 1, 2 ASC NULLS LAST  
FETCH FIRST 5000001 ROWS ONLY

Note the VALUEOF clause. When this is parsed out we get to see the actual value of the repository variable that OBIEE is going to execute the query with:

-------------------- Logical Request (before navigation): [[
RqList  
    0 as c1 GB,  
    D0 Time.T05 Per Name Year as c2 GB,  
    1- Revenue:[DAggr(F0 Sales Base Measures.1- Revenue by [ D0 Time.T05 Per Name Year] )] as c3 GB  
DetailFilter:  
    D0 Time.T00 Calendar Date < TIMESTAMP '2015-09-24 23:30:00.000'  
OrderBy: c1 asc, c2 asc NULLS LAST

We can see the value through the Administration Tool Manage Sessions page too, but it’s less convenient for tracking in testing:

If we update the RPD online with the Administration Tool (nothing fancy at this stage) to change the value of this static repository variable :

as18

And then rerun the report, we can see the value has changed:

-------------------- Logical Request (before navigation): [[
RqList  
    0 as c1 GB,  
    D0 Time.T05 Per Name Year as c2 GB,  
    1- Revenue:[DAggr(F0 Sales Base Measures.1- Revenue by [ D0 Time.T05 Per Name Year] )] as c3 GB  
DetailFilter:  
    D0 Time.T00 Calendar Date < TIMESTAMP '2015-09-25 00:30:00.000'  
OrderBy: c1 asc, c2 asc NULLS LAST

Now let’s do this programatically. First off, the easy stuff, that’s been written about plenty before. Using biserverxmlgen we can create a XUDML version of the repository. Searching through this we can pull out the variable definition:

<Variable name="LAST_DW_REFRESH" id="3031:286125" uid="00000000-1604-1581-9cdc-7f0000010000">  
<Expr><![CDATA[TIMESTAMP '2015-09-25 00:30:00']]></Expr>  
</Variable>

and then wrap it in the correct XML structure and update the timestamp we want to use:

<?xml version="1.0" encoding="UTF-8" ?>  
<Repository xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">  
    <DECLARE>  
        <Variable name="LAST_DW_REFRESH" id="3031:286125" uid="00000000-1604-1581-9cdc-7f0000010000">  
        <Expr><![CDATA[TIMESTAMP '2015-09-26 00:30:00']]></Expr>  
        </Variable>  
    </DECLARE>  
</Repository>

(you can also get this automagically by modifying the variable in the RPD, saving the RPD file, and then comparing it to the previous copy to generate a patch file with the Administration Tool or comparerpd)

Save this XML snippet, with the updated timestamp value, as last_dw_refresh.xml. Now to update the value on the BI Server in-flight, first using the OBIEE tool biserverxmlcli. This is available on all OBIEE servers and client installations – we’ll get to web services for making this update call remotely in a moment.

biserverxmlcli -D AnalyticsWeb -R Admin123 -U weblogic -P Admin123 -I last_dw_refresh.xml

Here -D is the BI Server DSN, -U / -P are the username/password credentials for the server, and the -R is the RPD password.

Running the analysis again shows that it is now working with the new value of the variable:

-------------------- Logical Request (before navigation): [[
RqList  
    0 as c1 GB,  
    D0 Time.T05 Per Name Year as c2 GB,  
    1- Revenue:[DAggr(F0 Sales Base Measures.1- Revenue by [ D0 Time.T05 Per Name Year] )] as c3 GB  
DetailFilter:  
    D0 Time.T00 Calendar Date < TIMESTAMP '2015-09-26 00:30:00.000'  
OrderBy: c1 asc, c2 asc NULLS LAST

Getting Funky – Updating RPD from Web Service

Let’s now bring these two things together – RPD updates (in this case to update a variable value, but could be anything), and BI Server calls.

In the above web service example I called the very simple GetOBISVersion. Now we’re going to use the slightly more complex NQSModifyMetadata. This is actually documented, and what we’re going to do is pass across the same XUDML that we sent to biserverxmlcli above, but through the web service. As a side note, you could also do this over JDBC if you wanted (under the covers, AdminService is just a web app with a JDBC connector to the BI Server).

I’m going to do this in SoapUI here for clarity but I actually used SUDS to prototype it and figure out the exact usage.

So as a quick recap, this is what we’re going to do:

Update the value of a repository variable, using XUDML. We generated this XUDML from biserverxmlgen and wrapped it in the correct XML structure.
We could also have got it through comparerpd or the Administration Tool ‘create patch’ function
Use the BI Server’s NQSModifyMetadata call to push the XUDML to the BI Server online.
We saw that biserverxmlcli can also be used as an alternative to this for making online updates through XUDML.
Use the BI Server Metadata Web Service (AdminService) to invoke the NQSModifyMetadata call on the BI Server, using the callProcedureWithResults method

In SoapUI create a new request:

Set up the authentication:

Edit the XML SOAP message to remove parameters and specify the basic BI Server call:

<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ws="http://ws.admin.obiee.oracle/">  
   <soapenv:Header/>  
   <soapenv:Body>  
      <ws:callProcedureWithResults>  
         <procedureName>NQSModifyMetadata()</procedureName>  
      </ws:callProcedureWithResults>  
   </soapenv:Body>  
</soapenv:Envelope>

Now the tricky bit – we need to cram the XUDML into our XML SOAP message. But, XUDML is its own kind of XML, and all sorts of grim things happen here if we’re not careful (because the XUDML gets swallowed up into the SOAP message, it all being XML). The solution I came up with (which may not be optimal…) is to encode all of the HTML entities, which a tool like this does (and if you’re using a client library like SUDS will happen automagically). So our XUDML, with another new timestamp for testing:

<?xml version="1.0" encoding="UTF-8" ?>  
<Repository xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">  
    <DECLARE>  
        <Variable name="LAST_DW_REFRESH" id="3031:286125" uid="00000000-1604-1581-9cdc-7f0000010000">  
        <Expr><![CDATA[TIMESTAMP '2015-09-21 00:30:00']]></Expr>  
        </Variable>  
    </DECLARE>  
</Repository>

becomes:

&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; ?&gt; &lt;Repository xmlns:xsi=&quot;http://www.w3.org/2001/XMLSchema-instance&quot;&gt; &lt;DECLARE&gt; &lt;Variable name=&quot;LAST_DW_REFRESH&quot; id=&quot;3031:286125&quot; uid=&quot;00000000-1604-1581-9cdc-7f0000010000&quot;&gt; &lt;Expr&gt;&lt;![CDATA[TIMESTAMP '2015-09-21 00:30:00']]&gt;&lt;/Expr&gt; &lt;/Variable&gt; &lt;/DECLARE&gt; &lt;/Repository&gt;

We’re not quite finished yet. Because this is actually a call (NQSModifyMetadata) nested in a call (callProcedureWithResults) we need to make sure NQSModifyMetadata gets the arguments (the XUDML chunk) through intact, so we wrap it in single quotes – which also need encoding ('):

&apos;&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; ?&gt; &lt;Repository xmlns:xsi=&quot;http://www.w3.org/2001/XMLSchema-instance&quot;&gt; &lt;DECLARE&gt; &lt;Variable name=&quot;LAST_DW_REFRESH&quot; id=&quot;3031:286125&quot; uid=&quot;00000000-1604-1581-9cdc-7f0000010000&quot;&gt; &lt;Expr&gt;&lt;![CDATA[TIMESTAMP '2015-09-21 23:30:00']]&gt;&lt;/Expr&gt; &lt;/Variable&gt; &lt;/DECLARE&gt; &lt;/Repository&gt;&apos;

and then for final good measure, the single quotes around the timestamp need double-single quoting:

&apos;&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; ?&gt; &lt;Repository xmlns:xsi=&quot;http://www.w3.org/2001/XMLSchema-instance&quot;&gt; &lt;DECLARE&gt; &lt;Variable name=&quot;LAST_DW_REFRESH&quot; id=&quot;3031:286125&quot; uid=&quot;00000000-1604-1581-9cdc-7f0000010000&quot;&gt; &lt;Expr&gt;&lt;![CDATA[TIMESTAMP ''2015-09-21 23:30:00'']]&gt;&lt;/Expr&gt; &lt;/Variable&gt; &lt;/DECLARE&gt; &lt;/Repository&gt;&apos;

Nice, huh? The WSDL suggests that parameters for these calls should be able to be placed within the XML message as additional entities, which I wonder if would allow for proper encoding, but I couldn’t get it to work (I kept getting java.sql.SQLException: Parameter 1 is not bound).

So, stick this mess of encoding plus twiddles into your SOAP message and it should look like this: (watch out for line breaks; these can break things)

<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ws="http://ws.admin.obiee.oracle/">  
   <soapenv:Header/>  
   <soapenv:Body>  
      <ws:callProcedureWithResults>  
         <procedureName>NQSModifyMetadata(&apos;&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; ?&gt; &lt;Repository xmlns:xsi=&quot;http://www.w3.org/2001/XMLSchema-instance&quot;&gt; &lt;DECLARE&gt; &lt;Variable name=&quot;LAST_DW_REFRESH&quot; id=&quot;3031:286125&quot; uid=&quot;00000000-1604-1581-9cdc-7f0000010000&quot;&gt; &lt;Expr&gt;&lt;![CDATA[TIMESTAMP ''2015-09-21 23:30:00'']]&gt;&lt;/Expr&gt; &lt;/Variable&gt; &lt;/DECLARE&gt; &lt;/Repository&gt;&apos;)</procedureName>  
      </ws:callProcedureWithResults>  
   </soapenv:Body>  
</soapenv:Envelope>

Hit run, and with a bit of luck you’ll get a “nothing to report” response :

If you look in nqquery.log you’ll see:

[...] NQSModifyMetadata started.  
[...] NQSModifyMetadata finished successfully.

and all-importantly when you run your report, the updated variable will be used:

-------------------- Logical Request (before navigation): [[
RqList  
    0 as c1 GB,  
    D0 Time.T05 Per Name Year as c2 GB,  
    1- Revenue:[DAggr(F0 Sales Base Measures.1- Revenue by [ D0 Time.T05 Per Name Year] )] as c3 GB  
DetailFilter:  
D0 Time.T00 Calendar Date < TIMESTAMP '2015-09-21 23:30:00.000'  
OrderBy: c1 asc, c2 asc NULLS LAST

If this doesn’t work … well, best of luck. Use nqquery.log, bi_server1.log (where the AdminService writes some diagnostics) to try and trace the issue. Also test calling NQSModifyMetadata from JDBC (or ODBC) directly, and then add in the additional layer of the web service call.

So, in the words of Mr Rittman, there you have it. Programatically updating the value of repository variables, or anything else, done online, and for bonus points done through a web service call making it possible to use without any local OBIEE client/server tools.

The post Using the BI Server Metadata Web Service for Automated RPD Modifications appeared first on Rittman Mead Consulting.

↧

Forays into Kafka – Logstash transport / centralisation

October 13, 2015, 11:14 am

≫ Next: Introducing the Rittman Mead OBIEE Performance Analytics Service

≪ Previous: Using the BI Server Metadata Web Service for Automated RPD Modifications

The holy trinity of Elasticsearch, Logstash, and Kibana (ELK) are a powerful trio of tools for data discovery and systems diagnostics. In a nutshell, they enable you to easily search through your log files, slice & dice them visually, drill into problem timeframes, and generally be the boss of knowing where your application’s at.

Getting application logs into ELK in the most basic configuration means doing the processing with Logstash local to the application server, but this has two overheads – the CPU required to do the processing, and (assuming you have more than one application server) the management of multiple configurations and deployments across your servers. A more flexible and maintainable architecture is to ship logs from the application server to a separate ELK server with something like Logstash-forwarder (aka Lumberjack), and do all your heavy ELK-lifting there.

In this article I’m going to demonstrate an alternative way of shipping and centralising your logs for Logstash processing, with Apache Kafka.

Kafka is a “publish-subscribe messaging rethought as a distributed commit log”. What does that mean in plainer English? My over-simplified description would be that it is a tool that:

Enables one or more components, local or across many machines, to send messages (of any format) to …
…a centralised store, which may be holding messages from other applications too…
…from where one or more consumers can independently opt to pull these messages in exact order, either as they arrive, batched, or ‘rewinding’ to a previous point in time on demand.

Kafka has been designed from the outset to be distributed and fault-tolerant, and for very high performance (low latency) too. For a good introduction to Kafka and its concepts, the introduction section of the documentation is a good place to start, as is Gwen Shapira’s Kafka for DBAs presentation.

If you’re interested in reading more about Kafka, the article that really caught my imagination with its possibilities was by Martin Kleppmann in which he likens (broadly) Kafka to the unix Pipe concept, being the joiner between components that never had to be designed to talk to each other specifically.

Kafka gets a lot of press in the context of “Big Data”, Spark, and the like, but it also makes a lot of sense as a “pipe” between slightly more ‘mundane’ systems such as Logstash…

Overview

In this article we’re using Kafka at its very simplest – one Producer, one Topic, one Consumer. But hey, if it works and it is a good use of technology who cares if it’s not a gazillion message throughput per second to give us bragging rights on Hacker News

We’re going to run Logstash twice; once on the application server to simply get the logfiles out and in to Kafka, and then again to pull the data from Kafka and process it at our leisure:

Once Logstash has processed the data we’ll load it into Elasticsearch, from where we can do some nice analysis against it in Kibana.

Build

This article was written based on three servers:

Application server (OBIEE)
Kafka server
ELK server

In practice, Kafka could run on the ELK server if you needed it to and throughput was low. If things got busier, splitting them out would make sense as would scaling out Kafka and ELK across multiple nodes each for capacity and resilience. Both Kafka and Elasticsearch are designed to be run distributed and are easy to do so.

The steps below show how to get the required software installed and running.

Networking and Host Names

Make sure that each host has a hostname that is proper (not ‘demo’) and can be resolved from all the other hosts being used. Liberal use of /etc/hosts hardcoding of IP/hostnames and copying to each host is one way around this in a sandbox environment. In the real world use DNS CNAMEs to resolve the static ip of each host.

Make sure that the hostname is accessible from all other machines in use. That is, if you type hostname on one machine:

rmoff@ubuntu-03:~$ hostname  
ubuntu-03

Make sure that you can ping it from another machine:

rmoff@ubuntu-02:/opt$ ping ubuntu-03  
PING ubuntu-03 (192.168.56.203) 56(84) bytes of data.  
64 bytes from ubuntu-03 (192.168.56.203): icmp_seq=1 ttl=64 time=0.530 ms  
64 bytes from ubuntu-03 (192.168.56.203): icmp_seq=2 ttl=64 time=0.287 ms  
[...]

and use netcat to hit a particular port (assuming that something’s listening on that port):

rmoff@ubuntu-02:/opt$ nc -vz ubuntu-03 9200  
Connection to ubuntu-03 9200 port [tcp/*] succeeded!

Application Server – log source (“`sampleappv406`”)

This is going to be the machine from which we’re collecting logs. In my example it’s OBIEE that’s generating the logs, but it could be any application. All we need to install is Logstash, which is going to ship the logs – unprocessed – over to Kafka. Because we’re working with Kafka, it’s also useful to have the console scripts (that ship with the Kafka distribution) available as well, but strictly speaking, we don’t need to install Kafka on this machine.

Downloadkafka is optional, but useful to have the console scripts there for testing

wget https://download.elastic.co/logstash/logstash/logstash-1.5.4.zip  
wget http://apache.mirror.anlx.net/kafka/0.8.2.0/kafka_2.10-0.8.2.0.tgz

Install

unzip logstash*.zip  
tar -xf kafka*

sudo mv kafka* /opt  
sudo mv logstash* /opt

Kafka host (“`ubuntu-02`”)

This is our kafka server, where Zookeeper and Kafka run. Messages are stored here before being passed to the consumer.

Download

wget http://apache.mirror.anlx.net/kafka/0.8.2.0/kafka_2.10-0.8.2.0.tgz

Install
```
tar -xf kafka*  
sudo mv kafka* /opt
```
ConfigureIf there’s any funny business with your networking, such as a hostname on your kafka server that won’t resolve externally, make sure you set the advertised.host.name value in /opt/kafka*/config/server.properties to a hostname/IP for the kafka server that can be connected to externally.

RunUse separate sessions, or even better, screen, to run both these concurrently:

Zookeeper

cd /opt/kafka*  
bin/zookeeper-server-start.sh config/zookeeper.properties

Kafka Server

cd /opt/kafka*  
bin/kafka-server-start.sh config/server.properties

ELK host (“`ubuntu-03`”)

All the logs from the application server (“sampleappv406” in our example) are destined for here. We’ll do post-processing on them in Logstash to extract lots of lovely data fields, store it in Elasticsearch, and produce some funky interactive dashboards with Kibana. If, for some bizarre reason, you didn’t want to use Elasticsearch and Kibana but had some other target for your logs after Logtash had parsed them you could use one of the many other output plugins for Logstash.

Downloadkafka is optional, but useful to have the console scripts there for testing

wget https://download.elastic.co/kibana/kibana/kibana-4.1.2-linux-x64.tar.gz  
wget https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.7.2.zip  
wget https://download.elastic.co/logstash/logstash/logstash-1.5.4.zip  
wget http://apache.mirror.anlx.net/kafka/0.8.2.0/kafka_2.10-0.8.2.0.tgz

Install

tar -xf kibana*  
unzip elastic*.zip  
unzip logstash*.zip  
tar -xf kafka*

sudo mv kafka* /opt  
sudo mv kibana* /opt  
sudo mv elastic* /opt  
sudo mv logstash* /opt

# Kopf is an optional, but very useful, Elasticsearch admin web GUI  
/opt/elastic*/bin/plugin --install lmenezes/elasticsearch-kopf

RunUse separate sessions, or even better, screen, to run both these concurrently:
```
/opt/elastic*/bin/elasticsearch  
/opt/kibana*/bin/kibana
```

Configuring Kafka

Create the topic. This can be run from any machine with kafka console tools available. The important thing is that you specify the --zookeeper correctly so that it knows where to find Kafka.

cd /opt/kafka*  
bin/kafka-topics.sh --create --zookeeper ubuntu-02:2181 --replication-factor 1 --partitions 1 --topic logstash

Smoke test

Having created the topic, check that the other nodes can connect to zookeeper and see it. The point is less about viewing the topic as checking that the connectivity between the machines is working.

$ cd /opt/kafka*  
$ ./bin/kafka-topics.sh --list --zookeeper ubuntu-02:2181  
logstash

If you get an error then check that the host resolves and the port is accessible:

$ nc -vz ubuntu-02 2181

found 0 associations  
found 1 connections:  
      1: flags=82<CONNECTED,PREFERRED>  
     outif vboxnet0  
     src 192.168.56.1 port 63919  
     dst 192.168.56.202 port 2181  
     rank info not available  
     TCP aux info available

Connection to ubuntu-02 port 2181 [tcp/eforward] succeeded!

Set up a simple producer / consumer test
1. On the application server node, run a script that will be the producer, sending anything you type to the kafka server:
```
cd /opt/kafka*  
./bin/kafka-console-producer.sh   
--broker-list ubuntu-02:9092   
--topic logstash
```
  (I always get the warning WARN Property topic is not valid (kafka.utils.VerifiableProperties); it seems to be harmless so ignore it…)
  This will sit waiting for input; you won’t get the command prompt back.
2. On the ELK node, run a script that will be the consumer:
```
cd /opt/kafka*  
./bin/kafka-console-consumer.sh   
--zookeeper ubuntu-02:2181   
--topic logstash
```
3. Now go back to the application server node and enter some text and press enter. You should see the same appear shortly afterwards on the ELK node. This is demonstrating Producer -> Kafka -> Consumer
4. Optionally, run kafka-console-consumer.sh on a second machine (either the kafka host itself, or on a Mac where you’ve run brew install kafka). Now when you enter something on the Producer, you see both Consumers receive it

If the two above tests work, then you’re good to go. If not, then you’ve got to sort this out first because the later stuff sure isn’t going to.

Configuring Logstash on the Application Server (Kafka Producer)

Logstash has a very simple role on the application server – to track the log files that we want to collect, and pass new content in the log file straight across to Kafka. We’re not doing any fancy parsing of the files this side – we want to be as light-touch as possible. This means that our Logstash configuration is dead simple:

input {  
    file {  
        path =>  ["/app/oracle/biee/instances/instance1/diagnostics/logs/*/*/*.log"]  
    }
}

output {  
    kafka {  
        broker_list => 'ubuntu-02:9092'  
        topic_id => 'logstash'  
    }
}

Notice the wildcards in the path variable – in this example we’re going to pick up everything related to the OBIEE system components here, so in practice you may want to restrict it down a bit at least during development. You can specify multiple path patterns by comma-separating them within the square brackets, and you can use the exclude parameter to (…drum roll…) exclude specific paths from the wildcard match.

If you now run Logstash with the above configuration (assuming it’s saved as logstash-obi-kafka-producer.conf)

/opt/logstash*/bin/logstash -f logstash-obi-kafka-producer.conf

Logstash will now sit and monitor the file paths that you’ve given it. If they don’t exist, it will keep checking. If they did exist, and got deleted and recreated, or truncated – it’ll still pick up the differences. It’s a whole bunch more smart than your average bear^H^H^H^H tail -f.

If you happen to have left your Kafka console consumer running you might be in for a bit of a shock, depending on how much activity there is on your application server:

Talk about opening the floodgates!

Configuring Logstash on the ELK server (Kafka Consumer)

Let’s give all these lovely log messages somewhere to head. We’re going to use Logstash again, but on the ELK server this time, and with the Kafka input plugin:

input {  
    kafka {  
            zk_connect => 'ubuntu-02:2181'  
            topic_id => 'logstash'  
    }
}

output {  
    stdout { codec => rubydebug }  
}

Save and run it:

/opt/logstash*/bin/logstash -f logstash-obi-kafka-consumer.conf

and assuming the application server is still writing new log content we’ll get it written out here:

So far we’re doing nothing fancy at all – simply dumping to the console whatever messages we receive from kafka. In effect, it’s the same as the kafka-console-consumer.sh script that we ran as part of the smoke test earlier. But now we’ve got the messages come in to Logstash we can do some serious processing on them with grok and the like (something I discuss and demonstrate in an earlier article) to pull out meaningful data fields from each log message. The console is not the best place to write this all too – Elasticsearch is! So we specify that as the output plugin instead. An extract of our configuration looks something like this now:

input {  
    kafka {  
            zk_connect => 'ubuntu-02:2181'  
            topic_id => 'logstash'  
    }
}

filter {  
    grok {  
        match => ["file", "%{WLSSERVER}"]  
        [...]  
    }

    geoip { source => "saw_http_RemoteIP"}

[...]  
}

output {  
    elasticsearch {  
        host => "ubuntu-03"  
        protocol=> "http"  
    }
}

Note the [...] bit in the filter section – this is all the really cool stuff where we wring every last bit of value from the log data and split it into lots of useful data fields…which is why you should get in touch with us so we can help YOU with your OBIEE and ODI monitoring and diagnostics solution!

Advert break over, back to the blog. We’ve set up the new hyper-cool config file, we’ve primed the blasters, we’ve set the “lasers” to 11 … we hit run … and …

…nothing happens. “Logstash startup completed” is the last sign of visible life we see from this console. Checking our kafka-console-consumer.sh we can still see the messages are flowing through:

But Logstash remains silent? Well, no – it’s doing exactly what we told it to, which is to send all output to Elasticsearch (and nowhere else), which is exactly what it’s doing. Don’t believe me? Add back in to the output stanza of the configuration file the output to stdout (console in this case):

output {  
    elasticsearch {  
        host => "ubuntu-03"  
        protocol=> "http"  
    }
    stdout { codec => rubydebug }  
}

(Did I mention Logstash is mega-powerful yet? You can combine, split, and filter data streams however you want from and to mulitple sources. Here we’re sending it to both elasticsearch and stdout, but it could easily be sending it to elasticsearch and then conditionally to email, or pagerduty, or enriched data back to Kafka, or … you get the idea)

Re-run Logstash with the updated configuration and sure enough, it’s mute no longer:

(this snippet gives you an idea of the kind of data fields that can be extracted from a log file, and this is one of the less interesting ones, difficult to imagine, I know).

Analysing OBIEE Log Data in Elasticsearch with Kibana

The kopf plugin provides a nice web frontend to some of the administrative functions of Elasticsearch, including a quick overview of the state of a cluster and number of documents. Using it we can confirm we’ve got some data that’s been loaded from our Logstash -> Kafka -> Logstash pipeline:

and now in Kibana:

You can read a lot more about Kibana, including the (minimal) setup required to get it to show data from Elasticsearch, in other articles that I’ve written here, here, and here.

Using Kibana we can get a very powerful but simple view over the data we extracted from the log files, showing things like response times, errors, hosts, data models used, and so on:

MOAR Application Servers

Let’s scale this thing out a bit, and add a second application server into the mix. All we need to do is replicate the Logstash install and configuration on the second application server – everything else remains the same. Doing this we start to see the benefit of centralising the log processing, and decoupling it from the application server.

Set the Logstash ‘producer’ running on the second application server, and the data starts passing through, straight into Elasticsearch and Kibana at the other end, no changes needed.

Reprocessing data

One of the appealing features of Kafka is that it stores data for a period of time. This means that consumers can stream or batch as they desire, and that they can also reprocess data. By acting as a durable ‘buffer’ for the data it means that recovering from a client crash, such as a Logstash failure like this:

Error: Your application used more memory than the safety cap of 500M.  
Specify -J-Xmx####m to increase it (#### = cap size in MB).  
Specify -w for full OutOfMemoryError stack trace

is really simple – you just restart Logstash and it picks up processing from where it left off. Because Kafka tracks the last message that a consumer (Logstash in this case) read, it can scroll back through its log to pass to the consumer just messages that have accumulated since that point.

Another benefit of the data being available in Kafka is the ability to reprocess data because the processing itself has changed. A pertinent example of this is with Logstash. The processing that Logstash can do on logs is incredibly powerful, but it may be that a bug is there in the processing, or maybe an additional enrichment (such as geoip) has been added. Instead of having to go back and bother the application server for all its logs (which may have since been housekept away) we can just rerun our Logstash processing as the Kafka consumer and re-pull the data from Kafka. All that needs doing is telling the Logstash consumer to reset its position in the Kafka log from which it reads:

input {  
    kafka {  
            zk_connect => 'ubuntu-02:2181'  
            topic_id => 'logstash'  
            # Use the following two if you want to reset processing  
            reset_beginning => 'true'  
            auto_offset_reset => 'smallest'  
    }
}

Kafka will keep data for the length of time, or size of data, as defined in the log.retention.minutes and log.retention.bytes configuration settings respectively. This is set globally by default to 7 days (and no size limit), and can be changed globally or per topic.

Conclusion

Logstash with Kafka is a powerful and easy way to stream your application log files off the application server with minimal overhead and then process them on a dedicated host. Elasticsearch and Kibana are a great way to visualise, analyse, and diagnose issues within your application’s log files.

Kafka enables you to loosely couple your application server to your monitoring and diagnostics with minimal overhead, whilst adding the benefit of log replay if you want to reprocess them.

The post Forays into Kafka – Logstash transport / centralisation appeared first on Rittman Mead Consulting.

↧

Introducing the Rittman Mead OBIEE Performance Analytics Service

October 21, 2015, 3:30 am

≫ Next: Forays into Kafka – Enabling Flexible Data Pipelines

≪ Previous: Forays into Kafka – Logstash transport / centralisation

Fix Your OBIEE Performance Problems Today

OBIEE is a powerful analytics tool that enables your users to make the most of the data in your organisation. Ensuring that expected response times are met is key to driving user uptake and successful user engagement with OBIEE.

Rittman Mead can help diagnose and resolve performance problems on your OBIEE system. Taking a holistic, full-stack view, we can help you deliver the best service to your users. Fast response times enable your users to do more with OBIEE, driving better engagement, higher satisfaction, and greater return on investment. We enable you to :

Create a positive user experience
Ensure OBIEE returns answers quickly
Empower your BI team to identify and resolve performance bottlenecks in real time

Rittman Mead Are The OBIEE Performance Experts

Rittman Mead have many years of experience in the full life cycle of data warehousing and analytical solutions, especially in the Oracle space. We know what it takes to design a good system, and to troubleshoot a problematic one.

We are firm believers in a practical and logical approach to performance analytics and optimisation. Eschewing the drunk man anti-method of ‘tuning’ configuration settings at random, we advocate making a clear diagnosis and baseline of performance problems before changing anything. Once a clear understanding of the situation is established, steps are taken in a controlled manner to implement and validate one change at a time.

Rittman Mead have spoken at conferences, produced videos, and written many blogs specifically on the subject of OBIEE Performance.

Performance Analytics is not a dark art. It is not the blind application of ‘best practices’ or ‘tuning’ configuration settings. It is the logical analysis of performance behaviour to accurately determine the issue(s) present, and the possible remedies for them.

Diagnose and Resolve OBIEE Performance Problems with Confidence

When you sign up for the Rittman Mead OBIEE Performance Analytics Service you get:

On-site consultancy from one of our team of Performance experts, including Mark Rittman (Oracle ACE Director), and Robin Moffatt (Oracle ACE).
A Performance Analysis Report to give you an assessment of the current performance and prioritised list of optimisation suggestions, which we can help you implement.
Use of the Performance Diagnostics Toolkit to measure and analyse the behaviour of your system and correlate any poor response times with the metrics from the server and OBIEE itself.
Training, which is vital for enabling your staff to deliver optimal OBIEE performance. We work with your staff to help them understand the good practices to be looking for in design and diagnostics. Training is based on formal courseware along with workshops based on examples from your OBIEE system where appropriate

Let Us Help You, Today!

Get in touch now to find out how we can help improve your OBIEE system’s performance. We offer a free, no-obligation sample of the Performance Analysis Report, built on YOUR data.

Don’t just call us when performance may already be problematic – we can help you assess your OBIEE system for optimal performance at all stages of the build process. Gaining a clear understanding of the performance profile of your system and any potential issues gives you the confidence and ability to understand any potential risks to the success of your project – before it gets too late.

The post Introducing the Rittman Mead OBIEE Performance Analytics Service appeared first on Rittman Mead Consulting.

↧

Forays into Kafka – Enabling Flexible Data Pipelines

October 27, 2015, 4:46 pm

≫ Next: Driving OBIEE User Engagement with Enhanced Usage Tracking for OBIEE

≪ Previous: Introducing the Rittman Mead OBIEE Performance Analytics Service

One of the defining features of “Big Data” from a technologist’s point of view is the sheer number of tools and permutations at one’s disposal. Do you go Flume or Logstash? Avro or Thrift? Pig or Spark? Foo or Bar? (I made that last one up). This wealth of choice is wonderful because it means we can choose the right tool for the right job each time.

Of course, we need to establish that have indeed chosen the right tool for the right job. But here’s the paradox. How do we easily work out if a tool is going to do what we want of it and is going to be a good fit, without disturbing what we already have in place? Particularly if it’s something that’s going to be part of an existing Productionised data pipeline, inserting a new tool partway through what’s there already is going to risk disrupting that. We potentially end up with a series of cloned environments, all diverging from each other, and not necessarily comparable (not to mention the overhead of the resource to host it all).

The same issue arises when we want to change the code or configuration of an existing pipeline. Bugs creep in, ideas to enhance the processing that you’ve currently got present themselves. Wouldn’t it be great if we could test these changes reliably and with no risk to the existing system?

This is where Kafka comes in. Kafka is very useful for two reasons:

You can use it as a buffer for data that can be consumed and re-consumed on demand
Multiple consumers can all pull the data, independently and at their own rate.

So you take your existing pipeline, plumb in Kafka, and then as and when you want to try out additional tools (or configurations of existing ones) you simply take another ‘tap’ off the existing store. This is an idea that Gwen Shapira put forward in May 2015 and really resonated with me.

I see Kafka sitting right on that Execution/Innovation demarcation line of the Information Management and Big Data Reference Architecture that Oracle and Rittman Mead produced last year:

Kafka enables us to build a pipeline for our analytics that breaks down into two phases:

Data ingest from source into Kafka, simple and reliable. Fewest moving parts as possible.
Post-processing. Batch or realtime. Uses Kafka as source. Re-runnable. Multiple parallel consumers: –
- Productionised processing into Event Engine, Data Reservoir and beyond
- Adhoc/loosely controlled Data Discovery processing and re-processing

These two steps align with the idea of “Obtain” and “Scrub” that Rittman Mead’s Jordan Meyer talked about in his BI Forum 2015 Masterclass about the Data Discovery:

So that’s the theory – let’s now look at an example of how Kafka can enable us to build a more flexible and productive data pipeline and environment.

Flume or Logstash? HDFS or Elasticsearch? … All of them!

Mark Rittman wrote back in April 2014 about using Apache Flume to stream logs from the Rittman Mead web server over to HDFS, from where they could be analysed in Hive and Impala. The basic setup looked like this:

Another route for analysing data is through the ELK stack. It does a similar thing – streams logs (with Logstash) in to a data store (Elasticsearch) from where they can be analysed, just with a different set of tools with a different emphasis on purpose. The input is the same – the web server log files. Let’s say I want to evaluate which is the better mechanism for analysing my log files, and compare the two side-by-side. Ultimately I might only want to go forward with one, but for now, I want to try both.

I could run them literally in parallel:

The disadvantage with this is that I have twice the ‘footprint’ on my data source, a Production server. A principle throughout all of this is that we want to remain light-touch on the sources of data. Whether a Production web server, a Production database, or whatever – upsetting the system owners of the data we want is never going to win friends.

An alternative to running in parallel would be to use one of the streaming tools to load data in place of the other, i.e.

The issue with this is I want to validate the end-to-end pipeline. Using a single source is better in terms of load/risk to the source system, but less so for validating my design. If I’m going to go with Elasticsearch as my target, Logstash would be the better fit source. Ditto HDFS/Flume. Both support connectors to the other, but using native capabilities always feels to me a safer option (particularly in the open-source world). And what if the particular modification I’m testing doesn’t support this kind of connectivity pattern?

Can you see where this is going? How about this:

The key points here are:

One hit on the source system. In this case it’s flume, but it could be logstash, or another tool. This streams each line of the log file into Kafka in the exact order that it’s read.
Kafka holds a copy of the log data, for a configurable time period. This could be days, or months – up to you and depending on purpose (and disk space!)
Kafka is designed to be distributed and fault-tolerant. As with most of the boxes on this logical diagram it would be physically spread over multiple machines for capacity, performance, and resilience.
The eventual targets, HDFS and Elasticsearch, are loaded by their respective tools pulling the web server entries exactly as they were on disk. In terms of validating end-to-end design we’re still doing that – we’re just pulling from a different source.

Another massively important benefit of Kafka is this:

Sooner or later (and if you’re new to the tool and code/configuration required, probably sooner) you’re going to get errors in your data pipeline. These may be fatal and cause it to fall in a heap, or they may be more subtle and you only realise after analysis that some of your data’s missing or not fully enriched. What to do? Obviously you need to re-run your ingest process. But how easy is that? Where is the source data? Maybe you’ll have a folder full of “.processed” source log files, or an HDFS folder of raw source data that you can reprocess. The issue here is the re-processing – you need to point your code at the alternative source, and work out the range of data to reprocess.

This is all eminently do-able of course – but wouldn’t it be easier just to rerun your existing ingest pipeline and just rewind the point at which it’s going to pull data from? Minimising the amount of ‘replumbing’ and reconfiguration to run a re-process job vs. new ingest makes it faster to do, and more reliable. Each additional configuration change is an opportunity to mis-configure. Each ‘shadow’ script clone for re-running vs normal processing is increasing the risk of code diverging and stale copies being run.

The final pipeline in this simple example looks like this:

The source server logs are streamed into Kafka, with a permanent copy up onto Amazon’s S3 for those real “uh oh” moments. Kafka, in a sandbox environment with a ham-fisted sysadmin, won’t be bullet-proof. Better to recover a copy from S3 than have to bother the Production server again. This is something I’ve put in for this specific use case, and wouldn’t be applicable in others.
From Kafka the web server logs are available to stream, as if natively from the web server disk itself, through Flume and Logstash.

There’s a variation on a theme of this, that looks like this:

Instead of Flume -> Kafka, and then a second Flume -> HDFS, we shortcut this and have the same Flume agent as is pulling from source writing to HDFS. Why have I not put this as the final pipeline? Because of this:

Let’s say that I want to do some kind of light-touch enrichment on the files, such as extracting the log timestamp in order to partition my web server logs in HDFS by the date of the log entry (not the time of processing, because I’m working with historical files too). I’m using a regex_extractor interceptor in Flume to determine the timestamp from the event data (log entry) being processed. That’s great, and it works well – when it works. If I get my regex wrong, or the log file changes date format, the house of cards comes tumbling down. Now I have a mess, because my nice clean ingest pipeline from the source system now needs fixing and re-running. As before, of course it is possible to write this cleanly so that it doesn’t break, etc etc, but from the point of view of decoupling operations for manageability and flexibility it makes sense to keep them separate (remember the Obtain vs Scrub point above?).

The final note on this is to point out that technically we can implement the pipeline using a Kafka Flume channel, which is a slightly neater way of doing things. The data still ends up in the S3 sink, and available in Kafka for streaming to all the consumers.

Kafka in Action

Let’s take a look at the configuration to put the above theory into practice. I’m running all of this on Oracle’s BigDataLite 4.2.1 VM which includes, amongst many other goodies, CDH 5.4.0. Alongside this I’ve installed into /opt :

apache-flume-1.6.0
elasticsearch-1.7.3
kafka_2.10-0.8.2.1
kibana-4.1.2-linux-x64
logstash-1.5.4

The Starting Point – Flume -> HDFS

First, we’ve got the initial Logs -> Flume -> HDFS configuration, similar to what Mark wrote about originally:

# http://flume.apache.org/FlumeUserGuide.html#exec-source  
source_agent.sources = apache_server  
source_agent.sources.apache_server.type = exec  
source_agent.sources.apache_server.command = tail -f /home/oracle/website_logs/access_log  
source_agent.sources.apache_server.batchSize = 1  
source_agent.sources.apache_server.channels = memoryChannel

# http://flume.apache.org/FlumeUserGuide.html#memory-channel  
source_agent.channels = memoryChannel  
source_agent.channels.memoryChannel.type = memory  
source_agent.channels.memoryChannel.capacity = 100

## Write to HDFS  
source_agent.sinks = hdfs_sink  
source_agent.sinks.hdfs_sink.type = hdfs  
source_agent.sinks.hdfs_sink.channel = memoryChannel  
source_agent.sinks.hdfs_sink.hdfs.path = /user/oracle/incoming/rm_logs/apache_log  
source_agent.sinks.hdfs_sink.hdfs.fileType = DataStream  
source_agent.sinks.hdfs_sink.hdfs.writeFormat = Text  
source_agent.sinks.hdfs_sink.hdfs.rollSize = 0  
source_agent.sinks.hdfs_sink.hdfs.rollCount = 10000  
source_agent.sinks.hdfs_sink.hdfs.rollInterval = 600

After running this

$ /opt/apache-flume-1.6.0-bin/bin/flume-ng agent --name source_agent \
--conf-file flume_website_logs_02_tail_source_hdfs_sink.conf

we get the logs appearing in HDFS and can see them easily in Hue:

Adding Kafka to the Pipeline

Let’s now add Kafka to the mix. I’ve already set up and started Kafka (see here for how), and Zookeeper’s already running as part of the default BigDataLite build.

First we need to define a Kafka topic that is going to hold the log files. In this case it’s called apache_logs:

$ /opt/kafka_2.10-0.8.2.1/bin/kafka-topics.sh --zookeeper bigdatalite:2181 \
--create --topic apache_logs  --replication-factor 1 --partitions 1

Just to prove it’s there and we can send/receive message on it I’m going to use the Kafka console producer/consumer to test it. Run these in two separate windows:

$ /opt/kafka_2.10-0.8.2.1/bin/kafka-console-producer.sh \
--broker-list bigdatalite:9092 --topic apache_logs

$ /opt/kafka_2.10-0.8.2.1/bin/kafka-console-consumer.sh \
--zookeeper bigdatalite:2181 --topic apache_logs

With the Consumer running enter some text, any text, in the Producer session and you should see it appear almost immediately in the Consumer window.

Now that we’ve validated the Kafka topic, let’s plumb it in. We’ll switch the existing Flume config to use a Kafka sink, and then add a second Flume agent to do the Kafka -> HDFS bit, giving us this:

The original flume agent configuration now looks like this:

source_agent.sources = apache_log_tail  
source_agent.channels = memoryChannel  
source_agent.sinks = kafka_sink

# http://flume.apache.org/FlumeUserGuide.html#exec-source  
source_agent.sources.apache_log_tail.type = exec  
source_agent.sources.apache_log_tail.command = tail -f /home/oracle/website_logs/access_log  
source_agent.sources.apache_log_tail.batchSize = 1  
source_agent.sources.apache_log_tail.channels = memoryChannel

# http://flume.apache.org/FlumeUserGuide.html#memory-channel  
source_agent.channels.memoryChannel.type = memory  
source_agent.channels.memoryChannel.capacity = 100

## Write to Kafka  
source_agent.sinks.kafka_sink.channel = memoryChannel  
source_agent.sinks.kafka_sink.type = org.apache.flume.sink.kafka.KafkaSink  
source_agent.sinks.kafka_sink.batchSize = 5  
source_agent.sinks.kafka_sink.brokerList = bigdatalite:9092  
source_agent.sinks.kafka_sink.topic = apache_logs

Restart the kafka-console-consumer.sh from above so that you can see what’s going into Kafka, and then run the Flume agent. You should see the log entries appearing soon after. Remember that kafka-console-consumer.sh is just one consumer of the logs – when we plug in the Flume consumer to write the logs to HDFS we can opt to pick up all of the entries in Kafka, completely independently of what we have or haven’t consumed in kafka-console-consumer.sh.

$ /opt/apache-flume-1.6.0-bin/bin/flume-ng agent --name source_agent \ 
--conf-file flume_website_logs_03_tail_source_kafka_sink.conf

[oracle@bigdatalite ~]$ /opt/kafka_2.10-0.8.2.1/bin/kafka-console-consumer.sh \
--zookeeper bigdatalite:2181 --topic apache_logs  

37.252.227.70 - - [06/Sep/2015:08:08:30 +0000] "GET / HTTP/1.0" 301 235 "-" "Mozilla/5.0 (compatible; monitis.com - free monitoring service; http://monitis.com)"  
174.121.162.130 - - [06/Sep/2015:08:08:35 +0000] "HEAD /blog HTTP/1.1" 301 - "http://oraerp.com/blog" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"  
177.71.183.71 - - [06/Sep/2015:08:08:35 +0000] "GET /blog/ HTTP/1.0" 200 145999 "-" "Mozilla/5.0 (compatible; monitis - premium monitoring service; http://www.monitis.com)"  
174.121.162.130 - - [06/Sep/2015:08:08:36 +0000] "HEAD /blog/ HTTP/1.1" 200 - "http://oraerp.com/blog" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"  
173.192.34.91 - - [06/Sep/2015:08:08:44 +0000] "GET / HTTP/1.0" 301 235 "-" "Mozilla/5.0 (compatible; monitis.com - free monitoring service; http://monitis.com)"  
217.146.9.53 - - [06/Sep/2015:08:08:58 +0000] "GET / HTTP/1.0" 301 235 "-" "Mozilla/5.0 (compatible; monitis - premium monitoring service; http://www.monitis.com)"  
82.47.31.235 - - [06/Sep/2015:08:08:58 +0000] "GET / HTTP/1.1" 200 36946 "-" "Echoping/6.0.2"

Set up the second Flume agent to use Kafka as a source, and HDFS as the target just as it was before we added Kafka into the pipeline:

target_agent.sources = kafkaSource  
target_agent.channels = memoryChannel  
target_agent.sinks = hdfsSink 

target_agent.sources.kafkaSource.type = org.apache.flume.source.kafka.KafkaSource  
target_agent.sources.kafkaSource.zookeeperConnect = bigdatalite:2181  
target_agent.sources.kafkaSource.topic = apache_logs  
target_agent.sources.kafkaSource.batchSize = 5  
target_agent.sources.kafkaSource.batchDurationMillis = 200  
target_agent.sources.kafkaSource.channels = memoryChannel

# http://flume.apache.org/FlumeUserGuide.html#memory-channel  
target_agent.channels.memoryChannel.type = memory  
target_agent.channels.memoryChannel.capacity = 100

## Write to HDFS  
#http://flume.apache.org/FlumeUserGuide.html#hdfs-sink  
target_agent.sinks.hdfsSink.type = hdfs  
target_agent.sinks.hdfsSink.channel = memoryChannel  
target_agent.sinks.hdfsSink.hdfs.path = /user/oracle/incoming/rm_logs/apache_log  
target_agent.sinks.hdfsSink.hdfs.fileType = DataStream  
target_agent.sinks.hdfsSink.hdfs.writeFormat = Text  
target_agent.sinks.hdfsSink.hdfs.rollSize = 0  
target_agent.sinks.hdfsSink.hdfs.rollCount = 10000  
target_agent.sinks.hdfsSink.hdfs.rollInterval = 600

Fire up the agent:

$ /opt/apache-flume-1.6.0-bin/bin/flume-ng agent -n target_agent \
-f flume_website_logs_04_kafka_source_hdfs_sink.conf

and as the website log data streams in to Kafka (from the first Flume agent) you should see the second Flume agent sending it to HDFS and evidence of this in the console output from Flume:

15/10/27 13:53:53 INFO hdfs.BucketWriter: Creating /user/oracle/incoming/rm_logs/apache_log/FlumeData.1445954032932.tmp

and in HDFS itself:

Play it again, Sam?

All we’ve done to this point is add Kafka into the pipeline, ready for subsequent use. We’ve not changed the nett output of the data pipeline. But, we can now benefit from having Kafka there, by re-running some of our HDFS load without having to go back to the source files. Let’s say we want to partition the logs as we store them. But, we don’t want to disrupt the existing processing. How? Easy! Just create another Flume agent with the additional configuration in to do the partitioning.

target_agent.sources = kafkaSource  
target_agent.channels = memoryChannel  
target_agent.sinks = hdfsSink

target_agent.sources.kafkaSource.type = org.apache.flume.source.kafka.KafkaSource  
target_agent.sources.kafkaSource.zookeeperConnect = bigdatalite:2181  
target_agent.sources.kafkaSource.topic = apache_logs  
target_agent.sources.kafkaSource.batchSize = 5  
target_agent.sources.kafkaSource.batchDurationMillis = 200  
target_agent.sources.kafkaSource.channels = memoryChannel  
target_agent.sources.kafkaSource.groupId = new  
target_agent.sources.kafkaSource.kafka.auto.offset.reset = smallest  
target_agent.sources.kafkaSource.interceptors = i1

# http://flume.apache.org/FlumeUserGuide.html#memory-channel  
target_agent.channels.memoryChannel.type = memory  
target_agent.channels.memoryChannel.capacity = 1000

# Regex Interceptor to set timestamp so that HDFS can be written to partitioned  
target_agent.sources.kafkaSource.interceptors.i1.type = regex_extractor  
target_agent.sources.kafkaSource.interceptors.i1.serializers = s1  
target_agent.sources.kafkaSource.interceptors.i1.serializers.s1.type = org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer  
target_agent.sources.kafkaSource.interceptors.i1.serializers.s1.name = timestamp  
#
# Match this format logfile to get timestamp from it:  
# 76.164.194.74 - - [06/Apr/2014:03:38:07 +0000] "GET / HTTP/1.1" 200 38281 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)"  
target_agent.sources.kafkaSource.interceptors.i1.regex = (\\d{2}\\/[a-zA-Z]{3}\\/\\d{4}:\\d{2}:\\d{2}:\\d{2}\\s\\+\\d{4})  
target_agent.sources.kafkaSource.interceptors.i1.serializers.s1.pattern = dd/MMM/yyyy:HH:mm:ss Z  
#

## Write to HDFS  
#http://flume.apache.org/FlumeUserGuide.html#hdfs-sink  
target_agent.sinks.hdfsSink.type = hdfs  
target_agent.sinks.hdfsSink.channel = memoryChannel  
target_agent.sinks.hdfsSink.hdfs.path = /user/oracle/incoming/rm_logs/apache/%Y/%m/%d/access_log  
target_agent.sinks.hdfsSink.hdfs.fileType = DataStream  
target_agent.sinks.hdfsSink.hdfs.writeFormat = Text  
target_agent.sinks.hdfsSink.hdfs.rollSize = 0  
target_agent.sinks.hdfsSink.hdfs.rollCount = 0  
target_agent.sinks.hdfsSink.hdfs.rollInterval = 600

The important lines of note here (as highlighted above) are:

the regex_extractor interceptor which determines the timestamp of the log event, then used in the hdfs.path partitioning structure
the groupId and kafka.auto.offset.reset configuration items for the kafkaSource.
- The groupId ensures that this flume agent’s offset in the consumption of the data in the Kafka topic is maintained separately from that of the original agent that we had. By default it is flume, and here I’m overriding it to new. It’s a good idea to specify this explicitly in all Kafka flume consumer configurations to avoid complications.
- kafka.auto.offset.reset tells the consumer that if no existing offset is found (which is won’t be, if the groupId is new one) to start from the beginning of the data rather than the end (which is what it will do by default).
- Thus if you want to get Flume to replay the contents of a Kafka topic, just set the groupId to an unused one (eg ‘foo01’, ‘foo02’, etc) and make sure the kafka.auto.offset.reset is smallest

Now run it (concurrently with the existing flume agents if you want):

$ /opt/apache-flume-1.6.0-bin/bin/flume-ng agent -n target_agent \
-f flume_website_logs_07_kafka_source_partitioned_hdfs_sink.conf

You should see a flurry of activity (or not, depending on how much data you’ve already got in Kafka), and some nicely partitioned apache logs in HDFS:

Crucially, the existing flume agent and non-partitioned HDFS pipeline stays in place and functioning exactly as it was – we’ve not had to touch it. We could then run two two side-by-side until we’re happy the partitioning is working correctly and then decommission the first. Even at this point we have the benefit of Kafka, because we just turn off the original HDFS-writing agent – the new “live” one continues to run, it doesn’t need reconfiguring. We’ve validated the actual configuration we’re going to use for real, we’ve not had to simulate it up with mock data sources that then need re-plumbing prior to real use.

Clouds and Channels

We’re going to evolve the pipeline a bit now. We’ll go back to a single Flume agent writing to HDFS, but add in Amazon’s S3 as the target for the unprocessed log files. The point here is not so much that S3 is the best place to store log files (although it is a good option), but as a way to demonstrate a secondary method of keeping your raw data available without impacting the source system. It also fits nicely with using the Kafka flume channel to tighten the pipeline up a tad:

Amazon’s S3 service is built on HDFS itself, and Flume can use the S3N protocol to write directly to it. You need to have already set up your S3 ‘bucket’, and have the appropriate AWS Access Key ID and Secret Key. To get this to work I added these credentials to /etc/hadoop/conf.bigdatalite/core-site.xml (I tried specifying them inline with the flume configuration but with no success):

<property>  
    <name>fs.s3n.awsAccessKeyId</name>  
    <value>XXXXXXXXXXXXX</value>  
</property>  
<property>  
    <name>fs.s3n.awsSecretAccessKey</name>  
    <value>YYYYYYYYYYYYYYYYYYYY</value>  
</property>

Once you’ve set up the bucket and credentials, the original flume agent (the one pulling the actual web server logs) can be amended:

source_agent.sources = apache_log_tail  
source_agent.channels = kafkaChannel  
source_agent.sinks = s3Sink

# http://flume.apache.org/FlumeUserGuide.html#exec-source  
source_agent.sources.apache_log_tail.type = exec  
source_agent.sources.apache_log_tail.command = tail -f /home/oracle/website_logs/access_log  
source_agent.sources.apache_log_tail.batchSize = 1  
source_agent.sources.apache_log_tail.channels = kafkaChannel


## Write to Kafka Channel  
source_agent.channels.kafkaChannel.channel = kafkaChannel  
source_agent.channels.kafkaChannel.type = org.apache.flume.channel.kafka.KafkaChannel  
source_agent.channels.kafkaChannel.topic = apache_logs  
source_agent.channels.kafkaChannel.brokerList = bigdatalite:9092  
source_agent.channels.kafkaChannel.zookeeperConnect = bigdatalite:2181

## Write to S3  
source_agent.sinks.s3Sink.channel = kafkaChannel  
source_agent.sinks.s3Sink.type = hdfs  
source_agent.sinks.s3Sink.hdfs.path = s3n://rmoff-test/apache  
source_agent.sinks.s3Sink.hdfs.fileType = DataStream  
source_agent.sinks.s3Sink.hdfs.filePrefix = access_log  
source_agent.sinks.s3Sink.hdfs.writeFormat = Text  
source_agent.sinks.s3Sink.hdfs.rollCount = 10000  
source_agent.sinks.s3Sink.hdfs.rollSize = 0  
source_agent.sinks.s3Sink.hdfs.batchSize = 10000  
source_agent.sinks.s3Sink.hdfs.rollInterval = 600

Here the source is the same as before (server logs), but the channel is now Kafka itself, and the sink S3. Using Kafka as the channel has the nice benefit that the data is now already in Kafka, we don’t need that as an explicit target in its own right.

Restart the source agent using this new configuration:

$ /opt/apache-flume-1.6.0-bin/bin/flume-ng agent --name source_agent \
--conf-file flume_website_logs_09_tail_source_kafka_channel_s3_sink.conf

and you should get the data appearing on both HDFS as before, and now also in the S3 bucket:

Didn’t Someone Say Logstash?

The premise at the beginning of this exercise was that I could extend an existing data pipeline to pull data into a new set of tools, as if from the original source, but without touching that source or the existing configuration in place. So far we’ve got a pipeline that is pretty much as we started with, just with Kafka in there now and an additional feed to S3:

Now we’re going to extend (or maybe “broaden” is a better term) the data pipeline to add Elasticsearch into it:

Whilst Flume can write to Elasticsearch given the appropriate extender, I’d rather use a tool much closer to Elasticsearch in origin and direction – Logstash. Logstash supports Kafka as an input (and an output, if you want), making the configuration ridiculously simple. To smoke-test the configuration just run Logstash with this configuration:

input {  
        kafka {  
                zk_connect => 'bigdatalite:2181'  
                topic_id => 'apache_logs'  
                codec => plain {  
                        charset => "ISO-8859-1"  
                }
                # Use both the following two if you want to reset processing  
                reset_beginning => 'true'  
                auto_offset_reset => 'smallest'

        }  
}

output {  
        stdout {codec => rubydebug }  
        }

A few of things to point out in the input configuration:

You need to specify plain codec (assuming your input from Kafka is). The default codec for the Kafka plugin is json, and Logstash does NOT like trying to parse plain text and json as I found out:

37.252.227.70 - - [06/Sep/2015:08:08:30 +0000] "GET / HTTP/1.0" 301 235 "-" "Mozilla/5.0 (compatible; monitis.com - free monitoring service; http://monitis.com)" {:exception=>#<NoMethodError: undefined method `[]' for 37.252:Float>, :backtrace=>["/opt/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.4-java/lib/logstash/event.rb:73:in `initialize'", "/opt/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-codec-json-1.0.1/lib/logstash/codecs/json.rb:46:in `decode'", "/opt/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-1.0.0/lib/logstash/inputs/kafka.rb:169:in `queue_event'", "/opt/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-1.0.0/lib/logstash/inputs/kafka.rb:139:in `run'", "/opt/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.4-java/lib/logstash/pipeline.rb:177:in `inputworker'", "/opt/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.4-java/lib/logstash/pipeline.rb:171:in `start_input'"], :level=>:error}

As well as specifying the codec, I needed to specify the charset. Without this I got \\u0000\\xBA\\u0001 at the beginning of each message that Logstash pulled from Kafka
Specifying reset_beginning and auto_offset_reset tell Logstash to pull everything in from Kafka, rather than starting at the latest offset.

When you run the configuration file above you should see a stream of messages to your console of everything that is in the Kafka topic:

$ /opt/logstash-1.5.4/bin/logstash -f logstash-apache_10_kafka_source_console_output.conf

The output will look like this – note that Logstash has added its own special @version and @timestamp fields:

{  
       "message" => "203.199.118.224 - - [09/Oct/2015:04:13:23 +0000] \"GET /wp-content/uploads/2014/10/JFB-View-Selector-LowRes-300x218.png HTTP/1.1\" 200 53295 \"http://www.rittmanmead.com/2014/10/obiee-how-to-a-view-selector-for-your-dashboard/\" \"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36\"",  
      "@version" => "1",  
    "@timestamp" => "2015-10-27T17:29:06.596Z"  
}

Having proven the Kafka-Logstash integration, let’s do something useful – get all those lovely log entries streaming from source, through Kafka, enriched in Logstash with things like geoip, and finally stored in Elasticsearch:

input {  
        kafka {  
                zk_connect => 'bigdatalite:2181'  
                topic_id => 'apache_logs'  
                codec => plain {  
                        charset => "ISO-8859-1"  
                }
                # Use both the following two if you want to reset processing  
                reset_beginning => 'true'  
                auto_offset_reset => 'smallest'  
        }
}


filter {  
        # Parse the message using the pre-defined "COMBINEDAPACHELOG" grok pattern  
        grok { match => ["message","%{COMBINEDAPACHELOG}"] }

        # Ignore anything that's not a blog post hit, characterised by /yyyy/mm/post-slug form  
        if [request] !~ /^\/[0-9]{4}\/[0-9]{2}\/.*$/ { drop{} }

        # From the blog post URL, strip out the year/month and slug  
        #  http://www.rittmanmead.com/2015/02/obiee-monitoring-and-diagnostics-with-influxdb-and-grafana/  
        #     year  => 2015  
        #     month =>   02  
        #     slug  => obiee-monitoring-and-diagnostics-with-influxdb-and-grafana  
        grok { match => [ "request","\/%{NUMBER:post-year}\/%{NUMBER:post-month}\/(%{NUMBER:post-day}\/)?%{DATA:post-slug}(\/.*)?$"] }

        # Combine year and month into one field  
        mutate { replace => [ "post-year-month" , "%{post-year}-%{post-month}" ] }

        # Use GeoIP lookup to locate the visitor's town/country  
        geoip { source => "clientip" }

        # Store the date of the log entry (rather than now) as the event's timestamp  
        date { match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"]}  
}

output {  
        elasticsearch { host => "bigdatalite"  index => "blog-apache-%{+YYYY.MM.dd}"}  
        }

Make sure that Elasticsearch is running and then kick off Logstash:

$ /opt/logstash-1.5.4/bin/logstash -f logstash-apache_01_kafka_source_parsed_to_es.conf

Nothing will appear to happen on the console:

log4j, [2015-10-27T17:36:53.228]  WARN: org.elasticsearch.bootstrap: JNA not found. native methods will be disabled.  
Logstash startup completed

But in the background Elasticsearch will be filling up with lots of enriched log data. You can confirm this through the useful kopf plugin to see that the Elasticsearch indices are being created:

and directly through Elasticsearch’s RESTful API too:

$ curl -s -XGET http://bigdatalite:9200/_cat/indices?v|sort  
health status index                  pri rep docs.count docs.deleted store.size pri.store.size  
yellow open   blog-apache-2015.09.30   5   1      11872            0       11mb           11mb  
yellow open   blog-apache-2015.10.01   5   1      13679            0     12.8mb         12.8mb  
yellow open   blog-apache-2015.10.02   5   1      10042            0      9.6mb          9.6mb  
yellow open   blog-apache-2015.10.03   5   1       8722            0      7.3mb          7.3mb

And of course, the whole point of streaming the data into Elasticsearch in the first place – easy analytics through Kibana:

Conclusion

Kafka is awesome :-D

We’ve seen in this article how Kafka enables the implementation of flexible data pipelines that can evolve organically without requiring system rebuilds to implement or test new methods. It allows the data discovery function to tap in to the same source of data as the more standard analytical reporting one, without risking impacting the source system at all.

The post Forays into Kafka – Enabling Flexible Data Pipelines appeared first on Rittman Mead Consulting.

↧

Driving OBIEE User Engagement with Enhanced Usage Tracking for OBIEE

December 10, 2015, 6:35 am

≫ Next: Using Linux Control Groups to Constrain Process Memory

≪ Previous: Forays into Kafka – Enabling Flexible Data Pipelines

Measuring and monitoring user interactions and behaviour with OBIEE is a key part of Rittman Mead’s User Engagement Service. By understanding and proving how users are engaging the system we can improve the experience for the user, driving up usage and ensuring maximum value for your OBIEE investment. To date, we have had the excellent option of Usage Tracking for finding out about system usage, but this only captures actual dashboard and analysis executions. What I am going to discuss in this article is taking Usage Tracking a step further, and capturing and analysing every click that that the user makes. Every login, every search, every report build action. This can be logged to a database such as Oracle, and gives us Enhanced Usage Tracking!

Why?

Because the more we understand about our user base, the more we can do for them in terms of improved content and accessibility, and the more we can do for us, the OBIEE dev/sysadmin, in terms of easier maintenance and better knowledge of the platform for which we are developing.

Here is a handful of questions that this data can answer – I’m sure once you see the potential of the data you will be able to think of plenty more…

How many users are accessing OBIEE through a mobile device?

Maybe you’re about to implement a mobile strategy, perhaps deploying MAD or rolling out BI Mobile HD. Wouldn’t it be great if you could quantify its uptake, and not only that but the impact that the provision of mobile makes on the general user engagement levels of your OBIEE user base?

Perhaps you think your users might benefit from a dedicated Mobile OBIEE strategy, but to back up your business case for the investment in mobile licences or time to optimise content for mobile consumption you want to show how many users are currently accessing full OBIEE through a web browser on their mobile device. And not only ‘a mobile device’, but which one, which browser, and which OS. Enhanced Usage Tracking data can provide all this, and more.

Which dashboards get exported to Excel the most frequently?

The risks that Excel-marts present are commonly discussed, and broader solutions such as data-mashup capabilities within OBIEE itself exist – but how do you identify which dashboards are frequently exported from OBIEE to Excel, and by whom? We’ve all probably got a gut-instinct, or indirect evidence, of when this happens – but now we can know for sure. Whilst Usage Tracking alone will tell us when a dashboard is run, only Enhanced Usage Tracking can show what the user then did with the results:

What do we do with this information? It Depends, of course. In some cases exporting data to Excel is a – potentially undesirable but pragmatic – way of getting certain analysis done, and to try to prevent it unnecessarily petulant and counterproductive. In many other cases though, people use it simply as a way of doing something that could be done in OBIEE but they lack the awareness or training in order to do it. The point is that by quantifying and identifying when it occurs you can start an informed discussion with your user base, from which both sides of the discussion benefit.

Precise Tracking of Dashboard Usage

Usage Tracking is great, but it has limitations. One example of this is where a user visits a dashboard page more than once in the same session, meaning that it may be served from the Presentation Services cache, and if that happens, the additional visit won’t be recorded in Usage Tracking. By using click data we can actually track every single visit to a dashboard.

In this example here we can see a user visiting two dashboard pages, and then going back to the first one – which is captured by the Enhanced Usage Tracking, but not the standard one, which only captures the first two dashboard visits:

This kind of thing can matter, both from an audit point of view, but also a more advanced use, where we can examine user behaviour in repeated visits to a dashboard. For example, does it highlight that a dashboard design is not optimal and the user is having to switch between multiple tabs to build up a complete picture of the data that they are analysing?

Predictive Modelling to Identify Users at Risk of ‘Churn’

Churn is when users disengage from a system, when they stop coming back. Being able to identify those at risk of doing this before they do it can be hugely valuable, because it gives you opportunity to prevent it. By analysing the patterns of system usage in OBIEE and looking at users who have stopped using OBIEE (i.e. churned) we can then build a predictive model to identify those with similar patterns of usage but are still active.

Measures such as the length of time it takes to run the first dashboard after login, or how many dashboards are run, or how long it takes to find data when building an analysis, can all be useful factors to include in the model.

Are any of my users still accessing OBIEE through IE6?

A trend that I’ve seen in the years working with OBIEE is that organisations are [finally] moving to a more tolerant view on web browsers other than IE. I suppose this is as the web application world evolves and IE becomes more standards compliant and/or web application functionality forces organisations to adopt browsers that provide the latest capabilities. OBIEE too, is a lot better nowadays at not throwing its toys out of the pram when run on a browser that happens to have been written within the past decade.

What’s my little tirade got to do with enhanced usage tracking? Because as those responsible for the development and support of OBIEE in an organisation we need to have a clear picture of the user base that we’re supporting. Sure, corporate ‘standard’ is IE9, but we all know that Jim in design runs one of those funny Mac things with Safari, Fred in accounts insists on Firefox, Bob in IT prides himself on running Konquerer, and it would be a cold day in hell before you prise the MD’s copy of IE5 off his machine. Whether these browsers are “supported” or not is only really a secondary point to whether they’re being used. A lot of the time organisations will take the risk on running unsupported configurations, consciously or in blissful ignorance, and being ‘right’ won’t cut it if your OBIEE patch suddenly breaks everything for them.

Enhanced Usage Tracking gives us the ability to analyse browser usage over time:

as well as the Enhanced Usage Tracking data rendered through OBIEE itself, showing browser usage in total (nb the Log scale):

It’s also easy to report on the Operating System that users have:

Where are my users connecting to OBIEE from?

Whilst a lot of OBIEE deployments are run within the confines of a corporate network, there are those that are public-facing, and for these ones it could be interesting to include location as another dimension by which we analyse the user base and their patterns of usage. Enhanced Usage Tracking includes the capture of a user’s IP, which for public networks we can easily lookup and use the resulting data in our analysis.

Even on a corporate network the user’s IP can be useful, because the corporate network will be divided into subnets and IP ranges, which will usually have geographical association to them – you just might need to code your own lookup in order to translate 192.168.11.5 to “Bob’s dining room”.

Who deleted this report? Who logged in? Who clicked the Do Not Click Button?

The uses for Enhanced Usage Tracking are almost endless. Any user interaction with OBIEE can now be measured and monitored.

A frequent question that I see on the OTN forums is along the lines of “for audit purposes, we need to know who logged in”. Since Usage Tracking alone won’t capture this directly (although the new init block logging in > 11.1.1.9 probably helps indirectly with this) this information usually isn’t available….until now! In this table we see the user, their session ID, and the time at which they logged in:

What about who updated a report last, or deleted it? We can find that out too! This simple example shows some of the operations in the Presentation Catalog recorded as clear as day in Enhanced Usage Tracking:

Want to know more? We’d love to tell you more!

If you’d like to find out more, including about Enhanced Usage Tracking and getting a free User Engagement Report for your OBIEE system, get in touch now!

The post Driving OBIEE User Engagement with Enhanced Usage Tracking for OBIEE appeared first on Rittman Mead Consulting.

↧

Using Linux Control Groups to Constrain Process Memory

December 21, 2015, 1:00 am

≫ Next: OBIEE 12c – Repository Password Corruption Issue

≪ Previous: Driving OBIEE User Engagement with Enhanced Usage Tracking for OBIEE

Linux Control Groups (cgroups) are a nifty way to limit the amount of resource, such as CPU, memory, or IO throughput, that a process or group of processes may use. Frits Hoogland wrote a great blog demonstrating how to use it to constrain the I/O a particular process could use, and was the inspiration for this one. I have been doing some digging into the performance characteristics of OBIEE in certain conditions, including how it behaves under memory pressure. I’ll write more about that in a future blog, but wanted to write this short blog to demonstrate how cgroups can be used to constrain the memory that a given Linux process can be allocated.

This was done on Amazon EC2 running an image imported originally from Oracle’s OBIEE SampleApp, built on Oracle Linux 6.5.

$ uname -a  
Linux demo.us.oracle.com 2.6.32-431.5.1.el6.x86_64 #1 SMP Tue Feb 11 11:09:04 PST 2014 x86_64 x86_64 x86_64 GNU/Linux

First off, install the necessary package in order to use them, and start the service. Throughout this blog where I quote shell commands those prefixed with # are run as root and $ as non-root:

# yum install libcgroup  
# service cgconfig start

Create a cgroup (I’m shamelessly ripping off Frits’ code here, hence the same cgroup name ;-) ):

# cgcreate -g memory:/myGroup

You can use cgget to view the current limits, usage, & high watermarks of the cgroup:

# cgget -g memory:/myGroup|grep bytes  
memory.memsw.limit_in_bytes: 9223372036854775807  
memory.memsw.max_usage_in_bytes: 0  
memory.memsw.usage_in_bytes: 0  
memory.soft_limit_in_bytes: 9223372036854775807  
memory.limit_in_bytes: 9223372036854775807  
memory.max_usage_in_bytes: 0  
memory.usage_in_bytes: 0

For more information about the field meaning see the doc here.

To test out the cgroup ability to limit memory used by a process we’re going to use the tool stress, which can be used to generate CPU, memory, or IO load on a server. It’s great for testing what happens to a server under resource pressure, and also for testing memory allocation capabilities of a process which is what we’re using it for here.

We’re going to configure cgroups to add stress to the myGroup group whenever it runs

$ cat /etc/cgrules.conf  
*:stress memory myGroup

[Re-]start the cg rules engine service:

# service cgred restart

Now we’ll use the watch command to re-issue the cgget command every second enabling us to watch cgroup’s metrics in realtime:

# watch --interval 1 cgget -g memory:/myGroup  
/myGroup:  
memory.memsw.failcnt: 0  
memory.memsw.limit_in_bytes: 9223372036854775807  
memory.memsw.max_usage_in_bytes: 0  
memory.memsw.usage_in_bytes: 0  
memory.oom_control: oom_kill_disable 0  
        under_oom 0  
memory.move_charge_at_immigrate: 0  
memory.swappiness: 60  
memory.use_hierarchy: 0  
memory.stat: cache 0  
        rss 0  
        mapped_file 0  
        pgpgin 0  
        pgpgout 0  
        swap 0  
        inactive_anon 0  
        active_anon 0  
        inactive_file 0  
        active_file 0  
        unevictable 0  
        hierarchical_memory_limit 9223372036854775807  
        hierarchical_memsw_limit 9223372036854775807  
        total_cache 0  
        total_rss 0  
        total_mapped_file 0  
        total_pgpgin 0  
        total_pgpgout 0  
        total_swap 0  
        total_inactive_anon 0  
        total_active_anon 0  
        total_inactive_file 0  
        total_active_file 0  
        total_unevictable 0  
memory.failcnt: 0  
memory.soft_limit_in_bytes: 9223372036854775807  
memory.limit_in_bytes: 9223372036854775807  
memory.max_usage_in_bytes: 0  
memory.usage_in_bytes: 0

In a separate terminal (or even better, use screen!) run stress, telling it to grab 150MB of memory:

$ stress --vm-bytes 150M --vm-keep -m 1

Review the cgroup, and note that the usage fields have increased:

/myGroup:  
memory.memsw.failcnt: 0  
memory.memsw.limit_in_bytes: 9223372036854775807  
memory.memsw.max_usage_in_bytes: 157548544  
memory.memsw.usage_in_bytes: 157548544  
memory.oom_control: oom_kill_disable 0  
        under_oom 0  
memory.move_charge_at_immigrate: 0  
memory.swappiness: 60  
memory.use_hierarchy: 0  
memory.stat: cache 0  
        rss 157343744  
        mapped_file 0  
        pgpgin 38414  
        pgpgout 0  
        swap 0  
        inactive_anon 0  
        active_anon 157343744  
        inactive_file 0  
        active_file 0  
        unevictable 0  
        hierarchical_memory_limit 9223372036854775807  
        hierarchical_memsw_limit 9223372036854775807  
        total_cache 0  
        total_rss 157343744  
        total_mapped_file 0  
        total_pgpgin 38414  
        total_pgpgout 0  
        total_swap 0  
        total_inactive_anon 0  
        total_active_anon 157343744  
        total_inactive_file 0  
        total_active_file 0  
        total_unevictable 0  
memory.failcnt: 0  
memory.soft_limit_in_bytes: 9223372036854775807  
memory.limit_in_bytes: 9223372036854775807  
memory.max_usage_in_bytes: 157548544  
memory.usage_in_bytes: 157548544

Both memory.memsw.usage_in_bytes and memory.usage_in_bytes are 157548544 = 150.25MB

Having a look at the process stats for stress shows us:

$ ps -ef|grep stress  
oracle   15296  9023  0 11:57 pts/12   00:00:00 stress --vm-bytes 150M --vm-keep -m 1  
oracle   15297 15296 96 11:57 pts/12   00:06:23 stress --vm-bytes 150M --vm-keep -m 1  
oracle   20365 29403  0 12:04 pts/10   00:00:00 grep stress

$ cat /proc/15297/status

Name:   stress  
State:  R (running)  
[...]  
VmPeak:   160124 kB  
VmSize:   160124 kB  
VmLck:         0 kB  
VmHWM:    153860 kB  
VmRSS:    153860 kB  
VmData:   153652 kB  
VmStk:        92 kB  
VmExe:        20 kB  
VmLib:      2232 kB  
VmPTE:       328 kB  
VmSwap:        0 kB  
[...]

The man page for proc gives us more information about these fields, but of particular note are:

VmSize: Virtual memory size.
VmRSS: Resident set size.
VmSwap: Swapped-out virtual memory size by anonymous private pages

Our stress process has a VmSize of 156MB, VmRSS of 150MB, and zero swap.

Kill the stress process, and set a memory limit of 100MB for any process in this cgroup:

# cgset -r memory.limit_in_bytes=100m myGroup

Run cgset and you should see the see new limit. Note that at this stage we’re just setting memory.limit_in_bytes and leaving memory.memsw.limit_in_bytes unchanged.

# cgget -g memory:/myGroup|grep limit|grep bytes  
memory.memsw.limit_in_bytes: 9223372036854775807  
memory.soft_limit_in_bytes: 9223372036854775807  
memory.limit_in_bytes: 104857600

Let’s see what happens when we try to allocate the memory, observing the cgroup and process Virtual Memory process information at each point:

15MB:

$ stress --vm-bytes 15M --vm-keep -m 1  
stress: info: [31942] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd

# cgget -g memory:/myGroup|grep usage|grep -v max  
memory.memsw.usage_in_bytes: 15990784  
memory.usage_in_bytes: 15990784

$ cat /proc/$(pgrep stress|tail -n1)/status|grep VmVmPeak:    21884 kB  
VmSize:    21884 kB  
VmLck:         0 kB  
VmHWM:     15616 kB  
VmRSS:     15616 kB  
VmData:    15412 kB  
VmStk:        92 kB  
VmExe:        20 kB  
VmLib:      2232 kB  
VmPTE:        60 kB  
VmSwap:        0 kB

50MB:

$ stress --vm-bytes 50M --vm-keep -m 1  
stress: info: [32419] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd

# cgget -g memory:/myGroup|grep usage|grep -v max  
memory.memsw.usage_in_bytes: 52748288  
memory.usage_in_bytes: 52748288     

$ cat /proc/$(pgrep stress|tail -n1)/status|grep Vm  
VmPeak:    57724 kB  
VmSize:    57724 kB  
VmLck:         0 kB  
VmHWM:     51456 kB  
VmRSS:     51456 kB  
VmData:    51252 kB  
VmStk:        92 kB  
VmExe:        20 kB  
VmLib:      2232 kB  
VmPTE:       128 kB  
VmSwap:        0 kB

100MB:

$ stress --vm-bytes 100M --vm-keep -m 1  
stress: info: [20379] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd        
# cgget -g memory:/myGroup|grep usage|grep -v max  
memory.memsw.usage_in_bytes: 105197568  
memory.usage_in_bytes: 104738816

$ cat /proc/$(pgrep stress|tail -n1)/status|grep Vm  
VmPeak:   108924 kB  
VmSize:   108924 kB  
VmLck:         0 kB  
VmHWM:    102588 kB  
VmRSS:    101448 kB  
VmData:   102452 kB  
VmStk:        92 kB  
VmExe:        20 kB  
VmLib:      2232 kB  
VmPTE:       232 kB  
VmSwap:     1212 kB

Note that VmSwap has now gone above zero, despite the machine having plenty of usable memory:

# vmstat -s  
     16330912  total memory  
     14849864  used memory  
     10583040  active memory  
      3410892  inactive memory  
      1481048  free memory  
       149416  buffer memory  
      8204108  swap cache  
      6143992  total swap  
      1212184  used swap  
      4931808  free swap

So it looks like the memory cap has kicked in and the stress process is being forced to get the additional memory that it needs from swap.

Let’s tighten the screw a bit further:

$ stress --vm-bytes 200M --vm-keep -m 1  
stress: info: [21945] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd

The process is now using 100MB of swap (since we’ve asked it to grab 200MB but cgroup is constraining it to 100MB real):

$ cat /proc/$(pgrep stress|tail -n1)/status|grep Vm  
VmPeak:   211324 kB  
VmSize:   211324 kB  
VmLck:         0 kB  
VmHWM:    102616 kB  
VmRSS:    102600 kB  
VmData:   204852 kB  
VmStk:        92 kB  
VmExe:        20 kB  
VmLib:      2232 kB  
VmPTE:       432 kB  
VmSwap:   102460 kB

The cgget command confirms that we’re using swap, as the memsw value shows:

# cgget -g memory:/myGroup|grep usage|grep -v max  
memory.memsw.usage_in_bytes: 209788928  
memory.usage_in_bytes: 104759296

So now what happens if we curtail the use of all memory, including swap? To do this we’ll set the memory.memsw.limit_in_bytes parameter. Note that running cgset whilst a task under the cgroup is executing seems to get ignored if it is below that currently in use (per the usage_in_bytes field). If it is above this then the change is instantaneous:

Current state

# cgget -g memory:/myGroup|grep bytes  
memory.memsw.limit_in_bytes: 9223372036854775807  
memory.memsw.max_usage_in_bytes: 209915904  
memory.memsw.usage_in_bytes: 209784832  
memory.soft_limit_in_bytes: 9223372036854775807  
memory.limit_in_bytes: 104857600  
memory.max_usage_in_bytes: 104857600  
memory.usage_in_bytes: 104775680

Set the limit below what is currently in use (150m limit vs 200m in use)
```
# cgset -r memory.memsw.limit_in_bytes=150m myGroup
```

Check the limit – it remains unchanged

# cgget -g memory:/myGroup|grep bytes  
memory.memsw.limit_in_bytes: 9223372036854775807  
memory.memsw.max_usage_in_bytes: 209993728  
memory.memsw.usage_in_bytes: 209784832  
memory.soft_limit_in_bytes: 9223372036854775807  
memory.limit_in_bytes: 104857600  
memory.max_usage_in_bytes: 104857600  
memory.usage_in_bytes: 104751104

Set the limit above what is currently in use (250m limit vs 200m in use)
```
# cgset -r memory.memsw.limit_in_bytes=250m myGroup
```

Check the limit – it’s taken effect

# cgget -g memory:/myGroup|grep bytes  
memory.memsw.limit_in_bytes: 262144000  
memory.memsw.max_usage_in_bytes: 210006016  
memory.memsw.usage_in_bytes: 209846272  
memory.soft_limit_in_bytes: 9223372036854775807  
memory.limit_in_bytes: 104857600  
memory.max_usage_in_bytes: 104857600  
memory.usage_in_bytes: 104816640

So now we’ve got limits in place of 100MB real memory and 250MB total (real + swap). What happens when we test that out?

$ stress --vm-bytes 245M --vm-keep -m 1  
stress: info: [25927] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd

The process is using 245MB total (VmData), of which 95MB is resident (VmRSS) and 150MB is swapped out (VmSwap)

$ cat /proc/$(pgrep stress|tail -n1)/status|grep Vm  
VmPeak:   257404 kB  
VmSize:   257404 kB  
VmLck:         0 kB  
VmHWM:    102548 kB  
VmRSS:     97280 kB  
VmData:   250932 kB  
VmStk:        92 kB  
VmExe:        20 kB  
VmLib:      2232 kB  
VmPTE:       520 kB  
VmSwap:   153860 kB

The cgroup stats reflect this:

# cgget -g memory:/myGroup|grep bytes  
memory.memsw.limit_in_bytes: 262144000  
memory.memsw.max_usage_in_bytes: 257159168  
memory.memsw.usage_in_bytes: 257007616  
[...]  
memory.limit_in_bytes: 104857600  
memory.max_usage_in_bytes: 104857600  
memory.usage_in_bytes: 104849408

If we try to go above this absolute limit (memory.memsw.max_usage_in_bytes) then the cgroup kicks in a stops the process getting the memory, which in turn causes stress to fail:

$ stress --vm-bytes 250M --vm-keep -m 1  
stress: info: [27356] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd  
stress: FAIL: [27356] (415) <-- worker 27357 got signal 9  
stress: WARN: [27356] (417) now reaping child worker processes  
stress: FAIL: [27356] (451) failed run completed in 3s

This gives you an indication of how careful you need to be using this type of low-level process control. Most tools will not be happy if they are starved of resource, including memory, and may well behave in unstable ways.

Thanks to Frits Hoogland for reading a draft of this post and providing valuable feedback.

The post Using Linux Control Groups to Constrain Process Memory appeared first on Rittman Mead Consulting.

↧

OBIEE 12c – Repository Password Corruption Issue

February 11, 2016, 8:00 am

≫ Next: OBIEE Performance – Why Metrics Matter (and…Announcing obi-metrics-agent v2!)

≪ Previous: Using Linux Control Groups to Constrain Process Memory

Here at Rittman Mead we’ve been working with OBIEE 12c for some time now, as part of the beta programme and more recently with clients looking to get the most out an upgrade to OBIEE 12c. We’ve also been hard at work on our brand new OBIEE 12c training course. What we’ve seen in terms of the stability of OBIEE 12c has been pleasantly surprising. Anyone who’s worked with software long enough will be familiar with the reputation that first releases in general have for nasty bugs, and it’s probably fair to say that with the first release of 11g (11.1.1.3) this was proven out. With the first release of OBIEE 12c, however, we’re seeing a stable tool with very few issues so far.

That said…I’m going to demonstrate an issue here that is a bit of a nasty one. It’s nasty because the trigger for it appears innocuous, and what it breaks is one of the really new things in OBIEE 12c—the way in which the RPD is stored on disk and accessed by the BI Server. This makes it a bit of a tough one to get to the bottom of at first, but it’s a good excuse to go digging!

Summary

If you open the RPD in online mode (use File –> Copy As and then use the Save option), the password on the server gets corrupted.

[nQSError: 43113] Message returned from OBIS  
[nQSError: 13042] Repository password is wrong

obiee 12c repository password corruption

From this point on you cannot checkin any changes, and when you restart the BI Server it will fail to start up.

Details

In OBIEE (12c, and before) it is possible to open the RPD as a straight binary file on disk (“Offline” mode), or by connecting directly to the BI Server and opening the copy that it is currently running (“Online mode”). Offline mode suits larger changes and development, with the RPD then being deployed onto the server once the development work is ready to be tested. Online mode is a good way for making changes on a dedicated dev server, minor changes on a shared server, or indeed just for viewing the RPD that’s currently being run.

Here’s what we’ve seen as the problem:

Open RPD in online mode
File -> Copy As
Enter a password with which to protect the RPD being saved on disk.
Do one of:
- File -> Close, and then when prompted to save changes click Yes
- File -> Save
- Click the Save icon on the toolbar
- Ctrl-S

obiee 12c repository password error

What happens now is two-fold:

You cannot check in any changes made online—the check in fails with an error from the Administration Tool:
```
[nQSError: 43113] Message returned from OBIS  
[nQSError: 13042] Repository password is wrong
```

The BI Server will fail on restart with the same error:

Opening latest versioned cached RPD for : /app/oracle/biee/bi/bifoundation/server/empty.rpd which is /app/oracle/biee/user_projects/domains/bi/bidata/service_instances/ssi/metadata/datamodel/customizations/liverpd.rpd_5  
[nQSError: 13042] Repository password is wrong. [[

We saw this on SampleApp v511, as well as on vanilla installations of OBIEE. Versions on both were 12.2.1.0.0.

After we reported this to Oracle, they agreed it was a bug and have logged it as bug number 22682937, with no patch currently (February 10, 2016) available.

Workaround

If you open the RPD online and use File -> Copy As, don’t hit save or check in, even if prompted by the Admin Tool. Close the RPD straightaway.

Often people will use File -> Copy As to take a copy of the current live RPD before doing some changes to it. At Rittman Mead, we’d always recommend using source control such as git to store all code including the RPD, and using this approach you obviate the need to open the RPD online simply to get the latest copy (because the latest copy is in source control).

You can also use the data-model-cmd downloadrpd option to download the actual live RPD—that’s exactly what this option is provided for.

Solution – if BI Server (OBIS) has not yet been restarted

If you’ve hit this bug and are hitting “Repository password is wrong” when you try to checkin, and if the BI Server is still running, then redeploy the RPD using the data-model-cmd uploadrpd tool. By redeploying the RPD the password appears to get sorted out.

If the BI Server is down, then this is not an option because it has to be running in order for data-model-cmd uploadrpd to work.

Solution – if BI Server (OBIS) has been restarted and failed

At this point using data-model-cmd uploadrpd is not possible because OBIS is not running and so the data-model-cmd uploadrpd will fail with the error:

[oracle@demo ~]$ /app/oracle/biee/user_projects/domains/bi/bitools/bin/data-model-cmd.sh uploadrpd -I /home/oracle/rmoff.rpd -W Password01 -U weblogic -P Admin123 -SI ssi  
Service Instance: ssi

Operation failed.  
An exception occurred during execution, please check server logs.

The only option from this point is to use importServiceInstance to reset the service instance, either to an empty, SampleAppLite, or an existing .bar export of your environment. For example:

/app/oracle/biee/oracle_common/common/bin/wlst.sh  
importServiceInstance('/app/oracle/biee/user_projects/domains/bi','ssi','/app/oracle/biee/bi/bifoundation/samples/sampleapplite/SampleAppLite.bar')

This will enable OBIS to start up correctly, from which point the desired RPD can then be re-uploaded if required using data-model-cmd uploadrpd.

Conclusion

The easiest thing is to simply not use File -> Copy As in online mode. Whilst this on its own is fine, the UI means it’s easy to accidentally use the Save option, which then triggers this problem. Instead, use data-model-cmd downloadrpd, and/or use source control so that you can easily identify the latest RPD that you want to develop against.

If you do hit this repository password corruption problem, then keep calm and don’t restart the BI Server—just re-upload the RPD using data-model-cmd uploadrpd. If you have already uploaded the RPD, then you need to use importServiceInstance to restore things to a working state.

As part of the diagnostics that we did to get to the bottom of this issue, we found some interesting things in OBIEE 12c, such as a web service endpoint for RPD upload/download, as well as the detailed workings of the RPD upload process and that infamous liverpd.rpd file. Stay tuned for a blog post on this and more soon! And in the meantime, be sure to get in touch with us to discuss how we can help you with your OBIEE systems, including OBIEE 12c upgrade, and OBIEE 12c training.

The post OBIEE 12c – Repository Password Corruption Issue appeared first on Rittman Mead Consulting.

↧

OBIEE Performance – Why Metrics Matter (and…Announcing obi-metrics-agent v2!)

February 26, 2016, 6:32 am

≫ Next: New OTN Article – OBIEE Performance Analytics: Analysing the Impact of Suboptimal Design

≪ Previous: OBIEE 12c – Repository Password Corruption Issue

One of the first steps to improve OBIEE performance is to determine why it is slow. That may sound obvious—can’t fix it if you don’t know what you’re fixing, right? Unfortunately, the “Drunk Man anti-method”, in which we merrily stumble from one change to another, maybe breaking things along the way and certainly having a headache at the end of it, is far too prevalent. This comes about partly through unawareness of a better method to follow, and partly encouraged by tuning documents comprising reams of configuration settings to “tune” and fiddle with without really knowing why or how to prove if they indeed actually fixed anything…

Determining the cause of performance problems is often a case of working out what it’s not just as much as what it is. This is for two important reasons. Firstly, we begin to narrow down the area of focus and analysis. Secondly, we know what to leave alone. If we can prove that, for example, the database is running the query behind a report quickly, then there is no point “tuning” the database, because the problem doesn’t lie there. Similarly, if we can see that a report taking 60 seconds in total to run spends 59 seconds of that in the database, fiddling with Java Heap Size settings on OBIEE is going to at the very, very most reduce our total runtime to…59 seconds! This kind of time profiling is important to do, and something that we produce automatically in our Performance Analytics Report:

timeprofile01

So, how do we pinpoint what is, or isn’t, going wrong? We need data, and specifically, we need metrics. We need log files, too, maybe for the real nitty-gritty of explain plans, but a huge amount can be understood about a system by looking at the metrics available.

Any modern operating system, from Windows to Linux, AIX to Solaris, will have copious utilities that will expose important metrics such as CPU usage, disk throughout, and so on. These can often be of great assistance in diagnosing performance problems.

OBIEE DMS Metrics

When it comes to OBIEE itself, we are spoilt by the performance counters available that since 11g (and still in 12c) have been exposed through the Dynamic Monitoring System (DMS). They were even there in 10g too, but accessed through JMX. These metrics give us information ranging from things like the number of logged in users, through how many connections are open to a given database, down to real low-level internals like how many threads are in use for handling LDAP lookups. Crucially, there are also metrics showing current and peak levels of queueing within the various internal systems in OBIEE, which is where DMS becomes particularly important.

By being able to prove that OBIEE has, for example, run out of available connections to the database, we can confidently state that by changing a given configuration parameter we will alleviate a bottleneck. Not only that, but we can monitor and determine how many connections we really do need at a given workload level. The chart below illustrates this. The capacity of the connection pool is plotted against the number of busy connections. As the number of active sessions increases so does the pressure on the connection pool, until it hits capacity at which point queueing starts—which now means queries are waiting for a connection to the database before they can even begin to execute (and it’s at this point we’d expect to see response times suffer).

So this is the kind of valuable information that is just not available anywhere other than the DMS metrics, and you can see from the above illustration just how useful it is. To access DMS metrics in OBIEE 11g and 12c, you have several options available out of the box:

DMS Spy Servlet
- This includes the very useful (but undocumented) option to pull the metrics out in XML format, by including format=xml as a request parameters—thanks to etcSudoers on the #obihackers IRC channel for this gem!
Fusion Middleware Control
WLST
opmnctl (not 12c)
There’s also the paid-for option of the BI Management Pack.

Some of these are useful for programmatically scraping the data, others for interactively checking values at a point in time.

obi-metrics-agent – v2

At Rittman Mead, we always recommend collecting and storing DMS metrics (alongside others, including OS) all the time—not just if you find yourself with performance problems. That way you can compare before and after states, you can track historical trends—and you’re all set to hit the ground running with your diagnostics when (if) you do hit performance problems.

You can capture DMS metrics with the BI Management Pack in Enterprise Manager, you can write something yourself, or you can take advantage of an open-source tool from Rittman Mead, obi-metrics-agent.

I wrote about obi-metrics-agent originally when we first open-sourced it almost two years ago. The principle in version 2 is still the same, we’ve just rewritten it in Jython so as to remove the need for any dependencies like Python and associated libraries. We’ve also added native InfluxDB output, as well as retained the option to send data in the original carbon/graphite protocol.

You can run obi-metrics-agent and just write the DMS data to CSV, but our recommendation is always to persist it straight to a time series data store such as InfluxDB. Once you’ve collected the data you can analyse and monitor it with several tools, our favourite being Grafana (read more about this here).

As part of our Performance Analytics Service we’ve built a set of Performance Analytics Dashboards, making available a full-stack view of OBIEE metrics (including DMS, OS, and even Oracle ASH data), as seen in this video here (click on the image to enlarge it):

If you’d like to find out more about these and the Performance Analytics service offered by Rittman Mead, please get in touch. You can download obi-metrics-agent itself freely from our github repository.

The post OBIEE Performance – Why Metrics Matter (and…Announcing obi-metrics-agent v2!) appeared first on Rittman Mead Consulting.

↧