Built-In OBIEE Load Testing with nqcmd

March 28, 2014, 4:21 am

≫ Next: Visualising OBIEE DMS metrics with Graphite

≪ Previous: Installing obi-metrics-agent, Graphite, and collectl

nqcmd ships with all installations of OBIEE and includes some very useful hidden functionality – the ability to generate load tests against OBIEE. There are lots of ways of generating load against OBIEE, but most require third party tools of varying degrees of complexity to work with.

It’s easy to try this out. First set the OBIEE environment: [I'm using SampleApp v309R2 as an example; your FMW_HOME path will vary]

. ~/obiee/instances/instance1/bifoundation/OracleBIApplication/coreapplication/setup/bi-init.sh

and then the “open sesame” setting which enables the hidden nqcmd functionality:

export SA_NQCMD_ADVANCED=Yes

On Windows, run set SA_NQCMD_ADVANCED=YES instead. If you don’t set this environment variable then nqcmd just throws an error if you try to use one of the hidden options.

Now if you list the available options for nqcmd you’ll see lots of new options in addition to the usual ones:

Command: nqcmd - a command line client which can issue SQL statements
                 against either Oracle BI server or a variety
                 of ODBC compliant backend databases.
SYNOPSIS
         nqcmd [OPTION]...
DESCRIPTION
         -d<data source name>
         -u<user name>
         -p<password>
         -s<sql input file name>
         -o<output result file name>
         -D<Delimiter>
         -b<super batch file name>
         -w<# wait seconds>
         -c<# cancel interval seconds>
         -C<# number of fetched rows by column-wise binding>
         -n<# number of loops>
         -r<# number of requests per shared session>
         -R<# number of fetched rows by row-wise binding>
         -t<# number of threads>
         -T (a flag to turn on time statistics)
         -a (a flag to enable async processing)
         -f (a flag to enable to flush output file for each write)
         -H (a flag to enable to open/close a request handle for each query)
         -z (a flag to enable UTF8 in the output result file
         -utf16 (a flag to enable UTF16 for communicating to Oracle BI ODBC driver)
         -q (a flag to turn off row output)
         -NoFetch (a flag to disable data fetch with query execution)
         -SmartDiff (a flag to enable SmartDiff tags in output)
         -NotForwardCursor (a flag to disable forwardonly cursor)
         -v (a flag to display the version)
         -P<the percent of statements to disable cache hit>
         -impersonate <the impersonate username>
         -runas <the runas username>
         -td <the time duration to run >
         -qsel <the query selection>
         -ds <the dump statistics duration in secs>
         -qstats <print Query statistics at end of run>
         -login <login scenario for PSR. login/execute sqls/logout for sql file>
         -ShowQueryLog <to display query log from server, -H is required for this setting>
         -i <ramup interval for each user for load testing, -i is required for this setting>
         -ONFormat<FormatString, i.e. TM9, 0D99>

You’re own your own figuring the new options out as they’re not documented (and therefore presumably not supported and liable to change or be dropped at any time). What I’ve done below is my best guess at how to use them – don’t take this as gospel. The one source that I did find is a post on Oracle’s CEAL blog: OBIEE 11.1.1 – Advanced Usage of nqcmd command, from which I’ve taken some of the detail below.

Let’s have a look at how we can generate a load test. First off, I’ll create a very simple query:

and from the Advanced tab extract the Logical SQL from it:

SELECT
   0 s_0,
   "A - Sample Sales"."Products"."P2  Product Type" s_1,
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2
FROM "A - Sample Sales"
ORDER BY 1, 2 ASC NULLS LAST
FETCH FIRST 5000001 ROWS ONLY

This Logical SQL I’ve saved to a file, report01.lsql.

To run this Logical SQL from nqcmd I use the standard (documented) syntax, passing the Logical SQL filename with the -s flag:

[oracle@obieesample loadtest]$ nqcmd -d AnalyticsWeb -u Prodney -p Admin123 -s report01.lsql

-------------------------------------------------------------------------------
          Oracle BI ODBC Client
          Copyright (c) 1997-2013 Oracle Corporation, All rights reserved
-------------------------------------------------------------------------------

Connection open with info:
[0][State: 01000] [DataDirect][ODBC lib] Application's WCHAR type must be UTF16, because odbc driver's unicode type is UTF16
SELECT
   0 s_0,
   "A - Sample Sales"."Products"."P2  Product Type" s_1,
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2
FROM "A - Sample Sales"
ORDER BY 1, 2 ASC NULLS LAST
FETCH FIRST 5000001 ROWS ONLY
[...]

0            Smart Phones   6773120.36
--------------------
Row count: 11
--------------------

Processed: 1 queries

Adding the -q flag will do the same, but suppress the data output:

oracle@obieesample loadtest]$ nqcmd -d AnalyticsWeb -u Prodney -p Admin123 -s report01.lsql -q

[...]
----------------------------------------------------------------------
Row count: 11
-------------------------------------------------------------------------------------------------------------   
Processed: 1 queries

The basic parameters for load testing are

-t – how many threads [aka Virtual Users]
-td – test duration
-ds – how frequently to write out load test statistics
-T – enable time statistics [without this they will not be reported correctly]

You also need to supply -o with an output filename. Even if you’re not writing the data returned from the query to disk (which you shouldn’t, and -q disables), nqcmd needs this in order to be able to write its load test statistics properly (I got a lot of zeros and nan otherwise). In addition, the -T (Timer) flag should be enabled for accurate timings.

So to run a test for a minute with 5 threads, writing load test stats to disk every 5 seconds, you’d run:

nqcmd -d AnalyticsWeb -u Prodney -p Admin123 -s report01.lsql -q -T -td 60 -t 5 -ds 5 -o output

The load test stats are written to a file based on the name given in the -o parameter, with a _Counters.txt suffix:

$ cat output_Counters.txt
                        nQcmd Load Testing
TimeStamp       Sqls/Sec        Avg RT  CumulativePrepareTime   CumulativeExecuteTime   CumulativeFetchTime
00:00:05        56.200000       0.065925        2.536000                13.977000               2.012000
00:00:10        66.800000       0.065009        5.641000                33.479000               4.306000
00:00:15        69.066667       0.066055        8.833000                52.234000               7.366000
00:00:20        73.100000       0.063984        11.978000               71.944000               9.622000
[...]

Using obi-metrics-agent to pull out the OBIEE metrics and Graphite to render them we can easily visualise what happened when we ran the test. The Oracle_BI_General.Total_sessions metric shows:

nq07

Ramping Up the Load

nqcmd also has a -i parameter, to specify the ramp up per thread. Most load tests should incorporate a “ramp up”, whereby the load is introduced gradually. This is important so that you don’t overwhelm a server all at once. It might be the server will not support the total number of users planned, so by using a ramp up period you can examine the server’s behaviour as the load increases gradually, spotting the point at which the wheels begin to come off.

The -i parameter for nqcmd is the delay between each thread launching, and this has an interesting effect on the duration of the test. If you specify a test duration (-td) of 5 seconds, five threads (-t), and a rampup (-i) of 10 seconds the total elapsed will be c.55 seconds (5×10 + 5).

I’ve used the standard time command on Linux to validate this by specifying it before the nqcmd call.

$ time nqcmd -d AnalyticsWeb -u Prodney -p Admin123 -s report01.lsql -q -td 5 -t 5 -ds 1 -o $(date +%Y-%m-%d-%H%M%S) -T -i 10 

[...]

real    0m56.896s
user    0m2.350s
sys     0m1.434s

So basically the -td is the “Steady State” once all threads are ramped up, and the literal test duration is equal to (rampup * number of threads) + (desired steady state)

The above ramp-up can be clearly seen:

nq06

BTW a handy trick I’ve used here is to use a timestamp for the output name so that the Counter.txt from one test doesn’t overwrite another, by specifying date using an inline bash command :

nqcmd [...]   -o $(date +%Y-%m-%d-%H%M%S)   [...]

Whilst we’re at it for tips & tricks – if you want to stop nqcmd running but Ctrl-C isn’t instant enough for you, the following will stop it in its tracks:

pkill -9 nqcmd

Wait a Moment…

…or two. Wait time, or “think time”, is also important in producing a realistic load test. Unless you want to hammer your server just for the lulz to see how fast you can overload it, you’ll want to make sure the workload you’re simulating represents how it is actually used — and in reality users will be pausing (thinking) between report requests. The -w flag provides this option to nqcmd.

In this test below, whilst the Total Sessions is as before (no ramp up), the Connection Pool shows far fewer busy connections. On previous tests the busy connections were equal to the number of active threads, because the server was continuously running queries.

nq09

And the CPU, which in the previous test was exhausted at five users with no wait time, now is a bit more relaxed

nq10

for comparison, this was the CPU in the first test we ran (5 threads, no wait time, no ramp up). Note that ‘idle’ drops to zero, i.e. the CPU is flat-out.

nq11

Load Test in Action

Let’s combine ramp up and wait times to run a load test and see what we can see in the underlying OBIEE metrics. I’m specifying:

Write the output to a file with the current timestamp (date, in the format YYYY-MM-DD HH:MM:SS)
```
-o $(date +%Y-%m-%d-%H%M%S)
```
20 threads
```
-t 20
```
10 second gap between starting each new thread
```
-i  10
```
5 second wait between each thread submitting a new query
```
-w 5
```
Run for a total of 230 seconds (20 thread x 10 second ramp up = 200 seconds, plus 30 second steady state)
```
-td 230
```

$ date;time nqcmd -d AnalyticsWeb -u weblogic -p Password01 -s queries.lsql -q -T -o $(date +%Y-%m-%d-%H%M%S) -t 20 -ds 5 -td 230 -w 5 -i 10;date

Here’s what happened.

At first, as the users ramp up the Connection Pool gets progressively busier
However, when we hit c.14 threads, things start to go awry. The busy count stays at 10, even though the user count is increasing:
(This was displayed in flot which you can get to on the /graphlot URL of your Graphite server)
So the user count is increasing, but we’re not seeing increasing activity on the Connection Pool… so what does that do for the response times?
OK, so the Average Query Elapsed Time is a metric I’d normally be wary of, but this is a dedicated server running just my load test workload (and a single query within it) so in this case it’s a valid indicator — and it’s showing that the response time it going up. Why’s it going up?
Looking more closely at the Connection Pool we can see a problem — we’re hitting the capacity of ten connections, and requests are starting to queue up:
Note how once the Current Busy Connection Count hits the Capacity of ten, the Current Queued Requests value starts to increase — because the number of users is increasing, trying to run more queries, but having to wait.

So this is a good example of where users would see slow performance, but some of the usual “Silver Bullets” around hardware and the database would completely miss the target, because the bottleneck here is actually in the configuration of the Connection Pool.

If you’re interested in hearing more about this subject, make sure you register for the BI Forum in Brighton, 7-9 May where I’m delighted to be speaking for the second time, presenting “No Silver Bullets : OBIEE Performance in the Real World“.

↧

Visualising OBIEE DMS metrics with Graphite

March 30, 2014, 12:50 pm

≫ Next: BI Forum 2014 preview – No Silver Bullets : OBIEE Performance in the Real World

≪ Previous: Built-In OBIEE Load Testing with nqcmd

Assuming you have set up obi-metrics-agent and collectl as described in my previous post, you have a wealth of data at your disposal for graphing and exploring in Graphite, including:

OS (CPU, disk, network, memory)
OBIEE’s metrics
Metrics about DMS itself
Carbon (Graphite’s data collector agent) metrics

In this post I’ll show you some of the techniques we can use to put together a simple dashboard.

Building graphs

First off, let’s see how Graphite actually builds graphs. When you select a data series from the Metrics pane it is added to the Graphite composer where you can have multiple metrics. They’re listed in a legend, and if you click on Graph Data you can see the list of them.

Data held in Graphite (or technically, held in whisper) can be manipulated and pre-processed in many ways before Graphite renders it. This can be mathmatical transforms of the data (eg Moving Average), but also how the data and its label is shown. Here I’ll take the example of several of the CPU metrics (via collectl) to see how we can manipulate them.

To start with, I’ve just added idle, wait and user from the cputotals folder, giving me a nice graph thus:

We can do some obvious things like add in a title, from the Graph Options menu

Graphite functions

Looking at the legend there’s a lot of repeated text (the full qualification of the metric name) which makes the graph more cluttered and less easy to read. We can use a Graphite function to fix this. Click on Graph Data, and use ctrl-click to select all three metrics:

Now click on Apply Function -> Set Legend Name By Metric. The aliasByMetric function is wrapped around the metrics, and the legend on the graph now shows just the metric names which is much smarter:

You can read more about Graphite functions here.

Another useful technique is being able to graph out metrics using a wildcard. Consider the ProcessInfo group of metrics that DMS provides about some of the OBIEE processes:

Let’s say we want a graph that shows cpuTime for each of the processes (not all are available). We could add each metric individually:

But that’s time consuming, and assumes there are only two processes. What if DMS gives us data for other processes? Instead we can use a wildcard in place of the process name:

obieesample.DMS.dms_cProcessInfo.ProcessInfo.*.cpuTime

You can do this by selecting a metric and then amending it in the Graph Data view, or from the Graph Data view itself click on Add and use the auto-complete to manually enter it.

But now the legend is pretty unintelligable, and this time using the aliasByMetric function won’t help because the metric name is constant (cpuTime). Instead, use the Set Legend Name By Node function. In this example we want the third node (the name of the process). Combined with a graph title this gives us:

This aliasbyNode method works well for Connection Pool data too. However it can be sensitive to certain characters (including brackets) in the metric name, throwing a IndexError: list index out of range error. The latest version of obi-metrics-agent should workaround this by modifying the metric names before sending them to carbon.

The above graph shows a further opportunity for using Graphite functions. The metric is a cumulative one – amount to CPU time that the process has used, in total. What would be more useful would be if we could show the delta between each occurrence. For this, the derivative function is appropriate:

Sometimes you’ll get graphs with gaps in; maybe the server was busy and the collector couldn’t keep up.

2014-03-28_07-29-47

To “gloss over” these, use the Keep Last Value function:

2014-03-28_07-30-51

Saving graphs

You don’t have to login to Graphite by default, but to save and return to graphs and dashboards between sessions you’ll want to. If you used the obi-metrics-agent installation script then Graphite will have a user oracle with password Password01. Click the Login button in the top right of the Graphite screen and enter the credentials.

Once logged in, you should see a Save icon (for you young kids out there, that’s a 3.5″ floppy disk…).

You can return to saved graphs from the Tree pane on the left:

flot

As well as the standard Graphite graphing described above, you also have the option of using flot, which is available from the link in the top-right options, or the icon on an existing graph:

2014-03-30_21-44-43

Graphlot/Flot is good for things like examining data values at specific times:

2014-03-30_21-47-36

Creating a dashboard

So far we’ve seen individual graphs in isolation, which is fine for ad-hoc experimentation but doesn’t give us an overall view of a system. Click on Dashboard in the top-right of the Graphite page to go to the dashboards area, ignoring the error about the default theme.

You can either build Graphite dashboards from scratch, or you can bring in graphs that you have prepared already in the Graphite Composer and saved.

At the top of the Graphite Dashboard screen is the metrics available to you. Clicking on them drills down the metric tree, as does typing in the box underneath

Selecting a metric adds it in a graph to the dashboard, and selecting a second adds it into a second graph:

You can merge graphs by dragging and dropping one onto the other:

Metrics within a graph can be modified with functions in exactly the same way as in the Graphite Composer discussed above:

To add in a graph that you saved from Graphite Composer, use the Graphs menu

You can resize the graphs shown on the dashboard, again using the Graphs menu:

To save your dashboard, use the Dashboard -> Save option.

Example Graphite dashboards

Here are some examples of obi-metrics-agent/Graphite being used in anger. Click on an image to see the full version.

OS stats (via collectl)
Presentation Services sessions, cache and charting
BI Server (nqserver) Connection and Thread Pools
Response times vs active users (via JMeter)

↧

BI Forum 2014 preview – No Silver Bullets : OBIEE Performance in the Real World

April 3, 2014, 2:35 am

≫ Next: Extended Visualisation of OBIEE Performance Data with Grafana

≪ Previous: Visualising OBIEE DMS metrics with Graphite

I’m honoured to have been accepted to speak at this year’s Rittman Mead BI Forum, the sixth year of this expert-level conference that draws some of the best Oracle BI/DW minds together from around the world. It’s running May 8th-9th in Brighton, and May 15-16th in Atlanta, with an optional masterclass from Cloudera’s Lars George the day before the conference itself at each venue.

My first visit to the BI Forum was in 2009 where I presented Performance Testing OBIEE, and now five years later (five years!) I’m back, like a stuck record, talking about the same thing – performance. That I’m still talking about it means that there’s still an audience for it, and this time I’m looking beyond just testing performance, but how it’s approached by people working with OBIEE. For an industry built around 1s and 0s, computers doing just what you tell them to and nothing else, there is a surprising amount of suspect folklore and “best practices” used when it comes to “fixing” performance problems.

OBIEE performance good luck charm

Getting good performance with OBIEE is just a matter of being methodical. Understanding where to look for information is half the battle. By understanding where the time goes, improvements can be targeted where they will be most effective. Heavily influence by Cary Millsap and his Method-R approach to performance, I will look at how to practically apply this to OBIEE. Most of the information needed to build up a full picture is available readily from OBIEE’s log files

I’ll also dig a bit deeper into OBIEE, exploring how to determine how the system’s behaving “under the covers”. The primary technique for this is through OBIEE’s DMS metrics which I have written about recently in relation to the new Rittman Mead open-source tool, obi-metrics-agent and am using day-to-day to rapidly examine and resolve performance problems that clients see.

I’m excited to be presenting again on this topic, and I hope to see you in Brighton next month. The conference always sells out, so don’t delay – register today!

↧

Extended Visualisation of OBIEE Performance Data with Grafana

April 29, 2014, 11:00 am

≫ Next: Mobile App Designer mis-configuration error

≪ Previous: BI Forum 2014 preview – No Silver Bullets : OBIEE Performance in the Real World

Recently I wrote about the new obi-metrics-agent tool and how it enables easy collection of DMS data from OBIEE into whisper, the time-series based database behind graphite. In this post I’m going to show two things that take this idea further:

How easy it is to add other data into Graphite
How to install and use Grafana, a most excellent replacement for the graphite front-end.

Collecting data in Graphite

One of the questions I have been asked about using Graphite for collecting and rendering OBIEE DMS metrics is a very valid one : given that OBIEE is a data visualisation tool, and that it usually sits alongside a database, where is the value in introducing another tool that apparently duplicates both data storage and visualisation.

My answer is that it is horses for courses. Graphite has a fairly narrow use-case but what it does it does superbly. It lets you throw any data values at it (as we’re about to see) over time, and rapidly graph these out alongside any other metric in the same time frame.

You could do this with OBIEE and a traditional RDBMS, but you’d need to design the database table, write a load script, handle duplicates, handle date-time arithmetic, build and RPD, build graphs – and even then, you wouldn’t have some of the advanced flexibility that I am going to demonstrate with Grafana below.

Storing nqquery.log response times in Graphite

As part of my Rittman Mead BI Forum presentation “No Silver Bullets – OBIEE Performance in the Real World”, I have been doing a lot of work examining some of the internal metrics that OBIEE exposes through DMS and how these correlate with the timings that are recorded in the BI Server log, nqquery.log, for example:

[2014-04-21T22:36:36.000+01:00] [OracleBIServerComponent] [TRACE:2] [USER-33] [] [ecid: 11d1def534ea1be0:6faf73dc:14586304e07:-8000-00000000000006ca,0:1:9:6:102] [tid: e4c53700] [requestid: c44b002c] [sessionid: c44b0000] [username: weblogic] -------------------- 
Logical Query Summary Stats: Elapsed time 5, Response time 2, Compilation time 0 (seconds) [[
]]

Now, flicking back and forth between the query log is tedious with a single-user system, and as soon as you have multiple reports running it is pretty much impossible to track the timings from the log with data points in DMS. The astute of you at this point will be wondering about Usage Tracking data, but for reasons that you can find out if you attend the Rittman Mead BI Forum I am deliberately using nqquery.log instead.

Getting data in to Graphite is ridiculously easy. Simply chuck a metric name, value, and timestamp, at the Graphite data collector Carbon, and that’s it. You can use whatever method you want for sending it, here I am just using the Linux commandline tool NetCat (nc):

echo "example.foo.bar 3 `date +%s`"|nc localhost 2003

This will log the value of 3 for a metric example.foo.bar for the current timestamp (date +%s). Timestamps are in Unix Time, which is the number of seconds since 1st Jan 1970. You can specify historical values for your metric too:

echo "foo.bar 3 1386806400"|nc localhost 2003

Looking in Graphite we can see the handle of test values I just sent through appear:

Tip: if you don’t see your data coming through, check out the logs in ~/graphite/storage/log/carbon-cache/carbon-cache-a/ (assuming you have Graphite installed in ~/graphite)

So, we know what data we want (nqquery.log timings), and how to get data into Graphite (send the data value to Carbon via nc). How do we bring the two together? We do this in the same way that many Linux things work, and that it using pipes to join different commands together, each doing one thing and one thing well. The above example demonstrates this – the output from echo is redirected to nc.

To extract the data I want from nqquery.log I am using grep to isolate the lines of data that I want, and then gawk to parse the relevant data value out of each line. The output from gawk is then piped to nc just like above. The resulting command looks pretty grim, but is mostly a result of the timestamp conversion into Unix time:

grep Elapsed nqquery.log |gawk '{sub(/\[/,"",$1);sub(/\]/,"",$1);sub(/\,/,"",$23);split($1,d,"-");split(d[3],x,"T");split(x[2],t,":");split(t[3],tt,".");e=mktime(d[1] " " d[2] " " x[1] " " t[1] " " t[2] " " tt[1]);print "nqquery.logical.elapsed",$23,e}'|nc localhost 2003

An example of the output of the above is:

nqquery.logical.response 29 1395766983
nqquery.logical.response 22 1395766983
nqquery.logical.response 22 1395766983
nqquery.logical.response 24 1395766984
nqquery.logical.response 86 1395767047
nqquery.logical.response 10 1395767233
nqquery.logical.response 9 1395767233

which we can then send straight to Carbon.

I’ve created additional versions for other available metrics, which in total gives us:

# This will parse nqquery.log and send the following metrics to Graphite/Carbon, running on localhost port 2003
#       nqquery.logical.compilation
#       nqquery.logical.elapsed
#       nqquery.logical.response
#       nqquery.logical.rows_returned_to_client
#       nqquery.physical.bytes
#       nqquery.physical.response
#       nqquery.physical.rows
# NB it parses the whole file each time and sends all values to carbon. 
# Carbon will ignore duplicates, but if you're working with high volumes 
# it would be prudent to ensure the nqquery.log file is rotated 
# appropriately.

grep Elapsed nqquery.log |gawk '{sub(/\[/,"",$1);sub(/\]/,"",$1);sub(/\,/,"",$23);split($1,d,"-");split(d[3],x,"T");split(x[2],t,":");split(t[3],tt,".");e=mktime(d[1] " " d[2] " " x[1] " " t[1] " " t[2] " " tt[1]);print "nqquery.logical.elapsed",$23,e}'|nc localhost 2003
grep Elapsed nqquery.log |gawk '{sub(/\[/,"",$1);sub(/\]/,"",$1);sub(/\,/,"",$26);split($1,d,"-");split(d[3],x,"T");split(x[2],t,":");split(t[3],tt,".");e=mktime(d[1] " " d[2] " " x[1] " " t[1] " " t[2] " " tt[1]);print "nqquery.logical.response",$26,e}'|nc localhost 2003
grep Elapsed nqquery.log |gawk '{sub(/\[/,"",$1);sub(/\]/,"",$1);split($1,d,"-");split(d[3],x,"T");split(x[2],t,":");split(t[3],tt,".");e=mktime(d[1] " " d[2] " " x[1] " " t[1] " " t[2] " " tt[1]);print "nqquery.logical.compilation",$29,e}'|nc localhost 2003
grep "Physical query response time" nqquery.log |gawk '{sub(/\[/,"",$1);sub(/\]/,"",$1);split($1,d,"-");split(d[3],x,"T");split(x[2],t,":");split(t[3],tt,".");e=mktime(d[1] " " d[2] " " x[1] " " t[1] " " t[2] " " tt[1]);print "nqquery.physical.response",$(NF-4),e}'|nc localhost 2003
grep "Rows returned to Client" nqquery.log |gawk '{sub(/\[/,"",$1);sub(/\]/,"",$1);split($1,d,"-");split(d[3],x,"T");split(x[2],t,":");split(t[3],tt,".");e=mktime(d[1] " " d[2] " " x[1] " " t[1] " " t[2] " " tt[1]);print "nqquery.logical.rows_returned_to_client",$(NF-1),e}'|nc localhost 2003
grep "retrieved from database" nqquery.log |gawk '{sub(/\[/,"",$1);sub(/\]/,"",$1);sub(/\,/,"",$(NF-9));split($1,d,"-");split(d[3],x,"T");split(x[2],t,":");split(t[3],tt,".");e=mktime(d[1] " " d[2] " " x[1] " " t[1] " " t[2] " " tt[1]);print "nqquery.physical.rows",$(NF-9),e}'|nc localhost 2003
grep "retrieved from database" nqquery.log |gawk '{sub(/\[/,"",$1);sub(/\]/,"",$1);split($1,d,"-");split(d[3],x,"T");split(x[2],t,":");split(t[3],tt,".");e=mktime(d[1] " " d[2] " " x[1] " " t[1] " " t[2] " " tt[1]);print "nqquery.physical.bytes",$(NF-7),e}'|nc localhost 2003

Now I run this script, it scrapes the data out of nqquery.log and sends it to Carbon, from where I can render it in Graphite:

or even better, Grafana:

Grafana

Grafana is an replacement for the default Graphite front-end, written by Torkel Ödegaard and available through the very active github repository.

It’s a great way to very rapidly develop and explore dashbaords of data sourced from Graphite. It’s easy to install too. Using SampleApp as an example, setup per the obi-metrics-agent example, do the following:

# Create a folder for Grafana
mkdir /home/oracle/grafana
cd /home/oracle/grafana
# Download the zip from http://grafana.org/download/
wget http://grafanarel.s3.amazonaws.com/grafana-1.5.3.zip
# Unzip it and rearrange the files
unzip grafana-1.5.3.zip
mv grafana-1.5.3/* .
# Create & update the config file
cp config.sample.js config.js
sed -i -e 's/8080/80/g' config.js
# Add grafana to apache config
sudo sed -i'.bak' -e '/Alias \/content/i Alias \/grafana \/home\/oracle\/grafana' /etc/httpd/conf.d/graphite-vhost.conf 
sudo service httpd restart 

# Download ElasticSearch from http://www.elasticsearch.org/overview/elkdownloads/
cd /home/oracle/grafana
wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.1.1.zip
unzip elasticsearch-1.1.1.zip
# Run elasticsearch
nohup /home/oracle/grafana/elasticsearch-1.1.1/bin/elasticsearch &
# NB if you get an out of memory error, it could be a problem with the JDK available. Try installing java-1.6.0-openjdk.x86_64 and adding it to the path.

At this point you should be able to go to the URL on your sampleapp machine http://localhost/grafana/ and see the Grafana homepage.

One of the reasons I like working with Grafana so much is how easy it is to create very smart, interactive dashboards. Here’s a simple walkthrough.

Click on the Open icon and then New to create a new dashboard
On the new dashboard, click Add a panel to this row, set the Panel Type to Graphite, click on Add Panel and then Close.
Click on the title of the new graph and select Edit from the menu that appears. In the edit screen click on Add query and from the select metric dropdown list define the metric that you want to display

From here you can add additional metrics to the graph, or add graphite functions to the existing metric. I described the use of functions in my previous post about OBIEE and Graphite
Click on Back to Dashboard at the top of the screen, to see your new graph in place. You can add rows to the dashboard, resize graphs, and add new ones. One of the really nice things you can do with Grafana is drag to zoom a graph, updating the time window shown for the whole page:

You can set dashboards to autorefresh too, from the time menu at the top of the screen, from where you can also select pre-defined windows.
When it comes to interacting with the data being able to click on a legend entry to temporarily ‘mute’ that metric is really handy.

This really is just scratching the surface of what Grafana can do. You can see more at the website, and a smart YouTube video.

Summary

Here I’ve shown how we can easily put additional, arbitrary data into Graphite’s datastore, called Whisper. In this instance it was nqquery.log data that I wanted to correlate with OBIEE’s DMS data, but I’ve also used it very successfully in the past to overlay the number of executing JMeter load test users with other data in graphite.

I’ve also demonstrated Grafana, a tool I strongly recommend if you do any work with Graphite. As a front-end it is an excellent replacement for the default Graphite web front end, and it’s very easy to use too.

↧

Mobile App Designer mis-configuration error

May 14, 2014, 8:21 am

≫ Next: Visual Regression Testing of OBIEE with PhantomCSS

≪ Previous: Extended Visualisation of OBIEE Performance Data with Grafana

I’ve been doing some work recently with OBIEE’s new Mobile App Designer (MAD). It’s a great bit of software that I’m genuinely impressed with, but it’s got its little v1 quirks, and helpful error messages are not its forte. I hit a MADdening (sorry) problem with it that Google and My Oracle Support both drew blanks on, so I’m posting it here in case it helps out others with the same problem.

Setting up MAD is a bit of a fiddly process involving patching OBIEE (regardless of the base version you’re on – hopefully in the future it will get rolled into the patchsets) and performing other bits of setup detailed in the documentation. The problem that I hit manifested itself twofold:

Publishing an App to the Apps Library worked fine, but updating an existing App threw an error in the browser:

Failed to publish /~weblogic/GCBC Mobile - Phone.xma:oracle.xdo.webservice.exception.AccessDeniedException: PublicReportService::executeUpdateTemplateForReport Failure: user has no access to report[/Apps Library//GCBC Mobile - Phone.xma] due to [Ljava.lang.StackTraceElement;@4e6106df

Trying to subscribe to any App threw a generic error in the browser: “Error occurred while accessing server. Please contect administrator.” with the corresponding bipublisher.log showing:

[2014-05-13T16:49:53.449+01:00] [bi_server1] [WARNING] [] [oracle.xdo] [tid: 24] [userId: <anonymous>] [ecid: 3f3d2d8955322f32:2f756afc:145f4d10b2f:-8000-0000000000003eea,0] [APP: bimad#11.1.1] User (weblogic) with session id: q2fq8fkh66f85ghamsq164u9qs98itvnk0c826i is looking for object in biee path: /shared/Apps Library//GCBC.xma/_mreport.xma[[
Object Error [Context: 0, code: QM3V3HLV, message: Invalid path (/shared/Apps Library//GCBC.xma/_mreport.xma) -- ]
]]
[2014-05-13T16:49:53.450+01:00] [bi_server1] [WARNING] [] [oracle.xdo] [tid: 24] [userId: <anonymous>] [ecid: 3f3d2d8955322f32:2f756afc:145f4d10b2f:-8000-0000000000003eea,0] [APP: bimad#11.1.1] oracle.xdo.XDOException: Target app not found in the repository :/Apps Library//GCBC.xma[[
    at oracle.xdo.online.AppStoreIO.doPost_subscribeApp(AppStoreIO.java:311)
    at oracle.xdo.online.AppStoreIO.doPost(AppStoreIO.java:120)
[...]

One of my esteemed Rittman Mead colleagues, Francesco Tisiot, pointed out that the path referenced in the errors has a double slash in it. On checking my configuration, I had indeed fat-fingered one of the settings. APPS_LIBRARY_FOLDER_LOCAL is defined in the <DOMAIN_HOME>/config/bipublisher/repository/Admin/Configuration/xmlp-server-config.xml file, and mine looked like this:

<property name="APPS_LIBRARY_FOLDER_LOCAL" value="/Apps Library/"/>

All I needed to do was to remove the trailing slash after Library:

<property name="APPS_LIBRARY_FOLDER_LOCAL" value="/Apps Library"/>

After restarting the bimad application deployment all was good again with the MAD world and I could republish and subscribe to Apps happily.

↧

Visual Regression Testing of OBIEE with PhantomCSS

May 23, 2014, 8:20 am

≫ Next: SampleApp v406 – Automatic startup of OBIEE

≪ Previous: Mobile App Designer mis-configuration error

Earlier this year I wrote a couple of blogs posts (here and here) discussing the topic of automated Regression Testing and OBIEE. One of the points that I was keen make was that OBIEE is a stack of elements and depending on the change being tested, it may be sensible to focus on certain elements in the stack instead of all of it. For example, if you are changing the RPD, there is little value in doing a web-based test when you can actually test for the vast majority of regressions using the nqcmd tool alone.

I also argued that testing the front end of OBIEE using tools such as Selenium is difficult to do comprehensively, it can be inflexible, time-consuming and in some cases just not a sensible use of effort. These tools work around the idea of parsing the web page that is served up and checking for presence (or absence) of a particular piece of text or an element on a web page. So for example, you could run a test and tell it to fail if it finds the text “Error” on the page, or you could say only pass the test if some known-content is present, such as a report title or data figure. This type of testing is prone to a great deal of false-negatives, because to efficiently build any kind of test case you must focus on something to check for in the page, but you cannot code for every possible error or failure. It is also usually based heavily on the internal IDs of elements on the page in locating the ‘something’ to check for. As the OBIEE Document Object Model (DOM) is undocumented code, Oracle are at presumably at liberty to change it whenever they feel like it, and thus any tests written based on it may fail. Finally, OBIEE 11g still defaults to serving up graphs as Flash objects, which Selenium et al just cannot handle, and so cannot be tested.

So, what do we do about regression testing the OBIEE front end?

What do we need to test in the front end?

There is still a strong case for regression testing the OBIEE front end. Analyses get changed, Dashboards break, permissions are updated – all these things can cause errors or problems for the end user, but which are something that testing further down the OBIEE stack (using something like nqcmd) will not cover.

Consider a simple dashboard:

If one of the dashboard pages that are linked to in the central section get moved in the Presentation Catalog, then this happens:

OK, so Invalid Link Path: is pretty easy to code in as an error check into Selenium. But, what about if the permissions on an analysis used in the dashboard get changed and the user can no longer access it when running the dashboard?

This is a different problem altogether. We need to check for the absence of something. There’s no error, there just isn’t the analysis that ought to be present. One way around this would be to code for the presence of the analysis title text or content – but that is not going to scale nor be maintainable to do for every dashboard being tested.

Another thing that is important to check in the front end is that authorisations are enforced as they should be. That is, a user can see the dashboards that they should be able to, and that they cannot see the ones they’re not. Changes made in the LDAP directory holding users and their groups, or a configuration change in the Application Roles, could easily mean that a user can no longer see the dashboards they should be able to. We could code for this specific issue using something like Web Services to programatically check each and every actual permission – but that could well be overkill.

What I would like to introduce here is the idea of testing OBIEE for regressions visually - but automated, of course.

Visual Regression Testing

Driven by the huge number of applications that are accessed solely on the web (sorry, “Cloud”), a new set of tools have been developed to support the idea of testing web pages for regressions visually. Instead of ‘explaining’ to the computer specifically what to look for in a page (no error text, etc), visual regression testing uses a process to compare images of a web page, comparing a baseline to a sample taken afterwards. This means that the number of false-negatives (missing genuine errors because the test didn’t detect them) drops drastically because instead of relying on coding a test program to parse the Document Object Model (DOM) of an OBIEE web page (which is extremely complex), instead it is simply considering if two snapshots of the resulting rendered page look the same.

The second real advantage of this method is that typically the tools (including the one I have been working with and will demonstrate below, PhantomCSS) are based on the actual engine that drives the web browsers in use by real end-users. So it’s not a case of parsing the HTML and CSS that the web server sends us and trying to determine if there’s a problem or not – it is actually rendering it the same as Chrome etc and taking a snapshot of it. PhantomCSS uses PhantomJS, which uses the engine that Safari is built on, WebKit.

Let’s Pretend…

So, we’ve got a tool – that I’ll demonstrate shortly – that can programatically fetch and snapshot OBIEE pages, and compare the snapshots to check for any changes. But what about graphs rendered in flash? These are a blindspot usually. Well here we can be a bit cheeky. If you pretend (in the User-Agent HTTP request header) to be an iPhone or iPad (devices that don’t support flash) then OBIEE obligingly serves up PNG graphs plus some javascript to do the hover tooltips. Because it’s a PNG image that means that it will be rendered correctly in our “browser”, and so included in the snapshot for comparison.

CasperJS

Let’s see this scripting in action. Some clarification of the programs we’re going to use first:

PhantomJS is the core functionality we’re using, a headless browser sporting Javascript (JS) APIs
CasperJS provides a set of APIs on top of PhantomJS that make working with web page forms, navigation etc much easier
PhantomCSS provides the regression testing bit, taking snapshots and running code to compare them and report differences.

We’ll consider a simple CasperJS example first, and come on to PhantomCSS after. Because PhantomCSS uses CasperJS for its core interactions, it makes sense to start with the basics.

Here is a bare-bones script. It loads the login page for OBIEE, echoes the page title to the console, takes a snapshot, and exits:

var casper = require('casper').create();

casper.start('http://rnm-ol6-2:9704/analytics', function() {
  this.echo(this.getTitle());
  this.capture('casper_screenshots/login.png');
});

casper.run();

I run it from the command line:

$ casperjs casper_example_01.js
Oracle Business Intelligence Sign In
$

As you can see, it outputs the title of the page, and then in the screenshots folder I have this:

I want to emphasise again to make clear why this is so useful: I ran this from the commandline only. I didn’t run a web browser, I didn’t take any snapshots by hand – it was all automatic.

Now, let’s build a bit of a bigger example, where we login to OBIEE and see what dashboards are available to us:

// Set the size of the browser window as part of the 
// Casper instantiation
var casper = require('casper').create({viewportSize: {
        width: 800,
        height: 600
    }});

// Load the login page
casper.start('http://rnm-ol6-2:9704/analytics', function() {
  this.echo(this.getTitle());
  this.capture('casper_screenshots/login.png');
});

// Do login
casper.then(function(){
  this.fill('form#logonForm', { NQUser: 'weblogic' ,
                                NQPassword: 'Password01'
                              }, true);
}).
waitForUrl('http://rnm-ol6-2:9704/analytics/saw.dll?bieehome',function(){
  this.echo('Logged into OBIEE','INFO')
  this.capture('casper_screenshots/afterlogin.png');
  });

// Now "click" the Dashboards menu
casper.then(function() {
  this.echo('Clicking Dashboard menu','INFO')
  casper.click('#dashboard');
  this.waitUntilVisible('div.HeaderPopupWindow', function() {
    this.capture('casper_screenshots/dashboards.png');
  });
});

casper.run();

So I now get a screenshot of after logging in:

and after “clicking” the Dashboard menu:

The only bit of the script above that isn’t self-explanatory is where I am referencing elements. The references are as CSS3 selectors and are easily found using something like Chrome Developer Tools. Where the click on Dashboards is simulated, there is a waitUntilVisible function, which is crucial for making sure that the page has rendered fully. For a user clicking the menu, they’d obviously wait until it appears but computers work much faster so functions like this are important for reining them back.

To round off the CasperJS script, let’s add to the above navigating to a Dashboard, snapshotting it (with graphs!), and then logging out.

[...]
casper.then(function(){
  this.echo('Navigating to GCBC Dashboard','INFO')
  casper.clickLabel('GCBC Dashboard');
})

casper.waitForUrl('http://rnm-ol6-2:9704/analytics/saw.dll?dashboard', function() {
  casper.waitWhileVisible('div.AjaxLoadingOpacity', function() {
    casper.waitWhileVisible('div.ProgressIndicatorDiv', function() {
      this.capture('casper_screenshots/dashboard.png');
    })
  })
});

casper.then(function() {
  this.echo('Signing out','INFO')
  casper.clickLabel('Sign Out');
});

Again, there’s a couple of waitWhileVisible functions in there, necessary to get CasperJS to wait until the dashboard has rendered properly. The dashboard rendered is captured thus:

PhantomCSS

So now let’s see how we can use the above CasperJS code in conjunction with PhantomCSS to generate a viable regression test scenario for OBIEE.

The script remains pretty much the same, except CasperJS’s capture gets replaced with a phantomcss.screenshot based on an element (html for the whole page), and there’s some extra code “footer” to include that executes the actual test.

So let’s see how the proposed test method holds up to the examples above – broken links and disappearing reports.

First, we run the baseline capture, the “known good”. The console output shows that this is the first time it’s been run, because there are no existing images against which to compare:

In the screenshots folder is the ‘baseline’ image for each of the defined snapshots:

Now let’s break something! First off I’ll rename the target page for one of the links in the central pane of the dashboard, which will cause the ‘Invalid Link Path’ message to display.

Now I run the same PhantomCSS test again, and this time it tells me there’s a problem:

When an image is found to differ, a composite of the two highlighting the differences is created:

OK, so first test passed (or rather, failed), but arguably this could have been picked up simply by parsing the page returned from the OBIEE server for known error strings. But what about a disappearing analysis – that’s more difficult to ascertain from the page source alone.

Again, PhantomCSS picks up the difference, and highlights it nice and clearly in the generated image:

For the baseline image that you capture it would be against a “gold” version of a dashboard – no point including ad-hoc reports or dashboards under development. You’d also want to work with data that was unchanging, so where available a time filter fixed at a point in the past, rather than ‘current day’ which will be changing frequently.

Belts and Braces?

So visual regression testing is a great thing, but I think a hybrid approach, of parsing the page contents for text too, is worthwhile. CasperJS provides its own test APIs (which PhantomCSS uses), and we can write simple tests such as the following:

this.test.assertTextDoesntExist('Invalid Link Path', 'Check for error text on page');
this.test.assertTextDoesntExist('View Display Error', 'Check for error text on page');
phantomcss.screenshot('div.DashboardPageContentDiv','GCBC Dashboard page 1');

So check for a couple of well-known errors, and then snapshot the page too for subsequent automatic comparison. If an assertion is failed, it shows in the console:

This means that what is already be being done in Selenium (or for which Selenium is an assumed default tool) could even be brought into the same single test rig based around CasperJS/PhantomCSS.

Frame of Reference

The eagle-eyed of you will have noticed that the snapshots generated by PhantomCSS above are not the entire OBIEE webpage, whereas the ones from CasperJS earlier in this article are. That is because PhantomCSS deliberately wants to focus on an area of the page to test, identified using a CSS3 selector. So if you are testing a dashboard, then considering the toolbar is irrelevant and can only lead to false-positives.

phantomcss.screenshot('div.DashboardPageContentDiv','GCBC Dashboard page 1');

Similarly, considering the available dashboard list (to validate enforced authorisations) just needs to look at the list itself, not the rest of the page. (and yes, that does say “Protals” – even developers have fat fingers sometimes ;-) )

phantomcss.screenshot('div.HeaderSharedProtals','Dashboard list');

Using this functionality means that the generated snapshots used for comparison can be done to exclude things like the alerts bar (which may appear or disappear between tests).

The Devil’s in the Detail

I am in no doubt that the method described above has definitely got its place in the regression testing arsenal for OBIEE. What I am yet to be fully convinced of is quite to what extent. My beef with Selenium et al is the level of detail one has to get in to when writing tests – identifying strings to test for, their location in the DOM, and so on. Yet above in my CasperJS/PhantomCSS examples, I have DOM selectors too, so is this just the same problem? At the moment, I don’t think so. For Selenium, to build a comprehensive test, you have to dissect the DOM for every single test you want to build. Whereas with CasperJS/PhantomCSS I think there is the need to write a basic framework for OBIEE (the basics of which are provided in this post; you’re welcome), which can then be parameterised based on dashboard name and page only. Sure, additional types of tests may need new code, but it would be more reusable.

Given that OBIEE doesn’t come with an out of the box test rig, whatever we build to test it is going to be bespoke, whether its nqcmd, Selenium, JMeter, LoadRunner, OATS, QTP, etc etc — the smart money is picking the option that will be the most flexible, more scalable, easiest to maintain, and take the least effort to develop. There is no one “program to rule them all” – an accurate, comprehensive, and flexible test suite is invariably going to utilise multiple components focussing on different areas.

In the case of regression testing – what is the aim of the testing? What are you looking to validate hasn’t broken after what kind of change? If all that’s changed in the system is the DBAs adding some indexes or partitioning to the data, I really would not be going anywhere near the front end of OBIEE. However, more complex changes affecting the Presentation Catalog and the RPD can be well covered by this technique in conjunction with nqcmd. Visual regression testing will give you a pass/fail, but then it’s up to you to decipher the images, whereas nqcmd will give you a pass/fail but also an actual set of data to show what has changed.

Don’t forget that other great tool — you! Or rather, you and your minions, who can sit at OBIEE for 5 minutes and spot certain regressions that would take magnitudes of order greater in time to build a test to locate. Things like testing for UI/UX changes between OBIEE versions is something that is realistically handled manually. The testing of the dashboards can be automated, but faster than I can even type the requirement, let alone build a test to validate it – does clicking on the save icon bring up the save box? Well go click for yourself – done? Next test.

Summary

I have just scratched the surface of what is possible with headless browser scripting for testing OBIEE. Being able to automate and capture the results of browser interactions as we’ve seen above is hugely powerful. You can find the CasperJS API reference here if you want to find out more about how it is possible to interact with the web page as a “user”.

I’ve put the complete PhantomCSS script online here. Let me know in the comments section or via twitter if you do try it out!

Thanks to Christian Berg and Gianni Ceresa for reading drafts of this article and providing valuable feedback.

↧

SampleApp v406 – Automatic startup of OBIEE

June 16, 2014, 11:20 pm

≫ Next: OBIEE SampleApp in The Cloud: Importing VirtualBox Machines to AWS EC2

≪ Previous: Visual Regression Testing of OBIEE with PhantomCSS

Last week Oracle released v406 of SampleApp. SampleApp is a one-stop-shop for a demonstration of pretty much any conceivable thing that OBIEE is capable of, and then some more on top of it. It is a VirtualBox appliance with everything on the one machine (Database, OBIEE, Endeca, TimesTen, Essbase, etc), and demonstrates basic analysis building techniques through to dashboarding, Mobile App Designer, analysis visualisations with D3, ADF, JavaScript, backend hackery with undocumented NQS calls (hi Christian!), and much, much more.

So, SampleApp is awesome, but … there’s no such thing as absolute perfection ;-) One of the things that is still missing from it is the automatic start/stop of OBIEE when you bootup/shutdown the machine respectively. Setting it up is easy to do, as I demonstrate below. I’ve also put my cheatsheet for whipping SampleApp into line for my day-to-day use, focussed on the commandline and automation (because respectively that’s where I spend most of my time on a server and because I’m lazy).

The OBIEE init.d service script that I demonstrate here is available for use on any Linux installation of OBIEE. For more information, see the Rittman Mead public scripts github repository here: http://ritt.md/init.d

Before we get started I’m assuming here that you’ve:

downloaded the 28GB worth of zip files
Unpacked them using 7zip
Found 70+GB of disc space free and imported the OVF into VirtualBox
Started up the VM

There’s full instructions in the SampleApp_QuickDeploymentGuide–406.pdf, available in the SampleApp v406 Documentation download from the same page as the SampleApp image itself.

So to make OBIEE start automatically, the first thing we need to do is make sure that the database (where the RCU schemas are) is doing the same, by setting it up as a service (init.d). This is based on this article if you want more details about quite how it works, but for know you just need to copy and paste this whole code block into the command prompt on SampleApp to create the necessary files. If you can’t copy & paste between your host and the Virtualbox guest, just go to this blog from Firefox within the SampleApp VM itself.
sa08
(Or if you know what you’re doing, SSH onto the server and paste the text into an SSH client on your host machine.)

Copy and paste this whole code block into the command prompt:

# Create startup/shutdown script files
mkdir -p /home/oracle/scripts
chown oracle.oinstall /home/oracle/scripts
cat>>/home/oracle/scripts/startup.sh<<EEOF
#!/bin/bash

# Start Listener
lsnrctl start

# Start Database
sqlplus / as sysdba << EOF
STARTUP;
ALTER PLUGGABLE DATABASE ALL OPEN;
EXIT;
EOF
EEOF

cat>>/home/oracle/scripts/shutdown.sh<<EEOF
#!/bin/bash

# Stop Database
sqlplus / as sysdba << EOF
ALTER PLUGGABLE DATABASE ALL CLOSE IMMEDIATE;
SHUTDOWN IMMEDIATE;
EXIT;
EOF

# Stop Listener
lsnrctl stop
EEOF

# Make them executable
chmod u+x /home/oracle/scripts/startup.sh /home/oracle/scripts/shutdown.sh
chown oracle.oinstall /home/oracle/scripts/startup.sh /home/oracle/scripts/shutdown.sh

# Create service script
cat>/tmp/dbora<<EOF
#!/bin/sh
# chkconfig: 345 90 25
# description: Oracle auto start-stop script.
#
# Set ORA_OWNER to the user id of the owner of the 
# Oracle database software.

ORA_OWNER=oracle

case "\$1" in
    'start')
        # Start the Oracle databases:
        # The following command assumes that the oracle login 
        # will not prompt the user for any values
        su - \$ORA_OWNER -c "/home/oracle/scripts/startup.sh >> /home/oracle/scripts/startup_shutdown.log 2>&1"
        touch /var/lock/subsys/dbora
        ;;
    'stop')
        # Stop the Oracle databases:
        # The following command assumes that the oracle login 
        # will not prompt the user for any values
        su - \$ORA_OWNER -c "/home/oracle/scripts/shutdown.sh >> /home/oracle/scripts/startup_shutdown.log 2>&1"
        rm -f /var/lock/subsys/dbora
        ;;
esac
EOF
sudo mv /tmp/dbora /etc/init.d/dbora
sudo chown root. /etc/init.d/dbora

# Make the service script executable
sudo chmod 750 /etc/init.d/dbora

# Associate the dbora service with the appropriate run levels and set it to auto-start using the following command.
sudo chkconfig --add dbora

On SampleApp v406 there is an Oracle 12c container database (CDB), with two “pluggable” databases (PDB) within it. Assuming you’ve not started the database yet, trying to connect to one of the PDBs for the RCU schema will fail:

[oracle@demo ~]$ sqlplus BIEE_BIPLATFORM/Oracle123@localhost:1521/pdborcl

SQL*Plus: Release 12.1.0.1.0 Production on Tue Jun 17 03:03:51 2014

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

ERROR:
ORA-12541: TNS:no listener

Now run the service start (which will happen automatically at boot)

sudo service dbora start

And check the status again:

[oracle@demo ~]$ sqlplus BIEE_BIPLATFORM/Oracle123@localhost:1521/pdborcl

SQL*Plus: Release 12.1.0.1.0 Production on Tue Jun 17 03:06:12 2014

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Last Successful login time: Tue Jun 17 2014 03:02:09 -04:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL>

Now we can set up OBIEE as a service that can start automatically at boot time. This is using a script that I wrote and is shared on the Rittman Mead public scripts github repository here: http://ritt.md/init.d. To install it on SampleApp v406 needs a couple of changes for the environment paths and dependencies etc, which I’ve incorporated in this code block. Again, copy and paste the whole thing into a command prompt on SampleApp.

# lsof is used in the script and isn't installed by default, so let's install it: 
sudo yum -y install lsof

# Now fetch the init.d script itself
sudo wget --no-check-certificate https://raw.githubusercontent.com/RittmanMead/scripts/master/obi/init.d/obiee -O /etc/init.d/obiee

# Update the FMW_HOME path in the script
# If you're doing this manually, you just neeed to change the line 
# "FMW_HOME=" and put in the FMW_HOME path for your installation. 
# In the case of SampleApp v406 it is /app/oracle/biee
sudo sed -i -e 's/FMW_HOME=.*$/FMW_HOME=\/app\/oracle\/biee/g' /etc/init.d/obiee

# Make the script executable
sudo chmod 750 /etc/init.d/obiee

You should now be able to run the following status command to see if OBIEE is running or not:

sudo service obiee status

and if it’s not running, start it up:

sudo service obiee start

To shut it down:

sudo service obiee stop

You can read more about the script here.

There’s one more step to run to make OBIEE start automatically at bootup:

sudo chkconfig --add obiee

Now that the service scripts are in place, restart the machine and check they work:

sudo shutdown -r now

When the VM restarts, it may appear to “hang” on boot – the progress bar will show and “Oracle Linux Server 6.5”, but no desktop. That’s because the services are starting up, and you can switch to the console view to see this. On my Mac I press Shift-Function-Backspace to do this, it may vary on other host operating systems (try ALT + d on Windows):

Once the desktop appears you should be able to launch Firefox and go straight to OBIEE, up and running

There are some additional tweaks I usually make to any new server I work with:

Install screen (don’t know what screen is? Read this!):
```
sudo yum -y install screen
```
Note I’m using sudo which this version of SampleApp thankfully has configured for the oracle user – previous versions didn’t IIRC.

Configure screen with a nice statusbar:

cat > ~/.screenrc <I’m using the bash “here document” functionality here to embed the contents of a document in a command statement. It means I can cut and paste it from my clipboard (if you’re on a Mac, you really should check out Alfred, which has a clipboard history manager, so I always have this screenrc and many other snippets available to paste within just a couple of key presses). Cut and paste a single command is easier (and thus less error-prone) than explaining what content to edit into which text file with which text editor.

Set up SSH keys. I wrote about SSH keys previously on this blog (in fact, all of the things on this checklist are covered there). Put simply, it makes logging in much faster and removes the problem of fat-fingering the password:

As with the screen configuration in the previous step, I use a snippet of code from my clipboard, pasted into the shell of any new server I’m working on:
```
# setup ssh key
mkdir ~/.ssh
cat >> ~/.ssh/authorized_keys <<EOF
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDMvNjJ8y1a3+NR5OEZHjAd6HhPFdkOI4mDiIIegbRDG+ABhTQmE80DuDe0vp4tex7qefS1u5FG3Xf+hmKDBzKFyegpC+9oEmrtIPkuW7b1Uf/f7Dr6EjXbyA6x1LGEu3THCrGVfw5Mkakqzr43ZfGEFf8k0RiMroHQbp3vOcR+5y0Q132Tbrr0GPd0RPuj4cVOC7QQ6jXSs7TZWepzrcqEpB4M1+N3xS/jEQ5aiCY2FTDOUVj6Y4c0Ogg93bLos6JltPKznq08KXy0ZW/rWDindAmeL0ZAMuU12Qv1ehQZJsxjUwq3jz4yNMgXs5ba3nSvMAr7ZVRzbK5nLl7+eAN3 zaphod@beeblebrox
EOF
#
chmod -R 700 ~/.ssh
```
As a side node, that’s my public SSH key there. One’s public ssh key is just that – public. If you want to use it, go ahead, it just means I’ll be able to login to your server. That’s because I have the other half of the key-pair; the private key, and that is the one that should not be shared, because with it you can access any server that has been configured with the public key. Private SSH keys should be treated just as your most important passwords (and indeed, you should use a passphase when creating your private SSH key, to add a layer of security to it, so that it can only be used if the passphrase is known).

The next step is Installing obi-metrics-agent, Graphite, and collectl, but that’s a matter for another blog post :-)

↧

OBIEE SampleApp in The Cloud: Importing VirtualBox Machines to AWS EC2

September 10, 2014, 12:40 am

≫ Next: First-timer tips for Oracle Open World

≪ Previous: SampleApp v406 – Automatic startup of OBIEE

Virtualisation has revolutionised how we work as developers. A decade ago, using new software would mean trying to find room on a real tin server to install it, hoping it worked, and if it didn’t, picking apart the pieces probably leaving the server in a worse state than it was to begin with. Nowadays, we can just launch a virtual machine to give a clean environment and if it doesn’t work – trash it and start again.
The sting in the tail of virtualisation is that full-blown VMs are heavy – for disk we need several GB just for a blank OS, and dozens of GB if you’re talking about a software stack such as Fusion MiddleWare (FMW), and the host machine needs to have the RAM and CPU to support it all too. Technologies such as Linux Containers go some way to making things lighter by abstracting out a chunk of the OS, but this isn’t something that’s reached the common desktop yet.

So whilst VMs are awesome, it’s not always practical to maintain a library of all of them on your local laptop (even 1TB drives fill up pretty quickly), nor will your laptop have the grunt to run more than one or two VMs at most. VMs like this are also local to your laptop or server – but wouldn’t it be neat if you could duplicate that VM and make a server based on it instantly available to anyone in the world with an internet connection? And that’s where The Cloud comes in, because it enables us to store as much data as we can eat (and pay for), and provision “hardware” at the click of a button for just as long as we need it, accessible from anywhere.

Here at Rittman Mead we make extensive use of Amazon Web Services (AWS) and their Elastic Computing Cloud (EC2) offering. Our website runs on it, our training servers run on it, and it scales just as we need it to. A class of 3 students is as easy to provision for as a class of 24 – no hunting around for spare servers or laptops, no hardware sat idle in a cupboard as spare capacity “just in case”.

One of the challenges that we’ve faced up until now is that all servers have had to be built from scratch in the cloud. Obviously we work with development VMs on local machines too, so wouldn’t it be nice if we could build VMs locally and then push them to the cloud? Well, now we can. Amazon offer a route to import virtual machines, and in this article I’m going to show how that works. I’ll use the superb SampleApp v406 VM that Oracle provide, because this is a great real-life example of a VM that is so useful, but many developers can find too memory-intensive to be able to run on their local machines all the time.

This tutorial is based on exporting a Linux guest VM from a Linux host server. A Windows guest probably behaves differently, but a Mac or Windows host should work fine since VirtualBox is supported on both. The specifics are based on SampleApp, but the process should be broadly the same for all VMs.

Obtain the VM

We’re going to use SampleApp, which can be downloaded from Oracle.

Download the six-part archive from http://www.oracle.com/technetwork/middleware/bi-foundation/obiee-samples–167534.html

Verify the md5 checksums against those published on the download page:

[oracle@asgard sampleapp406]$ ll
total 30490752
-rw-r--r-- 1 oracle oinstall 5242880000 Sep  9 01:33 SampleAppv406.zip.001
-rw-r--r-- 1 oracle oinstall 5242880000 Sep  9 01:30 SampleAppv406.zip.002
-rw-r--r-- 1 oracle oinstall 5242880000 Sep  9 02:03 SampleAppv406.zip.003
-rw-r--r-- 1 oracle oinstall 5242880000 Sep  9 02:34 SampleAppv406.zip.004
-rw-r--r-- 1 oracle oinstall 5242880000 Sep  9 02:19 SampleAppv406.zip.005
-rw-r--r-- 1 oracle oinstall 4977591522 Sep  9 02:53 SampleAppv406.zip.006
[oracle@asgard sampleapp406]$ md5sum *
2b9e11f69ada5f889088dd74b5229322  SampleAppv406.zip.001
f8a1a5ae6162b20b3e9c6c888698c071  SampleAppv406.zip.002
68438cfea87e8d3a2e2f15ff00dadf12  SampleAppv406.zip.003
b71d9ace4f75951198fc8197da1cfe62  SampleAppv406.zip.004
4f1a5389c9e0addc19dce6bbc759ec20  SampleAppv406.zip.005
2c430f87e22ff9718d5528247eff2da4  SampleAppv406.zip.006

Unpack the archive using 7zip — the instructions for SampleApp are very clear that you must use 7zip, and not another archive tool such as winzip.

[oracle@asgard sampleapp406]$ time 7za x SampleAppv406.zip.001</code>7-Zip (A) [64] 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,80 CPUs)

Processing archive: SampleAppv406.zip.001

Extracting SampleAppv406Appliance
Extracting SampleAppv406Appliance/SampleAppv406ga-disk1.vmdk
Extracting SampleAppv406Appliance/SampleAppv406ga.ovf

Everything is Ok

Folders: 1
Files: 2
Size: 31191990916
Compressed: 5242880000

real 1m53.685s
user 0m16.562s
sys 1m15.578s

Because we need to change a couple of things on the VM first (see below), we’ll have to import the VM to VirtualBox so that we can boot it up and make these changes.You can import using the VirtualBox GUI, or as I prefer, the VBoxManage command line interface. I like to time all these things (just because, numbers), so stick a time command on the front:
```
time VBoxManage import --vsys 0 --eula accept SampleAppv406Appliance/SampleAppv406ga.ovf
```
This took 12 minutes or so, but that was on a high-spec system, so YMMV.
```
[...]
0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
Successfully imported the appliance.

real    12m15.434s
user    0m1.674s
sys     0m2.807s
```

Preparing the VM

Importing Linux VMs to Amazon EC2 will only work if the kernel is supported, which according to an AWS blog post includes Red Hat Enterprise Linux 5.1 – 6.5. Whilst SampleApp v406 is built on Oracle Linux 6.5 (which isn’t listed by AWS as supported), we have the option of telling the VM to use a kernel that is Red Hat Enterprise Linux compatible (instead of the default Unbreakable Enterprise Kernel – UEK). There are some other pre-requisites that you need to check if you’re trying this with your own VM, including a network adaptor configured to use DHCP. The aforementioned blog post has details.

Boot the VirtualBox VM, which should land you straight in the desktop environment, logged in as the oracle user.
We need to modify a file as root (superuser). Here’s how to do it graphically, or use vi if you’re a real programmer:
1. Open a Terminal window from the toolbar at the top of the screen
2. Enter
```
sudo gedit /etc/grub.conf
```
  The sudo bit is important, because it tells Linux to run the command as root. (I’m on an xkcd-roll here: 1, 2)
3. In the text editor that opens, you will see a header to the file and then a set of repeating sections beginning with title. These are the available kernels that the machine can run under. The default is 3, which is zero-based, so it’s the fourth title section. Note that the kernel version details include uek which stands for Unbreakable Enterprise Kernel – and is not going to work on EC2.
4. Change the default to 0, so that we’ll instead boot to a Red Hat Compatible Kernel, which will work on EC2
5. Save the file
Optional steps:
1. Whilst you’ve got the server running, add your SSH key to the image so that you can connect to it easily once it is up on EC2. For more information about SSH keys, see my previous blog post here, and a step-by-step for doing it on SampleApp here.
2. Disable non-SSH key logins (in /etc/ssh/sshd_config, set PasswordAuthentication no and PubkeyAuthentication yes), so that your server once on EC2 is less vulnerable to attack. Particularly important if you’re using the stock image with Admin123 as the root password.
3. Set up screen, and OBIEE and the database as a Linux service, both covered in my article here.
Shutdown the instance by entering this at a Terminal window:
```
sudo shutdown -h now
```

Export the VirtualBox VM to Amazon EC2

Now we’re ready to really get going. The first step is to export the VirtualBox VM to a format that Amazon EC2 can work with. Whilst they don’t explicitly support VMs from VirtualBox, they do support the VMDK format – which VirtualBox can create. You can do the export from the graphical interface, or as before, from the command line:

time VBoxManage export "OBIEE SampleApp v406" --output OBIEE-SampleApp-v406.ovf

0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
Successfully exported 1 machine(s).

real    56m51.426s
user    0m6.971s
sys     0m12.162s

If you compare the result of this to what we downloaded from Oracle it looks pretty similar – an OVF file and a VMDK file. The only difference is that the VMDK file is updated with the changes we made above, including the modified kernel settings which are crucial for the success of the next step.

[oracle@asgard sampleapp406]$ ls -lh
total 59G
-rw------- 1 oracle oinstall  30G Sep  9 10:55 OBIEE-SampleApp-v406-disk1.vmdk
-rw------- 1 oracle oinstall  15K Sep  9 09:58 OBIEE-SampleApp-v406.ovf

We’re ready now to get all cloudy. For this, you’ll need:

An AWS account
1. You’ll also need your AWS account’s Access Key and Secret Key

AWS EC2 commandline tools installed, along with a Java Runtime Environment (JRE) 1.7 or greater:

wget http://s3.amazonaws.com/ec2-downloads/ec2-api-tools.zip
sudo mkdir /usr/local/ec2
sudo unzip ec2-api-tools.zip -d /usr/local/ec2
# You might need to fiddle with the following paths and version numbers: 
sudo yum install -y java-1.7.0-openjdk.x86_64
cat >> ~/.bash_profile <<EOF
export JAVA_HOME="/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre"
export EC2_HOME=/usr/local/ec2/ec2-api-tools-1.7.1.1/
export PATH=$PATH:$EC2_HOME/bin
EOF<

Set your credentials as environment variables:

export AWS_ACCESS_KEY=xxxxxxxxxxxxxx
export AWS_SECRET_KEY=xxxxxxxxxxxxxxxxxxxxxx

Ideally a nice fat pipe to upload the VM file over, because at 30GB it is not trivial (not in 2014, anyway)

What’s going to happen now is we use an EC2 command line tool to upload our VMDK (virtual disk) file to Amazon S3 (a storage platform), from where it gets converted into an EBS volume (Elastic Block Store, i.e. a EC2 virtual disk), and from there attached to a new EC2 instance (a “server”/”VM”).

Before we can do the upload we need an S3 “bucket” to put the disk image in that we’re uploading. You can create one from https://console.aws.amazon.com/s3/. In this example, I’ve got one called rmc-vms – but you’ll need your own.

Once the bucket has been created, we build the command line upload statement using ec2-import-instance:

time ec2-import-instance OBIEE-SampleApp-v406-disk1.vmdk --instance-type m3.large --format VMDK --architecture x86_64 --platform Linux --bucket rmc-vms --region eu-west-1 --owner-akid $AWS_ACCESS_KEY --owner-sak $AWS_SECRET_KEY

Points to note:

m3.large is the spec for the VM. You can see the available list here. In the AWS blog post it suggests only a subset will work with the import method, but I’ve not hit this limitation yet.
region is the AWS Region in which the EBS volume and EC2 instance will be built. I’m using ew-west-1 (Ireland), and it makes sense to use the one geographically closest to where you or your users are located. Still waiting for uk-yorks-1…
architecture and platform relate to the type of VM you’re importing.

The upload process took just over 45 minutes for me, and that’s from a data centre with a decent upload:

[oracle@asgard sampleapp406]$ time ec2-import-instance OBIEE-SampleApp-v406-disk1.vmdk --instance-type m3.large --format VMDK --architecture x86_64 --platform Linux --bucket rmc-vms --region eu-west-1 --owner-akid $AWS_ACCESS_KEY --owner-sak $AWS_SECRET_KEY
Requesting volume size: 200 GB
TaskType        IMPORTINSTANCE  TaskId  import-i-fh08xcya       ExpirationTime  2014-09-16T10:07:44Z    Status  active  StatusMessage   Pending InstanceID      i-b07d3bf0
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   31191914496     VolumeSize      200     AvailabilityZone        eu-west-1a      ApproximateBytesConverted       0       Status       active  StatusMessage   Pending : Downloaded 0
Creating new manifest at rmc-vms/d77672aa-0e0b-4555-b368-79d386842112/OBIEE-SampleApp-v406-disk1.vmdkmanifest.xml
Uploading the manifest file
Uploading 31191914496 bytes across 2975 parts
0% |--------------------------------------------------| 100%
   |==================================================|
Done
Average speed was 11.088 MBps
The disk image for import-i-fh08xcya has been uploaded to Amazon S3
where it is being converted into an EC2 instance.  You may monitor the
progress of this task by running ec2-describe-conversion-tasks.  When
the task is completed, you may use ec2-delete-disk-image to remove the
image from S3.

real    46m59.871s
user    10m31.996s
sys     3m2.560s

Once the upload has finished Amazon automatically converts the VMDK (now residing on S3) into a EBS volume, and then attaches it to a new EC2 instance (i.e. a VM). You can monitor the status of this task using ec2-describe-conversion-tasks, optionally filtered on the TaskId returned by the import command above:

ec2-describe-conversion-tasks --region eu-west-1 import-i-fh08xcya

TaskType        IMPORTINSTANCE  TaskId  import-i-fh08xcya       ExpirationTime  2014-09-16T10:07:44Z    Status  active  StatusMessage   Pending InstanceID      i-b07d3bf0
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   31191914496     VolumeSize      200     AvailabilityZone        eu-west-1a      ApproximateBytesConverted       3898992128
Status  active  StatusMessage   Pending : Downloaded 31149971456

This is now an ideal time to mention as a side note the Linux utility watch, which simply re-issues a command for you every x seconds (2 by default). This way you can leave a window open and keep an eye on the progress of what is going to be a long-running job

watch ec2-describe-conversion-tasks --region eu-west-1 import-i-fh08xcya

Every 2.0s: ec2-describe-conversion-tasks --region eu-west-1 import-i-fh08xcya                                                             Tue Sep  9 12:03:24 2014

TaskType        IMPORTINSTANCE  TaskId  import-i-fh08xcya       ExpirationTime  2014-09-16T10:07:44Z    Status  active  StatusMessage   Pending InstanceID      i-b07d3bf0
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   31191914496     VolumeSize      200     AvailabilityZone        eu-west-1a      ApproximateBytesConverted       5848511808
Status  active  StatusMessage   Pending : Downloaded 31149971456

And whilst we’re at it, if you’re using a remote server to do this (as I am, to take advantage of the large bandwidth), you will find screen invaluable for keeping tasks running and being able to reconnect at will. You can read more about screen and watch here.

So back to our EC2 import job. To start with, the task will be Pending: (NB unlike lots of CLI tools, you read the output of this one left-to-right, rather than as columns with headings)

$ ec2-describe-conversion-tasks --region eu-west-1
TaskType        IMPORTINSTANCE  TaskId  import-i-ffvx6z86       ExpirationTime  2014-09-12T15:32:01Z    Status  active  StatusMessage   Pending InstanceID      i-b2245ef2
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   5021144064      VolumeSize      60      AvailabilityZone        eu-west-1a      ApproximateBytesConverted       4707330352      Status  active  StatusMessage   Pending : Downloaded 5010658304

After a few moments it gets underway, and you can see a Progress percentage indicator: (scroll right in the code snippet below to see)

TaskType        IMPORTINSTANCE  TaskId  import-i-fgr0djcc       ExpirationTime  2014-09-15T15:39:28Z    Status  active  StatusMessage   Progress: 53%   InstanceID      i-c7692e87
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   5582545920      VolumeId        vol-f71368f0    VolumeSize      20      AvailabilityZone        eu-west-1a      ApproximateBytesConverted       5582536640      Status  completed

Note that at this point you’ll see also see an Instance in the EC2 list, but it won’t launch (no attached disk – because it’s still being imported!)

If something goes wrong you’ll see the Status as cancelled, such as in this example here where the kernel in the VM was not a supported one (observe it is the UEK kernel, which isn’t supported by Amazon):

TaskType        IMPORTINSTANCE  TaskId  import-i-ffvx6z86       ExpirationTime  2014-09-12T15:32:01Z    Status  cancelled       StatusMessage   ClientError: Unsupported kernel version 2.6.32-300.32.1.el5uek       InstanceID      i-b2245ef2
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   5021144064      VolumeId        vol-91b1c896    VolumeSize      60      AvailabilityZone        eu-west-1a      ApproximateBytesConverted    5021128688      Status  completed

After an hour or so, the task should complete:

TaskType        IMPORTINSTANCE  TaskId  import-i-fh08xcya       ExpirationTime  2014-09-16T10:07:44Z    Status  completed       InstanceID      i-b07d3bf0
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   31191914496     VolumeId        vol-a383f8a4    VolumeSize      200     AvailabilityZone        eu-west-1a      ApproximateBy
tesConverted    31191855472     Status  completed

At this point you can remove the VMDK from S3 (and should do, else you’ll continue to be charged for it), following the instructions for ec2-delete-disk-image

Booting the new server on EC2

Go to your EC2 control panel, where you should see an instance (EC2 term for “server”) in Stopped state and with no name.

Select the instance, and click Start on the Actions menu. After a few moments a Public IP will be shown in the details pane. But, we’re not home free quite yet…read on.

Firewalls

So this is where it gets a bit tricky. By default, the instance will have launched with Amazon’s Firewall (known as a Security Group) in place which – unless you have an existing AWS account and have modified the default security group’s configuration – is only open on port 22, which is for ssh traffic.

You need to head over to the Security Group configuration page, accessed in several ways but easiest is clicking on the security group name from the instance details pane:

Click on the Inbound tab and then Edit, and add “Custom TCP Rule” for the following ports:

7780 (OBIEE front end)
7001 (WLS Console / EM)
5902 (oracle VNC)

You can make things more secure by allowing access to the WLS admin (7001) and VNC port (5902) to a specific IP address or range only.

Whilst we’re talking about security, your server is now open to the internet and all the nefarious persons out there, so you’ll be wanting to harden your server not least by resetting all the passwords to ones which aren’t publicly documented in the SampleApp user documentation!

Once you’ve updated your Security Group, you can connect to your server! If you installed the OBIEE and database auto start scripts (and if not, why not??) you should find OBIEE running just nicely on http://[your ip]:7780/analytics – note that the port is 7780, not 9704.

2014-09-09_20-21-23

If you didn’t install the script, you will need to start the services manually per the SampleApp documentation. To connect to the server you can ssh (using Terminal, PuTTY, etc) to the server or connect on VNC (Admin123 is the password). For VNC clients try Screen Share on Macs (installed by default), or RealVNC on Windows.

Caveats & Disclaimers

Running a server on AWS EC2 costs real money, so watch out. Once you’ve put your credit card details in, Amazon will continue to charge your card whilst there are chargeable items on your account (EBS volumes, instances – running or not- , and so on). You can get an idea of the scale of charges here.
As mentioned above, a server on the open internet is a lot more vulnerable than one virtualised on your local machine. You will get poked and probed, usually by automated scripts looking for open ports, weak passwords, and so on. SampleApp is designed to open the toybox of a pimped-out OBIEE deployment to you, it is not “hardened”, and you risk learning the tough way about the need for it if you’re not careful.

Cloning

Amazon EC2 supports taking a snapshot of a server, either for backup/rollback purposes or spinning up as a clone, using an Amazon Machine Image (AMI). From the Instances page, simply select “Create an Image” to build your AMI. You can then build another instance (or ten) from this AMI as needed, exact replicas of the server as it was at the point that you created the image.

Lather, Rinse, and Repeat

There’s a whole host of VirtualBox “appliances” out there, and some of them such as the developer-tools-focused ones only really make sense as local VMs. But there are plenty that would benefit from a bit of “Cloud-isation”, where they’re too big or heavy to keep on your laptop all the time, but are handy to be able to spin up at will. A prime example of this for me is the EBS Vision demo database that we use for our BI Apps training. Oracle used to provide an pre-built Amazon image (know as an AMI) of this, but since withdrew it. However, Oracle do publish Oracle VM VirtualBox templates for EBS 12.1.3 and 12.2.3 (related blog), so from this with a bit of leg-work and a big upload pipe, it’s a simple matter to brew your own AWS version of it — ready to run whenever you need it.

↧

First-timer tips for Oracle Open World

October 8, 2014, 6:16 am

≫ Next: Using rlwrap with Apache Hive beeline for improved readline functionality

≪ Previous: OBIEE SampleApp in The Cloud: Importing VirtualBox Machines to AWS EC2

Last week I had the great pleasure to attend Oracle Open World (OOW) for the first time, presenting No Silver Bullets – OBIEE Performance in the Real World at one of the ODTUG user group sessions on the Sunday. It was a blast, as the saying goes, but the week before OOW I was more nervous about the event itself than my presentation. Despite having been to smaller conferences before, OOW is vast in its scale and I felt like the week before going to university for the first time, full of uncertainty about what lay ahead and worrying that everyone would know everyone else except you! So during the week I jotted down a few things that I’d have found useful to know ahead of going and hopefully will help others going to OOW take it all in their stride from the very beginning.

Coming and going

I arrived on the Friday at midday SF time, and it worked perfectly for me. I was jetlagged so walked around like a zombie for the remainder of the day. Saturday I had chance to walk around SF and get my bearings both geographically, culturally and climate. Sunday is “day zero” when all the user group sessions are held, along with the opening OOW keynote in the evening. I think if I’d arrived Saturday afternoon instead I’d have felt a bit thrust into it all straight away on the Sunday.

In terms of leaving, the last formal day is Thursday and it’s full day of sessions too. I left straight after breakfast on Thursday and I felt I was leaving too early. But, OOW is a long few days & nights so chances are by Thursday you’ll be beat anyway, so check the schedule and plan your escape around it.

Accomodation

Book in advance! Like, at least two months in advance. There are 60,000 people descending on San Francisco, all wanting some place to stay.

Get airbnb, a lot more for your money than a hotel. Wifi is generally going to be a lot better, and having a living space in which to exist is nicer than just a hotel room. Don’t fret about the “perfect” location – anywhere walkable to Moscone (where OOW is held) is good because it means you can drop your rucksack off at the end of the day etc, but other than that the events are spread around so you’ll end up walking further to at least some of them. Or, get an Uber like the locals do!

Sessions

Go to Oak Table World (OTW), it’s great, and free. Non-marketing presentations from some of the most respected speakers in the industry. Cuts through the BS. It’s also basically on the same site as the rest of OOW, so easy to switch back and forth between OOW/OTW sessions.

Go and say hi to the speakers. In general they’re going to want to know that you liked it. Ask questions — hopefully they like what they talk about so they’ll love to speak some more about it. You’ll get more out of a five minute chat than two hours of keynote. And on that subject, don’t fret about dropping sessions — people tweet them, the slides are usually available, and in fact you could be sat at your desk instead of OOW and have missed the whole lot so just be grateful for what you do see. Chance encounters and chats aren’t available for download afterwards; most presentations are. Be strict in your selection of “must see” sessions, lest you drop one you really really did want to see.

Use the schedule builder in advance, but download it to your calendar (watch out for line-breaks in the exported file that will break the import) and sync it to your mobile phone so you can see rapidly where you need to head next. Conference mobile apps are rarely that useful and frequently bloated and/or unstable.

Don’t feel you need to book every waking moment of every day to sessions. It’s not slacking if you go to half as many but are twice as effective from not being worn out!

Dress

Dress wise, jeans and polo is fine, company polo or a shirt for delivering presentations. Day wear is fine for evenings too, no need to dress up. Some people do wear shorts but they’re in the great minority. There are lots of suits around, given it is a customer/sales conference too.

Socialising

The sessions and random conversations with people during the day are only part of OOW — the geek chat over a beer (or soda) is a big part too. Look out for the Pythian blogger meetup, meetups from your country’s user groups, companies you work with, and so on.

Register for the evening events that you get invited to (ODTUG, Pythian, etc) because often if you haven’t pre-registered you can’t get in if you change your mind, whereas if you do register but then don’t go that’s fine as they’ll bank on no-shows. The evening events are great for getting to chat to people (dare I say, networking), as are the other events that are organised like the swim in the bay, run across the bridge, etc.

Sign up for stuff like swim in the bay, it’s good fun – and I can’t even swim really. Run/Bike across the bridge are two other events also organised. Hang around on twitter for details, people like Yury Velikanov and Jeff Smith are usually in the know if not doing the actual organising.

General

When the busy days and long evenings start to take their toll don’t be afraid to duck out and go and decompress. Grab a shower, get a coffee, do some sight seeing. Don’t forget to drink water as well as the copious quantities of coffee and soda.

Get a data package for your mobile phone in advance of going eg £5 per day unlimited data. Conference wifi is just about OK at best, often flaky. Trying to organise short-notice meetups with other people by IM/twitter/email gets frustrating if you only get online half an hour after the time they suggested to meet!

Don’t pack extra clothes ‘just in case’. Pack minimally because (1) you are just around the corner from Market Street with Gap, Old Navy etc so can pick up more clothes cheaply if you need to and (2) you’ll get t-shirts from exhibitors, events (eg swim in the bay) and you’ll need the suitcase space to bring them all home. Bring a suitcase with space in or that expands, don’t arrive with a suitcase that’s already at capacity.

Food

So much good food and beer. Watch out for some of the American beers; they seem to start at about 5% ABV and go upwards, compared to around 3.6% ABV here in the UK. Knocking back this at the same rate as this will get messy.

In terms of food you really are spoilt, some of my favourites were:

Lori’s diner (map) : As a brit, I loved this American Diner, and great food - yum yum. 5-10 minutes walk from Moscone.
Mel’s drive-in (map) : Just round the corner from Moscone, very busy but lots of seats. Great american breakfast experience! yum
Grove (map) : Good place for breakfast if you want somewhere a bit less greasy than a diner (WAT!)

↧

Using rlwrap with Apache Hive beeline for improved readline functionality

October 17, 2014, 5:18 am

≫ Next: Monitoring OBIEE with the ELK stack

≪ Previous: First-timer tips for Oracle Open World

rlwrap is a nice little wrapper in which you can invoke commandline utilities and get them to behave with full readline functionality just like you’d get at the bash prompt. For example, up/down arrow keys to move between commands, but also home/end to go to the start/finish of a line, and even ctrl-R to search through command history to rapidly find a command. It’s one of the standard config changes I’ll make to any system with Oracle’s sqlplus on, and it works just as nicely with Apache Hive’s commandline interface, beeline.

beeline comes with some of this functionality (up/down arrow) but not all (for me, it was ‘home’ and ‘end’ not working and printing 1~ and 5~ respectively instead that prompted me to setup rlwrap with it).

Installing rlwrap

To install rlwrap simply add the EPEL yum packages to your repository configuration:

sudo rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/`uname -p`/epel-release-6-8.noarch.rpm

and then install rlwrap from yum:

sudo yum install -y rlwrap

Use

Once rlwrap is installed you can invoke beeline through it manually, specifying all the standard beeline options as you would normally: (I’ve used the \ line continuation character here just to make the example nice and clear)

rlwrap -a beeline \
-u jdbc:hive2://bdanode1:10000 \
-n rmoffatt -p password \
-d org.apache.hive.jdbc.HiveDriver

Now I can connect to beeline, and as before I press up arrow to access commands from when I previously used the tool, but I can also hit ctrl-R to start typing part of a command to recall it, just like I would in bash. Some other useful shortcuts:

Ctrl-l – clears the screen but with the current line still shown
Ctrl-k – deletes to the end of the line from the current cursor position
Ctrl-u – deletes to the beginning of the line from the current cursor position
Esc-f – move forward one word
Esc-b – move backward one word
(more here)

And most importantly, Home and End work just fine! (or, ctrl-a/ctrl-e if you prefer).

NB the -a argument for rlwrap is necessary because beeline already does some readline-esque functions, and we want rlwrap to forceable override them (otherwise neither work very well). Or more formally (from man rlwrap):

Always remain in “readline mode”, regardless of command’s terminal settings. Use this option if you want to use rlwrap with commands that already use readline.

Alias

A useful thing to do is to add an alias directly in your profile so that it is always available to launch beeline under rlwrap, in this case as the rlbeeline command:

# this sets up "rlbeeline" as the command to run beeline
# under rlwrap, you can call it what you want though. 
cat >> ~/.bashrc<<EOF
alias rlbeeline='rlwrap -a beeline'
EOF
# example usage:
# rlbeeline /
# -u jdbc:hive2://bdanode1:10000 /
# -n rmoffatt -p password /
# -d org.apache.hive.jdbc.HiveDriver

If you want this alias available for all users on a machine create the above as a standalone .sh file in /etc/profile.d/.

Autocomplete

One possible downside of using rlwrap with beeline is that you lose the native auto-complete option within beeline for the HiveQL statements. But never fear – we can have the best of both worlds, with the -f argument for rlwrap, specifying a list of custom auto-completes. So this is even a level-up for beeline, because we could populate it with our own schema objects and so on that we want auto-completed.

As a quick-start, run beeline without rlwrap, hit tab-twice and then ‘y’ to show all options and paste the resulting list into a text file (eg beeline_autocomplete.txt). Now call beeline, via rlwrap, passing that file as an argument to rlwrap:

rlwrap -a -f beeline_autocomplete.txt beeline

Once connected, use auto-complete just as you would normally (hit tab after typing a character or two of the word you’re going to match):

Connecting to jdbc:hive2://bdanode1:10000
Connected to: Apache Hive (version 0.12.0-cdh5.0.1)
[...]
Beeline version 0.12.0-cdh5.0.1 by Apache Hive
0: jdbc:hive2://bdanode1:10000> SE
SECOND        SECTION       SELECT        SERIALIZABLE  SERVER_NAME   SESSION       SESSION_USER  SET
0: jdbc:hive2://bdanode1:10000> SELECT

Conclusion

rlwrap is the tool that keeps on giving; just as I was writing this article, I noticed that it also auto-highlights opening parentheses when typing the closing one. Nice!

↧

Monitoring OBIEE with the ELK stack

October 21, 2014, 8:37 am

≫ Next: Analytics with Kibana and Elasticsearch through Hadoop – part 1 – Introduction

≪ Previous: Using rlwrap with Apache Hive beeline for improved readline functionality

Monitoring the health of an OBIEE system and diagnosing problems that may occur is a vital task for the system’s administrator and support staff. It’s one that at Rittman Mead we help customers with implementing themselves, and also provide as a managed service. In this article I am going to discuss the ELK stack, which fills a specific gap between the high-level monitoring and configuration functionality of Enterprise Manager 11g Fusion Middleware Control, and the Enterprise-grade monitoring, alerting and configuration management of Enterprise Manager 12c Cloud Control.

The ELK stack enables you to rapidly access both summary and detail information across the stack, supporting swift identification and diagnosis of any issues that may occur. The responsive interface lets you to drill into time periods or any ad-hoc field or filter as you wish, to analyse and diagnose problems. Data can be summarised and grouped arbitrarily, displaying relative error rates ensuring that genuine problems are not lost in the ‘noise’ of usual operation.

Out of the box, OBIEE ships with Enterprise Manager 11g Fusion Middleware Control (FMC), which as the name says is part of the Enterprise Manager line of tools from Oracle for managing systems. It is more of a configuration and deployment tool than it is really a monitoring and diagnostics one. The next step up is FMC’s (very) big brother, Enterprise Manager 12c Cloud Control (EM12c). This is very much its own product, requiring its own infrastructure and geared up to monitoring an organisation’s entire fleet of [Oracle] hardware and software. With this greatly enhanced functionally with EM12c also comes a license cost. The ELK stack conceptually fits perfectly alongside your existing EM FMC, providing a most excellent OBIEE monitoring dashboard and analysis tool, and allowing you to explore the kind of diagnostics and historical data that you could have access to in EM 12c.

In ELK we can see at a glance what kind of relative activity there has been on the system over the past few days:

There have been some errors, and the top three nQS and ORA error codes and messages are shown. This is an important differentiator to EM where you can search for errors, but cannot see straightaway if it is a one-off or multiple occurence. By grouping by error message it’s possible to quickly see what the biggest problem on a system may currently be:

At this point we might want to drill down into what was being run when the errors were being thrown. For example, from the error summary alone we can see the biggest problem was a locked database account – but which database was being queried? Lower down the dashboard page is a list of log details, and by clicking on the search icon against an error message we can filter the results shown:

We can use the search icon again to restrict results by ECID

And from there see all the related log entries, including which connection pool the request was against (and thus which database account is locked)

Another way of diagnosing a sudden rash of errors would be to instead drilldown on time alone to take a more holistic view at the logs (useful also given that ECIDs don’t always give the full picture). Using the system activity timeline along with the events log view it is a piece of cake to do this – simply click and drag a time window on the chart to instantly zoom into it.

Taking a step back up, we can see at a glance which areas of the OBIEE metadata model (RPD) are being used, as well as where we are pulling logs from – and all of these are clickable in order to filter the results further. So it’s easy to see, for a given subject area, what’s the current error rate? Or to quickly access all the log files for a specific set of components alone (for example, BI Server and OPMN). Any field that is displayed, whether in a chart or a detailed log view, can be clicked and used as the input for an ad-hoc filter.

It’s not just errors and logs we can monitor – the current and trending performance of the system (or a part of it; note the filtering by subject area and database described above) can be observed and of course, drilled into:

The ELK stack

The ELK stack is a suite of free software made up of three tools, the first letter of each giving it its name:

ElasticSearch
Logstash
Kibana

At a very high level, we collect and enrich diagnostic data from log files using logstash, store it in ElasticSearch, and present and analyse it through Kibana.

ElasticSearch is a document store, in which data with no predefined structure can be stored. Its origins and core strength are in full text search of any of the data held within it, and it is this that differentiates it from pure document stores such as MongoDB that Mark Rittman wrote about recently. Data is loaded and retrieved from ElasticSearch through messages sent over the HTTP protocol, and one of the applications that can send data this way and works extremely well is Logstash.
Logstash is an innocuous looking tool that at first glance one could mistakenly write off as “just” a log parser. It does a lot more than that and a healthy ecosystem of input, filter, codec and output plugins means that it can interface between a great variety of applications, shifting data from one to another and optionally processing and enriching it along the way.
The final piece of the stack is Kibana, a web application that enables one to build very flexible and interactive time-based dashboards, sourcing data from ElasticSearch. Interestingly, another of my favourite tools that I have written about before – and will write about again in this article – is Grafana which is forked from Kibana (and modified to source its data from time-series databases like graphite/carbon/whisper or InfluxDB) – thus if you’re at home with one you will be the other.

In this article I’m going to show how to set up your own ELK stack to monitor OBIEE, based on SampleApp v406.

Who is this for?

As you will see below, setting up and configuring the ELK stack does involve rolling up ones sleeves and diving right in. If you’re looking for an off-the-shelf monitoring solution then you should look elsewhere (such as EM12c). But if you want to have a crack at it I think you’ll be pleasantly surprised at what is possible once you get past the initially (bumpy) learning curve. The capabilities are great, and there’s an active support community as is the case with lots of open-source tools. With a bit of work it is possible to create a monitoring environment tailored pretty much entirely to your design.

Installing the stack

ELK runs on all common linux distributions (including Oracle Linux), as well as Mac OS. The only prerequisite is a JDK for ElasticSearch and Logstash, and web server for Kibana; here I am using Apache.

First up, let’s install JDK 1.7 (SampleApp has 1.6, which isn’t enough):

sudo yum install -y java-1.7.0-openjdk.x86_64

Apache is already installed on SampleApp, which we can verify thus:

[oracle@demo ~]$ sudo yum install -y httpd
Loaded plugins: refresh-packagekit
Setting up Install Process
Package httpd-2.2.15-29.0.1.el6_4.x86_64 already installed and latest version
Nothing to do
[oracle@demo ~]$ sudo service httpd status
httpd is stopped

It’s shutdown by default and that’s fine because we need to update the configuration on it anyway.

ElasticSearch

The easiest way to install ElasticSearch is using the yum repository:

sudo rpm --import http://packages.elasticsearch.org/GPG-KEY-elasticsearch

cat > /tmp/elasticsearch.repo<<EOF
[elasticsearch-1.3]
name=Elasticsearch repository for 1.3.x packages
baseurl=http://packages.elasticsearch.org/elasticsearch/1.3/centos
gpgcheck=1
gpgkey=http://packages.elasticsearch.org/GPG-KEY-elasticsearch
enabled=1
EOF

sudo mv /tmp/elasticsearch.repo /etc/yum.repos.d/
sudo yum install -y elasticsearch

I’d then set it to start at boot automagically:

sudo chkconfig elasticsearch on

and then start it up:

sudo service elasticsearch start

One final, optional, step in the installation is a plugin called kopf which gives a nice web dashboard for looking at the status of ElasticSearch:

export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64/
cd /usr/share/elasticsearch/bin
sudo ./plugin -install lmenezes/elasticsearch-kopf

logstash

There’s no repository for logstash, but it’s no biggie because there’s no install as such, just a download and unpack. Grab the download archive for logstash from the ELK download page, and then unpack it:

cd ~/Downloads
wget https://download.elasticsearch.org/logstash/logstash/logstash-1.4.2.tar.gz
# Need to use sudo because /opt is owned by root
sudo tar -xf logstash-1.4.2.tar.gz --directory /opt/
sudo mv /opt/logstash-1.4.2/ /opt/logstash/
sudo chown -R oracle. /opt/logstash/

Kibana

As with logstash, Kibana just needs downloading and unpacking. There’s also a wee bit of configuration to do, so that the web server (Apache, in our case) knows to talk to it, and so that Kibana knows how to find ElasticSearch.

cd ~/Downloads/
wget https://download.elasticsearch.org/kibana/kibana/kibana-3.1.0.tar.gz
sudo tar -xf kibana-3.1.0.tar.gz --directory /opt
sudo mv /opt/kibana-3.1.0/ /opt/kibana/
sudo chown -R oracle. /opt/kibana/

Now to configure Apache, telling it where to find Kibana. If you have existing sites configured, you’ll need to sort this bit out yourself, but on a vanilla SampleApp v406 you can use the following sed command to set up the needful:

sudo sed -i'.bak' -e 's/DocumentRoot.*$/DocumentRoot "\/opt\/kibana\/"/g' /etc/httpd/conf/httpd.conf

Lastly, Kibana needs to know where to find ElasticSearch, which is where it is going to pull its data from. An important point here is that the URL of ElasticSearch must be resolvable and accessing from the web browser you run Kibana on, so if you are using a DNS name it must resolve etc. You can update the configuratinon file config.js by hand (it’s the elasticsearch: definition that needs updating), or use this sed command:

sed -i'.bak' -e 's/^\s*elasticsearch:.*$/elasticsearch: "http:\/\/demo.us.oracle.com:9200",/g' /opt/kibana/config.js

Finally, [re]start Apache so that it uses the new configuration:

sudo service httpd restart

You should be able to now point your web browser at the server and see the default Kibana dashboard. So for sampleapp, if you’re running Firefox locally on it, the URL would simply be http://localhost/ (port 80, so no need to specify it in the URL). Note that if you’re doing anything funky with network, your local web browser needs to be able to hit both Apache (port 80 by default), and ElasticSearch (port 9200 by default).

Configuring ELK end-to-end

Now that we’ve got the software installed, let’s see how it hangs together and create our first end-to-end example. There’s a good logstash tutorial here that covers a lot of the functionality. Here, I’ll just look at some of the very basics, creating a very simple logstash configuration which will prompt for input (i.e. stdin) and send it straight to ElasticSearch. The kopf plugin that we installed above can show that the data made it to ElasticSeach, and finally we will create a very simple Kibana dashboard to demonstrate its use.

Logstash works by reading a configuration file and then running continually waiting for the configured input. As well as the input we configure an output, and optionally in between we can have a set of filters. For now we will keep it simple with just an input and output. Create the following file in /opt/logstash and call it logstash-basic.conf:

input {
        stdin {}
        }
output {
        elasticsearch {}
        }

It’s pretty obvious what it’s saying – for the input, use stdin, and send it as output to elasticsearch (which will default to the localhost).
Run this with logstash:

export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64/
cd /opt/logstash
bin/logstash -f logstash-basic.conf

After a few moments you should get a prompt. Enter some text, and nothing happens… apparently. What’s actually happened is that the text, plus some information such as the current timestamp, has been sent to ElasticSearch.

Let’s see where it went. In a web browser, got to http://localhost:9200/_plugin/kopf/. 9200 is the port on which ElasticSearch listens by default, and kopf is a plugin we can use to inspect ElasticSearch’s state and data. ElasticSeach has a concept of an index, in which documents, maybe of the same repeating structure but not necessarily, can be stored. Crudely put, this can be seen as roughly analogous to tables and rows of data respectively. When logstash sends data to ElasticSearch it creates one index per day, and in kopf now you should see an index with the current date, with a “document” for each line you entered after running logstash:

ElasticSearch can be queried using HTTP requests, and kopf gives a nice way to construct these and see the results which are in JSON format. Click on the rest (as in, REST API) menu option, leave the request as default http://localhost:9200/_search, and click send. You’ll see in the response pane a chunk of JSON in amongst which are the strings that you’ve entered to logstash:

Enter a few more lines into the logstash prompt, and then head over to http://localhost/ where you should find the default dashboard, and click on the Logstash Dashboard option:

It’s fairly bare, because there’s very little data. Notice how you have a histogram of event rates over the past day at the top, and then details of each event at the bottom. There are two things to explore here. First up, go and enter a bit more data into logstash, so that the create events have been spread out over time. Click the refresh icon on the Kibana dashboard, and then click-drag to select just the period on the chart that has data. This will zoom in on it and you’ll see in greater definition when the events were created. Go and click on one of the event messages in the lower pane and see how it expands, showing the value of each field – including message which is what logstash sent through from its input to output.

Now let’s get some proper data in, by pointing logstash at the BI Server log (nqsserver.log). Create a new configuration file, logstash-obi.conf, and build it up as follows. First we’ll use the file input to get data from …wait for it….a file! The syntax is fairly obvious:

input {
        file {
                path => "/app/oracle/biee/instances/instance1/diagnostics/logs/OracleBIServerComponent/coreapplication_obis1/nqserver.log"
                }
        }

Now we need to tell Logstash how to interpret the file. By default it’ll chuck every line of the log to ElasticSearch, with the current timestamp – rather than the timestamp of the actual event.

Now is time to introduce the wonderful world of the grok. A grok is one of the most important of the numerous filter plugins that are available in logstash. It defines expected patterns of content in the input, and maps it to fields in the output. So everything in a log message, such as the timestamp, user, ecid, and so on – all can be extracted from the input and stored as distinct items. They can also be used for further processing – such as amending the timestamp output from the logstash event to that of the log file line, rather than the system time at which it was processed.

So, let us see how to extract the timestamp from the log line. An important part of grok’ing is patterns. Grok statement are written as Regular expressions, or regex (obXKCD), so to avoid continual wheel-reinventing of regex statements for common objects (time, ip addresses, etc) logstash ships with a bunch of these predefined, and you can build your own too. Taking a line from nqsserver.log we can see the timestamp matches the ISO 8601 standard:

So our grok will use the pre-defined pattern TIMESTAMP_ISO8601, and then everything else (“GREEDYDATA”) after the timestamp, map to the log message field. The timestamp is in square brackets, which I’ve escaped with the backslash character. To indicate that it’s a grok pattern we want to match, it’s enclosed in %{ } markers.

\[%{TIMESTAMP_ISO8601:timestamp}\] %{GREEDYDATA:log_message}

This can be broken down as follows:

\[                                  The opening square bracket, escaped by \
%{TIMESTAMP_ISO8601:timestamp}      Capture an ISO 8601 timestamp, store it in a field called 'timestamp'
\]                                  The closing square bracket, escaped by \
%{GREEDYDATA:log_message}           Capture everything else ('GREEDYDATA' is also a grok pattern) and store it in the 'log_message' field

A grok operator in logstash is part of the filter processing, so we need a new stanza in the configuration file, after input and before output. Note that the grok operator is matching our pattern we built above against the message field, which is pre-populated by default by the input stream. You can grok against any field though.

input {
        file {
                path => "/app/oracle/biee/instances/instance1/diagnostics/logs/OracleBIServerComponent/coreapplication_obis1/nqserver.log"
                }
        }
filter {
        grok {
                match => ["message","\[%{TIMESTAMP_ISO8601:timestamp}\] %{GREEDYDATA:log_message}"]
            }
        }
output {
        elasticsearch {}
        }

Now we can see in the resulting capture we’ve extracted the timestamp to a field called “timestamp”, with the remainder of the field in “log_message”

But – the actual timestamp of the log entry that we have attached to the event stored in ElasticSearch, a special field called @timestamp is still reflecting the timestamp at which logstash read the logfile entry (30th September), rather than when the logfile entry was created (11th June). To fix this, we use a new filter option (grok being the first), the date filter:

date {
        match => ["timestamp", "yyyy-MM-dd'T'HH:mm:ss.SSSZZ"]
}

This matches the timestamp field that we captured with the grok, and converts it into the special @timestamp field of the event, using the mask specified. Now if we go back to kopf, the ElasticSearch admin tool, we can see that a new index has been created, with the date of log entry that we just parsed and correctly extracted the actual date from:

And over in Kibana if you set the timeframe long enough in the filter at the top you’ll be able to find the log entries showing up, now with the correct timestamp. Note that ElasticSearch accounts for the timezone of the message, storing it in UTC:

A large part of setting up your own ELK stack is this configuration of the filters in order to imbue and extract as much information from the log as you want. Grok filters are just the beginning – you can use conditional paragraphs to grok certain fields or logs for certain strings (think, error messages and ORA codes), you can mutate output to add in your own fields and tags based on what has been read. The point of doing this is that you enrich what is in the logs by giving the different pieces of data meaning that you can then drive the Kibana dashboard and filtering through.

Tips for using logstash

To think of the logstash .conf file as merely “configuration” in the same way a simple .ini is would be to underestimate it. In effect you are writing a bit of data transformation code, so brace yourself – but the silver lining is that whatever you want to do, you probably can. Logstash is a most flexible and powerful piece of software, and one in which the model of input, filter, codec and output work very well.

In the spirit of code development, here are some tips you may find useful:

Grok patterns can inherit. Study the provided patterns (in the logstash patterns folder), and build your own. Look for commonality across files and look to reuse as much as possible. If you have multiple unrelated patterns for every type of log file in FMW then you’ve done it wrong.
When you’re building grok statements and patterns, use the superb http://grokdebug.herokuapp.com. The advantage of this over a standard regex debugger is that it supports the grok syntax of capture groups and even custom patterns.
For building up and testing regex from scratch, there are some useful sites:
On the Mac there is a useful tool called Oyster which does the same as the above sites but doesn’t require an Internet connection.
By default a failed grok will add the tag _grokparsefailure to the event. If you have multiple grok clauses, give each its own unique tag_on_failure value so that you can see from the output message which grok statement failed.
```
grok {
    match => [  "message", "%{WLSLOG}"
    ]
    tag_on_failure => ["_grokparsefailure","grok03"]
}
```
You can use # as a comment character, so comment your code liberally, particularly whilst you’re finding your way with the configuration language so you can find your way back from the gingerbread house.
grok’ing takes resources, so target your grok statements by using conditionals. For example, to only parse nqquery logs for an expected string, you could write :
```
if [path] =~ /nqquery/ {
    # Do your grok here
    }
```
NB the =~ denotes a regex match, and the / delineate the regex string.
Use a conditional to split the output, writing and grok failures to stdout and/or a file, so that it is easier to see what’s failing and when. Note the use of the rubydebug codec here to prettify output sent to stdout.
```
output {
    if "_grokparsefailure" not in [tags] {
        elasticsearch { }
    } else {
        file { path => "unprocessed.out" }
        stdout { codec => rubydebug }
    }
}
```
Use the multiline codec to work with log messages that span multiple lines, and watch out for newline characters – grok does not like them so use a mutate gsub to fix them out.

Building a Kibana dashboard

Now that we’ve got the data streaming through nicely from logstash into ElasticSearch, let us take a look at building dashboards against it in Kibana. Kibana is a web application that runs within an existing web server such as Apache, and it builds dashboards from data stored in ElasticSearch.

The building blocks of a Kibana dashboard are rows, which contain panels of a given pane width, up to twelve per row. On a panel goes one of the types of object, such as a graph. We’ll see now how to build up a dashboard and the interactions that we can use for displaying and analysing data.

When you first load Kibana you get a default dashboard that links to a few other start dashboards. Here we’re going to properly start from scratch so as to build up a picture of how a dashboard is created. Click on Blank Dashboard, and then in the top-right corner click on Configure Dashboard. Click on Index and from the Timestamping option select day. What this does is tell Kibana which ElasticSearch indices it is to pull data from, in this case using the standard Logstash index naming pattern – which we observed in kopf earlier – logstash-YYYY.MM.DD. Click Save, and then select Add a row. Give the row a title such as System Activity, click on Create Row and then Save. A new row appears on the dashboard.

Now we’ll add a graph to the row. Click on Add panel to empty row, and select Histogram as the Panel Type. Note that by default the width is 4 – change this to 12. You’ll note that there are plenty of options to explore, but to start with we’ll just keep it simple, so go ahead and click Save. By default the chart will show all data, so use the Time filter dropdown option at the top of the screen to select a recent time period. Assuming your data has loaded from logstash into ElasticSearch you should see a graph similar to this:

This is a graph of the number of events (log file entries) per time period. The graph will amend the resolution according to the zoom so that a reasonable resolution of data is shown, or you can force it through the Resolution option in the graph properties. In the legend of the chart you can see the resolution currently used.

Assuming you’ve selecting a broad time interval, such as the last week, you’ll presumably want to drill into the data shown. This is very intuitive in Kibana – simply click and drag horizontally over the time period you want to examine.

There are two important concepts for selecting and grouping data in Kibana, called filters and queries. A query groups data based on conditions, and we’ll explore those later. Filtering is a predicate applied to all data returned. Think of it just like a WHERE clause on a SQL query. You can see the current filter(s) applied at the top of the dashboard.

You may well want to hide these, and they can be collapsed – as can the query row and all of the dashboard rows – by clicking on the little triangle

From the Filter area you can also add, amend, disable and remove filters.

So far all we’ve got is a graph showing system activity over time, based on events recorded in a log file. But, we’ve no way of seeing what those logs are, and this is where the Table panel comes in. Add a new row, give it a title of Log messages and add the Table panel to it specifying the span as 12. You should now see a list of messages with timestamps corresponding to the time period shown in the graph. You can customise the Table panel, for example specifying which fields to show; by default it shows _source which is the raw row returned by ElasticSearch. More useful to us is the log_message field that we parsed out using the grok in logstash earlier. You can do this by selecting the relevant field from the Fields list on the left (which can be collapsed for convenience), or editing the Table panel and specifying it in the Columns area.

From the Table panel it is possible to select the data shown even more precisely by adding additional filters based on data in the table. Clicking on a particular row will expand it and show all of the associated fields, and each field has a set of Action options. You can filter only for that value, or specifically excluding it, and you can also add each field into the table shown (just like we did above for logmessage). So here I can opt to only display messages that I’ve tagged in logstash as coming from the BI Server component itself:

You’ll note in the second screenshow, once the filter has been applied, that the graph has changed and is showing less data. That is because a filter is global to a dashboard. But what if we want to show on the graph counts for all logs, but in the data table just those for BI Server? Here is where queries come into play. A query also looks like a predicate, but rather than restricting the data returned it just identifies a set of data within what is returned. To illustrate this I’m first going to remove all existing filters except the time period one:

And now in the Query area click on the + (to the right of the line). Now there are two queries, both with a wildcard as their value meaning they’ll each match everything. In the second query box I add the query Component: OracleBIServerComponent – note for this to work your logstash must be sending messages to ElasticSearch with the necessary Component field. Once updated, the second query’s impact can be seen in the graph, which is showing both the “all” query and the BI Server component tows. Use the View > option in the top left of the graph as a quick way of getting to the graph settings, including disabling cumulative/stack view:

Each panel in Kibana can be configured to show all or some of the query groups that have been defined. This is most useful for creating breakdowns of data, including those that are splitting it in different ways and you wouldn’t want all of the options displayed in entirety on all panels. You might want to group out the components, and the types of error, and then show a break down of system activity by one or the other – but not necessarily both. To configure which query a panel is to show use the Configure option in the top-right of a panel and go to the Queries tab. If it’s set to all then each and every query set will be shown individually on the panel.

If it’s selected then you can select one or more of the defined query sets to display

There are about a dozen types of panel in Kibana, and I’m not going to cover them all here. The other ones particularly of interest for building this kind of OBIEE monitoring dashboard include:

Terms is basically a SELECT FIELD, COUNT(*) ... GROUP BY FIELD. It shows the top x number of terms for a given field, and how frequently they occurred. Results can be as a pie or bar chart, or just a table:From a Terms panel you can add filters by clicking on a term. In the example above, clicking on the pie segment, or table row icon, for ERROR would add a filter to show just ERROR log entries
Trends shows the trend of event occurrences in a given time frame. Combined with an appropriate query you can show things like error rates
Stats shows a set of statistics, so you can identify mean response times, maximum users logged on, and so on – assuming you have this data coming through from the logstash parsing.

Once you’ve built your dashboard save it (Kibana stores it in ElasticSearch). The dashboard is defined in json and you can opt to download this too.

The complete OBIEE monitoring view

Parsing logs is a great way to get out valuable information from the text stored and build visualisations and metrics on top of it. However, for pure metrics alone (such as machine CPU, OBIEE DMS metrics, and so on) a close-relation to Kibana, Grafana, is better suited to the task. Thus we have the text-based data going into ElasticSearch and reported through Kibana, and the pure metrics into a time-based store such as whisper (graphite’s database) and reported through Grafana. Because Grafana is a fork of Kibana, the look and feel is very similar.

Using obi-metrics-agent the DMS metrics from OBIEE can be collected and stored in whisper, and so also graphed out in Grafana alongside the system metrics. This gives us an overall architecture like this:

Obviously, it would be nice if we could integrate the fundamental time-based nature of both Kibana and Grafana together, so that drilling into a particular time period of interest maybe from an error rate point of view in the logs would also show the system and DMS metrics for the same period. There has been discussion about this (1, 2, 3) but I don’t get the impression that it will happen soon, if ever. One other item of interest here is Marvel, which is a commercial offering for monitoring ElasticSearch – through Kibana. It makes use of stock Kibana panel types, along with some new ones including the Nodes panel type, which suits the requirements we have of monitoring OBIEE/system metrics within a Kibana view, but unfortunately it looks like currently it is going to remain within Marvel only.

One other path to consider is trying to get the metrics currently sent to graphite/whisper instead into ElasticSearch so that Kibana can then report on them. The problem with this is twofold. Firstly, ElasticSearch is fundamentally a text-based store, whereas whisper fits much better for time/metric data (as would another DB such as influxDB). So trying to crow-bar the two together may not be the best solution, and instead better for it to be resolved at front end as discussed above. Secondly, Kibana’s graphing capabilities do not conceptually extend to multiple metrics in the same graph – only multiple queries – which means that graphing something that would be simple in Grafana (such as CPU wait/user/sys) would be overly complex in Kibana.

Architecting an ELK deployment

So far I’ve shown how to configure ELK on a single server, reading logs from that same server. But there are two extra things we should consider. First, Logstash in particular can be quite a ‘heavy’ process depending on how much work you’re doing with it. If you are processing all the logs that FMW writes, and have lots of grok filters (which isn’t a bad thing; it means you’re extracting lots of good information), then you will see logstash using a lot of CPU, lots IO, possibly to the detriment of other processes on the system – a tad ironic if the purpose of using logstash is to monitor for any system problems that occur. Secondly, ELK works very well with mutiple servers. You might have a scaled out OBIEE stack, or want to monitor multiple environments. Rather than replicating the ELK stack on each server instead it’s better for each server to push its log messages to a central ELK server for processing. And since the processing takes place on the ELK server and not the server being monitored we reduce the local resource footprint too.

To implement this kind of deployment, you need something like logstash-forwarder on the OBI server which is a light-touch program, sending the messages to logstash itself on the ELK server, over a custom protocol called lumberjack. Logstash then processes the messages as before, except it is reading the input from the logstash-forwarder rather than from file. An alternative approach to this is using redis as a message broker, with logstash running on both the source (sending output to redis) and ELK server (using redis as the input). This approach is documented very well here / here, and the former of using logstash-forwarder here. Logstash-forwarder worked very well for me in my tests, and seems to fit the purpose nicely.

Conclusion

Responsive monitoring tools are crucial for successful and timely support of an OBIEE system, and the ELK stack provides an excellent basis on which to build beyond the capabilities of Enterprise Manager Fusion Middleware Control. The learning curve is a bit steep at first, and you have to be comfortable with installing unpackaged tools, but the payoff makes it worth it! If you are interested in finding out about how Rittman Mead can help with your OBIEE implementation or other areas, please contact us.

↧

Analytics with Kibana and Elasticsearch through Hadoop – part 1 – Introduction

November 3, 2014, 1:21 pm

≫ Next: Analytics with Kibana and Elasticsearch through Hadoop – part 2 – Getting data into Elasticsearch

≪ Previous: Monitoring OBIEE with the ELK stack

Introduction

I’ve recently started learning more about the tools and technologies that fall under the loose umbrella term of Big Data, following a lot of the blogs that Mark Rittman has written, including getting Apache log data into Hadoop, and bringing Twitter data into Hadoop via Mongodb.

What I wanted to do was visualise the data I’d brought in, looking for patterns and correlations. Obviously the de facto choice at our shop would be Oracle BI, which Mark previously demonstrated reporting on data in Hadoop through Hive and Impala. But, this was more at the “Data Discovery” phase that is discussed in the new Information Management and Big Data Reference Architecture that Rittman Mead helped write with Oracle. I basically wanted a quick and dirty way to start chucking around columns of data without yet being ready to impose the structure of the OBIEE metadata model on it. One of the tools I’ve worked with recently is a visualisation tool called Kibana which is part of the ELK stack (that I wrote about previously for use in building a monitoring solution for OBIEE). In this article we’ll take a look at making data available to Kibana and then the kind of analytics and visualisations you can do with it. In addition, we’ll see how loading the data into ElasticSearch has the benefit of extremely fast query times compared to through Hive alone.

The Data

I’ve got three sources of data I’m going to work with, all related to the Rittman Mead website:

Website logs, from Apache webserver
Tweets about Rittman Mead blog articles, via Datasift
Metadata about blog posts, extracted from the WordPress MySQL database

At the moment I’ve focussed on just getting the data in, so it’s mostly coming from static files, with the exception of the tweets which are held in a noSQL database (MongoDB).

The Tools

This is where ‘big data’ gets fun, because instead of “Acme DI” and “Acme Database” and “Acme BI”, we have the much more interesting – if somewhat silly – naming conventions of the whackier the better. Here I’m using:

Kibana – data visualisation tool for Elasticsearch
Elasticsearch – data store & analytics / search engine
HDFS – Hadoop’s distributed file system
MongoDB – NoSQL database
Hive – enables querying data held in various places including HDFS (and Elasticsearch, and MongoDB) with a SQL-like query language
Beeline – Hive command line interface
Datasift – online service that streams tweets matching a given pattern to a nominated datastore (such as MongoDB)
mongo-hadoop – a connector for MongoDB to Hadoop including Hive
elasticsearch-hadoop – a connector for Elasticsearch to Hadoop including Hive

Kibana only queries data held in Elasticsearch, which acts as both the data store and the analytics engine. There are various ways to get data into Elasticsearch directly from source but I’ve opted not to do that here, instead bringing it all in via HDFS and Hive. I’ve done that because my – albeit fairly limited – experience is that Elasticsearch is great once you’ve settled on your data and schema, but in the same way I’m not building a full OBIEE metadata model (RPD) yet, nor did I want to design my Elasticsearch schema up front and have to reload from source if it changed. Options for reprocessing and wrangling data once in Elasticsearch seem limited and complex, and by making all my data available through Hive first I could supplement it and mash it up as I wanted, loading it into Elasticsearch only when I had a chunk of data to explore. Another approach that I haven’t tried but could be useful if the requirement fits it would be to load the individual data elements directly into their own Elasticsearch area and then using the elasticsearch-hadoop connector run the required mashups with other data through Hive, loading the results back into Elasticsearch. It all depends on where you’re coming from with the data.

Overview

Here’s a diagram of what I’m building:

I’ll explain it in steps as follows:

Loading the data and making it accessible through Hive
Loading data from Hive to Elasticsearch
Visualising and analysing data in Kibana

Getting the data into Hive

Strictly speaking we’re not getting the data into Hive, so much as making it available through Hive. Hive simply enables you to define and query tables sitting on top of data held in places including HDFS. The beauty of the Hadoop ecosystem is that you can physicalise data in a bunch of tools and the components will most often support interoperability with each other. It’s only when you get started playing with it that you realise how powerful this is.

The Apache log files and WordPress metadata suit themselves fairly well to a traditional RDBMS format of [de]normalised tables, so we can store them in HDFS with simple RDBMS tables defined on top through Hive. But the twitter data comes in JSON format (like this), and if we were going to store the Twitter data in a traditional RDBMS we’d have to work out how to explode the document into a normalised schema, catering for varying structures depending on the type of tweet and data payload within it. At the moment we just want to collect all the data that looks useful, and then look at different ways to analyse it afterwards. Instead of having to compromise one way (force a structure over the variable JSON) or another (not put a relational schema over obviously relational data) we can do both, and decide at run-time how to best use it. From there, we can identify important bits of data and refactor our design as necessary. This “schema on read” approach is one of the real essences of Hadoop and ‘big data’ in general.

So with that said, let’s see how we get the data in. This bit is the easy part of the article to write, because a lot of it is pretty much what Mark Rittman has already written up in his articles, so I’ll refer to those rather than duplicate here.

Apache log data

References:

I’ve used a variation on the standard Apache log SerDe that the interwebs offers, because I’m going to need to work with the timestamp quite closely (we’ll see why later) so I’ve burst it out into individual fields.

The DDL is:

CREATE EXTERNAL TABLE apachelog (
host STRING,    identity STRING,    user STRING,
time_dayDD STRING,  time_monthMMM STRING,   time_yearYYYY STRING,
time_hourHH STRING, time_minmm STRING,  time_secss STRING,  time_tzZ STRING,
http_call STRING,   url STRING, http_status STRING, status STRING,  size STRING,    referer STRING, agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) \\[(\\d{2})\\/(\\w{3})\\/(\\d{4}):(\\d{2}):(\\d{2}):(\\d{2}) (.*?)\\] \\\"(\\w*) ([^ ]*?)(?:\\/)? ([^ \\\"]*)\\\" (\\d*) (\\d*) \\\"(.*?)\\\" \\\"(.*?)\\\"",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s %12$s %13$s %14$s %15$s %16$s %17$s")
STORED AS TEXTFILE LOCATION '/user/oracle/apache_logs';

The EXTERNAL is important on the table definition as it stops Hive moving the HDFS files into its own area on HDFS. If Hive does move the files it is annoying if you want to also access them through another program (or Hive table), and downright destructive if you DROP the table since it’ll delete the HDFS files too – unless it’s EXTERNAL. Note the LOCATION must be an HDFS folder, even if it just holds one file.

For building and testing the SerDe regex Rubular is most excellent, but note that it’s Java regex you’re specifying in the SerDe which has its differences from Python or Ruby regex that Rubular (and most other online regex testers) support. For the final validation of Java regex I use the slightly ugly but still useful regexplanet, which also gives you the fully escaped version of your regex which you’ll need to use for the actual Hive DDL/DML.

A sample row from the apache log on disk looks like this:

74.208.161.70 - - [12/Oct/2014:03:47:43 +0000] "GET /2014/09/sunday-times-tech-track-100/ HTTP/1.0" 301 247 "-" "-"

and now in Hive:

0: jdbc:hive2://bigdatalite:10000> !outputformat vertical
0: jdbc:hive2://bigdatalite:10000> select * from apachelog limit 1;
host           74.208.161.70
identity       -
user           -
time_daydd     12
time_monthmmm  Oct
time_yearyyyy  2014
time_hourhh    03
time_minmm     47
time_secss     43
time_tzz       +0000
http_call      GET
url            /2014/09/sunday-times-tech-track-100/
http_status    HTTP/1.0
status         301
size           247
referer        -
agent          -

Twitter data

Reference:

Analyzing Twitter Data using Datasift, MongoDB, Hive and ODI12c

The twitter data we’ve got includes the Hive ARRAY datatype for the collections of hashtag(s) and referenced url(s) from within a tweet. A point to note here is that the author_followers data appears in different locations of the JSON document depending on whether it’s a retweet or not. I ended up with two variations of this table and a UNION on top.

The table is mapped on data held in MongoDB and as with the HDFS data above the EXTERNAL is crucial to ensure you don’t trash your data when you drop your table.

CREATE EXTERNAL TABLE tweets
(
id string,
url string,
author string,
content string,
created_at string,
hashtags ARRAY<string>,
referenced_urls ARRAY<string>,
sentiment STRING,
author_handle string,
author_id string,
author_followers string,
author_friends string
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","url":"interaction.interaction.link","author":"interaction.interaction.author.name","content":"interaction.interaction.content","created_at":"interaction.interaction.created_at","hashtags":"interaction.interaction.hashtags","referenced_urls":"interaction.links.url","sentiment":"interaction.salience.content.sentiment","author_handle":"interaction.interaction.author.username","author_id":"interaction.interaction.author.id","author_followers":"interaction.twitter.user.followers_count","author_friends":"interaction.twitter.user.friends_count"}')
TBLPROPERTIES('mongo.uri'='mongodb://bigdatalite.localdomain:27017/rm_tweets.rm_tweets')
;

The other point to note is that we’re now using mongo-hadoop for Hive to connect to MongoDB. I found that I had to first build the full set of jar files by running ./gradlew jar -PclusterVersion='cdh5', and also download the MongoDB java driver, before copying the whole lot into /usr/lib/hadoop/lib. This is what I had by the end of it:

[oracle@bigdatalite mongo-hadoop-r1.3.0]$ ls -l /usr/lib/hadoop/lib/mongo-*
-rw-r--r--. 1 root root 105446 Oct 24 00:36 /usr/lib/hadoop/lib/mongo-hadoop-core-1.3.0.jar
-rw-r--r--. 1 root root  21259 Oct 24 00:36 /usr/lib/hadoop/lib/mongo-hadoop-hive-1.3.0.jar
-rw-r--r--. 1 root root 723219 Oct 24 00:36 /usr/lib/hadoop/lib/mongo-hadoop-pig-1.3.0.jar
-rw-r--r--. 1 root root    261 Oct 24 00:36 /usr/lib/hadoop/lib/mongo-hadoop-r1.3.0.jar
-rw-r--r--. 1 root root 697644 Oct 24 00:36 /usr/lib/hadoop/lib/mongo-hadoop-streaming-1.3.0.jar
-rw-r--r--. 1 root root 591189 Oct 24 00:44 /usr/lib/hadoop/lib/mongo-java-driver-2.12.4.jar

After all that, the data as it appears in Hive looks like this:

id                5441097d591f90cf2c8b45a1
url               https://twitter.com/rmoff/status/523085961681317889
author            Robin Moffatt
content           Blogged: Using #rlwrap with Apache #Hive #beeline for improved readline functionality http://t.co/IoMML2UDxp
created_at        Fri, 17 Oct 2014 12:19:46 +0000
hashtags          ["rlwrap","Hive","beeline"]
referenced_urls   ["http://www.rittmanmead.com/2014/10/using-rlwrap-with-apache-hive-beeline-for-improved-readline-functionality/"]
sentiment         4
author_handle     rmoff
author_id         82564066
author_followers  790
author_friends    375

For reference, without the mongo-hadoop connectors I was getting the error

Error in loading storage handler.com.mongodb.hadoop.hive.MongoStorageHandler

and with them installed but without the MongoDB java driver I got:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com/mongodb/util/JSON (state=08S01,code=1)
Caused by: java.lang.ClassNotFoundException: com.mongodb.util.JSON

WordPress metadata

WordPress holds its metadata in a MySQL database, so it’s easy to extract out:

Run a query in MySQL to generate the CSV export files, such as:

SELECT p.ID, p.POST_TITLE,p.POST_DATE_GMT,
       p.POST_TYPE,a.DISPLAY_NAME,p.POST_NAME,
       CONCAT('/', DATE_FORMAT(POST_DATE_GMT, '%Y'), '/', LPAD(
       DATE_FORMAT(POST_DATE_GMT, '%c'), 2, '0'), '/', p.POST_NAME) AS
       generated_url
FROM   posts p
       INNER JOIN users a
               ON p.POST_AUTHOR = a.ID
WHERE  p.POST_TYPE IN ( 'page', 'post' )
       AND p.POST_STATUS = 'publish' 
into outfile '/tmp/posts.csv' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n';

Copy the CSV file to your Hadoop machine, and copy it onto HDFS. Make sure each type of data goes in its own HDFS folder:
```
hadoop fs -mkdir posts
hadoop fs -copyFromLocal /tmp/posts.csv posts
```

Define the Hive table on top of it:

CREATE EXTERNAL TABLE posts 
( post_id string,title string,post_date string,post_type string,author string,url string ,generated_url string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^(\\d*),\\\"(.*?)\\\",\\\"(.*?)\\\",\\\"(.*?)\\\",\\\"(.*?)\\\",\\\"(.*?)\\\",\\\"(.*?)\\\"",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s")
location '/user/oracle/posts'
;

Rinse & repeat for the category data, and post->category relationships.

The data once modelled in Hive looks like this:

0: jdbc:hive2://bigdatalite:10000> select * from posts limit 1;
post_id        788
title          Blog
post_date      2007-03-07 17:45:07
post_type      page
author         Mark Rittman
url            blog
generated_url  /2007/03/blog

0: jdbc:hive2://bigdatalite:10000> select * from categories limit 1;
category_id    5
cat2_id        5
category_name  category
category_code  BI (General)
catslug        bi

0: jdbc:hive2://bigdatalite:10000> select * from post_cats limit 5;
post_id      8046
category_id  1

The WordPress metadata quite obviously joins together, as it is already from the relational schema in which it was held on MySQL. Here is an example of where “schema on read” comes into play, because you could look at the above three tables (posts / post_cats / categories) and conclude it was redundant to export all three from WordPress and instead a single query listings posts and their respective category would be sufficient. But, some posts have more than one category, which then leads to a design/requirements decision. Either we retain one row per post – and collapse down the categories, but in doing so lose ability to easily treat categories as individual data – or have one row per post/category, and end up with multiple rows per post which if we’re doing a simple count of posts complicates matters. So we bring it in all raw from source, and then decide how we’re going to use it afterwards.

Bringing the data together

At this point I have six tables in Hive that I can query (albeit slowly) with HiveQL, a close relation to SQL with a few interesting differences running through the Hive client Beeline. The data is tweets, website visits, and details about the blog posts themselves.

0: jdbc:hive2://bigdatalite:10000> show tables;
+------------------------+
|        tab_name        |
+------------------------+
| apachelog              |
| categories             |
| post_cats              |
| posts                  |
| retweets               |
| tweets                 |
+------------------------+

As well as time, the other common element running throughout all the data is the blog article URL, whether it is a post, a visit to the website, or a tweet about it. But to join on it is not quite as simple as you’d hope, because all the following are examples of recorded instances of the data for the same blog post:

http://www.rittmanmead.com/2014/01/automated-regression-testing-for-obiee/
/2014/01/automated-regression-testing-for-obiee/
/2014/01/automated-regression-testing-for-obiee
/2014/01/automated-regression-testing-for-obiee/feed
/2014/01/automated-regression-testing-for-obiee/foobar+foobar

So whether it’s querying the data within Hive, or loading it joined together to another platform, we need to be able to unify the values of this field.

Tangent: RegEx

And now it’s time, if you’d not already for your SerDe against the Apache file, to really immerse yourself in Regular Expressions (RegEx). Part of the “schema on read” approach is that it can get messy. You need to juggle and wrangle and munge data in ways that it really might not want to, and RegEx is an essential tool with which to do this. Regex isn’t specific to Hadoop – it’s used throughout the computing world.

My journey with regex over quite a few years in computing has gone in stages something like this:

To be a fully rounded geek, I should learn regex. Looks up regex. Hmm, looks complicated….Squirrel!
1. To be a fully round (geddit?!) geek, I should keep eating these big breakfasts
I’ve got a problem, I’ve got a feeling regex will help me. But my word it looks complicated … I’ll just do it by hand.
I’ve got another problem, I need to find this text in a file but with certain patterns around it. Here’s a regex I found on google. Neat!
Hmmm another text matching problem, maybe I should really learn regex instead of googling it to death each time
Mastered the basic concepts of regex
Still a long way to go…

If you think you’ll nail RegEx overnight, you won’t (or at least, you’re a better geek than me). It’s one of those techniques, maybe a bit like SQL, that to fully grok takes a period of exposure and gradually increasing usage, before you have an “ah hah!” moment. There’s a great site explaining regex here: www.regular-expressions.info. My best advice is to take a real example text that you want to work with (match on, replace bits of, etc), and stick it in one of these parsers and experiment with the code:

Oh and finally, watch out for variations in regex – what works in a Java-based program (most of the Hadoop world) may not in Python and visa versa. Same goes for PHP, Ruby, and so on – they all have different regex engines that may or may not behave as you’d expect.

Back on track : joining data on non-matching columns

So to recap, we want to be able to analyse our blog data across tweets, site hits and postings, using the common field of the post URL, which from the various sources can look like any of the following (and more):

http://www.rittmanmead.com/2014/01/automated-regression-testing-for-obiee/
/2014/01/automated-regression-testing-for-obiee/
/2014/01/automated-regression-testing-for-obiee
/2014/01/automated-regression-testing-for-obiee/feed
/2014/01/automated-regression-testing-for-obiee/foobar+foobar

So out comes the RegEx. First off, we’ll do the easy one – strip the http:// and server bit. Using the Hive function REGEXP_REPLACE we can use this in the query:

regexp_replace(ref_url,'http:\\/\\/www.rittmanmead.com','')

This means, take the ref_url column and if you find http://www.rittmanmead.com then replace it with nothing, i.e. delete it. The two backslashes before each forward slash simply escape them since a forward slash on its own has a special meaning in regex. Just to keep you on your toes – Java regex requires double backspace escaping, but all other regex (including the online parser I link to below) uses a single one.

So now our list possible join candidates has shrunk by one to look like this:

/2014/01/automated-regression-testing-for-obiee/
/2014/01/automated-regression-testing-for-obiee
/2014/01/automated-regression-testing-for-obiee/feed
/2014/01/automated-regression-testing-for-obiee/foobar+foobar

The variation as you can see is whether there is a trailing forward slash (/) after the post ‘slug’ , and whether there is additional cruft after that too (feed, foobar+foorbar, etc). So let’s build it up a piece at a time. On each one, I’ve linked to an online parser that you can use to see it in action.

We’ll match on the year and month (/2014/01/) because they’re fixed pattern, so using \d to match on digits and {x} to match x repetitions: (see example on Rubular.com)
```
\/\d{4}\/\d{2}\/
```
This will match /2014/01/.
Now we need to match the slug, but we’re going to ditch the forward slash suffix if there is one. This is done with two steps.

First, we define a “match anything except x” group, which is what the square brackets (group) and the caret ^ (negate) do, and in this case x is the forward slash character, escaped.

Secondly, the plus symbol + tells regex to match at least one repetitions of the preceeding group – i.e. any character that is not a forward slash. (example)
```
[^\/]+
```
Combined with the above regex from the first step we will now match /2014/01/automated-regression-testing-for-obiee.
The final step is to turn the previous REGEXP_REPLACE on its head and instead of replacing out content from the string that we don’t want, instead we’ll extract the content that we do want, using a regex capture group which is defined by regular brackets (parantheses, just like these). We’ve now brought in a couple of extra bits to make it hang together, seen in the completed regex here:
```
\S*(\/\d{4}\/\d{2}\/[^\/]+).*$
```
1. The \S* at the beginning means match any non-whitespace character, which will replace the previous regex replace we were doing to strip out the http://www.rittmanmead.com
2. After the capture group, which is the content from steps one and two above, surround by parentheses (\/\d{4}\/\d{2}\/[^\/]+) there is a final .* to match anything else that might be present (eg trailing forward slash, foobar, etc etc)
Now all we need to do is escape it for Java regex, and stick it in the Hive REGEXP_EXTRACT function, specifying 1 as the capture group number to extract: (example)
```
regexp_extract(url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1)
```

So now all our URLs will look like this, regardless of whether they’re from tweet data, website hits, or wordpress:

/2014/01/automated-regression-testing-for-obiee

Which is nice, because it means we can use it as the common join in our queries. For example, to look up the title of the blog post that someone has tweeted about, and who wrote the post:

SELECT
x.author AS tweet_author, 
x.tweet ,
x.tweet_url, 
x.created_at, 
p.author as post_author, 
p.title as post_title
FROM            ( 
SELECT 'tweets' , 
t.url AS tweet_url , 
t.author , 
t.content AS tweet , 
t.created_at ,regexp_extract(ref_url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) as url
FROM   tweets t 
LATERAL VIEW EXPLODE (referenced_urls) refs as ref_url 
WHERE  t.author_followers IS NOT NULL 
AND    ref_url regexp '\\S*\\/\\d{4}\\/\\d{2}\\/.*' ) x 
INNER JOIN posts p 
ON regexp_extract(x.url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) = p.generated_url ;

[...]
tweet_author  Dain Hansen
tweet         Like a Big Data kid in a Hadoop candy store: Presos on #bigdata for BI, DW, Data Integration http://t.co/06DLnvxINx via @markrittman
tweet_url     https://twitter.com/dainsworld/status/520463199447961600
created_at    Fri, 10 Oct 2014 06:37:51 +0000
post_author   Mark Rittman
post_title    Upcoming Big Data and Hadoop for Oracle BI, DW and DI Developers Presentations

tweet_author  Robin Moffatt
tweet         Analyzing Twitter Data using Datasift, MongoDB and Pig http://t.co/h67cd4kJo2 via @rittmanmead
tweet_url     https://twitter.com/rmoff/status/524197131276406785
created_at    Mon, 20 Oct 2014 13:55:09 +0000
post_author   Mark Rittman
post_title    Analyzing Twitter Data using Datasift, MongoDB and Pig
[...]

Note here also the use of LATERAL VIEW EXPLODE () as a way of denormalising out the Hive ARRAY of referenced url(s) in the tweet so there is one row returned per value.

Summary

We’ve got our three sources of data available to us in Hive, and can query across them. Next we’ll take a look at loading the data into Elasticsearch, taking advantage of our conformed url column to join data that we load. Stay tuned!

↧

Analytics with Kibana and Elasticsearch through Hadoop – part 2 – Getting data into Elasticsearch

November 4, 2014, 1:17 pm

≫ Next: Analytics with Kibana and Elasticsearch through Hadoop – part 3 – Visualising the data in Kibana

≪ Previous: Analytics with Kibana and Elasticsearch through Hadoop – part 1 – Introduction

Introduction

In the first part of this series I described how I made several sets of data relating to the Rittman Mead blog from various sources available through Hive. This included blog hits from the Apache webserver log, tweets, and metadata from WordPress. Having got it into Hive I now need to get it into ElasticSearch as a pre-requisite for using Kibana to see how it holds up as a analysis tool or as a “data discovery” option. Here’s a reminder of the high-level architecture, with the parts that I’ve divided it up into covering over the three number of blog posts indicated:

kib31

In this article we will see how to go about doing that load into ElasticSearch, before getting into some hands-on with Kibana in the final article of this series.

Loading data from Hive to Elasticsearch

We need to get the data into Elasticsearch itself since that is where Kibana requires it to be for generating the visualisations. Elasticsearch holds the data and provides the analytics engine, and Kibana provides the visualisation rendering and the generation of queries into Elasticsearch. Kibana and Elasticsearch are the ‘E’ and ‘K’ of the ELK stack, which I have written about previously (the ‘L’ being Logstash but we’re not using that here).

Using the elasticsearch-hadoop connector we can load data exposed through Hive into Elasticsearch. It’s possible to load data directly from origin into Elasticsearch (using, for example, Logstash) but here we’re wanting to bring together several sets of data using Hadoop/Hive as the common point of integration.

Elasticsearch has a concept of an ‘index’ within which data is stored, held under a schema known as a ‘mapping’. Each index can have multiple mappings. It’s dead easy to run Elasticsearch – simply download it, unpack the archive, and then run it – it really is as easy as that:

[oracle@bigdatalite ~]$ /opt/elasticsearch-1.4.0.Beta1/bin/elasticsearch
[2014-10-30 16:59:39,078][INFO ][node                     ] [Master] version[1.4.0.Beta1], pid[13467], build[1f25669/2014-10-01T14:58:15Z]
[2014-10-30 16:59:39,080][INFO ][node                     ] [Master] initializing ...
[2014-10-30 16:59:39,094][INFO ][plugins                  ] [Master] loaded [], sites [kopf, gui]
[2014-10-30 16:59:43,184][INFO ][node                     ] [Master] initialized
[2014-10-30 16:59:43,184][INFO ][node                     ] [Master] starting ...
[2014-10-30 16:59:43,419][INFO ][transport                ] [Master] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.57.3:9300]}
[2014-10-30 16:59:43,446][INFO ][discovery                ] [Master] elasticsearch/mkQYgr4bSiG-FqEVRkB_iw
[2014-10-30 16:59:46,501][INFO ][cluster.service          ] [Master] new_master [Master][mkQYgr4bSiG-FqEVRkB_iw][bigdatalite.localdomain][inet[/192.168.57.3:9300]], reason: zen-disco-join (elected_as_master)
[2014-10-30 16:59:46,552][INFO ][http                     ] [Master] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.57.3:9200]}
[2014-10-30 16:59:46,552][INFO ][node                     ] [Master] started

You can load data directly across into Elasticsearch from Hive without having to prepare anything on Elasticsearch – it will create the index and mapping for you. But, for it to work how we want, we do need to specify the mapping in advance because we want to tell Elasticsearch two important things:

To treat the date field as a date – crucial for Kibana to do its time series-based magic
Not to “analyze” certain fields. By default Elasticsearch will analyze each string field so that you can display most common terms within it etc. However if we want to report things like blog title, breaking it down into individual words doesn’t make sense.

This means that the process is as follows:

Define the Elasticsearch table in Hive
Load a small sample of data into Elasticsearch from Hive
Extract the mapping and amend the date field and mark required fields as non-analysed
Load the new mapping definition to Elasticsearch
Do a full load from Hive into Elasticsearch

Steps 2 and 3 can be sidestepped by crafting the mapping by hand from the outset but it’s typically quicker not to.

Before we can do anything in terms of shifting data around, we need to make elasticsearch-hadoop available to Hadoop. Download it from the github site, and copy the jar file to /usr/lib/hadoop and add it to HIVE_AUX_JARS_PATH in /usr/lib/hive/conf/hive-env.sh.

Defining the Hive table over Elasticsearch

The Hive definition for a table stored in Elasticsearch is pretty simple. Here’s a basic example of a table that’s going to hold a list of all blog posts made. Note the _es suffix, a convention I’m using to differentiate the Hive table from others with the same data and denoting that it’s in Elasticsearch (es). Also note the use of EXTERNAL as previously discussed, to stop Hive trashing the underlying data if you drop the Hive table:

CREATE EXTERNAL TABLE all_blog_posts_es (
ts_epoch bigint ,
post_title string ,
post_title_a string ,
post_author string ,
url string ,
post_type string )
ROW FORMAT SERDE 'org.elasticsearch.hadoop.hive.EsSerDe'
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES (
'es.nodes'='bigdatalite.localdomain',
'es.resource'='all_blog/posts'
) ;

The ROW FORMAT and STORED BY are standard, but the TBLPROPERTIES values should be explained (you’ll find full details in the manual):

es.nodes – this is the hostname of the Elasticsearch server. If you have multiple nodes it will discover the others from this.
es.resource – this is the index and mapping where the data should be stored. We’ll see more about these later, because they’re important.

Time for a tangent …

The biggest issue I had getting data from Hive into Elasticsearch was timestamps. To cut a very long story (involving lots of random jiggling, hi Christian!) short, I found it was easiest to convert timestamps into Unix epoch (number of seconds since Jan 1st 1970), rather than prat about with format strings (and prat about I did). For timestamps already matching the ISO8601 standard such as those in my WordPress data, I could leverage the Hive function UNIX_TIMESTAMP which returns exactly that

0: jdbc:hive2://bigdatalite:10000> select post_date, unix_timestamp(post_date) as post_date_epoch from posts limit 1;
post_date        2007-03-07 17:45:07
post_date_epoch  1173289507

For others though that included the month name as text such as Wed, 17 Sep 2014 08:31:20 +0000 I had to write a very kludgy CASE statement to first switch the month names for numbers and then concatenate the whole lot into a ISO8601 that could be converted to unix epoch. This is why I also split the apache log SerDe so that it would bring in the timestamp components (time_dayDD, time_monthMMM, etc) individually, making the epoch conversion a little bit neater:

unix_timestamp(concat(concat(concat(concat(concat(concat(
a.time_yearyyyy,'-')
,case a.time_monthmmm when 'Jan' then 1 when 'Feb' then 2 when 'Mar' then 3 when 'Apr' then 4 when 'May' then 5 when 'Jun' then 6 when 'Jul' then 7 when 'Aug' then 8 when 'Sep' then 9 when 'Oct' then 10 when 'Nov' then 11 when 'Dec' then 12 else 0 end,'-')
,a.time_daydd,' ')
,a.time_hourhh,':')
,a.time_minmm,':')
,a.time_secss,'')
)

Because if you thought this was bad, check out what I had to do to the twitter timestamp:

unix_timestamp(
    concat(concat(concat(concat(regexp_replace(regexp_replace(created_at,'^\\w{3}, \\d{2} \\w{3} ',''),' .*$',''),'-')
    ,case regexp_replace(regexp_replace(created_at,'^\\w{3}, \\d{2} ',''),' .*$','') 
    when 'Jan' then 1 when 'Feb' then 2 when 'Mar' then 3 when 'Apr' then 4 when 'May' then 5 when 'Jun' then 6 when 'Jul' then 7 when 'Aug' then 8 when 'Sep' then 9 when 'Oct' then 10 when 'Nov' then 11 when 'Dec' then 12 else 0 end,'-')
    ,regexp_replace(regexp_replace(created_at,'^\\w{3}, ',''),' .*$',''),' '),regexp_replace(regexp_replace(created_at,'^\\w{3}, \\d{2} \\w{3} \\d{4} ',''),' .*$',''))
)

As with a few things here, this was all for experimentation than streamlined production usage, so it probably could be rewritten more efficiently or solved in a better way – suggestions welcome!

So the nett result of all of these is the timestamp as epoch in seconds – but note that Elasticsearch works with millisecond epoch, so they all need multiplying by 1000.

As I’ve noted above, this feels more complex than it needed to have been, and maybe with a bit more perseverence I could have got it to work without resorting to epoch. The issue I continued to hit with passing timestamps across as non-epoch values (i.e. as strings using the format option of the Elasticsearch mapping definition, or Hive Timestamp, and even specifying es.mapping.timestamp) was org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: TimestampParsingException, regardless of the careful format masks that I applied.

Back on track – loading a sample row into Elasticsearch

We want to send a sample row of data to Elasticsearch now for two reasons:

As a canary to prove the “plumbing” – no point chucking thousands of rows across through MapReduce if it’s going to fall over for a simple problem (I learnt my lesson during the timestamp fiddling above).
Automagically generate the Elasticsearch mapping, which we subsequently need to modify by hand and is easier if it’s been created for us first.

Since the table is defined in Hive, we can just run a straightforward INSERT to send some data across, making use of the LIMIT clause of HiveQL to just send a couple of rows:

INSERT INTO TABLE all_blog_posts_es 
SELECT UNIX_TIMESTAMP(post_date) * 1000 AS post_date_epoch, 
       title, 
       title, 
       author, 
       REGEXP_EXTRACT(generated_url, '\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*', 1) , 
       post_type 
FROM   posts 
WHERE  post_date IS NOT NULL
LIMIT 2
;

Hive will generate a MapReduce job that pushes the resulting data over to Elasticsearch. You can see the log for the job – essential for troubleshooting – at /var/log/hive/hive-server2.log (by default). In this snippet you can see a successful completion:

2014-10-30 22:35:14,977 INFO  exec.Task (SessionState.java:printInfo(417)) - Starting Job = job_1414451727442_0011, Tracking URL = http://bigdatalite.localdomain:8088/proxy/application_1414451727442_0011/
2014-10-30 22:35:14,977 INFO  exec.Task (SessionState.java:printInfo(417)) - Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1414451727442_0011
2014-10-30 22:35:22,244 INFO  exec.Task (SessionState.java:printInfo(417)) - Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 1
2014-10-30 22:35:22,275 WARN  mapreduce.Counters (AbstractCounters.java:getGroup(234)) - Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
2014-10-30 22:35:22,276 INFO  exec.Task (SessionState.java:printInfo(417)) - 2014-10-30 22:35:22,276 Stage-0 map = 0%,  reduce = 0%
2014-10-30 22:35:30,757 INFO  exec.Task (SessionState.java:printInfo(417)) - 2014-10-30 22:35:30,757 Stage-0 map = 100%,  reduce = 0%, Cumulative CPU 2.51 sec
2014-10-30 22:35:40,098 INFO  exec.Task (SessionState.java:printInfo(417)) - 2014-10-30 22:35:40,098 Stage-0 map = 100%,  reduce = 100%, Cumulative CPU 4.44 sec
2014-10-30 22:35:40,100 INFO  exec.Task (SessionState.java:printInfo(417)) - MapReduce Total cumulative CPU time: 4 seconds 440 msec
2014-10-30 22:35:40,132 INFO  exec.Task (SessionState.java:printInfo(417)) - Ended Job = job_1414451727442_0011
2014-10-30 22:35:40,158 INFO  ql.Driver (SessionState.java:printInfo(417)) - MapReduce Jobs Launched:
2014-10-30 22:35:40,158 INFO  ql.Driver (SessionState.java:printInfo(417)) - Job 0: Map: 1  Reduce: 1   Cumulative CPU: 4.44 sec   HDFS Read: 4313 HDFS Write: 0 SUCCESS
2014-10-30 22:35:40,158 INFO  ql.Driver (SessionState.java:printInfo(417)) - Total MapReduce CPU Time Spent: 4 seconds 440 msec
2014-10-30 22:35:40,159 INFO  ql.Driver (SessionState.java:printInfo(417)) - OK

But if you’ve a problem with your setup you’ll most likely see this generic error instead passed back to beeline prompt:

Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)

Meaning that you need to go to the Hive log file for the full diagnostics.

Amending the Elasticsearch mapping

So assuming the previous step worked (if you got the innocuous No rows affected from beeline then it did) you now have an index and mapping (and a couple of “documents” of data) in Elasticsearch. You can inspect the mapping in several ways, including with the GUI for Elasticsearch admin kopf.

You can also interogate Elasticsearch directly with its REST API, which is what we’re going to use to update the mapping so let’s use it also to view it. I’m going to use curl to do the HTTP call, and then pipe it | straight to jq to prettify the resulting JSON that Elasticsearch sends back.

[oracle@bigdatalite ~]$ curl --silent -XGET 'http://bigdatalite.localdomain:9200/all_blog/posts/_mapping' | jq '.'
{
  "all_blog": {
    "mappings": {
      "posts": {
        "properties": {
          "url": {
            "type": "string"
          },
          "ts_epoch": {
            "type": "long"
          },
          "post_type": {
            "type": "string"
          },
          "post_title_a": {
            "type": "string"
          },
          "post_title": {
            "type": "string"
          },
          "post_author": {
            "type": "string"
          }
        }
      }
    }
  }
}

We can see from this that Elasticsearch has generated the mapping to match the data that we’ve sent across from Hive (note how it’s picked up the ts_epoch type as being numeric not string, per our Hive table DDL). But, as mentioned previously, there are two things we need to rectify here:

ts_epoch needs to be a date type, not long. Without the correct type, Kibana won’t recognise it as a date field.
Fields that we don’t want broken down for analysis need marking as such. We’ll see the real difference that this makes when we get on to Kibana later.

To amend the mapping we just take the JSON document, make the changes, and then push it back with curl again. You can use any editor with the JSON (I’ve found Atom on the Mac to be great for its syntax highlighting, brace matching, etc). To change the type of the date field just change long to date. To mark a field not for analysis add "index": "not_analyzed" to the column definition. After these changes, the amended fields in my mapping JSON look like this:

[...]
          "url": {
            "type": "string","index": "not_analyzed"
          },
          "ts_epoch": {
            "type": "date"
          },
          "post_title_a": {
            "type": "string"
          },
          "post_title": {
            "type": "string","index": "not_analyzed"
          },
          "post_author": {
            "type": "string","index": "not_analyzed"
            [...]

The particularly eagle-eyed of you will notice that I am loading post_title in twice. This is because I want to use the field both as a label but also to analyse it as a field itself, looking at which terms get used most. So in the updated mapping, only post_title is set to not_analyzed; the post_title_a is left alone.

To remove the existing mapping, use this API call:

curl -XDELETE 'http://bigdatalite.localdomain:9200/all_blog/posts'

and then the amended mapping put back. Note that the "all_blog" / "mappings" outer levels of the JSON have been removed from the JSON that we send back to Elasticsearch:

curl -XPUT 'http://bigdatalite.localdomain:9200/all_blog/_mapping/posts' -d '
{
      "posts": {
        "properties": {
          "url": {
            "type": "string","index": "not_analyzed"
          },
          "ts_epoch": {
            "type": "date"
          },
          "post_type": {
            "type": "string"
          },
          "post_title_a": {
            "type": "string"
          },
          "post_title": {
            "type": "string","index": "not_analyzed"
          },
          "post_author": {
            "type": "string","index": "not_analyzed"
          }
        }
      }
    }
'

Full load into Elasticsearch

Now we can go ahead and run a full INSERT from Hive, and this time the existing mapping will be used. Depending on how much data you’re loading, it might take a while but you can always tail the hive-server2.log file to monitor progress. So that we don’t duplicate the ‘canary’ data that we sent across, use the INSERT OVERWRITE statement:

INSERT OVERWRITE table all_blog_posts_es 
SELECT UNIX_TIMESTAMP(post_date) * 1000 AS post_date_epoch, 
title, 
title, 
author, 
REGEXP_EXTRACT(generated_url, '\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*', 1) , 
post_type 
FROM   posts 
WHERE  post_date IS NOT NULL
;

To check the data’s made it across we can do a count from Hive:

0: jdbc:hive2://bigdatalite:10000> select count(*) from all_blog_posts_es;
+------+
| _c0  |
+------+
| 2257 |
+------+
1 row selected (27.005 seconds)

But this requires a MapReduce job to run and is fairly slow. Much faster is direct from the horse’s mouth – from Elasticsearch itself where the data is. Just as we called a REST API to get and set the mapping, Elasticsearch can also give us statistics back this way too:

[oracle@bigdatalite ~]$ curl --silent -XGET 'http://bigdatalite.localdomain:9200/all_blog/_stats/docs' | jq '.indices[].total.docs'
{
  "deleted": 0,
  "count": 2257
}

Here I’ve used a bit more jq to parse down the stats in JSON that Elasticsearch sends back. If you want to explore more of what jq can do, you’ll find https://jqplay.org/ useful.

Code

For reference, here is the set of three curl/DDL/DML that I used:

Elasticsearch index mappings

# For reruns, remove and recreate index
curl -XDELETE 'http://bigdatalite.localdomain:9200/all_blog' && curl -XPUT 'http://bigdatalite.localdomain:9200/all_blog'

# For partial rerun, remove mapping
curl -XDELETE 'http://bigdatalite.localdomain:9200/all_blog/_mapping/posts'
# Create posts mapping
curl -XPUT 'http://bigdatalite.localdomain:9200/all_blog/_mapping/posts' -d '
{
"posts" : {
"properties": {
"ts_epoch": {"type": "date"},
"post_author": {"type": "string", "index" : "not_analyzed"},
"post_title": {"type": "string", "index" : "not_analyzed"},
"post_title_a": {"type": "string", "index" : "analyzed"},
"post_type": {"type": "string", "index" : "not_analyzed"},
"url": {"type": "string", "index" : "not_analyzed"}
}}}
'

# For partial rerun, remove mapping
# Create tweets mapping
curl -XDELETE 'http://bigdatalite.localdomain:9200/all_blog/_mapping/tweets'
curl -XPUT 'http://bigdatalite.localdomain:9200/all_blog/_mapping/tweets' -d '
{"tweets": {
"properties": {
"tweet_url": {
"index": "not_analyzed",
"type": "string"
},
"tweet_type": {
"type": "string"
},
"ts_epoch": {
"type": "date"
},
"tweet_author": {
"index": "not_analyzed",
"type": "string"
},
"tweet_author_followers": {
"type": "string"
},
"tweet_author_friends": {
"type": "string"
},
"tweet_author_handle": {
"index": "not_analyzed",
"type": "string"
},
"tweet": {
"index": "not_analyzed",
"type": "string"
},
"tweet_analysed": {     "type": "string"      }
,"post_author": {       "index": "not_analyzed","type": "string"      }
,"post_title": {       "index": "not_analyzed", "type": "string"      }
,"post_title_a": {    "type": "string"    }

}
}
}'

# For partial rerun, remove mapping
curl -XDELETE 'http://bigdatalite.localdomain:9200/all_blog/_mapping/apache'
# Create apachelog mapping
curl -XPUT 'http://bigdatalite.localdomain:9200/all_blog/_mapping/apache' -d '

{"apache": {
"properties": {
"user": {
"type": "string"
},
"url": {
"index": "not_analyzed",
"type": "string"
},
"status": {
"type": "string"
},
"agent": {
"index": "not_analyzed",
"type": "string"
},
"host": {
"type": "string"
},
"http_call": {
"type": "string"
},
"http_status": {
"type": "string"
},
"identity": {
"type": "string"
},
"referer": {
"index": "not_analyzed",
"type": "string"
},
"ts_epoch": {
"type": "date"
},
"size": {
"type": "string"
},"post_author": {      "index": "not_analyzed","type": "string"      }
,"post_title": {       "index": "not_analyzed", "type": "string"      }
,"post_title_a": {    "type": "string"    }

}}}'

Hive table DDL

drop table all_blog_posts_es;
CREATE external TABLE all_blog_posts_es(
ts_epoch bigint ,
post_title string ,
post_title_a string ,
post_author string ,
url string ,
post_type string )
ROW FORMAT SERDE
'org.elasticsearch.hadoop.hive.EsSerDe'
STORED BY
'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES (
'es.nodes'='bigdatalite.localdomain',
'es.resource'='all_blog/posts')
;

drop table all_blog_tweets_es;
CREATE EXTERNAL TABLE all_blog_tweets_es(
tweet_type string ,
tweet_url string ,
tweet_author string ,
tweet string ,
tweet_analysed string ,
ts_epoch bigint ,
tweet_author_handle string ,
tweet_author_followers string ,
tweet_author_friends string ,url string ,post_author string ,
post_title string ,post_title_a string )
ROW FORMAT SERDE
'org.elasticsearch.hadoop.hive.EsSerDe'
STORED BY
'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES (
'es.nodes'='bigdatalite.localdomain',
'es.resource'='all_blog/tweets')
;

drop table all_blog_apache_es;
CREATE EXTERNAL TABLE all_blog_apache_es(
host string ,
identity string ,
user string ,
ts_epoch bigint ,
http_call string ,
url string ,
http_status string ,
status string ,
size string ,
referer string ,
agent string, post_author string ,
post_title string ,post_title_a string )
ROW FORMAT SERDE
'org.elasticsearch.hadoop.hive.EsSerDe'
STORED BY
'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES (
'es.nodes'='bigdatalite.localdomain',
'es.resource'='all_blog/apache');

Hive DML – load data to Elasticsearch

insert into table all_blog_posts_es
select unix_timestamp(post_date) * 1000 as post_date_epoch,title,title,author,
regexp_extract(generated_url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1)
,post_type
from posts
where post_date is not null
;

insert overwrite table all_blog_tweets_es
select x.*,p.author,p.title
from (
select 'tweets'
,t.url as tweet_url
,t.author
,t.content as tweet
,t.content as tweet_analyzed
,unix_timestamp(concat(concat(concat(concat(regexp_replace(regexp_replace(t.created_at,'^\\w{3}, \\d{2} \\w{3} ',''),' .*$',''),'-'),case regexp_replace(regexp_replace(t.created_at,'^\\w{3}, \\d{2} ',''),' .*$','') when 'Jan' then 1 when 'Feb' then 2 when 'Mar' then 3 when 'Apr' then 4 when 'May' then 5 when 'Jun' then 6 when 'Jul' then 7 when 'Aug' then 8 when 'Sep' then 9 when 'Oct' then 10 when 'Nov' then 11 when 'Dec' then 12 else 0 end,'-'),regexp_replace(regexp_replace(t.created_at,'^\\w{3}, ',''),' .*$',''),' '),regexp_replace(regexp_replace(t.created_at,'^\\w{3}, \\d{2} \\w{3} \\d{4} ',''),' .*$',''))) * 1000 as ts_epoch
,t.author_handle
,t.author_followers
,t.author_friends
,regexp_extract(ref_url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) as url
from tweets t lateral view explode (referenced_urls) refs as ref_url
where t.author_followers is not null
and ref_url regexp '\\S*\\/\\d{4}\\/\\d{2}\\/.*'
) x left outer join posts p on regexp_extract(x.url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) = p.generated_url
;

insert overwrite table all_blog_tweets_es
select x.*,p.author,p.title
from (
select 'retweets'
,t.url as tweet_url
,t.author
,t.content as tweet
,t.content as tweet_analyzed
,unix_timestamp(concat(concat(concat(concat(regexp_replace(regexp_replace(t.created_at,'^\\w{3}, \\d{2} \\w{3} ',''),' .*$',''),'-'),case regexp_replace(regexp_replace(t.created_at,'^\\w{3}, \\d{2} ',''),' .*$','') when 'Jan' then 1 when 'Feb' then 2 when 'Mar' then 3 when 'Apr' then 4 when 'May' then 5 when 'Jun' then 6 when 'Jul' then 7 when 'Aug' then 8 when 'Sep' then 9 when 'Oct' then 10 when 'Nov' then 11 when 'Dec' then 12 else 0 end,'-'),regexp_replace(regexp_replace(t.created_at,'^\\w{3}, ',''),' .*$',''),' '),regexp_replace(regexp_replace(t.created_at,'^\\w{3}, \\d{2} \\w{3} \\d{4} ',''),' .*$',''))) * 1000 as ts_epoch
,t.author_handle
,t.author_followers
,t.author_friends
,regexp_extract(ref_url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) as url
from retweets t lateral view explode (referenced_urls) refs as ref_url
where t.author_followers is not null
and ref_url regexp '\\S*\\/\\d{4}\\/\\d{2}\\/.*'
) x left outer join posts p on regexp_extract(x.url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) = p.generated_url
;

insert into table all_blog_apache_es
select x.*,p.author,p.title,p.title
from (
select
a.host,a.identity,a.user
,unix_timestamp(concat(concat(concat(concat(concat(concat(
a.time_yearyyyy,'-')
,case a.time_monthmmm when 'Jan' then 1 when 'Feb' then 2 when 'Mar' then 3 when 'Apr' then 4 when 'May' then 5 when 'Jun' then 6 when 'Jul' then 7 when 'Aug' then 8 when 'Sep' then 9 when 'Oct' then 10 when 'Nov' then 11 when 'Dec' then 12 else 0 end,'-')
,a.time_daydd,' ')
,a.time_hourhh,':')
,a.time_minmm,':')
,a.time_secss,'')
) * 1000 as ts_epoch
,a.http_call ,regexp_extract(a.url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) as url,a.http_status ,a.status ,a.size ,a.referer ,a.agent
from apachelog a
where a.url regexp "^\\/\\d{4}\\/\\d{2}\\/.*"
) x left outer join posts p on regexp_extract(x.url,'\\S*(\\/\\d{4}\\/\\d{2}\\/[^\\/]+).*',1) = p.generated_url
;

Summary

With the data loaded into Elasticsearch we’re now ready to start our analysis against it. Stay tuned for the final part in this short blog series to see how we use Kibana to do this.

↧

Analytics with Kibana and Elasticsearch through Hadoop – part 3 – Visualising the data in Kibana

November 4, 2014, 2:02 pm

≫ Next: Auditing OBIEE Presentation Catalog Activity with Custom Log Filters

≪ Previous: Analytics with Kibana and Elasticsearch through Hadoop – part 2 – Getting data into Elasticsearch

In this post we will see how Kibana can be used to create visualisations over various sets of data that we have combined together. Kibana is a graphical front end for data held in ElasticSearch, which also provides the analytic capabilities. Previously we looked at where the data came from and exposing it through Hive, and then loading it into ElasticSearch. Here’s what we’ve built so far, the borders denoting what was covered in the previous two blog articles and what we’ll cover here:

kib32

Now that we’ve got all the data into Elasticsearch, via Hive, we can start putting some pictures around it. Kibana works by directly querying Elasticsearch, generating the same kind of queries that you can run yourself through the Elasticsearch REST API (similar to what we saw when defining the mappings in the previous article). In this sense there is a loose parallel between OBIEE’s Presentation Services and the BI Server – one does the fancy front end stuff, generating queries to the hard-working backend.

I’ve been looking at both the current release version of Kibana (3.x), and also the beta of Kibana 4 which brings with it a very smart visualiser that we’ll look at in detail. It looks like Kibana 4 is a ground-up rewrite rather than modifications to Kibana 3, which means that at the moment it is a long way from parity of functionality – which is why I’m flitting between the two. For a primer in Kibana 3 and its interface see my article on using it to monitor OBIEE.

Installing Kibana is pretty easy in Kibana 3, involving a simple config change to a web server of your choice that you need to provide (details in my previous blog), and has been made even easier in Kibana 4 which actually ships with its own web server so you literally just download it, unarchive it and run it.

So the starting point is the assumption we have all the data in a single Elasticsearch index all_blog, with three different mappings which Kibana refers to accurately as “types”: blog posts, blog visits, and blog tweets.

Kibana 3

Starting with a simple example first, and to illustrate the “analysed” vs “non-analysed” mapping configuration that I mentioned previously, let’s look at the Term visualisation in Kibana 3. This displays the results of an Elasticsearch analysis against a given field. If the field has been marked as “not analysed” we get a listing of the literal values, ranking by the number of times they repeat. This is useful, for example, to show who has blogged the most:

But less useful if we want to analyse the use of words in blog titles, since non-analysed we just get a listing of blog titles:

(there are indeed two blog posts entitled “Odds and Ends” from quite a while ago 1 2)

Building the Term visualisation against the post title field that has been analysed gives us a more interesting, although hardly surprising, result:

Here I’ve weeded out the obvious words that will appear all the time (‘the’, ‘a’, etc), using the Exclude Term(s) option.

Term visualisations are really useful for displaying any kind of top/bottom ranked values, and also because they are interactive – if you click on the value it is applied as a filter to the data on the page. What that means is that we can take a simple dashboard using the two Term objects above, plus a histogram of posts made over time:

And by clicking on one of the terms (for example, my name in the authors list) it shows that I only started posting on the Rittman Mead blog three years ago, and that I write about OBIEE, performance, and exalytics.

Taking another tack, we can search for any term and add it in to the histogram. Here we can see when interest in 11g (the green line), as well as big data (red), started :

Note here we’re just analyzing post titles not content so it’s not 100% representative. Maybe loading in our post contents to Elasticsearch will be my next blog post. But that does then start to get a little bit meta…

Adding in a Table view gives us the ability to show the actual posts and links to them.

Let’s explore the data a bit. Clicking on an entry in the table gives us the option to filter down further

Here we can see for a selected blog post, what its traffic was and when (if at all) it was tweeted:

Interesting in the profile of blog hits is a second peak that looks like it might correlate with tweets. Let’s drill further by drag-clicking (brushing) on the graph to select the range we want, and bring in details of those tweets:

So this is all pretty interesting, and importantly, very rapid in terms of both the user experience and the response time.

Kibana 4

Now let’s take a look at what Kibana 4 offers us. As well as a snazzier interface (think hipster data explorer vs hairy ops guy parsing logs), its new Visualiser builder is great. Kibana 3 dumped you on a dashboard in which you have to build rows and panels and so on. Kibana 4 has a nice big “Visualize” button. Let’s see what this does for us. To start with it’s a nice “guided” build process:

By default we get a single bar, counting all the ‘documents’ for the time period. We can use the Search option at the top to filter just the ‘type’ of document we want, which in this case is going to be tweets about our blog articles.

Obviously, a single bar on its own isn’t that interesting, so let’s improve it. We’ll click the “Add Aggregation” button (even though to my pedantic mind the data is already aggregated to total), and add an X-Axis of date:

The bucket size in the histogram defaults to automatic, and the the axis label tells us it’s per three hours. At the volume of tweets we’re analysing, we’d see patterns better at a higher grain such as daily (the penultimate bar to the right of the graph shows a busy day of tweets that’s lost in the graph at 3-hour intervals):

NB at the moment in Kibana 4 intervals are fixed (in Kibana 3 they were freeform).

Let’s dig into the tweets a bit deeper. Adding a “Sub Aggregation” to split the bars based on top two tweet authors per day gives us this:

You can hover over the legend to highlight the relevant bar block too:

Now with a nifty function in the Visualizer we can change the order of this question. So instead of, “by day, who were the top two tweeters”, we can ask “who were the top two tweeters over the time period, and what was their tweet count by day” – all just by rearranging the buckets/aggregation with a single click:

Let’s take another angle on the data, looking not at time but which blog links were most tweeted, and by whom. Turns out I’m a self-publicist, tweeting four times about my OOW article. Note that I’ve also including some filtering on my data to exclude automated tweets:

Broadening out the tweets to all those from accounts we were capturing during the sample we can see the most active tweeters, and also what proportion are original content vs retweets:

Turning our attention to the blog hits, it’s easy to break it down by top five articles in a period, accesses by day:

Having combined (dare I say, mashed up) post metadata with apache logs, we can overlay information about which author gets the most hits. Unsuprisingly Mark Rittman gets the lion’s share, but interestingly Venkat, who has not blogged for quite a while is still in the top three authors (based on blog page hits) in the time period analysed:

It’s in the current lack of a table visualisation that Kibana 4 is currently limited (although it is planned), because this analysis here (of the top three authors, what were their respective two most popular posts) just makes no sense as a graph:

but would be nice an easy to read off a table. You can access a table view of sorts from the arrow at the bottom of the screen, but this feels more like a debug option than an equal method for presenting the data

Whilst you can access the table on a dashboard, it doesn’t persist as the default option of the view, always showing the graph initially. As noted above, a table visualisation is planned and under development for Kibana 4.

Speaking of dashboards, Kibana 4 has a very nice dashboard builder with interactive resizing of objects both within rows and columns – quite a departure from Kibana 3 which has a rigid system of rows and panels:

Summary

Kibana 3 is great for properly analysing data and trends as you find them in the data, if you don’t mind working your way through the slightly rough interface. In contrast, Kibana 4 has a pretty slick UI but being an early beta is missing features like Term and Table from Kibana 3 that would enable tables of data as well as the pretty graphs. It’ll be great to see how it develops.

Putting the data in Elasticsearch makes it very fast to query. I’m doing this on a the Big Data Lite VM which admittedly is not very representative of a realworld Hadoop cluster but the relative speeds are interesting – dozens of seconds for any kind of Hive query, subsecond for any kind of Kibana/Elasticsearch query. The advantage of the latter of course being very interesting from a data exploration point of view, because you not only have the speed but also the visualisation and interactions with those visuals to dig and drill further into it.

Whilst Elasticsearch is extremely fast to query, I’ve not compared it to other options that are designed for speed (eg Impala) and which support a more standard interface, such as ODBC or JDBC so you can bring your own data visualisation tool (eg T-who-shall-not-be-named). In addition, there is the architectural consideration of Elasticsearch’s fit with the rest of the Hadoop stack. Whilst the elasticsearch-hadoop connector is two-way, I’m not sure if you would necessarily site your data in Elasticsearch alone, opting instead to duplicate all or part of it from somewhere like HDFS.

What would be interesting is to look at a similar analysis exercise using the updated Hue Search in CDH 5.2 which uses Apache Solr and therefore based on the same project as Elasticsearch (Apache Lucene). Another angle on this is Oracle’s forthcoming Big Data Discovery tool which also looks like it covers a similar purpose.

↧

Auditing OBIEE Presentation Catalog Activity with Custom Log Filters

November 9, 2014, 11:49 pm

≫ Next: OBIEE SampleApp v406 Amazon EC2 AMI – available for public use

≪ Previous: Analytics with Kibana and Elasticsearch through Hadoop – part 3 – Visualising the data in Kibana

A question that I’ve noticed coming up a few times on the OBIEE OTN forums goes along the lines of “How can I find out who deleted a report from the Presentation Catalog?”. And whilst the BI Server’s Usage Tracking is superb for auditing who ran what report, we don’t by default have a way of seeing who deleted a report.

The Presentation Catalog (or “Web Catalog” as it was called in 10g) records who created an object and when it was last modified, accessible through both OBIEE’s Catalog view, and the dedicated Catalog Manager tool itself:

But if we want to find out who deleted an object, or maybe who modified it before the most recent person (that is, build up an audit trail of who modified an object) we have to dig a bit deeper.

Presentation Services Log Sources

Perusing the OBIEE product manuals, one will find documented additional Logging in Oracle BI Presentation Services options. This is more than just turning up the log level en masse, because it also includes additional log writers and filters. What this means is that you can have your standard Presentation Services logging, but then configure a separate file to capture more detailed information about just specific goings on within Presentation Services.

Looking at a normal Presentation Services log (in $FMW_HOME/instances/instance1/diagnostics/logs/OracleBIPresentationServicesComponent/coreapplication_obips1/) you’ll see various messages by default – greater or fewer depending on the health of your system – but they all use the Location stack track, such as this one here:

[2014-11-10T06:33:19.000-00:00] [OBIPS] [WARNING:16] [] [saw.soap.soaphelpers.writeiteminfocontents] [ecid: 11d1def534ea1be0:15826b4a:14996b86fbb:-8000-0000000000001ede,0:1] [tid: 2569512704] Resolving and writing full ACL for path /shared/Important stuff/Sales by brand[[
File:soaphelpers.cpp
Line:609
Location:
        saw.soap.soaphelpers.writeiteminfocontents
        saw.soap.catalogservice
        saw.SOAP
        saw.httpserver.request.soaprequest
        saw.rpc.server.responder
        saw.rpc.server
        saw.rpc.server.handleConnection
        saw.rpc.server.dispatch
        saw.threadpool.socketrpcserver
        saw.threads
Path: /shared/Important stuff/Sales by brand
AuthProps: AuthSchema=UidPwd-soap|PWD=******|UID=weblogic|User=weblogic
ecid: 11d1def534ea1be0:15826b4a:14996b86fbb:-8000-0000000000001ede,0:1
ThreadID: 2569512704

And it is the Location that is of interest to us here, because it’s what gives hints about the types of log messages that can be emitted and that we may want to filter. For example, the one quoted above is evidently something to do with the Presentation Catalog and SOAP, which I’d guess is a result of Catalog Manager (which uses web services/SOAP to access OBIEE).

To get a full listing of all the possible log sources, first set up the BI command line environment with bi-init:

source $FMW_HOME/instances/instance1/bifoundation/OracleBIApplication/coreapplication/setup/bi-init.sh

and then run:

sawserver -logsources

(If you get an error, almost certainly you didn’t set up the command line environment properly with bi-init). You’ll get an list of over a thousand lines (which gives you an idea of quite how powerful this granular logging is). Assuming you’ll want to peruse it at your leisure, it makes sense to write it to disk which if you’re running this on *nix you can simply do thus:

sawserver -logsources > sawserver.logsources.txt

To find what you want on the list, you can just search through it. Looking for anything related to “catalog” and narrowing it down further, I came up with these interesting sources:

[oracle@demo ~]$ sawserver -logsources|grep catalog|grep local
saw.catalog.item.getlocalized
saw.catalog.local
saw.catalog.local.checkforcatalogupgrade
saw.catalog.local.copyItem
saw.catalog.local.createFolder
saw.catalog.local.createLink
saw.catalog.local.deleteItem
saw.catalog.local.getItemACL
saw.catalog.local.getItemInfo
saw.catalog.local.loadCatalog
saw.catalog.local.moveItem
saw.catalog.local.openObject
saw.catalog.local.readObject
saw.catalog.local.search
saw.catalog.local.setItemACL
saw.catalog.local.setItemInfo
saw.catalog.local.setMaintenanceMode
saw.catalog.local.setOwnership
saw.catalog.local.writeObject

Configuring granular Presentation Services logging

Let us see how to go and set up this additional logging. Remember, this is not the same as just going to Enterprise Manager and bumping the log level to 11 globally – we’re going to retain the default logging level, but for just specific actions that occur within the tool, capture greater information. The documentation for this is here.

The configuration is found in the instanceconfig.xml file, so like all good sysadmins let’s take a backup first:

cd $FMW_HOME/instances/instance1/config/OracleBIPresentationServicesComponent/coreapplication_obips1/
cp instanceconfig.xml instanceconfig.xml.20141110

Now depending on your poison, open the instanceconfig.xml directly in a text editor from the command line, or copy it to a desktop environment where you can open it in your favourite text editor there. Either way, these are the changes we’re going to make:

Locate the <Logging> section. Note that within it there are three child entities – <Writers>, <WriterClassGroups> and <Filters>. We’re going to add an entry to each.

Under <Writers>, add:

<Writer implementation="FileLogWriter" name="RM Presentation Catalog Audit" disableCentralControl="true" writerClassId="6" dir="{%ORACLE_BIPS_INSTANCE_LOGDIR%}" filePrefix="rm_pres_cat_audit" maxFileSizeKb="10240" filesN="10" fmtName="ODL-Text"/>

This defines a new writer than will write logs to disk (FileLogWriter), in 100MB files of which it’ll keep 10. If you’re defining additional Writers, make sure they have a unique writerClassId See docs for detailed syntax.

Under <WriterClassGroups> add:
```
<WriterClassGroup name="RMLog">6</WriterClassGroup>
```
This defines the RMLog class group as being associated with writerClassId 6 (as defined above), and is used in the Filters section to direct logs. If you wanted you could log entries to multiple logs (eg both file and console) this way.

Under <Filters> add:

<FilterRecord writerClassGroup="RMLog" disableCentralControl="true" path="saw.catalog.local.moveItem" information="32" warning="32" error="32" trace="32" incident_error="32"/>
<FilterRecord writerClassGroup="RMLog" disableCentralControl="true" path="saw.catalog.local.deleteItem" information="32" warning="32" error="32" trace="32" incident_error="32"/>

Here we’re defining two event filters, with levels turned up to max (32), directing the capture of any occurences to the RMLog writerClassGroup.

After making the changes to instanceconfig.xml, restart Presentation Services:

$FMW_HOME/instances/instance1/bin/opmnctl restartproc ias-component=coreapplication_obips1

Here’s the completed instanceconfig.xml from the top of the file through to the end of the <Logging> section, with my changes overlayed the defaults:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Oracle Business Intelligence Presentation Services Configuration File -->
<WebConfig xmlns="oracle.bi.presentation.services/config/v1.1">
   <ServerInstance>

      <!--This Configuration setting is managed by Oracle Enterprise Manager Fusion Middleware Control--><CatalogPath>/app/oracle/biee/instances/instance1/SampleAppWebcat</CatalogPath>

      <DSN>AnalyticsWeb</DSN>

      <Logging>

         <Writers>
            <!--This Configuration setting is managed by Oracle Enterprise Manager Fusion Middleware Control--><Writer implementation="FileLogWriter" name="Global File Logger" writerClassId="1" dir="{%ORACLE_BIPS_INSTANCE_LOGDIR%}" filePrefix="sawlog" maxFileSizeKb="10240" filesN="10" fmtName="ODL-Text"/>
            <!--This Configuration setting is managed by Oracle Enterprise Manager Fusion Middleware Control--><Writer implementation="CoutWriter" name="Console Logger" writerClassId="2" maxFileSizeKb="10240"/>
            <!--This Configuration setting is managed by Oracle Enterprise Manager Fusion Middleware Control--><Writer implementation="EventLogWriter" name="System Event Logger" writerClassId="3" maxFileSizeKb="10240"/>
            <!--  The following writer is not centrally controlled -->
            <Writer implementation="FileLogWriter" name="Webcat Upgrade Logger" disableCentralControl="true" writerClassId="5" dir="{%ORACLE_BIPS_INSTANCE_LOGDIR%}" filePrefix="webcatupgrade" maxFileSizeKb="2147483647" filesN="1" fmtName="ODL-Text"/>
            <Writer implementation="FileLogWriter" name="RM Presentation Catalog Audit" disableCentralControl="true" writerClassId="6" dir="{%ORACLE_BIPS_INSTANCE_LOGDIR%}" filePrefix="rm_pres_cat_audit" maxFileSizeKb="10240" filesN="10" fmtName="ODL-Text"/>
         </Writers>

         <WriterClassGroups>
            <WriterClassGroup name="All">1,2,3,5,6</WriterClassGroup>
            <WriterClassGroup name="File">1</WriterClassGroup>
            <WriterClassGroup name="Console">2</WriterClassGroup>
            <WriterClassGroup name="EventLog">3</WriterClassGroup>
            <WriterClassGroup name="UpgradeLogFile">5</WriterClassGroup>
            <WriterClassGroup name="RMLog">6</WriterClassGroup>
         </WriterClassGroups>

         <Filters>
            <!--  These FilterRecords are updated by centrally controlled configuration -->
            <!--This Configuration setting is managed by Oracle Enterprise Manager Fusion Middleware Control--><FilterRecord writerClassGroup="File" path="saw" information="1" warning="31" error="31" trace="0" incident_error="1"/>
            <!--This Configuration setting is managed by Oracle Enterprise Manager Fusion Middleware Control--><FilterRecord writerClassGroup="File" path="saw.mktgsqlsubsystem.joblog" information="1" warning="31" error="31" trace="0" incident_error="1"/>

            <!--  The following FilterRecords are not centrally controlled -->
            <FilterRecord writerClassGroup="UpgradeLogFile" disableCentralControl="true" path="saw.subsystem.catalog.initialize.upgrade" information="1" warning="32" error="32" trace="1" incident_error="32"/>
            <FilterRecord writerClassGroup="UpgradeLogFile" disableCentralControl="true" path="saw.subsystem.catalog.upgrade" information="1" warning="32" error="32" trace="1" incident_error="32"/>
            <FilterRecord writerClassGroup="RMLog" disableCentralControl="true" path="saw.catalog.local.moveItem" information="32" warning="32" error="32" trace="32" incident_error="32"/>
            <FilterRecord writerClassGroup="RMLog" disableCentralControl="true" path="saw.catalog.local.deleteItem" information="32" warning="32" error="32" trace="32" incident_error="32"/>
         </Filters>

      </Logging>

[...]

Granular logging in action

Having restarted Presentation Services after making the above change, I can see in my new log file whenever an item from the Presentation Catalog is deleted, by whom, and from what IP address:

[2014-11-10T07:13:36.000-00:00] [OBIPS] [TRACE:1] [] [saw.catalog.local.deleteItem] [ecid: 11d1def534ea1be0:15826b4a:14996b86fbb:-8000-0000000000002cf1,0:1] [tid: 2458068736] Succeeded with '/shared/Important stuff/Sales by brand 2'[[
File:localwebcatalog.cpp
Line:626
Location:
        saw.catalog.local.deleteItem
        saw.httpserver.processrequest
        saw.rpc.server.responder
        saw.rpc.server
        saw.rpc.server.handleConnection
        saw.rpc.server.dispatch
        saw.threadpool.socketrpcserver
        saw.threads
Path: /shared/Important stuff/Sales by brand 2
SessionID: p8n6ojs0vkh7tou0mkstmlc9me381hadm9o1fui
AuthProps: AuthSchema=UidPwd|PWD=******|UID=r.mellie|User=r.mellie
ecid: 11d1def534ea1be0:15826b4a:14996b86fbb:-8000-0000000000002cf1,0:1
ThreadID: 2458068736
HttpCommand: CatalogTreeModel
RemoteIP: 192.168.57.1
HttpArgs: action='rm',_scid='QR5zMdHIL3JsW1b67P9p',icharset='utf-8',urlGenerator='qualified',paths='["/shared/Important stuff/Sales by brand 2"]'
]]

And the same for when a file is moved/renamed:

[2014-11-10T07:28:17.000-00:00] [OBIPS] [TRACE:1] [] [saw.catalog.local.moveItem] [ecid: 11d1def534ea1be0:15826b4a:14996b86fbb:-8000-0000000000003265,0:1] [tid: 637863680] Source '/shared/Important stuff/copy of Sales by brand', Destination '/shared/Important stuff/Sales by brand 2': Succeeded with '/shared/Important stuff/copy of Sales by brand'[[
File:localwebcatalog.cpp
Line:1186
Location:
        saw.catalog.local.moveItem
        saw.httpserver.processrequest
        saw.rpc.server.responder
        saw.rpc.server
        saw.rpc.server.handleConnection
        saw.rpc.server.dispatch
        saw.threadpool.socketrpcserver
        saw.threads
Path: /shared/Important stuff/copy of Sales by brand
SessionID: ddt6eo7llcm0ohs5e2oivddj7rtrhn8i41a7f32
AuthProps: AuthSchema=UidPwd|PWD=******|UID=f.saunders|User=f.saunders
ecid: 11d1def534ea1be0:15826b4a:14996b86fbb:-8000-0000000000003265,0:1
ThreadID: 637863680
HttpCommand: CatalogTreeModel
RemoteIP: 192.168.57.1
HttpArgs: path='/shared/Important stuff/copy of Sales by brand',action='ren',_scid='84mO8SRViXlwJ*180HV7',name='Sales by brand 2',keepLink='f',icharset='utf-8',urlGenerator='qualified'
]]

Be careful with your logging

Just because you can log everything, don’t be tempted to actually log everything. Bear in mind that we’re crossing over from simple end-user logging here into the very depths of the sawserver (Presentation Services) code, accessing logging that is extremely diagnostic in nature. Which for our specific purpose of tracking when someone deletes an object from the Presentation Catalog is handy. But as an example, if you enable saw.catalog.local.writeObject event logging, you may think that it will record who changed a report when, and that might be useful. But – look at what gets logged every time someone saves a report:

[2014-11-10T07:19:32.000-00:00] [OBIPS] [TRACE:1] [] [saw.catalog.local.writeObject] [ecid: 11d1def534ea1be0:15826b4a:14996b86fbb:-8000-0000000000002efb,0:1] [tid: 2454759168] Succeeded with '/shared/Important stuff/Sales 01'[[
File:localwebcatalog.cpp
Line:1476
Location:
        saw.catalog.local.writeObject
        saw.httpserver.processrequest
        saw.rpc.server.responder
        saw.rpc.server
        saw.rpc.server.handleConnection
        saw.rpc.server.dispatch
        saw.threadpool.socketrpcserver
        saw.threads
Path: /shared/Important stuff/Sales 01
SessionID: p8n6ojs0vkh7tou0mkstmlc9me381hadm9o1fui
AuthProps: AuthSchema=UidPwd|PWD=******|UID=r.mellie|User=r.mellie
ecid: 11d1def534ea1be0:15826b4a:14996b86fbb:-8000-0000000000002efb,0:1
ThreadID: 2454759168
HttpCommand: CatalogTreeModel
RemoteIP: 192.168.57.1
HttpArgs: path='/shared/Important stuff/Sales 01',action='wr',_scid='QR5zMdHIL3JsW1b67P9p',repl='t',followLinks='t',icharset='utf-8',modifiedTime='1415600931000',data='<saw:report xmlns:saw="com.siebel.analytics.web/report/v1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:sawx="com.siebel.analytics.web/expression/v1.1" xmlVersion="201201160"><saw:criteria xsi:type="saw:simpleCriteria" subjectArea="&quot;A - Sample Sales&quot;" withinHierarchy="true"><saw:columns><saw:column xsi:type="saw:regularColumn" columnID="c1dff1637cbc77948"><saw:columnFormula><sawx:expr xsi:type="sawx:sqlExpression">"Time"."T05 Per Name Year"</sawx:expr></saw:columnFormula></saw:column></saw:columns></saw:criteria><saw:views currentView="0"><saw:view xsi:type="saw:compoundView" name="compoundView!1"><saw:cvTable><saw:cvRow><saw:cvCell viewName="titleView!1"><saw:displayFormat><saw:formatSpec/></saw:displayFormat></saw:cvCell></saw:cvRow><saw:cvRow><saw:cvCell viewName="tableView!1"><saw:displayFormat><saw:formatSpec/></saw:displayFormat></saw:cvCell></saw:cvRow></saw:cvTable></saw:view><saw:view xsi:type="saw:titleView" name="titleView!1"/><saw:view xsi:type="saw:tableView" name="tableView!1" scrollingEnabled="false"><saw:edges><saw:edge axis="page" showColumnHeader="true"/><saw:edge axis="section"/><saw:edge axis="row" showColumnHeader="true"><saw:edgeLayers><saw:edgeLayer type="column" columnID="c1dff1637cbc77948"/></saw:edgeLayers></saw:edge><saw:edge axis="column" showColumnHeader="rollover"/></saw:edges></saw:view></saw:views></saw:report>',sig='queryitem1'
]]

It’s the whole report definition! And this is a very very small report – real life reports can be page after page of XML. That is not a good level at which to be recording this information. If you want to retain this kind of control over who is saving what report, you should maybe be looking at authorisation groups for your users in terms of where they can save reports, and have trusted ‘gatekeepers’ for important areas.

As well as the verbose report capture with the writeObject event, you also get this background chatter:

[2014-11-10T07:20:27.000-00:00] [OBIPS] [TRACE:1] [] [saw.catalog.local.writeObject] [ecid: 0051rj7FmC3Fw000jzwkno0007PK000000,0:200] [tid: 3034580736] Succeeded with '/users/r.mellie/_prefs/volatileuserdata'[[
File:localwebcatalog.cpp
Line:1476
Location:
        saw.catalog.local.writeObject
        saw.subsystem.security.cleanup
        saw.Sessions.cache.cleanup
        saw.taskScheduler.processJob
        taskscheduler
        saw.threads
Path: /users/r.mellie/_prefs/volatileuserdata
ecid: 0051rj7FmC3Fw000jzwkno0007PK000000,0:200
ThreadID: 3034580736
task: Cache/Sessions
]]

volatileuserdata is presumably just that (user data that is volatile, constantly changing) and not something that it would be of interest to anyone to log – but you can’t capture actual report writes without capturing this too. On a busy system you’re going to be unnecessarily thrashing the log files if you capture this event by routine – so don’t!

Summary

The detailed information is there for the taking in Presentation Services’ excellent granular log sources – just be careful what you capture lest you bite off more than you can chew.

↧

OBIEE SampleApp v406 Amazon EC2 AMI – available for public use

November 13, 2014, 6:34 am

≫ Next: Linux cluster sysadmin — SSH keys

≪ Previous: Auditing OBIEE Presentation Catalog Activity with Custom Log Filters

I wrote a while ago about converting Oracle’s superb OBIEE SampleApp from a VirtualBox image into an EC2-hosted instance. I’m pleased to announce that Oracle have agreed for us to make the image (AMI) on Amazon available publicly. This means that anyone who wants to run their own SampleApp v406 server on Amazon’s EC2 cloud service can do so.

2014-11-11_17-47-32

2014-11-12_09-15-41

Important caveats

Before getting to the juicy stuff there’s some important points to note about access to the AMI, which you are implicitly bound by if you use it:

In accessing it you’re bound by the same terms and conditions that govern the original SampleApp
SampleApp is only ever for use in your own development/testing/prototyping/demonstrating with OBIEE. It must not be used as the basis for any kind of Productionisation.
Neither Oracle nor Rittman Mead provide any support for SampleApp or the AMI, nor warranty to any issues caused through their use.
Once launched, the server will be accessible to the public and its your responsibility to secure it as such.

How does it work?

Create yourself an AWS account, if you haven’t already. You’ll need your credit card for this. Read more about getting started with AWS here.
Request access to the AMI (below)
Launch the AMI on your AWS account
Everything starts up automagically. After 15-20 minutes, enjoy your fully functioning SampleApp v406 instance, running in the cloud!

How much does it cost?

You can get an estimate of the cost involved using the Amazon Calculator.

As a rough guide, as of November 2014 an “m3.large” instance costs around $4 a day – but it’s your responsibility to check pricing and commitments.

Be aware that once a server is created you’ll incur costs on it right through until you “terminate” it. You can “stop” it (in effect, power it off) which reduces the running costs but you’ll still pay for the ‘disk’ (EBS volume) that holds it. The benefit of this though is that you can then power it back up and it’ll be as you left it (just with a different IP).

You can track your AWS usage through the AWS page here.

Security

Access to the instance’s command line is through SSH as the oracle user using SSH keys only (provided by you when you launch the server) – no password access
- You cannot ssh to the server as root; instead connect as oracle and use sudo as required.
- The ssh key does not get set up until the very end of the first boot sequence, which can be 20 minutes. Be patient!
All the OBIEE/WebLogic usernames and passwords are per the stock SampleApp v406 image, so you are well advised to change them. Otherwise if someone finds your instance running, they’ll be able to access it
There is no firewall (iptables) running on the server. Since this is a public server you’d be wise to make use of Amazon’s Security Group functionality (in effect, a firewall at the virtual hardware level) to block access on all ports except those necessary.
For example, you could block all traffic except 7780, and then enable access on port 22 (SSH) and 7001 (Admin Server) just when you need to access it for admin.

Using the AMI

You first need to get access to the AMI, through the form below. You also need an active AWS account.
Launch the server:
1. From the AWS AMI page locate the SampleApp AMI using the details provided when you request access through the form below. Make sure you are on the Ireland/eu-west-1 region. Click Launch.
2. Select an Instance Type. An “m3.large” size is a good starting point (this site is useful to see the spec of all instances).
3. Click through the Configure Instance Details, Add Storage, and Tag Instance screens without making changes unless you need to.
4. On the Security Group page select either a dedicated security group if you have already configured one, or create a new one.
  A security group is a firewall that controls traffic to the server regardless of any software firewall configured or not on the instance. By default only port 22 (SSH) is open, so you’ll need to open at least 7780 for analytics, and 7001 too if you want to access WLS/EM as well
  Note that you can amend a security group’s rules once the instance is created, but you cannot change which security group it is bound to. For ad-hoc purposes I’d always use a dedicated security group per instance so that you can change rules just for your server without impacting others on your account.
5. Click on Review and Launch, check what you’ve specified, and then click Launch. You’ll now need to either specific an existing SSH key pair, or generate a new one. It’s vital that you get this bit right, otherwise you’ll not be able to access the server. If you generate a new key pair, make sure you download it (it’ll be a .pem file).
6. Click Launch Instances
  You’ll get a hyperlinked Instance ID; click on that and it’ll take you to the Instances page filtered for your new server.
  
  Shortly you’ll see the server’s public IP address shown.
OBIEE is configured to start automagically at boot time along with the database. This means that in theory you don’t need to actually access the server directly. It does take 15-20 minutes on first boot to all fire up though, so be patient.
The managed server is listening on port 7780, and admin server on 7001. If your server IP is 42.42.42.42 the URLs would be:
- Analytics: http://42.42.42.42:7780/analytics
- WLS: http://42.42.42.42:7001/console
- EM: http://42.42.42.42:7001/em

On the server

The server is a stock SampleApp v406 image, with a few extras:

obiee and dbora services configured and set to run at bootup. Control obiee using:

sudo service obiee status
sudo service obiee stop
sudo service obiee start
sudo service obiee restart

screen installed with a .screenrc setup

Accessing the AMI

To get access to the AMI, please complete this short form and we will send you the AMI details by email.

By completing the form and requesting access to the AMI, you are acknowledging that you have read and understood the terms and conditions set out by Oracle here.

↧

Linux cluster sysadmin — SSH keys

December 9, 2014, 3:34 am

≫ Next: Linux cluster sysadmin — Parallel command execution with PDSH

≪ Previous: OBIEE SampleApp v406 Amazon EC2 AMI – available for public use

In this short series of blog posts I’m going to take a look at a few very useful tools that can make your life as the sysadmin of a cluster of Linux machines easier. This may be a Hadoop cluster, or just a plain simple set of ‘normal’ machines on which you want to run the same commands and monitoring.

To start with, we’re going to use the ever-awesome ssh keys to manage security on the cluster. After that we’ll look at executing the same command across multiple machines at the same time using PDSH, and then monitoring OS metrics across a cluster with colmux.

In a nutshell, ssh keys enable us to do password-less authentication in a secure way. You can find a detailed explanation of them in a previous post that I wrote, tips and tricks for OBIEE Linux sysadmin. Beyond the obvious time-saving function of not having to enter a password each time we connect to a machine, having SSH keys in place enable the use of the tools we discuss later, pdsh and colmux.

Working with SSH keys involves taking the public key from a pair, and adding that to another machine in order to allow the owner of the pair’s private key to access that machine. What we’re going to do here is generate a unique key pair that will be used as the identity across the cluster. So each node will have a copy of the private key, in order to be able to authenticate to any other node, which will be holding a copy of the public key (as well as, in turn, the same private key).

In this example I’m going to use my own client machine to connect to the cluster. You could easily use any of the cluster nodes too if a local machine would not be appropriate.
As a side-note, this is another reason why I love the fact that Rittman Mead standard-issue laptop is a MacBook, and just under the covers of Mac OS is a *nix-based command-line meaning that a lot of sysadmin work can be done natively without needing additional tools that you would on Windows (e.g. PuTTY, WinSCP, Pageant, etc etc).

SSH key strategy

We’ve several ways we could implement the SSH keys. Because it’s a purely sandbox cluster, I could use the same SSH key pair that I generate for the cluster on my machine too, so the same public/private key pair is distributed thus:

If we wanted a bit more security, a better approach might be to distribute my personal SSH key’s public key across the cluster too, and leave the cluster’s private key to truly identify cluster nodes alone. An additional benefit of this approach is that is the client does not need to hold a copy of the cluster’s SSH private key, instead just continuing to use their own.

For completeness, the extreme version of the key strategy would be for each machine to have its own ssh key pair (i.e. its own security identity), with the corresponding public keys distributed to the other nodes in the cluster:

But anyway, here we’re using the second option – a unique keypair used across the cluster and the client’s public ssh key distributed across the cluster too.

Generating the SSH key pair

First, we need to generate the key. I’m going to create a folder to hold it first, because in a moment we’re going to push it and a couple of other files out to all the servers in the cluster and it’s easiest to do this from a single folder.

mkdir /tmp/rnmcluster02-ssh-keys

Note that in the ssh-keygen command below I’m specifying the target path for the key with the -f argument; if you don’t then watch out that you don’t accidentally overwrite your own key pair in the default path of ~/.ssh.

The -q -N "" flags instruct the key generation to use no passphrase for the key and to not prompt for it either. This is the lowest friction approach (you don’t need to unlock the ssh key with a passphrase before use) but also the least secure. If you’re setting up access to a machine where security matters then bear in mind that without a passphrase on an ssh key anyone who obtains it can therefore access any machine to which the key has been granted access (i.e. on which its public key has been deployed).

ssh-keygen -f /tmp/rnmcluster02-ssh-keys/id_rsa -q -N ""

This generates in the tmp folder two files – the private and public (.pub) keys of the pair:

robin@RNMMBP ~ $ ls -l /tmp/rnmcluster02-ssh-keys
total 16
-rw-------  1 robin  wheel  1675 30 Nov 17:28 id_rsa
-rw-r--r--  1 robin  wheel   400 30 Nov 17:28 id_rsa.pub

Preparing the authorized_keys file

Now we’ll prepare the authorized_keys file which is where the public SSH key of any identity permitted to access the machine is stored. Note that each user on a machine has their own authorized_keys file, in ~/.ssh/. So for example, the root user has the file in /root/.ssh/authorized_keys and any public key listed in that file will be able to connect to the server as the root user. Be aware the American [mis-]spelling of “authorized” – spell it [correctly] as “authorised” and you’ll not get any obvious errors, but the ssh key login won’t work either.

So we’re going to copy the public key of the unique pair that we just created for the cluster into the authorized_keys file. In addition we will copy in our own personal ssh key (and any other public key that we want to give access to all the nodes in the cluster):

cp /tmp/rnmcluster02-ssh-keys/id_rsa.pub /tmp/rnmcluster02-ssh-keys/authorized_keys
# [optional] Now add any other keys (such as your own) into the authorized_keys file just created
cat ~/.ssh/id_rsa.pub >> /tmp/rnmcluster02-ssh-keys/authorized_keys
# NB make sure the previous step is a double >> not > since the double appends to the file, a single overwrites.

Distributing the SSH artefacts

Now we’re going to push this set of SSH files out to the .ssh folder of the target user on each node, which in this case is the root user. From a security point of view it’s probably better to use a non-root user for login and then sudo as required, but we’re keeping things simple (and less secure) to start with here. So the files in our folder are:

id_rsa – the private key of the key pair
id_rsa.pub – the public key of the key pair. Strictly speaking this doesn’t need distributing to all nodes, but it’s conventional and handy to hold it alongside the private key.
authorized_keys – this is the file that the sshd daemon on each node will look at to validate an incoming login request’s offered private key, and so needs to hold the public key of anyone who is allowed to access the machine as this user.

To copy the files we’ll use scp, but how you get them in place doesn’t really matter so much, so long as they get to the right place:

scp -r /tmp/rnmcluster02-ssh-keys root@rnmcluster02-node01:~/.ssh

At this point you’ll need to enter the password for the target user, but rejoice! This is the last time you’ll need to enter it as subsequent logins will be authenticated using the ssh keys that you’re now configuring.

Run the scp for all nodes in the cluster. If you’ve four nodes in the cluster your output should look something like this:

$ scp -r /tmp/rnmcluster02-ssh-keys/ root@rnmcluster02-node01:~/.ssh
root@rnmcluster02-node01's password:
authorized_keys                                                  100%  781     0.8KB/s   00:00
id_rsa                                                           100% 1675     1.6KB/s   00:00
id_rsa.pub                                                       100%  400     0.4KB/s   00:00
$ scp -r /tmp/rnmcluster02-ssh-keys/ root@rnmcluster02-node02:~/.ssh
Warning: Permanently added the RSA host key for IP address '172.28.128.7' to the list of known hosts.
root@rnmcluster02-node02's password:
authorized_keys                                                  100%  781     0.8KB/s   00:00
id_rsa                                                           100% 1675     1.6KB/s   00:00
id_rsa.pub                                                       100%  400     0.4KB/s   00:00
$ scp -r /tmp/rnmcluster02-ssh-keys/ root@rnmcluster02-node03:~/.ssh
root@rnmcluster02-node03's password:
authorized_keys                                                  100%  781     0.8KB/s   00:00
id_rsa                                                           100% 1675     1.6KB/s   00:00
id_rsa.pub                                                       100%  400     0.4KB/s   00:00
$ scp -r /tmp/rnmcluster02-ssh-keys/ root@rnmcluster02-node04:~/.ssh
root@rnmcluster02-node04's password:
authorized_keys                                                  100%  781     0.8KB/s   00:00
id_rsa                                                           100% 1675     1.6KB/s   00:00
id_rsa.pub                                                       100%  400     0.4KB/s   00:00

Testing login authenticated through SSH keys

The moment of truth. From your client machine, try to ssh to each of the cluster nodes. If you are prompted for a password, then something is not right – see the troubleshooting section below.

If you put your own public key in authorized_keys when you created it then you don’t need to specify which key to use when connecting because it’ll use your own private key by default:

robin@RNMMBP ~ $ ssh root@rnmcluster02-node01
Last login: Fri Nov 28 17:13:23 2014 from 172.28.128.1



[root@localhost ~]#

There we go – logged in automagically with no password prompt. If we’re using the cluster’s private key (rather than our own) you need to specify it with -i when you connect.

robin@RNMMBP ~ $ ssh -i /tmp/rnmcluster02-ssh-keys/id_rsa root@rnmcluster02-node01
Last login: Fri Nov 28 17:13:23 2014 from 172.28.128.1



[root@localhost ~]#

Troubleshooting SSH key connections

SSH keys are one of the best things in a sysadmin’s toolkit, but when they don’t work can be a bit tricky to sort out. The first thing to check is that on the target machine the authorized_keys file that does all the magic (by listing the ssh keys that are permitted to connect inbound on a host to the given user) is in place:

[root@localhost .ssh]# ls -l ~/.ssh/authorized_keys
-rw-r--r-- 1 root root 775 Nov 30 18:55 /root/.ssh/authorized_keys

If you get this:

[root@localhost .ssh]# ls -l ~/.ssh/authorized_keys
ls: cannot access /root/.ssh/authorized_keys: No such file or directory

then you have a problem.

One possible issue in this specific instance could be that the above pre-canned scp assumes that the user’s .ssh folder doesn’t already exist (since it doesn’t, on brand new servers) and so specifies it as the target name for the whole rnmcluster02-ssh-keys folder. However if it does already exist then it ends up copying the rnmcluster02-ssh-keys folder into the .ssh folder:

[root@localhost .ssh]# ls -lR
.:
total 12
-rw------- 1 root root 1675 Nov 22  2013 id_rsa
-rw-r--r-- 1 root root  394 Nov 22  2013 id_rsa.pub
drwxr-xr-x 2 root root 4096 Nov 30 18:49 rnmcluster02-ssh-keys

./rnmcluster02-ssh-keys:
total 12
-rw-r--r-- 1 root root  775 Nov 30 18:49 authorized_keys
-rw------- 1 root root 1675 Nov 30 18:49 id_rsa
-rw-r--r-- 1 root root  394 Nov 30 18:49 id_rsa.pub
[root@localhost .ssh]#

To fix this simply move the authorized_keys from rnmcluster02-ssh-keys back into .ssh:

[root@localhost .ssh]# mv ~/.ssh/rnmcluster02-ssh-keys/authorized_keys ~/.ssh/

Other frequent causes of problems are file/folder permissions that are too lax on the target user’s .ssh folder (which can be fixed with chmod -R 700 ~/.ssh) or the connecting user’s ssh private key (fix: chmod 600 id_rsa). The latter will show on connection attempts very clearly:

robin@RNMMBP ~ $ ssh -i /tmp/rnmcluster02-ssh-keys/id_rsa root@rnmcluster02-node01
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@         WARNING: UNPROTECTED PRIVATE KEY FILE!          @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0777 for '/tmp/rnmcluster02-ssh-keys/id_rsa' are too open.
It is required that your private key files are NOT accessible by others.
This private key will be ignored.
bad permissions: ignore key: /tmp/rnmcluster02-ssh-keys/id_rsa

Another one that has bitten me twice over time – and that eludes the troubleshooting I’ll demonstrate in a moment – is that SELinux gets stroppy about root access using ssh keys. I always just take this as a handy reminder to disable selinux (in /etc/selinux/config, set SELINUX=disabled), having never had cause to leave it enabled. But, if you do need it enabled you’ll need to hit the interwebs to check the exact cause/solution for this problem.

So to troubleshoot ssh key problems in general do two things. Firstly from the client side, specify verbosity (-v for a bit of verbosity, -vvv for most)

ssh -v -i /tmp/rnmcluster02-ssh-keys/id_rsa root@rnmcluster02-node01

You should observe ssh trying to use the private key, and if the server rejects it it’ll fall back to any other ssh private keys it can find, and then password authentication:

[...]
debug1: Offering RSA public key: /tmp/rnmcluster02-ssh-keys/id_rsa
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic,password
debug1: Next authentication method: password

Quite often the problem will be on the server side, so assuming that you can still connect to the server (eg through the physical console, or using password authentication) then go and check /var/log/secure where you’ll see all logs relating to attempted connections. Here’s the log file corresponding to the above client log, where ssh key authentication is attempted but fails, and then password authentication is used to successfully connect:

Nov 30 18:15:05 localhost sshd[13156]: Authentication refused: bad ownership or modes for file /root/.ssh/authorized_keys
Nov 30 18:15:15 localhost sshd[13156]: Accepted password for root from 172.28.128.1 port 59305 ssh2
Nov 30 18:15:15 localhost sshd[13156]: pam_unix(sshd:session): session opened for user root by (uid=0)

Now we can see clearly what the problem is – “bad ownership or modes for file /root/.ssh/authorized_keys”.

The last roll of the troubleshooting dice is to get sshd (the ssh daemon that runs on the host we’re trying to connect to) to issue more verbose logs. You can either set LogLevel DEBUG1 (or DEBUG2, or DEBUG3) in /etc/ssh/sshd_config and restart the ssh daemon (service sshd restart), or you can actually run a (second) ssh daemon from the host with specific logging. This would be appropriate on a multi-user server where you can’t just go changing sshd configuration. To run a second instance of sshd you’d use:

/usr/sbin/sshd -D -d -p 2222

You have to run sshd from an absolute path (you’ll get told this if you try not to). The -D flag stops it running as a daemon and instead runs interactively, so we can see easily all the output from it. -d specifies the debug logging (-dd or -ddd for greater levels of verbosity), and -p 2222 tells sshd to listen on port 2222. Since we’re doing this on top of the existing sshd, we obviously can’t use the default ssh port (22) so pick another port that is available (and not blocked by a firewall).

Now on the client retry the connection, but pointing to the port of the interactive sshd instance:

ssh -v -p 2222 -i /tmp/rnmcluster02-ssh-keys/id_rsa root@rnmcluster02-node01

When you run the command on the client you should get both the client and host machine debug output go crackers for a second, giving you plenty of diagnostics to pore through and analyse the ssh handshake etc to get to the root of the issue.

Hopefully you’ve now sorted your SSH keys, because in the next article we’re going to see how we can use them to run commands against multiple servers at once using pdsh.

Summary

When working with multiple Linux machines I would first and foremost make sure SSH keys are set up in order to ease management through password-less logins.

We’ll see in the next couple of articles some other tools that are useful when working on a cluster:

pdsh
colmux

I’m interested in what you think – what particular tools or tips do you have for working with a cluster of Linux machines? Leave your answers in the comments below, or tweet them to me at @rmoff.

↧

Linux cluster sysadmin — Parallel command execution with PDSH

December 9, 2014, 9:40 am

≫ Next: Linux cluster sysadmin — OS metric monitoring with colmux

≪ Previous: Linux cluster sysadmin — SSH keys

In this series of blog posts I’m taking a look at a few very useful tools that can make your life as the sysadmin of a cluster of Linux machines easier. This may be a Hadoop cluster, or just a plain simple set of ‘normal’ machines on which you want to run the same commands and monitoring.

Previously we looked at using SSH keys for intra-machine authorisation, which is a pre-requisite what we’ll look at here — executing the same command across multiple machines using PDSH. In the next post of the series we’ll see how we can monitor OS metrics across a cluster with colmux.

PDSH is a very smart little tool that enables you to issue the same command on multiple hosts at once, and see the output. You need to have set up ssh key authentication from the client to host on all of them, so if you followed the steps in the first section of this article you’ll be good to go.

The syntax for using it is nice and simple:

-w specifies the addresses. You can use numerical ranges [1-4] and/or comma-separated lists of hosts. If you want to connect as a user other than the current user on the calling machine, you can specify it here (or as a separate -l argument)
After that is the command to run.

For example run against a small cluster of four machines that I have:

robin@RNMMBP $ pdsh -w root@rnmcluster02-node0[1-4] date

rnmcluster02-node01: Fri Nov 28 17:26:17 GMT 2014
rnmcluster02-node02: Fri Nov 28 17:26:18 GMT 2014
rnmcluster02-node03: Fri Nov 28 17:26:18 GMT 2014
rnmcluster02-node04: Fri Nov 28 17:26:18 GMT 2014

PDSH can be installed on the Mac under Homebrew (did I mention that Rittman Mead laptops are Macs, so I can do all of this straight from my work machine… :-) )

brew install pdsh

And if you want to run it on Linux from the EPEL yum repository (RHEL-compatible, but packages for other distros are available):

yum install pdsh

You can run it from a cluster node, or from your client machine (assuming your client machine is mac/linux).

Example – install and start collectl on all nodes

I started looking into pdsh when it came to setting up a cluster of machines from scratch. One of the must-have tools I like to have on any machine that I work with is the excellent collectl. This is an OS resource monitoring tool that I initially learnt of through Kevin Closson and Greg Rahn, and provides the kind of information you’d get from top etc – and then some! It can run interactively, log to disk, run as a service – and it also happens to integrate very nicely with graphite, making it a no-brainer choice for any server.

So, instead of logging into each box individually I could instead run this:

pdsh -w root@rnmcluster02-node0[1-4] yum install -y collectl
pdsh -w root@rnmcluster02-node0[1-4] service collectl start
pdsh -w root@rnmcluster02-node0[1-4] chkconfig collectl on

Yes, I know there are tools out there like puppet and chef that are designed for doing this kind of templated build of multiple servers, but the point I want to illustrate here is that pdsh enables you to do ad-hoc changes to a set of servers at once. Sure, once I have my cluster built and want to create an image/template for future builds, then it would be daft if I were building the whole lot through pdsh-distributed yum commands.

Example – setting up the date/timezone/NTPD

Often the accuracy of the clock on each server in a cluster is crucial, and we can easily do this with pdsh:

Install packages

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node0[1-4] yum install -y ntp ntpdate

Set the timezone:

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node0[1-4] ln -sf /usr/share/zoneinfo/Europe/London /etc/localtime

Force a time refresh:

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node0[1-4] ntpdate pool.ntp.org
rnmcluster02-node03: 30 Nov 20:46:22 ntpdate[27610]: step time server 176.58.109.199 offset -2.928585 sec
rnmcluster02-node02: 30 Nov 20:46:22 ntpdate[28527]: step time server 176.58.109.199 offset -2.946021 sec
rnmcluster02-node04: 30 Nov 20:46:22 ntpdate[27615]: step time server 129.250.35.250 offset -2.915713 sec
rnmcluster02-node01: 30 Nov 20:46:25 ntpdate[29316]: 178.79.160.57 rate limit response from server.
rnmcluster02-node01: 30 Nov 20:46:22 ntpdate[29316]: step time server 176.58.109.199 offset -2.925016 sec

Set NTPD to start automatically at boot:

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node0[1-4] chkconfig ntpd on

Start NTPD:

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node0[1-4] service ntpd start

Example – using a HEREDOC (here-document) and sending quotation marks in a command with PDSH

Here documents (heredocs) are a nice way to embed multi-line content in a single command, enabling the scripting of a file creation rather than the clumsy instruction to “open an editor and paste the following lines into it and save the file as /foo/bar”.

Fortunately heredocs work just fine with pdsh, so long as you remember to enclose the whole command in quotation marks. And speaking of which, if you need to include quotation marks in your actual command, you need to escape them with a backslash. Here’s an example of both, setting up the configuration file for my ever-favourite gnu screen on all the nodes of the cluster:

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node0[1-4] "cat > ~/.screenrc <<EOF
hardstatus alwayslastline \"%{= RY}%H %{kG}%{G} Screen(s): %{c}%w %=%{kG}%c  %D, %M %d %Y  LD:%l\"
startup_message off
msgwait 1
defscrollback 100000
nethack on
EOF
"

Now when I login to each individual node and run screen, I get a nice toolbar at the bottom:

Combining commands

To combine commands together that you send to each host you can use the standard bash operator semicolon ;

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node0[1-4] "date;sleep 5;date"
rnmcluster02-node01: Sun Nov 30 20:57:06 GMT 2014
rnmcluster02-node03: Sun Nov 30 20:57:06 GMT 2014
rnmcluster02-node04: Sun Nov 30 20:57:06 GMT 2014
rnmcluster02-node02: Sun Nov 30 20:57:06 GMT 2014
rnmcluster02-node01: Sun Nov 30 20:57:11 GMT 2014
rnmcluster02-node03: Sun Nov 30 20:57:11 GMT 2014
rnmcluster02-node04: Sun Nov 30 20:57:11 GMT 2014
rnmcluster02-node02: Sun Nov 30 20:57:11 GMT 2014

Note the use of the quotation marks to enclose the entire command string. Without them the bash interpretor will take the ; as the delineator of the local commands, and try to run the subsequent commands locally:

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node0[1-4] date;sleep 5;date
rnmcluster02-node03: Sun Nov 30 20:57:53 GMT 2014
rnmcluster02-node04: Sun Nov 30 20:57:53 GMT 2014
rnmcluster02-node02: Sun Nov 30 20:57:53 GMT 2014
rnmcluster02-node01: Sun Nov 30 20:57:53 GMT 2014
Sun 30 Nov 2014 20:58:00 GMT

You can also use && and || to run subsequent commands conditionally if the previous one succeeds or fails respectively:

robin@RNMMBP $ pdsh -w root@rnmcluster02-node[01-4] "chkconfig collectl on && service collectl start"

rnmcluster02-node03: Starting collectl: [  OK  ]
rnmcluster02-node02: Starting collectl: [  OK  ]
rnmcluster02-node04: Starting collectl: [  OK  ]
rnmcluster02-node01: Starting collectl: [  OK  ]

Piping and file redirects

Similar to combining commands above, you can pipe the output of commands, and you need to use quotation marks to enclose the whole command string.

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node[01-4] "chkconfig|grep collectl"
rnmcluster02-node03: collectl           0:off   1:off   2:on    3:on    4:on    5:on    6:off
rnmcluster02-node01: collectl           0:off   1:off   2:on    3:on    4:on    5:on    6:off
rnmcluster02-node04: collectl           0:off   1:off   2:on    3:on    4:on    5:on    6:off
rnmcluster02-node02: collectl           0:off   1:off   2:on    3:on    4:on    5:on    6:off

However, you can pipe the output from pdsh to a local process if you want:

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node[01-4] chkconfig|grep collectl
rnmcluster02-node02: collectl           0:off   1:off   2:on    3:on    4:on    5:on    6:off
rnmcluster02-node04: collectl           0:off   1:off   2:on    3:on    4:on    5:on    6:off
rnmcluster02-node03: collectl           0:off   1:off   2:on    3:on    4:on    5:on    6:off
rnmcluster02-node01: collectl           0:off   1:off   2:on    3:on    4:on    5:on    6:off

The difference is that you’ll be shifting the whole of the pipe across the network in order to process it locally, so if you’re just grepping etc this doesn’t make any sense. For use of utilities held locally and not on the remote server though, this might make sense.

File redirects work the same way – within quotation marks and the redirect will be to a file on the remote server, outside of them it’ll be local:

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node[01-4] "chkconfig>/tmp/pdsh.out"
robin@RNMMBP ~ $ ls -l /tmp/pdsh.out
ls: /tmp/pdsh.out: No such file or directory

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node[01-4] chkconfig>/tmp/pdsh.out
robin@RNMMBP ~ $ ls -l /tmp/pdsh.out
-rw-r--r--  1 robin  wheel  7608 30 Nov 19:23 /tmp/pdsh.out

Cancelling PDSH operations

As you can see from above, the precise syntax of pdsh calls can be hugely important. If you run a command and it appears ‘stuck’, or if you have that heartstopping realisation that the shutdown -h now you meant to run locally you ran across the cluster, you can press Ctrl-C once to see the status of your commands:

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node[01-4] sleep 30
^Cpdsh@RNMMBP: interrupt (one more within 1 sec to abort)
pdsh@RNMMBP:  (^Z within 1 sec to cancel pending threads)
pdsh@RNMMBP: rnmcluster02-node01: command in progress
pdsh@RNMMBP: rnmcluster02-node02: command in progress
pdsh@RNMMBP: rnmcluster02-node03: command in progress
pdsh@RNMMBP: rnmcluster02-node04: command in progress

and press it twice (or within a second of the first) to cancel:

robin@RNMMBP ~ $ pdsh -w root@rnmcluster02-node[01-4] sleep 30
^Cpdsh@RNMMBP: interrupt (one more within 1 sec to abort)
pdsh@RNMMBP:  (^Z within 1 sec to cancel pending threads)
pdsh@RNMMBP: rnmcluster02-node01: command in progress
pdsh@RNMMBP: rnmcluster02-node02: command in progress
pdsh@RNMMBP: rnmcluster02-node03: command in progress
pdsh@RNMMBP: rnmcluster02-node04: command in progress
^Csending SIGTERM to ssh rnmcluster02-node01
sending signal 15 to rnmcluster02-node01 [ssh] pid 26534
sending SIGTERM to ssh rnmcluster02-node02
sending signal 15 to rnmcluster02-node02 [ssh] pid 26535
sending SIGTERM to ssh rnmcluster02-node03
sending signal 15 to rnmcluster02-node03 [ssh] pid 26533
sending SIGTERM to ssh rnmcluster02-node04
sending signal 15 to rnmcluster02-node04 [ssh] pid 26532
pdsh@RNMMBP: interrupt, aborting.

If you’ve got threads yet to run on the remote hosts, but want to keep running whatever has already started, you can use Ctrl-C, Ctrl-Z:

robin@RNMMBP ~ $ pdsh -f 2 -w root@rnmcluster02-node[01-4] "sleep 5;date"
^Cpdsh@RNMMBP: interrupt (one more within 1 sec to abort)
pdsh@RNMMBP:  (^Z within 1 sec to cancel pending threads)
pdsh@RNMMBP: rnmcluster02-node01: command in progress
pdsh@RNMMBP: rnmcluster02-node02: command in progress
^Zpdsh@RNMMBP: Canceled 2 pending threads.
rnmcluster02-node01: Mon Dec  1 21:46:35 GMT 2014
rnmcluster02-node02: Mon Dec  1 21:46:35 GMT 2014

NB the above example illustrates the use of the -f argument to limit how many threads are run against remote hosts at once. We can see the command is left running on the first two nodes and returns the date, whilst the Ctrl-C – Ctrl-Z stops it from being executed on the remaining nodes.

PDSH_SSH_ARGS_APPEND

By default, when you ssh to new host for the first time you’ll be prompted to validate the remote host’s SSH key fingerprint.

The authenticity of host 'rnmcluster02-node02 (172.28.128.9)' can't be established.
RSA key fingerprint is 00:c0:75:a8:bc:30:cb:8e:b3:8e:e4:29:42:6a:27:1c.
Are you sure you want to continue connecting (yes/no)?

This is one of those prompts that the majority of us just hit enter at and ignore; if that includes you then you will want to make sure that your PDSH call doesn’t fall in a heap because you’re connecting to a bunch of new servers all at once. PDSH is not an interactive tool, so if it requires input from the hosts it’s connecting to it’ll just fail. To avoid this SSH prompt, you can set up the environment variable PDSH_SSH_ARGS_APPEND as follows:

export PDSH_SSH_ARGS_APPEND="-q -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null"

The -q makes failures less verbose, and the -o passes in a couple of options, StrictHostKeyChecking to disable the above check, and UserKnownHostsFile to stop SSH keeping a list of host IP/hostnames and corresponding SSH fingerprints (by pointing it at /dev/null). You’ll want this if you’re working with VMs that are sharing a pool of IPs and get re-used, otherwise you get this scary failure:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
00:c0:75:a8:bc:30:cb:8e:b3:8e:e4:29:42:6a:27:1c.
Please contact your system administrator.

For both of these above options, make sure you’re aware of the security implications that you’re opening yourself up to. For a sandbox environment I just ignore them; for anything where security is of importance make sure you are aware of quite which server you are connecting to by SSH, and protecting yourself from MitM attacks.

PDSH Reference

You can find out more about PDSH at https://code.google.com/p/pdsh/wiki/UsingPDSH

Summary

When working with multiple Linux machines I would first and foremost make sure SSH keys are set up in order to ease management through password-less logins.

After SSH keys, I would recommend pdsh for parallel execution of the same SSH command across the cluster. It’s a big time saver particularly when initially setting up the cluster given the installation and configuration changes that are inevitably needed.

In the next article of this series we’ll see how the tool colmux is a powerful way to monitor OS metrics across a cluster.

So now your turn – what particular tools or tips do you have for working with a cluster of Linux machines? Leave your answers in the comments below, or tweet them to me at @rmoff.

↧

Linux cluster sysadmin — OS metric monitoring with colmux

December 15, 2014, 10:44 am

≫ Next: Concurrent RPD Development in OBIEE

≪ Previous: Linux cluster sysadmin — Parallel command execution with PDSH

In this mini-series of blog posts I’m taking a look at a few very useful tools that can make your life as the sysadmin of a cluster of Linux machines. This may be a Hadoop cluster, or just a plain simple set of ‘normal’ machines on which you want to run the same commands and monitoring.

First we looked at using SSH keys for intra-machine authorisation, which is a pre-requisite executing the same command across multiple machines using PDSH, as well as what we look at in this article – monitoring OS metrics across a cluster with colmux.

Colmux is written by Mark Seger, the same person who wrote collectl. It makes use of collectl on each target machine to report back OS metrics across a cluster to a single node.

colmux03

Install collectl across the cluster

Using pdsh we can easily install collectl on each node (if it’s not already), which is a pre-requisite for colmux:

pdsh -w root@rnmcluster02-node0[1-4] "yum install -y collectl && service collectl start && chkconfig collectl on"

NB by enabling the collectl service on each node it will capture performance data to file locally, which colmux can replay centrally.

Then install colmux itself, which you can download from Sourceforge. It only needs to be actually installed on a single host, but obviously we could push it out across the cluster with pdsh if we wanted to be able to invoke it on any node at will. Note that here I’m running it on a separate linux box (outside of the cluster) rather than on my Mac:

cd /tmp
# Make sure you get the latest version of collectl-utils, from https://sourceforge.net/projects/collectl-utils/files/
# This example is hardcoded to a version and particular sourceforge mirror
curl -O http://garr.dl.sourceforge.net/project/collectl-utils/collectl-utils-4.8.2/collectl-utils-4.8.2.src.tar.gz
tar xf collectl-utils-4.8.2.src.tar.gz
cd collectl-utils-4.8.2
sudo ./INSTALL
# collectl-utils also includes colplot, so if you might want to use it restart
# apache (assuming it's installed)
sudo service httpd restart

Colmux and networking

Couple of important notes:

The machine you run colmux from needs to have port 2655 open in order for each node’s collectl to send back the data to it.
You also may encounter an issue if you have any odd networking (eg NAT on virtual machines) that causes colmux to not work because it picks the ‘wrong’ network interface of the host to tell collectl on each node to send its data to. Details and workaround here.

colmux in action – real-time view, point in time

Command

colmux -addr 'rnmcluster02-node0[1-4]' -username root

Output

# Mon Dec  1 22:20:40 2014  Connected: 4 of 4
#                    <--------CPU--------><----------Disks-----------><----------Network---------->
#Host                cpu sys inter  ctxsw KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut  PktOut
rnmcluster02-node01    1   1    28     36      0      0      0      0      0      2      0       2
rnmcluster02-node04    0   0    33     28      0      0     36      8      0      1      0       1
rnmcluster02-node03    0   0    15     17      0      0      0      0      0      1      0       1
rnmcluster02-node02    0   0    18     18      0      0      0      0      0      1      0       1

colmux03

Real-time view, persisted

-cols puts the hosts across the top and time as rows. Specify one or more columns from the output without -cols. In this example it is the values for cpu value, along with the disk read/write (columns 1, 5 and 7 of the metrics as seen above):

Command

colmux -addr 'rnmcluster02-node0[1-4]' -user root -cols 1,5,7

Output

cpu                            KBRead                         KBWrit
 node01 node02 node03 node04 |  node01 node02 node03 node04 |  node01 node02 node03 node04
      0      0      0      0 |       0      0      0      0 |      12     28      0      0
      0      0      0      0 |       0      0      0      0 |      12     28      0      0
      1      0      1      0 |       0      0      0      0 |       0      0      0      0
      0      0      0      0 |       0      0      0      0 |       0      0      0      0
      0      0      0      0 |       0      0      0      0 |       0      0      0      0
      0      0      0      0 |       0      0      0      0 |       0     20      0      0
      0      0      0      0 |       0      0      0      0 |      52      4      0      0
      0      0      0      2 |       0      0      0      0 |       0      0      0      0
      1      0      0      0 |       0      0      0      0 |       0      0      0      0
     15     16     15     15 |       0      4      4      4 |      20     40     32     48
      0      0      1      1 |       0      0      0      0 |       0      0      4      0
      1      0      0      0 |       0      0      0      0 |       0      0      0      0

colmux04

To check the numbers of the columns that you want to reference, run the command with the --test argument:

colmux -addr 'rnmcluster02-node0[1-4]' -user root --test

>>> Headers <<<
#                    <--------CPU--------><----------Disks-----------><----------Network---------->
#Host                cpu sys inter  ctxsw KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut  PktOut

>>> Column Numbering <<<
 0 #Host   1 cpu     2 sys     3 inter   4 ctxsw   5 KBRead  6 Reads   7 KBWrit
 8 Writes  9 KBIn   10 PktIn  11 KBOut  12 PktOut

And from there you get the numbers of the columns to reference in the -cols argument.

To include the timestamp, use -oT in the -command and offset the column numbers by 1:

Command

colmux -addr 'rnmcluster02-node0[1-4]' -user root -cols 2,6,8 -command '-oT'

Output

sys                            Reads                          Writes
#Time    node01 node02 node03 node04 |  node01 node02 node03 node04 |  node01 node02 node03 node04
22:24:50      0      0      0      0 |       0      0      0      0 |       0      0      0      0
22:24:51      1      0      0      0 |       0      0      0      0 |       0      0      0      0
22:24:52      0      0      0      0 |       0      0      0      0 |       0     16      0     16
22:24:53      1      0      0      0 |       0      0      0      0 |      36      0     16      0
22:24:54      0      0      0      1 |       0      0      0      0 |       0      0      0      0
22:24:55      0      0      0      0 |       0      0      0      0 |       0     20     32     20

NB There’s a bug with colmux 4.8.2 that prevents you accessing the first metric with -cols when you also enable timestamp -oT – details here.

Specifying additional metrics

Collectl (which is what colmux calls to get the data) can fetch metrics from multiple subsystems on a node. You can access all of these through colmux too. By default when you run colmux you get cpu, disk and network but you can specify others using the -s argument followed by the subsystem identifier.

To examine the available subsystems run collectl on one of the target nodes:

[root@rnmcluster02-node01 ~]# collectl --showsubsys
The following subsystems can be specified in any combinations with -s or
--subsys in both record and playbackmode.  [default=bcdfijmnstx]

These generate summary, which is the total of ALL data for a particular type
  b - buddy info (memory fragmentation)
  c - cpu
  d - disk
  f - nfs
  i - inodes
  j - interrupts by CPU
  l - lustre
  m - memory
  n - network
  s - sockets
  t - tcp
  x - interconnect (currently supported: OFED/Infiniband)
  y - slabs

From the above list we can see that if we want to also show memory detail alongside CPU we need to include m and c in the subsystem list:

Command

colmux -addr 'rnmcluster02-node0[1-4]' -user root -command '-scm'

Output

# Tue Dec  2 08:02:38 2014  Connected: 4 of 4
#                    <--------CPU--------><-----------Memory----------->
#Host                cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map
rnmcluster02-node02    1   0    19     18  33M  15M 345M 167M  30M  56M
rnmcluster02-node04    0   0    30     24  32M  15M 345M 167M  30M  56M
rnmcluster02-node03    0   0    30     36  32M  15M 345M 165M  30M  56M
rnmcluster02-node01    0   0    16     16  29M  15M 326M 167M  27M  81M

Changing the sample frequency

To change the sample frequency use the -i syntax in -command:

Command

colmux -addr 'rnmcluster02-node0[1-4]' -user root -command '-scm -i10 -oT' -cols 2,4

Samples every 10 seconds:

sys                            ctxsw
#Time    node01 node02 node03 node04 |  node01 node02 node03 node04
08:06:29     -1     -1     -1     -1 |      -1     -1     -1     -1
08:06:39     -1     -1     -1     -1 |      -1     -1     -1     -1
08:06:49      0      0      0      0 |      14     13     15     19
08:06:59      0      0      0      0 |      13     13     17     21
08:07:09      0      0      0      0 |      19     18     15     24
08:07:19      0      0      0      0 |      13     13     15     19
08:07:29      0      0      0      0 |      13     13     14     19
08:07:39      0      0      0      0 |      12     13     13     19

Column width

Add the -colwidth argument

Command

colmux -addr 'rnmcluster02-node0[1-4]' -user root -command '-scm' -cols 1 -colwidth 20

Output

cpu
  rnmcluster02-node01  rnmcluster02-node02  rnmcluster02-node03  rnmcluster02-node04
                   -1                   -1                   -1                   -1
                   -1                   -1                   -1                   -1
                    1                    0                    0                    0
                    0                    0                    0                    0
                    0                    1                    0                    0
                    0                    0                    1                    0
                    1                    0                    1                    0
                    0                    1                    0                    0

Playback

As well as running interactively, collectl can run as a service and record metric samples to disk. Using colmux you can replay these from across the cluster.

Within the -command, include -p and the path to the collectl log files (assumes that it is the same on each host). As with real-time mode, for different subsystems change the flags after -s

Command

colmux -addr 'rnmcluster02-node0[1-4]' -user root -command '-p /var/log/collectl/*20141201* -scmd -oD'

Output

[...]
# 21:48:50  Reporting: 4 of 4
#                                      <--------CPU--------><-----------Memory-----------><----------Disks----------->
#Host               Date     Time      cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map KBRead  Reads KBWrit Writes
rnmcluster02-node04 20141201 21:48:50    0   0    17     15  58M  10M 340M 162M  30M  39M      0      0      1      0
rnmcluster02-node03 20141201 21:48:50    0   0    11     13  58M  10M 340M 160M  30M  39M      0      0      0      0
rnmcluster02-node02 20141201 21:48:50    0   0    11     15  58M  10M 340M 163M  29M  39M      0      0      1      0
rnmcluster02-node01 20141201 21:48:50    0   0    12     14  33M  12M 342M 157M  27M  63M      0      0      1      0

# 21:49:00  Reporting: 4 of 4
#                                      <--------CPU--------><-----------Memory-----------><----------Disks----------->
#Host               Date     Time      cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map KBRead  Reads KBWrit Writes
rnmcluster02-node04 20141201 21:49:00    0   0    17     15  58M  10M 340M 162M  30M  39M      0      0      4      0
rnmcluster02-node03 20141201 21:49:00    0   0    13     14  58M  10M 340M 160M  30M  39M      0      0      5      0
rnmcluster02-node02 20141201 21:49:00    0   0    12     14  58M  10M 340M 163M  29M  39M      0      0      1      0
rnmcluster02-node01 20141201 21:49:00    0   0    12     15  33M  12M 342M 157M  27M  63M      0      0      6      0

# 21:49:10  Reporting: 4 of 4
#                                      <--------CPU--------><-----------Memory-----------><----------Disks----------->
#Host               Date     Time      cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map KBRead  Reads KBWrit Writes
rnmcluster02-node04 20141201 21:49:10    0   0    23     23  58M  10M 340M 162M  30M  39M      0      0      1      0
rnmcluster02-node03 20141201 21:49:10    0   0    19     24  58M  10M 340M 160M  30M  39M      0      0      2      0
rnmcluster02-node02 20141201 21:49:10    0   0    18     23  58M  10M 340M 163M  29M  39M      0      0      2      1
rnmcluster02-node01 20141201 21:49:10    0   0    18     24  33M  12M 342M 157M  27M  63M      0      0      1      0
[...]

Restrict the time frame by adding to -command the arguments -from and/or -thru

[oracle@rnm-ol6-2 ~]$ colmux -addr 'rnmcluster02-node0[1-4]' -user root -command '-p /var/log/collectl/*20141201* -scmd -oD --from 21:40:00 --thru 21:40:10'
# 21:40:00  Reporting: 4 of 4
#                                      <--------CPU--------><-----------Memory-----------><----------Disks----------->
#Host               Date     Time      cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map KBRead  Reads KBWrit Writes
rnmcluster02-node04 20141201 21:40:00    0   0    16     14  59M  10M 340M 162M  30M  39M      0      0      0      0
rnmcluster02-node03 20141201 21:40:00    0   0    12     14  58M  10M 340M 160M  30M  39M      0      0      8      1
rnmcluster02-node02 20141201 21:40:00    0   0    12     15  59M  10M 340M 162M  30M  39M      0      0      6      1
rnmcluster02-node01 20141201 21:40:00    0   0    13     16  56M  11M 341M 156M  27M  42M      0      0      7      1

# 21:40:10  Reporting: 4 of 4
#                                      <--------CPU--------><-----------Memory-----------><----------Disks----------->
#Host               Date     Time      cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map KBRead  Reads KBWrit Writes
rnmcluster02-node04 20141201 21:40:10    0   0    26     33  59M  10M 340M 162M  30M  39M      1      0     10      2
rnmcluster02-node03 20141201 21:40:10    0   0    20     31  58M  10M 340M 160M  30M  39M      0      0      4      1
rnmcluster02-node02 20141201 21:40:10    0   0    23     35  59M  10M 340M 162M  30M  39M      3      0      9      2
rnmcluster02-node01 20141201 21:40:10    0   0    23     37  56M  11M 341M 156M  27M  42M      4      1      4      1


[oracle@rnm-ol6-2 ~]$

colmux reference

You can find more about colmux from the website:

as well as the built in man page man colmux

Visualising collectl data with colplot

As a little bonus to the above, colmux is part of the collectl-utils package, which also includes colplot, a gnuplot-based web tool that renders collectl data into graphs. It’s pretty easy to set up, running under Apache just fine and just needing gnuplot installed if you haven’t already. It can report metrics across a cluster if you make sure that you first make each node’s collectl data available locally to colplot.

Navigating to the web page shows the interface from which you can trigger graph plots based on the collectl data available:

colplot’s utilitarian graphs are a refreshing contrast to every webapp that is built nowadays promising “beautiful” visualisations (which no doubt the authors are “passionate” about making “awesome”):

The graphs are functional and can be scaled as needed, but each change is a trip back to the front page to tweak options and re-render:

colplot

For me, colplot is an excellent tool for point-in-time analysis and diagnostics, but for more generalised monitoring with drilldown into detail, it is too manual to be viable and I’ll be sticking with collectl -> graphite -> grafana with its interactive and flexible graph rendering:

grafana

Do note however that colplot specifically does not drop data points, so if there is a spike in your data you will see it. Other tools (possibly including graphite but I’ve not validated this) will, for larger timespans average out data series so as to provide a smoother picture of a metric (eg instead of a point every second, maybe every ten seconds). If you are doing close analysis of a system’s behaviour in a particular situation this may be a problem. If you are wanting more generalised overview of a system’s health, with the option to drill into data historical as needed, it will be less of an issue.

Summary

When working with multiple Linux machines I would first and foremost make sure SSH keys are set up in order to ease management through password-less logins.

To monitor a cluster I would always recommend collectl as the base metric collector. colmux works excellently for viewing these metrics from across the cluster in a single place from the commandline. For viewing the metrics over the longer term you can either store them in (or replay them into) Graphite/Carbon, and render them in Grafana. You have the option of colplot too since this is installed as part of colmux.

So now your turn – what particular tools or tips do you have for working with a cluster of Linux machines? Leave your answers in the comments below, or tweet them to me at @rmoff.

↧

Concurrent RPD Development in OBIEE

January 14, 2015, 5:06 am

≫ Next: OBIEE Monitoring and Diagnostics with InfluxDB and Grafana

≪ Previous: Linux cluster sysadmin — OS metric monitoring with colmux

OBIEE is a well established product, having been around in various incarnations for well over a decade. The latest version, OBIEE 11g, was released 3.5 years ago, and there are mutterings of OBIEE 12c already. In all of this time however, one thing it has never quite nailed is the ability for multiple developers to work with the core metadata model – the repository, known as the RPD – concurrently and in isolation. Without this, development is doomed to be serialised – with the associated bottlenecks and inability to scale in line with the number of developers available.

My former colleague Stewart Bryson wrote a series of posts back in 2013 in which he outlines the criteria for a successful OBIEE SDLC (Software Development LifeCycle) method. The key points were :

There should be a source control tool (a.k.a version control system, VCS) that enables us to store all artefacts of the BI environment, including RPD, Presentation Catalog, etc etc. From here we can tag snapshots of the environment at a given point as being ready for release, and as markers for rollback if we take a wrong turn during development.
Developers should be able to do concurrent development in isolation.
- To do this, source control is mandatory in order to enable branch-based development, also known as feature-driven development, which is a central tenet of an Agile method.

Oracle’s only answer to the SDLC question for OBIEE has always been MUDE. But MUDE falls short in several respects:

It only manages the RPD – there is no handling of the Presentation Catalog etc
It does not natively integrate with any source control
It puts the onus of conflict resolution on the developer rather than the “source master” who is better placed to decide the outcome.

Whilst it wasn’t great, it wasn’t bad, and MUDE was all we had. Either that, or manual integration into source control (1, 2) tools, which was clunky to say the least. The RPD remained a single object that could not be merged or managed except through the Administration Tool itself, so any kind of automatic merge strategies that the rest of the software world were adopting with source control tools were inapplicable to OBIEE. The merge would always require the manual launching of the Administration Tool, figuring out the merge candidates, before slowly dying in despair at having to repeat such a tortuous and error-prone process on a regular basis…

Then back in early 2012 Oracle introduced a new storage format for the RPD. Instead of storing it as a single binary file, closed to prying eyes, it was instead burst into a set of individual files in MDS XML format.

For example, one Logical Table was now one XML files on disk, made up of entities such as LogicalColumn, ExprText, LogicalKey and so on:

It even came with a set of configuration screens for integration with source control. It looked like the answer to all our SDLC prayers – now us OBIEE developers could truly join in with the big boys at their game. The reasoning went something like:

An RPD stored in MDS XML is no longer binary
git can merge code that is plain text from multiple branches
Let’s merge MDS XML with git!

But how viable is MDS XML as a storage format for the RPD used in conjunction with a source control tool such as git? As we will see, it comes down to the Good, the Bad, and the Ugly…

The Good

As described here, concurrent and unrelated developments on an RPD in MDS XML format can be merged successfully by a source control tool such as git. Each logical object is an file, so git just munges (that’s the technical term) the files modified in each branch together to come up with a resulting MDS XML structure with the changes from each development in it.

The Bad

This is where the wheels start to come off. See, our automagic merging fairy dust is based on the idea that individually changed files can be spliced together, and that since MDS XML is not binary, we can trust a source control tool such as git to also work well with changes within the files themselves too.

Unfortunately this is a fallacy, and by using MDS XML we expose ourselves to greater complications than we would if we just stuck to a simple binary RPD merged through the OBIEE toolset. The problem is that whilst MDS XML is not binary, is not unstructured either. It is structured, and it has application logic within it (mdsid, of which see below).

Within the MDS XML structure, individual first-class objects such as Logical Tables are individual files, and structured within them in the XML are child-objects such as Logical Columns:

Source control tools such as git cannot parse it, and therefore do not understand what is a real conflict versus an unrelated change within the same object. If you stop and think for a moment (or longer) quite what would be involved in accurately parsing XML (let alone MDS XML), you’ll realise that you basically need to reverse-engineer the Administration Tool to come up with an accurate engine.

We kind of get away with merging when the file differences are within an element in the XML itself. For example, the expression for a logical column is changed in two branches, causing clashing values within ExprText and ExprTextDesc. When this happens git will throw a conflict and we can easily resolve it, because the difference is within the element(s) themselves:

Easy enough, right?

But taking a similarly “simple” merge conflict where two independent developers add or modify different columns within the same Logical Table we see what a problem there is when we try to merge it back together relying on source control alone.

Obvious to a human, and obvious to the Administration Tool is that these two new columns are unrelated and can be merged into a single Logical Table without problem. In a paraphrased version of MDS XML the two versions of the file look something like this, and the merge resolution is obvious:

But a source control tool such as git looks as the MDS XML as a plaintext file, not understanding the concept of an XML tree and sibling nodes, and throws its toys out of the pram with a big scary merge conflict:

Now the developer has to roll up his or her sleeves and try to reconcile two XML files – with no GUI to support or validate the change made except loading it back into the Administration Tool each time.

So if we want to use MDS XML as the basis for merging, we need to restrict our concurrent developments to completely independent objects. But, that kind of hampers the ideal of more rapid delivery through an Agile method if we’re imposing rules and restrictions like this.

The Ugly

This is where is gets a bit grim. Above we saw that MDS XML can cause unnecessary (and painful) merge conflicts. But what about if two developers inadvertently create the same object concurrently? The behaviour we’d expect to see is a single resulting object. But what we actually get is both versions of the object, and a dodgy RPD. Uh Oh.

Here are the two concurrently developed RPDs, produced in separate branches isolated from each other:

And here’s what happens when you leave it to git to merge the MDS XML:

The duplicated objects now cannot be edited in the Administration Tool in the resulting merged RPD – any attempt to save them throws the above error.

Why does it do this? Because the MDS XML files are named after a globally unique identifier known as the mdsid, and not their corresponding RPD qualified name. And because the mdsid is unique across developments, two concurrent creations of the same object end up with different mdsid values, and thus different filenames.

Two files from separate branches with different names are going to be seen by source control as being unrelated, and so both are brought through in the resulting merge.

As with the unnecessary merge conflict above, we could define process around same object creation, or add in a manual equalise step. The issue really here is that the duplicates can arise without us being aware because there is no conflict seen by the source control tool. It’s not like merging an un-equalised repository in the Administration Tool where we’d get #1 suffixes on the duplicate object so that at least (a) we spot the duplication and (b) the repository remains valid and the duplicate objects available to edit.

MDS XML Repository opening times

Whether a development strategy based on MDS XML is for you or not, another issue to be aware of is that for anything beyond a medium sized RPD opening times of an MDS XML repository are considerable. As in, a minute from binary RPD, and 20 minutes from MDS XML. And to be fair, after 20 minutes I gave up on the basis that no sane developer would write off that amount of their day simply waiting for the repository to open before they can even do any work on it. This rules out working with any big repositories such as that from BI Apps in MDS XML format.

So is MDS XML viable as a Repository storage format?

MDS XML does have two redeeming features :

It reduces the size of your source control repository, because on each commit you will be storing just a delta of the overall repository change, rather than the whole binary RPD each time.
For tracking granular development progress and changes you can identify what was modified through the source control tool alone – because the new & modified objects will be shown as changes:

But the above screenshots both give a hint of the trouble in store. The mdsid unique identifier is used not only in filenames – causing object duplication and strange RPD behaviour- but also within the MDS XML itself, referencing other files and objects. This means that as a RPD developer, or RPD source control overseer, you need to be confident that each time you perform a merge of branches you are correctly putting Humpty Dumpty back together in a valid manner.

If you want to use MDS XML with source control you need to view it as part of a larger solution, involving clear process and almost certainly a hybrid approach with the binary RPD still playing a part — and whatever you do, the Administration Tool within short reach. You need to be aware of the issues detailed above, decide on a process that will avoid them, and make sure you have dedicated resource that understands how it all fits together.

If not MDS XML, then what?…

Source control (e.g. git) is mandatory for any kind of SDLC, concurrent development included. But instead of storing the RPD in MDS XML, we store it as a binary RPD.

Wait wait wait, don’t go yet ! … it gets better

By following the git-flow method, which dictates how feature-driven development is done in source control (git), we can write a simple script that determines when merging branches what the candidates are for an OBIEE three-way RPD merge.

In this simple example we have two concurrent developments – coded “RM–1” and “RM–2”. First off, we create two branches which take the code from our “mainline”. Development is done on the two separate features in each branch independently, and committed frequently per good source control practice. The circles represent commit points:

The first feature to be completed is “RM–1”, so it is merged back into “develop”, the mainline. Because nothing has changed in develop since RM–1 was created from it, the binary RPD file and all other artefacts can simply ‘overwrite’ what is there in develop:

Now at this point we could take “develop” and start its deployment into System Test etc, but the second feature we were working on, RM–2, is also tested and ready to go. Here comes the fancy bit! Git recognises that both RM–1 and RM–2 have made changes to the binary RPD, and as a binary RPD git cannot try to merge it. But now instead of just collapsing in a heap and leaving it for the user to figure out, it makes use of git and the git-flow method we have followed to work out the merge candidates for the OBIEE Administration Tool:

Even better, it invokes the Administration Tool (which can be run from the command line, or alternatively use command line tools comparerpd/patchrpd) to automatically perform the merge. If the merge is successful, it goes ahead with the commit in git of the merge into the “develop” branch. The developer has not had to do any kind of interaction to complete the merge and commit.

If the merge is not a slam-dunk, then we can launch the Administration Tool and graphically figure out the correct resolution – but using the already-identified merge candidates in order to shorten the process.

This is not perfect, but there is no perfect solution. It is the closest thing that there is to perfection though, because it will handle merges of :

Unique objects
Same objects, different modifications (c.f. two new columns on same table example above)
Duplicate objects – by equalisation

Conclusion

There is no single right answer here, nor are any of the options overly appealing.

If you want to work with OBIEE in an Agile method, using feature-driven development, you will have to adopt and learn specific processes for working with OBIEE. The decision you have to make is on how you store the RPD (binary or multiple MDS XML files, or maybe both) and how you handle merging it (git vs Administration Tool).

My personal view is that taking advantage of git-flow logic, combined with the OBIEE toolset to perform three-way merges, is sufficiently practical to warrant leaving the RPD in binary format. The MDS XML format is a lovely idea but there are too few safeguards against dodgy/corrupt RPD (and too many unnecessary merge conflicts) for me to see it as a viable option.

Whatever option you go for, make sure you are using regression testing to test the RPD after you merge changes together, and ideally automate the testing too. Here at Rittman Mead we’ve written our own suite of tools that do just this – get in touch to find out more.

↧