Solaris Analytics Collections, partitions, slices and operators

Having poked around with Solaris Analytics, and the WebUI a little, you might have wondered what information or statistics we gather by default. For instance, what are the statistics which we collect to make the Solaris Dashboard sheet useful?

/images/2018/05/solaris-dashboard.png

The feature we provide to make this happen is the Collection. Collections give us a handy shorthand for gathering statistics. We ship with several collections:

# sstore list //:class.collection//:collection.name/*
IDENTIFIER
//:class.collection//:collection.name/root/apache-stats
//:class.collection//:collection.name/root/compliance-stat
//:class.collection//:collection.name/root/cpu-stats
//:class.collection//:collection.name/root/network-stats
//:class.collection//:collection.name/root/solaris-dashboard
//:class.collection//:collection.name/root/system

Listing collections is a privileged operation; if I run the command above as myself then I get a very different result:

$ sstore list //:class.collection//:collection.name/*
Warning (//:class.collection//:collection.name/*) - lookup error: no matching collections found

The collection which is enabled by default is //:class.collection//:collection.name/root/system, and you can see what it gathers by running sstore info on it:

# sstore info //:class.collection//:collection.name/root/system
Identifier: //:class.collection//:collection.name/root/system
  ssid: //:class.system//:*
 state: enabled
  uuid: 7a002985-2cf4-4965-adc9-b53116d8ae67
 owner: root
 cname: system
crtime: 1523243338963817

I quite like having the solaris-dashboard and apache-stats collections enabled, and that is really easy to do:

# sstoreadm enable-collection \
    //:class.collection//:collection.name/root/solaris-dashboard \
    //:class.collection//:collection.name/root/apache-stats

One thing I'm always concerned with, since our family media server is, shall we say, homebrew, is whether my disks are doing ok. Fortunately for me, it is very easy to cons up my own collection and stash it in /usr/lib/sstore/metadata/collections:

[
    {
        "$schema": "//:collection",
        "description": "disk-related statistics",
        "enabled": true,
        "id": "disk-stats",
        "ssids": [
            "//:class.disk//:res*//:*"
        ],
        "user": "root"
    }
]

and once you've restarted sstored you can see it like so:

# sstore info -a //:class.collection//:collection.name/root/disk-stats
Identifier: //:class.collection//:collection.name/root/disk-stats
   ssid: //:class.disk//:res*//:*
  state: enabled
   uuid: bee6c5c5-487e-4376-9d91-f4eb933fd64e
  owner: root
  cname: disk-stats
 crtime: 1525373259871426

[Note that you do need to ensure that your collection validates against the collections schema, so run soljsonvalidate /path/to/my/collection.json, and if you need to reformat it, soljsonfmt /path/to/my/collection.json].

So that's useful - now what? How about looking at the illegal requests counter? When you run iostat -En that information is jumbled up with all the other errors and can be a little difficult to distinguish:

$ iostat -En sd0
c2t0d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: WDC WD30EFRX-68E Revision: 0A82 Serial No: WD-WCC4N7CNYH0S
Size: 3000.59GB <3000592982016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 537 Predictive Failure Analysis: 0 Non-Aligned Writes: 0

With Solaris Analytics, however, we can gather all of those errors together in aggregate and partition them at the same time. This command shows us the most recent data point (the -p -1 argument):

$ sstore export -p -1 "//:class.disk//:res.name/sd0//:stat.errors//:part.type"
TIME                VALUE IDENTIFIER
2018-05-08T19:07:44  //:class.disk//:res.name/sd0//:stat.errors//:part.type
                    device-not-ready: 0.0
                    hard-errors: 0.0
                    illegal-requests: 537.0
                    media-errors: 0.0
                    no-device: 0.0
                    non-aligned-writes: 0.0
                    predictive-failure-analysis: 0.0
                    recoverable: 0.0
                    soft-errors: 0.0
                    transport-errors: 0.0

That's a bit more useful! (Yes, having to use sdN rather than cXtYdZ is a pain, sorry). So... how about just looking for the illegal-requests? That's where we really make use of the partition concept - and let's throw the argument to give a daily total from the start of this month (May 2018):

$ sstore export -t 2018-05-01T00:00:00 -i 86400  "//:class.disk//:res.name/sd0//:stat.errors//:part.type(illegal-requests)"
TIME                VALUE IDENTIFIER
2018-05-01T00:00:00  //:class.disk//:res.name/sd0//:stat.errors//:part.type(illegal-requests)
                    illegal-requests: 27.0
2018-05-02T00:00:00  //:class.disk//:res.name/sd0//:stat.errors//:part.type(illegal-requests)
                    illegal-requests: 27.0
2018-05-03T00:00:00  //:class.disk//:res.name/sd0//:stat.errors//:part.type(illegal-requests)
                    illegal-requests: 27.0
2018-05-04T00:00:00  //:class.disk//:res.name/sd0//:stat.errors//:part.type(illegal-requests)
                    illegal-requests: 79.0
2018-05-05T00:00:00  //:class.disk//:res.name/sd0//:stat.errors//:part.type(illegal-requests)
                    illegal-requests: 174.0
2018-05-06T00:00:00  //:class.disk//:res.name/sd0//:stat.errors//:part.type(illegal-requests)
                    illegal-requests: 270.0
2018-05-07T00:00:00  //:class.disk//:res.name/sd0//:stat.errors//:part.type(illegal-requests)
                    illegal-requests: 365.02
2018-05-08T00:00:00  //:class.disk//:res.name/sd0//:stat.errors//:part.type(illegal-requests)
                    illegal-requests: 461.0

Much more useful - and observe that because we're using () to extract the partition element, we need to quote the argument so the shell doesn't get snippy with us.

To finish this post, let's take a look at two more really useful features, slices and operators. One operator that I'm particularly happy with is //:op.changed, which shows you when a statistic value changed. While not particularly useful for volatile statistics on a per-second basis (watch //:class.system//:stat.virtual-memory for a few minutes and you'll see what I mean) if you aggregate such stats over a longer time period, such as a day, you can get a better understanding what that stat is doing. So, with disk errors again, but on a daily basis (-i 86400) from the start of this month (-t 2018-05-01T00:00:00):

$ sstore export -t 2018-05-01T00:00:00 -i 86400  "//:class.disk//:res.name/sd0//:stat.errors//:op.changed"
TIME                VALUE IDENTIFIER
2018-05-01T00:00:00 27.0 //:class.disk//:res.name/sd0//:stat.errors//:op.changed
2018-05-04T00:00:00 79.0 //:class.disk//:res.name/sd0//:stat.errors//:op.changed
2018-05-05T00:00:00 174.0 //:class.disk//:res.name/sd0//:stat.errors//:op.changed
2018-05-06T00:00:00 270.0 //:class.disk//:res.name/sd0//:stat.errors//:op.changed
2018-05-07T00:00:00 365.0 //:class.disk//:res.name/sd0//:stat.errors//:op.changed
2018-05-08T00:00:00 461.0 //:class.disk//:res.name/sd0//:stat.errors//:op.changed

Finally, slices. These are //:s.[....] and you enter the statistic names which you wish to extract inside the brackets - and once again I'm using the //:op.changed to constrain the output:

$ sstore export -i 86400  //:class.disk//:res.name/sd//:s.[0,28]//:stat.errors//:op.changed //:class.disk//:res.name/sd//:s.[0,28]//:stat.//:s.[vendor,serial-number]//:op.changed
TIME                VALUE IDENTIFIER
1970-01-01T10:00:00 27.0 //:class.disk//:res.name/sd0//:stat.errors//:op.changed
2018-05-04T10:00:00 119.0 //:class.disk//:res.name/sd0//:stat.errors//:op.changed
2018-05-05T10:00:00 214.0 //:class.disk//:res.name/sd0//:stat.errors//:op.changed
2018-05-06T10:00:00 309.0 //:class.disk//:res.name/sd0//:stat.errors//:op.changed
2018-05-07T10:00:00 405.0 //:class.disk//:res.name/sd0//:stat.errors//:op.changed
2018-05-08T10:00:00 500.0 //:class.disk//:res.name/sd0//:stat.errors//:op.changed
1970-01-01T10:00:00 169.0 //:class.disk//:res.name/sd28//:stat.errors//:op.changed
2018-05-04T10:00:00 120.0 //:class.disk//:res.name/sd28//:stat.errors//:op.changed
2018-05-05T10:00:00 215.0 //:class.disk//:res.name/sd28//:stat.errors//:op.changed
2018-05-06T10:00:00 310.0 //:class.disk//:res.name/sd28//:stat.errors//:op.changed
2018-05-07T10:00:00 406.0 //:class.disk//:res.name/sd28//:stat.errors//:op.changed
2018-05-08T10:00:00 501.0 //:class.disk//:res.name/sd28//:stat.errors//:op.changed
1970-01-01T10:00:00 ATA      //:class.disk//:res.name/sd0//:stat.vendor//:op.changed
1970-01-01T10:00:00 Z1D5K89L //:class.disk//:res.name/sd0//:stat.serial-number//:op.changed
2018-05-04T10:00:00 WD-WCC4N7CNYH0S //:class.disk//:res.name/sd0//:stat.serial-number//:op.changed
1970-01-01T10:00:00 ATA      //:class.disk//:res.name/sd28//:stat.vendor//:op.changed
1970-01-01T10:00:00  //:class.disk//:res.name/sd28//:stat.serial-number//:op.changed

For more information about operators, slices and partitions, have a read of ssid-op (aka ssid-op(7)).

Tune in next time when I'll guide you through the process of using a proper certificate for your WebUI instance, rather than the default self-signed certificate.