A chance to use SunResolve
Earlier this week I was asked to provide some help to a colleague to peg the cpu usage of a particular process running in Solaris 10. We went back and forth for a day or so and amongst other things figured out that the new Solaris 10 SOE that this mob have installed didn’t include the SUNWpool. Which of course is what you need in order to make use of resource pools …. and some aspects of the Fair Share Scheduler (FSS).
The next day I arrived to find a Gropewise calendar request to meet and nut out the problem …. including my team lead, the monitoring team and their team lead, the app owner and the bloke I’d been talking to the previous day.
Just as I read this one of the monitoring team came over and said he had some data to show me about this very issue.
As it turns out, the heaviest process on this system (a Sun Fire v100) was a messaging application instance which had 52 threads associated with one process, and about 10 threads each for three other processes associated with it.
That’s not so bad, right?
Well the data that the monitoring team had gathered pretty clearly showed me that the monitoring system was — relatively — not using all that much cpu. So when we got into the meeting pretty much the first thing that the app owner said was “the monitoring system is the problem, fix it so we can run our app.”
I had had a chance to mull over the data for a few hours by this time, so I asked a few SunResolve-style questions:
When did the problem start revealing itself? {when the monitoring system was installed and activated}Did it ever work? {yes, before the monitoring system was activated}When was the machine installed? {about 3 years ago}What version of Solaris did it run then? {Solaris 8*}Did you run the monitoring system on it when you initially installed it? {no}Did you do any OS tuning when the system was installed? {no}Have you done any OS tuning since the system was migrated to Solaris 10? {no}
The app bloke was incredibly keen to blame the monitoring system, and this got me a bit annoyed. It had become quite obvious to me that “The Problem” was most likely not what he was blaming.
The conversation actually got a bit heated at that point. I told the app bloke and the bloke I’d been talking to the day before that I’d spent 6 of my last 7 years at Sun doing high level troubleshooting (like Steve, Roy, Mark, TPenta and ChrisG amongst others). I also told them that if they really wanted my help then they would have to accept and implement whatever recommendations I might make. In fact, I demanded their complete cooperation. Somehow, I don’t think my response was what they really expected. Given the heat in the conversation I was actually surprised when they agreed. I made sure that their agreement went into the meeting minutes .. image:: /images/smilies/icon_smile.gif
- alt
:-)
Let’s be serious, though – I gained SunResolve program leader accreditation while I was at Sun. (For those who came in late, that’s the Sun implementation of Kepner-Tregoe’s Problem Solving and Decision Management course). I taught the SunResolve course. I used the SunResolve methodology and techniques in many facets of my PTS incarnation and even more so in my Storage software engineering incarnation where I personally reduced our team’s outstanding issue list by 10%. I’m hardly likely to roll over and not ask questions when somebody asks me to help with a problem, especially regarding performance!
The first step in cases like this is to define the problem, which I did (and followed up with an email later that afternoon). Then in order to accelerate the problem specification I requested that the app owner find out from the messaging vendor what the minimum spec Solaris 10 system is for this application. As it happens, the information which he mailed back to me today indicated that the Sun Fire v100 was indeed the minimum spec system. I wasn’t all that surprised. I did get somewhat riled when he said to me yesterday that when they installed the system “it worked perfectly fine under Solaris 8 so it should work fine with Solaris 10 therefore it’s the monitoring system’s fault”…. because we don’t know that the system was ok to start with, and the monitoring system is a symptom and somewhat unlikely to be the cause of the problem. This last assumption is mine – because the minimum spec box that I’ve been working with at this place in the last 5 weeks has been dual-proc 4Gb Sun Fire v240s, and every system we put into production is required to have the standard monitoring system installed otherwise Operations refuses to accept responsibility for it!
Another step in the process is to ask “what are you looking at to make this claim?” In this case, the solitary tool being used by the app team was vmstat. Not something you should ever use in isolation. Ever. And if you don’t really understand what it measures then you shouldn’t use it at all….. and should ask somebody who knows somewhat more about the environment to assist. These guys are lucky because they’ve got me. (Arrogant, but true!) If you don’t have me, you should log a call with your OS and hardware vendor.
The most recent entry in this story concerns the monitoring system. It seems that the Solaris 10 SOE that the app team used was not what we currently have as SOE, and it includes some alpha revisions of the monitoring system scripts which do some (frankly) rather insane things. So we don’t have a good basis for blaming the monitoring system anyway.
I’ll keep you posted, but suffice to say that it felt really, really good to be able to make use of some of the skills I used at Sun to solve problems.
Really good indeed.