Wednesday, November 12, 2008

Problem Solving Skills: Are you born with it or can you learn it?

An interesting title I know, but after a recent experience and typical day-to-day operations, I figured it was time drag out the Soap Box and try to have a philosophical moment.

"So what has brought this on" I hear you ask? Well, like I mentioned above I have had a recent experience. Now to ensure I don't get in too much trouble, here is the summary of the experience:

I received a phone call the other night about a strange buzzing sound coming form the bathroom of a family member's home. After asking all of the logical questions:

  1. Is the fan on?
  2. Is the heater on?
  3. Is the Air-conditioner on?
  4. Is the hair dryer on?
  5. Is anything on in the bathroom?

I then progressed to the not so obvious questions:

  1. Get a broom and lift up the man hole in the roof to see if you can hear the buzzing.
  2. Could it be the "whirly-bird" roof heat extraction fan spinning?
  3. What about the motor that moves the outside blinds up and down?
  4. What about electrical interference from the AV signal receiver connected to the TV (on the other side of the wall)?

When I went around to the house, I put my head in the roof but couldn't hear the noise. I went into the bathroom and removed the fan cover. This is when the noise appeared to come from the shower recess. Although this has baffled a a couple of people, after careful listening, the cause of the noise was obvious and in plain sight.

This has made me think, are problem solving skills something you can learn or is it something you are born with.

Coincidentally, I actually had this same conversation about a week prior with an academic colleague. The conversation started when we were discussing hiring graduates (which I look upon in a highly positive way). My colleague mentioned that in the IT educational system today, Computer Science is not what it was previously. "Today it is all about learning to program in Java or C#, and it is no longer about applying algorithms to solve the problems". I will be the first to say that I don't necessarily know about all of the different documented algorithms there are to solve problems - but I digress. My point is, this is just one facet of problem solving in IT. In this instance, my colleague believed you cannot necessarily teach problem solving skills.

But there are other facets of IT when it comes to problem solving. Take for example the infrastructure side of IT, supporting the servers and application software that we use on a daily basis. I would be a liar if I didn't say that it is not always smooth sailing.

I believe that good software should not be judged solely on the functionality that it offers. When evaluating a product, it is important to understand the way it can be supported as part of the process. All the cool features and widgets would be useless if the software stops and you cannot return it to service. Software engineers should, in general, ensure their code contains enough debug support code, so that if an issue should arise, it can help the supporters pin-point the root cause. By debug support, I mean, entries in Windows Event Logs and File logs etc. And, oh yes, this needs to be able to be turned on when the issue arises, without destroying the current symptoms (so don't put it in the Web.config :) )

But even with all of these wonderful "support aides", you still need people with good problem solving skills to debug the situation.

So, how do I try and solve a problem? Well I have a really simple process that I follow. This process is inspired by the CSI TV shows - which is basically "Follow the evidence!". Look in the event logs and note the errors that appear to highlight the symptoms. But also look at the errors that occurred before and after the events. Look for multiple cases of the error. This will help you get an idea of all the symptoms. Hopefully this is enough to provide root cause and therefore a solution. But more often than not, it doesn't.

This leads you into the next step of the process: Internet Search. I am a firm believer of "Googling" (Yes it hurt me say this - But "Windows Live It" - well it doesn't really roll off the tongue) the event log messages. Why? Well unless you are sitting on the bleeding edge of technology, I am pretty confident to say, you are not the only person in the world that has experienced this problem. You might be unfortunate as to be the only person that has written about the problem though...

If still no luck in obtaining root cause (or there is no simple fix), then you need to reproduce the problem. This should / must be performed in an isolated lab environment. Tinkering in production is not recommended and should be avoided wherever possible. So if you can reproduce the symptoms in the lab environment, then this is where you should start. Now this is where a little discipline goes a long way. Try to make only one change at a time, and ensure you document each and everyone. This is important, because if you fix it, and you can't remember how, well then you are no better off - right?

The most important step at the end is to share the knowledge. Document it, tell somebody, blog it, write it up in your Knowledge Base system (Don't have one, try WSS 3 and the KB application template). This ensures the information is shared and if the problem occurs again, someone else can fix it. Also, if you believe in Karma and what goes around, comes around, well you will be sorted again on your next issue.

So, I might have digressed a bit, but not really. So, having shed some insight into my problem solving process, do you conclude that I am a good problem solver or a good Internet searcher? Well, I believe, a bit of both. It is not possible to know everything. We in IT are constantly playing catch-up. Understanding the basics of how things work combined with the ability to research does give you the edge to solve any issue. What I can't believe is how many times I ask people if they have searched the Internet looking for the symptom. What I don't understand is why people don't do this first? In actual fact, this is a real interview question I ask of potential candidates, where the outcome can be influenced by the answer to this question. Showing an understanding of technology is one thing, but when you are stuck, following a process and learning from it is another.

I remember once interviewing a new graduate. Being once a graduate myself, I understand that you don't have real world experience and getting your first job can actually be rather difficult. But this one interview I remember vividly:

Me: "Assuming that you have done some sort of hands-on project for your studies, tell me about a time when you were stumped and how did you resolve the problem?"
Candidate: "It was some C++ code. I went to the lecturer, and he resolved it for me".
Me: "Ok, so what was the solution?"
Candidate: "I don't know, the lecturer fixed it for me."

The interview then ended.

Now, I didn't care too much that the lecturer solved the problem, but it was the sheer lack of learning from or caring for the solution that killed it for me.

I believe that problem solving skills can indeed be learned. It comes with experience. Every problem is different, and every solution brings with it a new set of learning's that assists with the next problem to solve. But, it is a 2-way street and you must want to learn from your experiences in order for your skills to be enhanced.

As a final point, my learning has showed that the absurd theories for the root cause of the problem should never be ruled out until they have first been proven wrong. Technology is a wonderful thing in that in most cases its never what you expected (I suppose that ties into "Its always in the last place you look" - well dah!)

But seriously take these snippets as an example:

Example 1:

Symptom:
Lotus Notes Client published through Citrix runs extremely slow. Servers well resourced with low load. Doesn't affect everyone, only certain people.
Root Cause: Dual monitors on certain desktops claimed extra server memory and resources resulting in slow performance of the application.
Stump Time: Few days

Example 2:

Symptom: SQL Server 2005 replication schedule fails. Event logs display the following:

The job failed. Unable to determine if the owner (domain\username) of job SQLJobName has server access (reason: Could not obtain information about Windows NT group/user 'domain\username', error code 0x5. [SQLSTATE 42000] (Error 15404)).”

Root Cause: SQL Server Service Account password expired and it could not contact the Active Directory to validate if the Job Owner was a valid account.

Stump Time: about 1 day duration. Including a large audience of people involved. Fix time: 2 minutes.

Oh and by the way, It was an electric razor vibrating against the "shower caddy" in the shower!!!

0 comments: