Friday, May 16, 2008

Challenges of testing Down-scaling of systems

I once came across an unexplainable phenomena. We where in the middle of a POC, very tight schedule, and it was evening Thuresday. We decided to leave the server 'on', and not inject any transactions or events, and come back Saturday evening to continue working on it. When, after only a few hours later (11 or 12 I believe), we found out that the system carshed!

That was the fist time I saw a big system fall so hard, when not dealing with anything, just in idle state. regardless to say, that we have included from that day forward an 'Idle state test' in the regression of every release.

It brings us to the point of asking do we know to test downscaling systems? We always ask how to test up scaling ones, but the downscaling is a big issue as well. Systems are 'used' to high communication and high volume of events, and are exercising daemons, loggers, and other means to make sure everything is 'alive and kicking', but seldom do we see big systems testing trying to simulate small scale traffic.

What other issues are to take into consideration in downscaling testing?
- traffic
- buffers
- log mechanism
- memory
- synchronous things that should happen vs. asynchronous ones
- shooting 'by requests' processes and or quesries and or reports after long time

Add your own commenting this.

Alon Linetzki
Best-Testing

1 comment:

Unknown said...

Hi Alon,

in the literature I've never read nothing about downscaling testing, but I can report a direct experience in which I've analysed a problem similar to your experience.

My idea is that is formally correct to identify boundary limits on the load and suppose that in some cases there is the possibility to identify the lowest load that assures the correct behaviour of the system.

Although if can appear strange, some architectures could have this problem (e.g. management of empty queues and timeout on event reading).

I can summarize briefly the only situation in which I've encountered this matter.

In a real-time application (using a famous RTOS) for an embedded system during a session of load monitoring, after only 5 hours all the system crashed. From the analysis "post mortem" of the logs, we weren't able to identify the cause of the problem.

But we were lucky because we have the possibity to reproduce and observe "live" the same situation: the problem we identified was that some processes entered in the execution when the system was unloaded (or in a situation of low load) and caused the crash of the system. This processes were:

- a process that was flushing some data
- a process that consolidate some queues to free resources

So, when the specification identifies an average load and a maximun load, implicitly identifies a minimun load.

Very interesting topic :-)

Bye,
Alessandro