What is this?

This is basically where I write down stuff that I work with at my job as a GIS Technical Analyst (previously system administrator). I do it because it's practical for documentation purposes (although, I remove stuff that might be a security breach) and I hope it can be of use to someone out there. I frequently search the net for help myself, and this is my way of contributing.

Tuesday, April 19, 2011

Unstable ArcIMS virtual servers and how to deal with it

For some time now we’ve experienced instability in our ArcIMS production environment. It has been quite difficult to deal with, and despite trying to apply all best practices from ESRI we could find, we were not able to stabilize the ArcIMS server.

Production environment: ArcIMS 9.2 SP4 running on Windows 2003 x86 and ArcSDE 9.2 SP4 running on MSSQL 2005 SP2 / Windows 2003 x86. From what I read, I believe the problem exists on later versions of ArcIMS as well.

Essentially what would happen is that under periods of high load (especially if someone decided to start image tiling our maps) one or more virtual servers (image servers) would stop responding. We then had to gracefully stop all ArcIMS services and manually kill the process for the particular image server who had stopped responding. This would also leave dead processes hanging on our SDE server which in turn would create problems there due to a high number of connections. Some days this would happen 5-6 times and would leave our users very frustrated, quite understandably. We weren't able to pinpoint any particular event that caused it as the server was running fine for two years until it gradually got more and more unstable during the last few months. The only pattern we saw was that it seemed to affect the most heavily accessed image servers most often, but not always the same one. It would also just be image servers (not feature servers, query servers etc).

Here are some of the things we did:
  • Restarted all virtual servers every night (basically all ArcIMS processes). 
  • Trimmed axl-files to reduce the number of layers and remove eventual (minor) errors and warnings during image server startup. 
  • Enforced stricter scaling restrictions for heavy datasets. 
  • Increased memory for Tomcat. 
  • Added more CPUs and RAM to provide better load balancing. 
  • Increased the number of image servers (virtual servers) and gave each image server more instances. 
  • Gave each image server (virtual server) exclusive access to its own ‘server’. 
  • Increased the frequency of log file rotation to avoid appending to huge log files. And cleaned all temp folders frequently. 
  • Increased logging to debug level and tried playing back map requests in order to reproduce the problem, but weren't able to do that consistently. 
We managed to increase processing speed quite a lot, but nothing we did seemed to make it more stable. In short we were stuck. That was until I happened to read one post by Kelly Watts in this long thread: (http://forums.esri.com/Thread.asp?c=64&f=792&t=247210&ESRISessionID=P0xuaUmvWOi6GSAlHoEDo3TOHx3bdB7j6EBX2YxCMXCtb43wRjC7vkXJDv42HP-armKb) which suggested that all image servers should only use one instance each. Quite the contrary to what ESRI suggests in their best practice documents (the default is two, but for high load servers you’re encouraged to increase that).

Well long story short – I altered the most vulnerable Virtual Servers to utilize two servers but with one instance each rather than 2 (or more), like shown below:



What a difference. Our ArcIMS environment changed to rock stable overnight. We’ve experienced only a couple of issues now in three weeks. I can finally go home, and not have to go over to the computer to restart ArcIMS every few hours. As expected the processing is a little slower than it used to be with more instances, but that’s a small price to pay. Besides – in a year or so we’re going to replace ArcIMS with ArcGIS Server anyway.