1
2
3
4
NO SLEEP
10PM. The WhatsApp image that gets your brain motoring. Let's dream in code
18
Jun 2018
The 10PM Nightmare
John Ince
10PM
In the afterglow of a successful day of coding and updating there's a ping on the mobile phone. "Take a look at the server", Darren says, "it's going nuts". All the webapps are consuming CPU resource. We designed our eco, efficient and green webapps to sleep most of the time when there's nothing to do. We see the odd flicker from 0% CPU time to 0.1% CPU time as a norm. What's gone wrong? Where to start? How long until the morning when we can delve into the code? Must be something in the last update. Watching TV and thinking about code. A restless night follows dreaming of code.
The 10PM Nightmare
Houston, there's a problem. What on Earth is going on? Where would you start? Lucky for us we've a wealth of webapp support tools we've built over the years to help us track down this problem. However best place to start is recent updates. However we've a nice process monitor to see exactly what is consuming. An even better place to start...
Everything runs in a Thread
In our webapp world everything runs in a thread. We're multi-threaded. We can attend to more than one thing at once via a process/task/thread. We've also implemented a thread pool to speed things up even more. The thread pool is totally configurable in our admin panel and we keep statistics on usage. We can bump up the pool to demand. That's not our story here though. We get used to the processes that frequent our thread pool snapshots and their associated behaviours. Sad being a mechanic isn't it. When we checked our snapshot, the morning after the night before, we made an interesting observation. CThreadTaskStaticMessageCheck ( ) continually had a very small runtime, in the order of a few milliseconds. Typically this thread is longer running and resets only when we pick up a message from the static website version (to support a contact form on our fallback legacy browser version) or it times out. This kind of implied that our static message folder was being bombarded with files, since a file is created to respond to a static site message in order to bridge the regular website world with our live webapp world. So, we've a possibility, but damn, the folder is empty and there aren't lots of static site messages appearing in our user messaging system. Close but not the answer.

However. In a recent sprint to support a new retail feature, DUOLOGI interest free monthly payment, we refactored this static message function. Previously we'd fire a thread each minute to check the folder where the static website messages are saved. We'd then re-queue another task to run in a minutes time. It's a common concept that we have, self repeating tasks. For our DUOLOGI enhancement their API uses web hooks to send messages to our server via the conventional web server. It's simply a JSON (structured text file) that defines the state of the customer's finance application. We didn't want to wait a minute, so we used some windows based directory API monitoring functions, firstly FindFirstChangeNotification ( ). It's simple. We get to know via the operating system when a file is created and run some code in response. Perfect, no more checking every minute and perfect time-stamping on when the event occurred. We also refactored this into the CThreadTaskStaticMessageCheck ( ). Maybe there's something here, new code, erratic behaviour in the thread pool snapshot. Let's look a little deeper.
The 10PM Nightmare
Some of our mechanic tools are quite polished, some aren't. Our process/thread pool snapshot is simply that, not polished or alive, but will be someday! It's a simple dump of what's doing what at that moment in time, with a timer showing how long it's been doing it for. That turned out to be the key to this bug, how long?
Our AIM monitors health too
Our AIM (Application Instance Manager) monitors webapp health amongst lots of other things. You can see it duplicates MS Windows task manager in terms of Instance CPU and total CPU. But it also shows message flow in the IN/OUT columns. Since in general tasks are spawned as a result of messages then it's good to check that nothing is firing messages constantly. Nothing abnormal in the message flow. The CPU matches the results from task manager BUT another interesting thing. Our retail webapp variants are consuming double the CPU of our regular webapps. A coincidence? Could it be that our eCommerce webapps have both the static website message check and the DUILOGI enhancement? This could explain what we are seeing. But neither directories are receiving files, what's going on?
The 10PM Nightmare
Normal CPU usage and normal message flow. Our AIM monitors our instances health live.
The Breakthrough
So currently we have a rather disjointed hypothesis. We've made assumptions. But all good bug fixes start this way. We've a sniff of what's going on so let's really work it all out. We needed to take a look at the code surrounding the event firing to handle an incoming file and to process the file. It seemed to us that the directory change notification was firing all the time, hence the short lived thread that should be long lived. These young threads must be different threads, hence the CPU usage.

In our recent enhancement we considered the implications of webapp shutdown and longer live tasks. Since these threads that monitor directory changes are, or should be, longer lived what would happen if we shut down one of our webapps using our AIM. The update process does this to, stop, update, restart. We would have to deal with that or our webapp shutdown would be waiting for the thread to complete until it times out. So we add a bit of code to create a shutdown event AND monitor for either condition. A file arrives or a webapp is being shutdown terminates the thread. If it's not being shutdown then it must be a file, so we process and simply start another thread to monitor for further directory changes.

We checked our source code. We use the MS Windows API CreateEvent ( ) and WaitForMultipleObjects ( ). On close inspection of the documentation for creating events it seems we are creating a named event in global namespace, just by giving it the name "Shutdown" in the 4th (optional name) parameter. Let's think about this.

Here's the new hypothesis: If we shutdown a single webapp using our AIM, for maintenance or a simple update, we create the named 'PUBLIC' event. Since it's public ALL our webapps running hear the event fire, it's then always fired, and break out of the directory change wait, because they think that they are shutting down but aren't at all. Since they are not shutting down the task for monitoring task is recreated and then immediately returns to process ad infinitum. Bingo. That's it.

Lesson learnt. A single word in a computer program can break your entire system. To fix is easy. Just move the event to the private namespace by not assigning an optional name. ALL FIXED.
Our in-built Tools Help
We fixed this bug in double quick time. Why? Because our tools like the thread snapshot and the AIM webapp health pointed us in the right direction. Since we run everything in a thread then in order for CPU to be used there must be an associated thread. Thread snapshot and we're on our way, new changes validate our assumptions, AIM health proves our double CPU usage theory and last but not least RTFM (Read The F'ing Manual). All's well that ends well. We are but human.
Comments
PROJECT PEACH
Doing it differently since 2012. We're simply 10 years ahead. Our software is designed, coded, maintained, supported & hosted in the United Kingdom.
READING
Blogs
Newfeed
PORTFOLIO
DBD International
The Hawthorn Gallery
ScriptTrack
Knot 2b Missed
Overview
Monero Mine
COMPANY
Home
About Us
Contact Us
Pricing
Pay Us
Copyright 2018 Project Peach