Blog

Query Notifications and the Importance of Quick Callback Handlers

By Alejandro Villarreal

A few months ago we developed a slightly complex service that made use of query notifications to keep an in-memory copy of some database tables. The service worked like a charm, it passed QA, UAT, and went into production, where it has been working ever since. Recently, however, our client had a vague suspicion that something weird was happening with it, but could not pinpoint exactly what, and they only felt like this after a couple of weeks of the service working just fine. Restarting the service dissipated whatever problem the service appeared to be having, so we all lived with that for a while… until they found actual evidence that the service was not responding correctly to the notifications sent by the database; some of them were being ignored, and the changes were obviously not reflected in the in-memory copy of the tables.

At first this seemed like one of those bugs that would take weeks to discover, since that’s exactly how it behaved in production; even so, I gave myself the task of understanding what was happening, and luckily for me the bug turned out to be very easy to find.

The first thing I did was to run a profiler over the database, monitoring Query Notification events, to verify that the notifications were in fact being sent to the application, which we thought was the one ignoring them. But oh surprise! The database was not firing notifications. More accurately, it was not firing some notifications. And to make things even more complicated, each time I tried to determine which ones, or why, some others started failing. At this step, it seemed like a nightmare was approaching.

I checked the service’s log files and found that each time this error happened the service seemed to get stuck in a finite, but very long loop of firing the event handler for a particular table. At times, the service would spend hours firing the handler again, and again, and again… I tried to correlate these events with something in the SQL Server Logs, and did find that each time this happened, an event like the following would appear in there:

The query notification dialog on conversation handle '{5925E62A-A3BA-DC11-9E8E-000C293EC5A4}.'closed due to the following error: '<?xml version="1.0"?><Error xmlns="http://schemas.microsoft.com/SQL/ServiceBroker/Error"><Code>-8470</Code><Description>Remote service has been dropped.</Description></Error>'.

I proceeded to read everything I could find about how Query Notifications works under the covers, since I firmly believe that having a deep and thorough understanding of your tools is probably the best asset you can have while debugging a problem. This research eventually led me to this blog, written by an independent consultant who has been working in the SQL Server team at Microsoft for 6 years, and as he himself puts it: “some of the features you are using today were designed, coded or tested by myself”. He seems to have a particular interest for Query Notifications, and has some very detailed posts about it, one of which eventually led me to solve my problem.

It turns out Query Notifications sets up some infrastructure to notify its client applications of any changes, and some timeout-triggered clean-up logic so if one application requested to be notified of a change, but then disconnects abruptly, the infrastructure created for that notification is automatically removed after some time. But if this happens and then another notification fires before the infrastructure has been re-created (which happens when the client application subscribes again), then an entry will be logged to the SQL Error logs, stating that the remote service has been dropped; exactly the notification I was seeing each time the service got stuck. Furthermore, the notification that fires while there’s no infrastructure will never reach the client application, which in turn will not be able to re-subscribe, and thus any further changes will not be seen by it. That explains why our service was losing synchronization with the database!

The next step was to determine why this was happening. According to the blog I found, something that could cause the time-out to expire was that your callback handlers were taking too long to execute (more than 2 minutes, which is the time-out set by Query Notifications). I could see in our service’s logs that the handler took less than a second to execute, so this didn’t seem to be the issue… but then I remembered that the handler was executing many times, and wondered if all that time added up; and in fact it did. So now I knew that the problem was being caused by the handler executing over and over and over… But why was it doing that?

I started up a fresh instance of the service, and opened a real-time log file monitoring tool to see what was happening. And I started causing the DB to fire notifications that in turn activated the problematic callback handler. Update in the DB. 2 log entries for the handler. Another update in the DB. 4 log entries for the handler. No, that certainly can’t be it… Can it? Another update in the DB. 8 log entries for the handler. Yes, it is. Yet another update in the DB. 16 log entries for the handler. Any programmer should be able to see that the service was calling the handler an exponential amount of times and that would eventually, but certainly, lead to an update that took more than 2 minutes to process.

Armed with that knowledge, I went straight to the code and quickly found the issue. Our callback handler was calling 2 other methods, and each of those created a subscription to the same table. So the next time the table changes, 2 handlers fire, and each in turn calls the 2 methods again, now creating 4 subscriptions to the database. Etc, etc, etc. The problem was easily solved with a bit of refactoring, and the lesson learned, albeit a bit painfully, was: if you need to re-subscribe to the DB-issued notifications in your callback handler, make sure you do it quickly, and only do it once per handler.

I hope this helps somebody in the future, and saves them the couple of days it took me to debug and correct this issue.

 

Comments

Leave a comment

 
 
 
 
CAPTCHA Image Validation