Internet Explorer 9 was released 4 years ago in 2011 which sounds like a short time, but it really is an eternity in the world of web development. IE has had 3 releases since then and the 4th is on the horizon, but the upgrade cycle of many companies doesn’t reflect that fast pace. So we must continue to support IE9, and in doing so we will uncover more obscure bugs that will never see fixes.
This was the case recently when working on a grid view in JavaScript. A customer reported IE9 would crash, as in close the browser window down, when dragging a column to reorder them. They could drag the column in either direction and repeat the crash reliably. No error message, no where to clue where to start other that the drag action.
Since there were no error messages to go off of I used WinDbg after reading a MSDN blog post on debugging Internet Explorer. I had a virtual machine downloaded from modern.ie. After getting WinDbg hooked up to IE I set off to replicate the crash and had no trouble in doing so. The error message stack trace ended inside mshtml.dll, which is also known as Trident, the IE rendering engine.
WinDbg delivered the crash stack:
6D6DD5BB 81 3F D0 AD A3 6D cmp dword ptr [edi],6DA3ADD0h mshtml.dll!CDispScroller::SetScrollOffsetInternal() Unknown mshtml.dll!CDispScroller::SetScrollOffset(class CSize const &,int,bool,enum tagCOORDINATE_SYSTEM) Unknown mshtml.dll!CDragDropManager::DragOver(unsigned long,struct _POINTL,class CSize const &,unsigned long *) Unknown mshtml.dll!CDoc::DragOver(unsigned long,struct _POINTL,class CSize const &,unsigned long *) Unknown mshtml.dll!CDoc::DragOver(unsigned long,struct _POINTL,unsigned long *) Unknown mshtml.dll!CDropTarget::DragOver(unsigned long,struct _POINTL,unsigned long *) Unknown
Looking at the ScrollOffsetInternal
function I suspected it might be in the CSS positioning of the grid area. It was able to scroll left and right as the user dragged for a wider-than-browser grid. I removed all positioning from the element and was still able to get the crash. I lost the crash log that had the right clue, which is the one I got when I removed the positioning classes. The log mentioned CSS colors, paint, and I remembered that there was a hover on the columns as you dragged over them to indicate which one was active. I removed the class, and couldn’t crash the browser! However I lost that visual cue, and though I discovered the issue I wanted to keep the existing functionality in place.
The problem was in the timing. When we hit the drag enter event we would check to see if the element is a valid target, then add the class which changed the look. The crash happened when dragging an element, scrolling the element container, and updating the drag enter target’s parent class when it was outside the container. That update caused an attempt at a paint, which ultimately took down the browser on a null exception. WinDbg even let me walk through the assembly instructions to see Trident was trying to perform a math operation at the time. The solution was easy, just update the CSS on the next event and not in the same one. So now we drag and scroll, then update the style using setTimeout
pushing that action off to the JavaScript event stack.
The older technology gets and the more you continue to have to use it necessitates learning new tools to take the investigation into your own hands. Don’t be afraid of debugging tools like WinDbg. It didn’t give me the answer outright, and Trident is closed source but the stack trace gave me some logic to follow. It gave me the clues I needed to deduce the problem and come up with a solution that not only works, but one that I understand why I had to add that setTimeout
.