It’s been a long time since I’ve done any reasonable blogging. Between work and life, my plate has be been full and the last thing on my list usually is “write a blog about stuff you’ve done.” More times than not, the things I’ve done I cannot write about due to work contractual concerns with work. In an attempt to revitalize some posting and encourage myself to push more content out, I’m going to try to start creating these QuickBlogs. These will be mostly focused around specific, bite-size content and will likely get a lot less word-smithing than my typical articles. So there’s the intro. Quick. Easy. Let’s move on to the juicy stuff.
As part of my job, I’m often forced to look at or work on obscure setups that probably shouldn’t be production solutions .. and then fix them. In this case, we’re talking about a Trac instance supporting an SVN repository. Trac is being run under Apache2 using mod_wsgi in it’s daemonized, workermode configuration. The problem statement is simple: Trac begins spitting out 500 response codes to certain requests and before long, all requests are returning 500’s after exceptionally long wait times. To make matters a bit more interesting, these errors began occurring mere hours after a full system upgrade.
The first problem was logging. Trac’s logs simply didn’t tell us what was happening. Apache’s logs were even worse, random 500’s with no real indicator or commonality between the failures. A peek at the resources on the box showed CPU and memory consumption would eventually reach obscene levels and the only obvious fix required processes to be killed.
After a number of eyes spent time pouring through the logs we identified a particular SVN revision that seemed to be giving Trac the business more than others. A quick check of this changeset revealed a 1.2 Gigabyte diff of a large file with significant modifications. This changeset also was committed just a few hours after our system upgrade, making sense with our problem timeline. Now that we’ve got a potential root cause, it was time to sort out this logging issue.
Trac’s native logging leaves something to be desired, but the trac.log file did include enough log text to make finding the last logged action within the code trivial. Looking around within the code, I noticed that Trac provides an object for doing debugging logging that I could use to dump to the trac.log file. Most code within Trac has access to the self.log.debug() function, which I made heavy use of throughout to identify where execution was going, what variables looked like, and most importantly – where the code eventually would block and stop processing the diff.
A few hours of debugging later, I had my code location (or at least close enough to know what the issue was.) After the diff() was run within Trac, Trac attempts to htmlify() the results for presentation to the user; an action that takes far too much time when your diff is 1.2 Gigabytes. With the additional logging I added, I could see the 500’s and the completion of this long-running function coincide, telling me that I’ve found my bottleneck. This is only half the answer however; as we have pages that have nothing to do with this, or any, changeset to the code that also will toss 500’s back at us.
The obvious question was “is this related to our massive changeset issue,” to which I thought an equally obvious answer was “probably, yes.” During the troubleshooting of the changeset issue, I noticed that the logging and seemed to lag behind my actual work. I’d make an update to the code, refresh the Trac web page, and my logs would not reflect my code changes. A few requests later, I’d see those changes. Based on this, I was fairly certain that something about these diffs was cascading into other execution.
With a problem definition in mind, I moved forward to test my problem hypothesis. I modified my Trac logging so that I could clearly see when requests came in and completed. I then opened up a two different pages in the Trac web interface. One page was a diff-view of the problem changeset and the other was an unrelated, lightweight page that should have always returned content. I issued a few fast paced requests to the diff-page (literally just spam clicking the refresh/cancel button in firefox), followed by a single request to our canary page (in this case a random Trac ticket.) Watching the logs, I saw exactly what I expected – when each diff-page request got to the htmlify() function, it would block ALL THE OTHER REQUESTS from even being accepted (at the mod_wsgi level, Apache still saw them.) As each request finished, the next in queue would be processed and would block again, and again and again. This cascade of blocks caused the requests for our ticket to fail because even tho the Python/Trac code executed quickly, by the time the request was even processed by Trac, Apache had already given up on it and moved on with a 500.
So the question you may be asking here is (especially if you don’t know me, and haven’t heard a rant on python threading) – Why didn’t a worker-mode, daemonized version of mod_wsgi simply allow the various threads to handle the requests in parallel, completely side-stepping this problem? The answer is within the Python interpreter itself and it’s GIL (as well as potentially the implementation on the Trac side.)
Python’s GIL acts as a safety mechanism, preventing you from doing “unsafe things” to volatile shared memory. This works excellently for protecting your data, but also means that Python’s threads can only run one at a time, per core/processor. Having only a single instance of Python means that, because Python can only use a single core/processor per instance, every request blocked the rest. This resulted in a single bad page being able to push the entire request stack so far back in time that they all had to result in 500’s.
While Trac could be patched to detect large diffs to ignore or the Apache/mod_wsgi set-up could potentially be moved to a pre-fork/embedded mode (much more memory intensive, but less CPU blocking) – we determined the best fix was to simply purge our SVN of this bad revision (which really shouldn’t have been committed to begin with.)
Fixing the SVN was fairly trivial. I identified the problem revisions, and used svnadmin’s dump and load, with a few flags, to rebuild the SVN repository minus the bad changes. Note, this may cause problems if you’re removing anything that is modified by a future revision, in my case I knew these weren’t further modified and were safe to remove revisions of. We’ll assume an SVN repository going up to revision 100, with problem revisions 89 and 92:
svnadmin create /tmp/svn-workspace/fixed-svn
svnadmin dump /path/to/svn -r 1:88 > /tmp/svn-workspace/first_part.dump
svnadmin dump /path/to/svn --incremental -r 89:91 > /tmp/svn-workspace/second_part.dump
svnadmin dump /path/to/svn --incremental -r 93:HEAD > /tmp/svn-workspace/third_part.dump
svnadmin load --force-uuid /tmp/svn-workspace/fixed-svn < /tmp/svn-workspace/first_part.dump
svnadmin load --force-uuid /tmp/svn-workspace/fixed-svn < /tmp/svn-workspace/second_part.dump
svnadmin load --force-uuid /tmp/svn-workspace/fixed-svn < /tmp/svn-workspace/third_part.dump
trac-admin /path/to/tracdata/ repository resync YourSVNName