IRC outage (2015-04-15)

Through a ridiculous series of events and incredibly poor luck (read: I hope it’s not incompetence) our VPS provider is experiencing power issues. You can follow along at TransIP’s , some of you may need to hunt down a babelfish.

Updates will be available through this post and through the psych0tik twitter.

Since I can’t make it go, I’m going to take a shower instead.

Update@2015-04-15 03:11 UTC:
Some systems are returning to service but we don’t have VPS control panel yet. I might recommend a good pair of jumper cables to them soon.

Update@2015-04-15 05:26 UTC:
Refresh; still nothing; repeat. I might sleep first and get everything running first thing in the morning. I’ll give them another 30 minutes.

Update@2015-04-15 06:19 UTC:
IRC is back!

Posted in psych0tik News | Leave a comment

QuickBlog: Trac, SVN, Python and threads

It’s been a long time since I’ve done any reasonable blogging.  Between work and life, my plate has be been full and the last thing on my list usually is “write a blog about stuff you’ve done.”  More times than not, the things I’ve done I cannot write about due to work contractual concerns with work.  In an attempt to revitalize some posting and encourage myself to push more content out, I’m going to try to start creating these QuickBlogs.  These will be mostly focused around specific, bite-size content and will likely get a lot less word-smithing than my typical articles.  So there’s the intro.  Quick. Easy.  Let’s move on to the juicy stuff.

The problem

As part of my job, I’m often forced to look at or work on obscure setups that probably shouldn’t be production solutions .. and then fix them.  In this case, we’re talking about a Trac instance supporting an SVN repository.  Trac is being run under Apache2 using mod_wsgi in it’s daemonized, workermode configuration.  The problem statement is simple:  Trac begins spitting out 500 response codes to certain requests and before long, all requests are returning 500’s after exceptionally long wait times.  To make matters a bit more interesting, these errors began occurring mere hours after a full system upgrade.

The first problem was logging.  Trac’s logs simply didn’t tell us what was happening.  Apache’s logs were even worse, random 500’s with no real indicator or commonality between the failures.  A peek at the resources on the box showed CPU and memory consumption would eventually reach obscene levels and the only obvious fix required processes to be killed.

After a number of eyes spent time pouring through the logs we identified a particular SVN revision that seemed to be giving Trac the business more than others.  A quick check of this changeset revealed a 1.2 Gigabyte diff of a large file with significant modifications.  This changeset also was committed just a few hours after our system upgrade, making sense with our problem timeline.  Now that we’ve got a potential root cause, it was time to sort out this logging issue.

Trac’s native logging leaves something to be desired, but the trac.log file did include enough log text to make finding the last logged action within the code trivial.  Looking around within the code, I noticed that Trac provides an object for doing debugging logging that I could use to dump to the trac.log file.  Most code within Trac has access to the self.log.debug() function, which I made heavy use of throughout to identify where execution was going, what variables looked like, and most importantly – where the code eventually would block and stop processing the diff.

A few hours of debugging later, I had my code location (or at least close enough to know what the issue was.)  After the diff() was run within Trac, Trac attempts to htmlify() the results for presentation to the user; an action that takes far too much time when your diff is 1.2 Gigabytes.  With the additional logging I added, I could see the 500’s and the completion of this long-running function coincide, telling me that I’ve found my bottleneck.  This is only half the answer however; as we have pages that have nothing to do with this, or any, changeset to the code that also will toss 500’s back at us.

The obvious question was “is this related to our massive changeset issue,” to which I thought an equally obvious answer was “probably, yes.”  During the troubleshooting of the changeset issue, I noticed that the logging and seemed to lag behind my actual work.  I’d make an update to the code, refresh the Trac web page, and my logs would not reflect my code changes.  A few requests later, I’d see those changes.  Based on this, I was fairly certain that something about these diffs was cascading into other execution.

With a problem definition in mind, I moved forward to test my problem hypothesis.  I modified my Trac logging so that I could clearly see when requests came in and completed.  I then opened up a two different pages in the Trac web interface.  One page was a diff-view of the problem changeset and the other was an unrelated, lightweight page that should have always returned content.  I issued a few fast paced requests to the diff-page (literally just spam clicking the refresh/cancel button in firefox), followed by a single request to our canary page (in this case a random Trac ticket.)  Watching the logs, I saw exactly what I expected – when each diff-page request got to the htmlify() function, it would block ALL THE OTHER REQUESTS from even being accepted (at the mod_wsgi level, Apache still saw them.)  As each request finished, the next in queue would be processed and would block again, and again and again.  This cascade of blocks caused the requests for our ticket to fail because even tho the Python/Trac code executed quickly, by the time the request was even processed by Trac, Apache had already given up on it and moved on with a 500.

So the question you may be asking here is (especially if you don’t know me, and haven’t heard a rant on python threading) – Why didn’t a worker-mode, daemonized version of mod_wsgi simply allow the various threads to handle the requests in parallel, completely side-stepping this problem?  The answer is within the Python interpreter itself and it’s GIL (as well as potentially the implementation on the Trac side.)

Python’s GIL acts as a safety mechanism, preventing you from doing “unsafe things” to volatile shared memory.  This works excellently for protecting your data, but also means that Python’s threads can only run one at a time, per core/processor.  Having only a single instance of Python means that, because Python can only use a single core/processor per instance, every request blocked the rest.  This resulted in a single bad page being able to push the entire request stack so far back in time that they all had to result in 500’s.

While Trac could be patched to detect large diffs to ignore or the Apache/mod_wsgi set-up could potentially be moved to a pre-fork/embedded mode (much more memory intensive, but less CPU blocking) – we determined the best fix was to simply purge our SVN of this bad revision (which really shouldn’t have been committed to begin with.)

Fixing the SVN was fairly trivial.  I identified the problem revisions, and used svnadmin’s dump and load, with a few flags, to rebuild the SVN repository minus the bad changes.  Note, this may cause problems if you’re removing anything that is modified by a future revision, in my case I knew these weren’t further modified and were safe to remove revisions of.   We’ll assume an SVN repository going up to revision 100, with problem revisions 89 and 92:

  mkdir /tmp/svn-workspace
  svnadmin create /tmp/svn-workspace/fixed-svn
  svnadmin dump /path/to/svn -r 1:88  > /tmp/svn-workspace/first_part.dump
  svnadmin dump /path/to/svn --incremental -r 89:91 > /tmp/svn-workspace/second_part.dump
  svnadmin dump /path/to/svn --incremental -r 93:HEAD > /tmp/svn-workspace/third_part.dump
  svnadmin load --force-uuid /tmp/svn-workspace/fixed-svn < /tmp/svn-workspace/first_part.dump
  svnadmin load --force-uuid /tmp/svn-workspace/fixed-svn < /tmp/svn-workspace/second_part.dump
  svnadmin load --force-uuid /tmp/svn-workspace/fixed-svn < /tmp/svn-workspace/third_part.dump
  trac-admin /path/to/tracdata/ repository resync YourSVNName


Posted in QuickBlogs | Tagged , , , , , | Leave a comment

Let’s write a compiler in rust: pt1

In the tried and true tradition of building esoteric things to deal with
brainfuck, nelhage nerdsniped me into writing a brainfuck compiler
in rust.

I’m gunna gloss over tons of stuff, and only write about the interesting parts.
I’m also not going to actually walk through the code as I wrote it because it
was and is incredibly sketchy, but it’s better seperated now so we’re going to
look at master (at time of writing).

Writing a parser

Rust makes this comically simple.

We define a recursive enum containing all the opcodes, and a Vec to hold the
contents of any loops:

#[deriving(Show)] pub enum OpCode {

By deriving Show we’ll be able to use format!, println! and friends at
various points of debugging and get something plausible ish.

Our actual parser is just about as simple. I won’t inline the
code, but it should read pretty simply.

We create a mutable vec to hold our actual program, and a loop_stack to hold
the contents of each nested loop, inserting those loops into it’s parent each
time we enounter a closing bracket. This also makes it simple to test for
unbalanced brackets.

This then lets us return the Program. Using Option was left as an escape
hatch (unbalanced braces could be caught, as could the file errors, but for now
failing is reasonable since it’s not exactly “production ready”).

Next time, the evaluator.

Posted in Articles | Tagged , , | Leave a comment

Programming is literally the hardest thing

A conversation on the twitter really got me thinking, both about
programming as a profession, the human obsession with winning at everything (Up
to and including failing at life), and much though it kills me to say this,

The blog post that started it all is interesting- it begins by
saying that the authors friends who have a more physical job than he does, and
he goes on to explain why programming is so awful.

It’s not that I don’t see the point of his post. Programming really is awful.
These past few weeks have been among the most harrowing and demoralising that I
can remember, both in terms of invested effort and in losing faith in the tools
that we trust day to day. I enjoy ranting about what a hilarious shitshow
software is well above the mean for people who do what I do.

That said though- it’s worth maintaining some modicum of perspective. Ignoring
everything else, programming is innately safe. No job is without its risks,
but comparing the outcome of RSI to the average workplace injury on a
construction site I would wager skews way in favour of the construction site
for both frequency and severity. (As an aside, where the hell would I find
stats for this?).

Michael raised a valid point about mental health, but I totally fail to see how
this relates to computer science. If you’re in a workplace that’s causing you
mental or emotional harm you should seek help, immediately. My intuition
suggests that you stand much better chances of actually receiving it if you’re
in the typically privileged shoes of your average software engineer.

My intent with this post isn’t to victim blame or to tell anyone miserable in
their job (or just needing a space online to vent) to ‘harden up’. I do really
stop to wonder though, when I read about the number of studies showing that
using a male name improves chances of success in the
workplace and academia whether taking to a privileged soapbox to complain how
hard it is after you get there without a handicap might be missing the bigger

Posted in psych0tik News | 3 Comments

Heartbleed, psych0tik edition

As per the rest of the internet, our collective jaws fell through the floor when Heartbleed was disclosed.

First and foremost- our IRC server was not vulnerable. At present, the only live IRC node is magikarp, owned and run my carbon, and running gnutls. OpenSSL based *clients* on the other hand, could plausibly have been compromised.

With that said, the first order of business once patching was in order was to develop a working exploit. Given that the internet has had a few days to patch, and that numerous other exploits are now public anyway, I’m releasing mine.

Given the time sensitive nature of the situation (We needed a working PoC to quickly enumerate internal services at $dayjob), I took the quick and dirty approach, rather than developing from scratch I produced a patched libssl that when linked against turns any SSL client into a working heartbleed vector.

You can find it at For obvious reasons I would recommend building without shared object support, and not installing this anywhere.

Posted in psych0tik News | Leave a comment

Caffeine napping for glorious success

Toward the end of my biphasic experiment I started experimenting with caffeine napping- I don’t remember where I originally got the idea, but there are plenty of plausible sounding posts on the internets.

Basically, the routine is simple. Wait till you’re tired, have a shot of espresso and immediately get 15-20 mins sleep. I’ve found it borderline impossible to pull off if you’re not tired, although I find it difficult to sleep without impetus anyway. I’m not convinced this is something I could do longterm, but it’s proving to be significantly more convenient to slot in around my work/life balancing act than the 45 min daily naps I was taking during the biphasic experiment.

Posted in Articles | Tagged , | Leave a comment