Once, long ago, there was an engineer who broke the operating system particularly badly. Now, if you’ve implemented important software for any serious length of time, you’ve seriously screwed up at least once — but this was notable for a few reasons. First, the change that the engineer committed was egregiously broken: the machine that served as our building’s central NFS server wasn’t even up for 24 hours running the change before the operating system crashed — an outcome so bad that the commit was unceremoniously reverted (which we called a “backout”).
Last Friday I got pulled in to a very hot customer call.
The issue was best summarised as
Since migrating our WebLogic and database services from AIX to Solaris, at random times we are seeing the the WebLogic server pause for a few minutes at a time. This takes down the back office and client services part of the business for these periods and is causing increasing frustration on the part of staff and customers. This only occurs when a particular module is enabled in WebLogic.
cache hostile: no reuse.
The older I get, the more engineering values matter to me — and the more I seek out shared values in those with whom I endeavor to build things. For us at Joyent, those engineering values reflect that we operate the software we make: we believe that foundational systems must be designed to be robust and high-performing — and when they fail in this regard, it is incumbent upon the system itself to provide the tooling to diagnose the errant behavior. These values are not new (indeed, they are some of the oldest in computing), but there are times when they can feel endangered.
Update October 8, 2015: Android 5.1 (“Lollipop”) OTAs
Like many programmers I like to try out new languages. After lunch with Alex Crichton, one of the Rust contributors, I started writing my favorite program in Rust. Rust is a “safe” systems language that introduces concepts of data ownership and mutability to semantically prevent whole categories of problems.
I recently found myself with a support request to do some research involving looking at the results of removing vdevs from a pool in a recoverable way while doing operations on the pool.
My initial thought was to make the disk devices available to a guest ldom from a control ldom, but I found that Solaris and LDOMS coupled things too tightly for me to do something which had the potential to cause damage.