01PTT: Microsoft Azure Apology by Jason Zander

Corporate apologies suck. They always drip with responsibility-dodging weasel words marshaled into emotionless, impersonal paragraphs bristling with mitigation and damage control, obfuscated from start to finish by meaningless jargon and nickel-and-dime customer-service platitudes.

"We take this issue very seriously." Really? Then how could this happen? If it happened while you were already taking "it" very seriously, doesn't that mean you need to dial it up all the way to "extremely seriously" to stop it happening again?

No. "We take this issue very seriously" is meant to convince an audience that the people in charge are fit to be in charge. It means: "Oh shit, we dropped the ball and now everybody's watching -- for the love of god somebody grab that thing -- Rodney, RODNEY pick it up, pick it -- no, he fumbled. SAM, it's on your left, your OTHER left, grab it GRAB IT! Ah, gentlemen, ladies and gentlemen, we're experiencing a slight interruption to our ball-carrying capacity affecting a very low number of our ball carriers, and we're actively engaging our affected customers to achieve a recovery posture at this time, so everything's fine, everything's great, just really really peachy, thank you."

As it happens, this corporate apology from Microsoft's Jason Zander is not a good example of corporate assholery. Jason does a reasonable job of hitting the high notes:

Stuff broke and we apologize. Here's how we're fixing it and here's when we think stuff will be fixed. Here's what we know so far about what broke and why. Here's what we propose to do to not have it happen again. Here's what to do if you're affected and how to contact us. Here's what we think is a fair way to compensate you for our screw-up.

Jason forgets to stick the landing and omits some of those last points, but overall he's not being a dick about it. Not deliberately. But I'll point out some corporate PR habits that are kind of dickish anyway.

Jason's blog post is at http://azure.microsoft.com/blog/2014/11/19/update-on-azure-storage-service-interruption/

It's four paragraphs long. So we'll tackle the analysis in four parts. Our goal: cut the bullshit, deliver the message, be sincere, be helpful, get the hell out of Dodge.

Yesterday evening Pacific Standard Time, Azure storage services experienced a service interruption across the United States, Europe and parts of Asia, which impacted multiple cloud services in these regions. I want to first sincerely apologize for the disruption this has caused. We know our customers put their trust in us and we take that very seriously. I want to provide some background on the issue that has occurred.

"Experienced a service interruption" downplays seriousness. That might be okay for ten or twenty minutes of downtime on a single day, but this "interruption" went on for about 11 hours. Notice how the focus is on the services, not the affected users. It impacted "Azure storage service" and "cloud services," not customers and people. Not a great start to your apology, Jason. We'll see as we go how the blog post speaks to customers in the third person, how it's not speaking directly to YOU, the reader, who probably came to the Microsoft Azure site to get some urgently needed answers.

"I want to first sincerely apologize for the disruption this has caused." That's the single most important sentence in the message. Somebody in charge is sorry. And we hope this contrite person has all the authority necessary to kick asses and take names so this never happens again.

"We know our customers put their trust in us and we take that very seriously." Aaand Jason immediately stumbles after a good start. Taking things seriously is a given. Yet it's the first pony companies trot out at the first damage-control show. Yes, we know you take it seriously or we wouldn't be your customers. We'd be sitting over there laughing and hugging our aching sides while a grinning clown in a suit shreds our important documents with a lawnmower. IF YOU HAVE TO SAY YOU'RE TAKING IT SERIOUSLY THEN YOU HAVEN'T BEEN TAKING IT SERIOUSLY. Like when you're on the operating table about to go under the knife and your brain surgeon pauses to announce: "Nothing to worry about. I take this very seriously."

The final line in this paragraph is embarrassingly useless -- the same as nervously clearing your throat and tugging at your suddenly too-tight collar.

Here's my revision:

Yesterday evening Pacific Standard Time, an Azure storage services interruption impacted multiple cloud services across the United States, Europe and parts of Asia . I want to first sincerely apologize to you for this disruption. You trust us and we take that very seriously. This is what happened.

What happened. Apologize. Explain. Your audience is here for the facts, so squeeze that trigger and don't let go until you're out of bullets.

"I want to first sincerely apologize to you for this disruption." TO YOU, our affected customers. Very important. Otherwise Jason might as well be apologizing to his manager, to his team, to his wife, to his dog for not arriving home in time to go for walkies -- who knows.

There's no avoiding that "we take this very seriously" so get it out of the way quickly and move on to the explanation.

Paragraph two is the workhorse of the blog post. Let's see how it performs.

As part of a performance update to Azure Storage, an issue was discovered that resulted in reduced capacity across services utilizing Azure Storage, including Virtual Machines, Visual Studio Online, Websites, Search and other Microsoft services. Prior to applying the performance update, it had been tested over several weeks in a subset of our customer-facing storage service for Azure Tables. We typically call this “flighting,” as we work to identify issues before we broadly deploy any updates. The flighting test demonstrated a notable performance improvement and we proceeded to deploy the update across the storage service. During the rollout we discovered an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting. The net result was an inability for the front ends to take on further traffic, which in turn caused other services built on top to experience issues.

"An issue was discovered." "It had been tested." "Customer-facing." "Flighting." "An issue that resulted in." "The net result." "Inability for the front ends to take on traffic." "Services built on top experienced issues."

All very calm, droll, passive and impersonal. A scientist squinting through a microscope at a not particularly exciting specimen. No fault to be placed at Jason's feet. He's doing an okay job explaining what caused the outage. But do we need to know about "flighting"? It's like he dropped it in there as a cool bit of trivia for the tech-geek readers. I suppose it sort of shifts blame from his testing team to a testing process, reducing their culpability: "Hey, boss, it wasn't us! That damn flighting process let the team down again." Ugh. I think we can improve this paragraph. Trim and simplify and throw out repetition and pretend we're speaking to a wider audience than just the I.T. crowd.

A performance update to Azure Storage reduced capacity for some Microsoft services including Virtual Machines, Visual Studio Online, Websites, and Search. We tested the update over several weeks in a process we call “flighting,” where we try to identify issues before we broadly deploy updates. The flighting test showed a notable performance improvement, so we deployed the update across the full storage service. During the rollout we discovered storage blob front ends were entering an infinite loop, which flighting had failed to detect. The front ends rejected new traffic, which in turn caused other services to fail.

In this paragraph, Jason pretty much admits his team rushed to deploy the update because they saw a desired performance gain. That's bad practice -- something Google understands and avoids by rolling out its updates in controlled stages.

Observe how it's no longer about machines and process. It's stripped down to the "we": "we tested, we tried, we deployed, we discovered." It's about the soldiers in the trenches scrambling to cope with this emergency. An audience can empathize with people. Machines and process and services mean nothing to us emotionally. If you want us on your side, tell us about YOUR struggle, about the problems your people faced and overcame during this disaster. Your customers suffered and are understandably unhappy. At least give us an engaging story -- a story about people -- that makes it easier to process what happened.

Once we detected this issue, the change was rolled back promptly, but a restart of the storage front ends was required in order to fully undo the update. Once the mitigation steps were deployed, most of our customers started seeing the availability improvement across the affected regions. While services are generally back online, a limited subset of customers are still experiencing intermittent issues, and our engineering and support teams are actively engaged to help customers through this time.

This paragraph works hard to keep the reader at arm's length. Crowd control. Stay behind the barriers, you filth! Lots of passive construction and idea sanitizing. "Mitigation... availability improvement... limited subset... intermittent issues... actively engaged."

And the oily jaw-flapping of "help customers through this time." Sweet jitterbugging Jesus.

The revision:

We immediately rolled back the change then restarted the storage front ends to fully undo the update, and most of our customers saw improvement across the affected regions. Services are generally back online, and our engineering and support teams continue to assist those customers still experiencing issues.

And the final paragraph, the mea culpa and where-to-from-here:

When we have an incident like this, our main focus is rapid time to recovery for our customers, but we also work to closely examine what went wrong and ensure it never happens again. We will continually work to improve our customers’ experiences on our platform. We will update this blog with a RCA (root cause analysis) to ensure customers understand how we have addressed the issue and the improvements we will make going forward.

"Rapid time to recovery." Sigh.

When things go wrong we focus on rapid recovery. Then we closely examine what happened and ensure it never happens again. We will update this blog with a root cause analysis (RCA) to further explain what happened and how we will improve your experience on our platform.

And we're done.

All told, that's about 370 words shrunk to 240 -- a 35% reduction in wordiness.

One day more companies will realize a stream of words is not communication. And an apology is only sincere if communicated clearly.

	Author	Topic: 01PTT: Microsoft Azure Apology by Jason Zander (Read 1414 times)
0 Members and 1 Guest are viewing this topic.