Measuring Quality
What is Software Quality? What does it mean? How do you make sure you're shipping a high quality experience? And how do you build that into your development practice?
A Programmer can understand a problem or feature request, envision the solution, break it down into stories and implement it in code. Not everyone can do it. Not everyone is suited to it. That’s truly a gift. An Engineer goes even further - they consider how that software will operate in the wild; How will it react under unforeseen conditions? How will it scale? How will it be managed in production? The former is a builder who can create an experience out of whole cloth. The latter is a professional who delivers and manages software for a living. As a career. As an Engineer.
Understanding Software Quality, how to build it into your development process and how to manifest it in your team will set you apart from the crowd, regardless of the career stage you’re at.
Quality distinguishes a mediocre forgettable experience from a truly memorable and premium one. Quality is often implied more than specified but it can be measured. In fact it must be measured. As an engineer you cannot just assume quality exists and improves over time. Regardless of how the software is brought to life (often in the scrappiest of processes) you must ensure that its operation improves over time and the only way to do that is… to measure it.
From abstract to concrete; What is meant by Quality when it comes to Software? What does it mean? and how do you know you’re delivering it?
Some questions you might ask yourself or your team:
What is the current status of your production infrastructure? Is everything working? How do you know? How long does it take to find out?
Even if things are working, are they working more or less efficiently than yesterday. More or less than last week? Or last year?
How is your system scaling with additional usage? Is it ready to react to a rapid influx of new traffic or customers? How do you know?
Are most of your defects reported by end users or are they caught internally and resolved before they reach a customer?
Do you have a well understood process for triaging production issues? How do you ensure those issues don’t happen again? Are there checks and balances in place?
Does your system have availability targets? Macro targets for the experience overall or specific targets for individual components and services? What are your SLAs, SLI, SLOs? (Service Level Agreements, Indicators and Objectives respectively).
When was your last outage? How was it triaged? What is the Mean Time To Response and Mean Time to Mitigate production issues? Is it increasing (worsening) or decreasing (improving) over time?
Financial: Is your system costing more and more to run as you bring on new customers? Is it getting cheaper and cheaper to run? Are you handling more customers with less infrastructure? Or the reverse?
Putting it into Practice; If those questions caught you off guard that’s good! Knowing how your software operates is known as “Observability” and is implemented through “Instrumentation”. You physically build in metrics reporting and measurement into the software and then track those metrics at various intervals. You must trigger alarms and page the support team when expected tolerances are exceeded. So where to start?
If your team doesn’t already track key metrics you might consider starting a discussion. Build some mindshare and support amongst your team and get a process up and running. I generally refer to this as the “Operational Excellence” or “OE” process and I’ve provided an outline below.
“What are other teams doing within your organization?”; There’s probably an opportunity to standardize and share lessons learned across groups. Working with another team can give you a jump start. Does your organization have an SRE team (System Reliability Engineering). Reach out to them. Operational Excellence is literally their job and not only can they can guide you but it’s in their best interests to work closely with software engineering so you both can be successful.
You might create some tooling or scaffolding to automate your instrumentation but before you do consider that there’s a whole world of tools and techniques out there that can help you implement and track various metrics that determine the health of your production software. Whatever you build you’ll have to maintain so beware of inventing a wheel that’s rougher and bumpier than what already exists in the Open Source community.
What to expect: Expect lots of issues in the beginning. Services outside of SLA, screens slow or unresponsive, bugs happening un-detected. Crashes and failures. Left alone entropy always increases and so it is when you ship software constantly without a solid Operational Excellence process in place.
You may even encounter cultural resistance. That’s normal and a potential sign that not all your teammates are on a software professionals’ path. That’s ok. We’re all on a journey.
When you go to the doctor for your yearly physical (you do schedule a physical don’t you?) you want to hear the magic words “non-remarkable“. It means nothing’s wrong. Congratulations! You’re a healthy individual. Boring is good. We want the same thing when it comes to our systems’ health. Total utter boredom is the goal!
Suggested Operational Excellence process:
Sit down weekly with representatives from all aspects of your system. Why weekly? You need to develop muscle memory. Spacing reviews close together forces teams to be smart about how data is collected. It’s should be automatic and frictionless.
Have each attendee bring data, evidence of their software’s performance. These might be graphs, dashboards, log trends or something else. The key is that it must be hard data that’s been measured and not prose or bullets on a slide that someone has hand-written.
All data should be available outside of the review for all to query and question. This ensures the process is transparent.
Don’t assign a single person to collect everyone’s data on their behalf. This is both unfair and ineffective. Instead have each attendee bring their own data and report on it themselves. This builds accountability into the process.
Review any incidents or anomalies that occurred in the past week. Check for variances. Are things trending one way or another week over week? Discuss any outages. Don’t use this time for an in-depth retrospective but rather just ensure that there’s a retro scheduled and its owned by a specific individual. (More on Retrospectives and Ownership in future chapters. )
You may find not everything lends itself to being measured. That’s normal. If you can’t measure the thing directly you surely can measure the things around it that point to its existence. You can track a trend line and over time improve its direction. Without measurement you are not doing engineering. You’re acting on intuition, otherwise known as“winging it”.
Some ideas for what to measure:
Service Availability - was your service up and running, meaning responding with HTTP 200s (or whatever the expected response is) for the full period in which you’re reporting?
Service Throughput - how many requests per second were served? How did this number vary over time? When were the peak hours of operation? What did your system’s utilization look like? Did your fleet have any headroom? Did you auto-scale or have to pro-actively provision infrastructure to handle the volume?
Latency - what were your response times? Page or screen load times? Above the fold vs below. (ATF or Above The Fold refers to the portion of the experience that the customer sees without having to scroll).
Crash Rates - are they nominal, were there any spikes? Is there an explanation that correlates with when the spike occured? Has there been a customer impact? Can you quantify it?
For many metrics it’s best to talk in terms of percentiles. You might hear phrases like “tp90” or “tp95”. In plain language this just means for a set of datapoints, e.g. service response times, the tp95 would be the highest value under which 95% of the requests were served. Generally tp90 is too low to be useful as it means 10% of the data might be outside of SLA. For a system of any reasonable scale that’s too much. Your aim should be to ensure tp95 or higher is within your Service Level.
Things to avoid: Be careful this review doesn’t turn into a lot of happy-talk. If there’s nothing to report (because everything ran smoothly) move on and give the time back. The review shouldn’t be a status report to upper management. It should be a collaborative session aimed at the betterment of the team. Nor is it a blame session. If one team is struggling, take it offline and help out. Having all teams report creates a healthy tension and keeps everyone honest. Keep the session tight and focused. If it turns out there’s not much to talk about that’s great. But ask yourself are you really measuring the right things? Is the Quantitative data supporting what your customers are reporting in the field (which is often more Qualitative)?
This is just a rough outline and I encourage you to tailor the process to suit your needs. Start somewhere, doesn’t matter how rough and then iterate and discuss amongst your team to improve the process week over week. As an individual this is a chance for you to drive the conversation and influence your team for the better. Remember why you’re doing it and work accordingly to achieve that aim.
Quality is a vast topic and we’ve only just scratched the surface. I haven’t covered the development process or various types of testing, automation, chaos engineering, load or performance testing etc. There’s a lot to explore and the pursuit of quality in the work you produce should be a theme that runs through your entire career. As a Software Engineer when you understand the full spectrum of what you build you elevate yourself to a whole new level of responsibility, influence and impact. I hope I’ve given you some ideas to think about and some things to discuss with your team.
Why it matters: Customer trust is built on Quality and trust is what drives the business. Think of the brands you admire and perhaps aspire to work for one day. They all have quality baked in and you probably don’t even think about it. “It just works”. That assurance is implicit when we think of The Fruit Company in Cupertino or the Book Company in Seattle. It’s hard to quantify the impact on the business but it is very real and we know this because we feel it…when it’s missing.