Troubleshooting best practices for DevOps teams – strategies to recover quickly from downtime

Monitoring Board

It’s Saturday night – and your system is down, customers can’t access the application any more. And your key developers are not in reach. Sounds like a quite uncomfortable situation. Read here what you can do to prepare for such events – and recover as quickly as possible from outages and downtime events.

The usual steps to remedy problems are clear: understand the issue, fix the root cause. Sounds very straightforward. However, what if the person on call is not the experienced developer and doesn’t know right away what to do? In DevOps teams with shared responsibility and distributed 24/7 support that may happen sooner or later. To be on the safe side you need an approach that enables the people on call to remedy the most common problems without deep expert knowledge. How to prepare for that? Here are some best practices.

(1) Know the usual suspects

Chances are that this is not the first time that the service failed. That may be due to some known but not yet fixed problem, or due to some dependency to an service outside of the team’s control. Such potential “known” issues should be documented prominently, along with step-by-step instructions on how to get up and running again. Ideally this should be part of the ‘troubleshooting’ section of your runbook (see below).

(2) Provide quick diagnostics via monitoring boards

A good monitoring board is the starting point for efficient troubleshooting. Each service should have its ‘health’ board where the status of each major component is displayed, e.g. via green/red status panels. Make sure the overall situation can be perceived at a glance. Where finer grained information is available make it accessible as drill down from these panels. An example: for service latency you can use time series plots over the last few hours. For such display it may be helpful to display horizontal lines within the chart to indicate the ‘normal’ range of the value.

Monitoring Board
Monitoring Board

The board should also show the status for each required external service. This will immediately indicate if the own service or some external dependency – like e.g. a database – is the cause of the problem.

Building good monitoring boards takes time and effort. For each component you need to come up with a reliable status test. However the work will pay back sooner or later. Running production systems without such monitors is like driving a car at night without the lights on.

Grafana is a widely used open source tool for building such boards. There are also lots of other tools, including commercial systems that automatically take care of code instrumentation for health and latency measurement.

(3) Set up symptom based fix procedures

This is the most underrated approach to speed up system recovery. It will take some time and effort to prepare but will most likely provide good learnings for the team and put you in a much better position if problems occur. How does it work?

As engineers we are used to start reasoning about system behaviour from the viewpoint of individual components:

“if the database index is corrupt => service xyz will have high latency

However in an outage situation such information is not very helpful, especially for non-experts. The guys will not see the database index problem – they will see the high service latency. And they want to know what to do about it. So lets analyse the system and set up instructions that start from exactly such observable symptoms. This is how it may look:

high latency of service xyz may be caused by an overloaded database

symptom-cause-fix

Imagine your 24/7 support had a complete list of possible system problems (symptoms) – and for each of them a corresponding fix procedure. Troubleshooting would be a lot easier. Of course there may be more than one potential root cause. Or additional checks may be required to find out which of the possible causes is the culprit.  Here’s the approach to do this analysis in a systematic way. For best outcomes do it with the entire team:

Phase 1: Problem brainstorming

  • Brainstorm possible system problems and failure symptoms
  • Ask yourself: what can go wrong and how would that become visible in system behaviour?
  • Try to make this list as exhaustive as possible

Phase 2: Assign root causes and verification checks

  • For each symptom list the possible root causes
  • If required or useful: add instructions to verify or exclude the suspected root cause – these are the verification checks

Phase 3: Write down fix procedures

  • For each root cause write down the required steps to bring the system back up to normal operation
  • If possible include verification instructions – how would you check that the procedure solved the problem?

Congratulation: you just created a troubleshooting guideline  : )

Do this exercise with the team, and repeat it every few weeks or months t o make it more complete over time – and to adapt it to modified system behaviour or new features. The troubleshooting guideline is also an essential part of the fourth best practice:

(4) Keep a runbook

Set up and maintain a runbook for each of your applications and services. The runbook contains the basic operational data for the service:

  • Name and short description
  • SLA (service level agreement or target)
  • Involved team members
  • Involved artefacts and libraries – and corresponding links to the repositories
  • Consumed external services (dependencies)
  • Build and deployment approach
  • Smoke tests (how would you quickly verify that the service is up and running?)
  • Monitoring KPI’s and strategies
  • Troubleshooting guideline (see above)
  • …and everything else that may be helpful

Keep the runbook up to date – and make sure it is easily accessible for whoever may need the related information.

And how about logging?

Logs are important, no doubt about that. However you should not rely only on logs to find out about your system’s health status. Set up monitoring boards for that purpose. And have your logs ready and easily accessible for verification checks – or situations where the approaches 1…3 did not help and you need to dive one level deeper.

Fast software release cycles – how to avoid accidents at high speed

Why are fast release cycles so important for software development – and what strategies can help to avoid accidents although the team is producing at high speed.

Blurr - fast software releases
Photo by chuttersnap on Unsplash

Fast release cycles create customer value

The goal of every software development team should be to deliver new functionality to the users as soon as possible. Why? Finished software that sits in the shelf waiting for the next release is not usable. It is incomplete work, wasted effort and money. To add value you need to put that shiny new feature in the hands of the customer. Only then the new features make a difference in the real world. This means your software is only complete after the release and deployment. The entire process from development and testing to deployment needs to be optimized for speed.

Fast release cycles enable flexibility

Or think about a situation were your tests have discovered a security problem in your software. Now you need to be able to fix it quickly. Or you may need to adapt to a breaking change in some other consumed service that is not even in your own hands. Things happen, and in Cloud World you need to be flexible and able to adapt quickly. Once again – you need to be able to fix fast, but this only helps if you are also fast at testing and deployment. However nobody wants to be reckless. Jump and see how it goes? You want to be sure that your fix works.

Fast release cycles - but no reckless jump into the unknown
Photo by Victor Rodriguez on Unsplash

Why incremental changes are your friend

The good news is that you won’t change the entire product from one day to another. If planned accordingly the team can break down the work into small steps. Ideally these can be tested individually to get immediate feedback. It works? Great. There’s a new problem? Ok, we should know pretty well where it comes from since only a small number of changes occurred since the last good version. And the developers will have these changes fresh in their minds. Fixing the problem should be much easier compared to yesterday’s approach were many changes came to test a long timer after implementation -and all at the same time. So let’s assume the new small change is implemented and tested separately.

Incremental step wise changes help to shorten release cycles
Photo by Lindsay Henwood on Unsplash

 

 

 

 

 

 

 

 

 

 

The next and final step is to deploy this incremental change and we’re done? Sounds too good to be true and indeed… How can you assure that the small change didn’t cause any side effects and broke something within the existing overall system? This is called a regression.

The new bottleneck: regression testing

So you need to test for regressions. And this basically means that you need an overall test of the entire system, which often is a huge effort. If you want to be on the safe side you will have to repeat this exercise over and over again for each small incremental change. Now if such an overall tests would take days or weeks it kills the nice-and-small incremental approach. It would just be too slow and too expensive.

Software test lab
Photo by Ani Kolleshi on Unsplash

The only way out of this dilemma is…

Test automation – the enabler for high speed releases

Imagine a setup where you could prove with a click on a button that your software is doing what it is supposed to do. That today’s changes did not introduce any regression. Test automation aims at achieving this. Manually clicking through an application may still have its place within an overall testing concept. But in no way this should be your normal approach. Put the test procedures in code and execute them automatically. This is what enables quick feedback to code changes – and therefore fast release cycles. This automated approach has the added benefit of repeatability – and test status reports that your the test framework will create automatically if set up accordingly.

Does this mean that testers are not required any more? Not at all, rather the opposite is true. Test automation won’t save cost – this is about saving time and improving quality. Or in other words: about avoiding regressions although the team is going at high speed. This is where testers play a key role. However with test automation the tester’s focus and know-how profile changes completely. Yesterday testing meant manually executing test procedures over and over again. Today it means development and coding of tests. Testers are becoming developers – or developers take more and more responsibility for testing. Welcome once more to DevOps world. (more here).

Fast software release – mission accomplished?

So let’s assume you the team works with incremental changes. You have automation in place to quickly test the changes for functionality and regressions. We are good to go – now what needs to happen to put the new version to production – into the hands of the users? This will be covered in the next article about Deployment automation. Stay tuned.

What is DevOps? How development for the cloud changes a dev teams life

What DevOps means is quickly explained: Development + Operations together. ,But what does DevOps really mean for development teams and their day-to-day work? And what is ‘operations’ to begin with…?

“Operations” explained

What is operations, does all software need to be operated? To explain this let’s take your local Word and Excel, or whatever local software you have installed, as an example. It just sits on your notebook. Once in a while you’ll probably update it to a newer version but that’s it. It is your personal software on your own machine – no real operation involved.

Compare that to your email. Here again you may use some local client, or just the browser. No operations. But then there is your email provider and all that is required to manage your mail account and transmit your mails. This is done by services that run in some datacenter and you can count on a team of experts that look after that software. They’ll make sure that the system runs smoothly, they apply the latest security patches, protect it against hacking attacks and securely back up your data, to name just a few of the tasks. This is what an operations team does. And you need to rely on it because you are using services that are not under your own control. Means whatever software runs in a cloud or datacenter will need operations.

Software development vs. operation

Traditionally there has been a very clear separation between teams that develop software and the ones that operate it:

DevOps separation of development and operation

Photo by Raj Eiamworakul on Unsplash (stuff in red by Tom)

The developers would write their code and maybe even test it 🙂 but as soon as possible throw it over the wall to the operations team. Then it would be up to them to figure out how to install and run it. Okay, maybe there are companies with good collaboration between these groups, but still there may be some conflict of interest. Developers will want their new versions out on the productive system as soon and as frequently as possible to bring new features and bug fixes to their users. The Ops team will try to slow things down a bit and play it safe since every update is considered a risk and may required some system downtime.

Operation for cloud software

Now fast forward to cloud world with an agile team in Scrum mode. Software sitting idle waiting to be deployed is considered ‘waste’. The infrastructure does not consist of physical servers owned by administrators any more. Now infrastructure is code and the dev teams’ architectural design decisions have a huge impact on the required cloud building blocks and corresponding cost. Also the ongoing operations effort is to a large part determined by the architecture. System changes may require modification of the infrastructure code, too: adapted configuration of the used cloud provider services, extensions to the system monitoring etc. The entire separation between development and operations does not make sense any more. As they say: you can do it, but it’s not good.

DevOps to rescue

Instead let’s put everybody in one team. Ensure that operational concerns are considered during development and when designing the system architecture. The team should consider the overall lifecycle cost and minimize effort accordingly. This is what DevOps is supposed to mean. No wall any more, not even two distinct groups, hopefully.

development team
Photo by rawpixel on Unsplash

Check out these related articles: