Improve your reliability with modern operations practices

Go to class
Write Review

Free Online Course: Improve your reliability with modern operations practices provided by Microsoft Learn is a comprehensive online course, which lasts for 2-3 hours worth of material. The course is taught in English and is free of charge.

Overview
    • Module 1: Discover a map for navigating reliability challenges and sustainably achieving the appropriate level of reliability in your systems, services, and products.
    • By the end of this module, you will be able to:

      • Express why reliability is crucial to your success
      • Describe modern operations practices that offer tools you can use to work on your reliability challenges
      • Explain the Dickerson hierarchy of reliability and the map it provides for approaching reliability challenges
    • Module 2: Learn how to use monitoring to help you sustainably achieve the appropriate level of reliability in your systems, services, and products.
    • In this module you will:

      • Learn how to increase your operational awareness as a precursor to reliability work
      • Expand your understanding of reliability itself
      • Change the way you frame your thinking about monitoring to make it more impactful
      • Gain a basic understanding of the applicable monitoring platform and tools available on Azure
      • Learn a practice from site reliability engineering that can immediately start to create an impact on reliability
      • Learn to craft actionable alerts to make your operational practices sustainable
    • Module 3: Learn the incident response fundamentals necessary to help you sustainably achieve the appropriate level of reliability in your systems, services, and products.
    • In this module you will:

      • Learn the importance of effective incident response
      • Gain an understanding of the lifecycle of an incident so we know just how to apply our efforts
      • Learn the building blocks for constructing an incident response process that allows us to respond with urgency.
      • Begin to track your incidents effectively using Azure DevOps tools.
      • Explore ways to automate your incident tracking for a speedy and consistent response
      • Understand the guidelines around communication that allow incident response to be more efficient
      • Visit some Azure tools that can significantly speed up your remediation times during an incident
    • Module 4: Learn about post-incident reviews, a practice necessary to help you sustainably achieve the appropriate level of reliability in your systems, services, and products.
    • In this module you will:

      • Discover the importance of learning from incidents
      • Understand the aspects of complex systems that make learning from failure important
      • Learn when and how to conduct a post-incident review
      • Understand the purpose and goals of a post-incident review
      • Learn the components that go into a good post-incident review
      • Explore the Azure tools that can assist with getting started with post-incident reviews
      • Become aware of common traps to avoid
      • Identify helpful practices to conduct a better review
    • Module 5: Learn about deployment practices that can help you sustainably achieve the appropriate level of reliability in your systems, services, and products.
    • In this module you will:

      • Learn about what software deployment is and different kinds of deployments we might employ
      • Discover the significant benefits of switching from an "epic deployment" model to a "continuous deployment" model
      • Explore the components of continuous deployment
      • Look deep into pipelines and how they are implemented in Azure Pipelines
      • Learn a number of different strategies for deployment to production that can help us avoid incidents
      • Examine some important best practices that can minimize the risk when rolling out new software or a new version of existing software
    • Module 6: Learn about capacity planning and scaling practices that can help you sustainably achieve the appropriate level of reliability in your systems, services, and products.
    • In this module you will:

      • Learn about scalability and the scalability/reliability relationship
      • Understand the role of capacity planning in preparing for growth
      • Learn basic concepts and fundamental terms related to scaling
      • Eliminate single points of failure
      • Understand the different kinds of growth and how to respond to them
      • Be able to measure capacity in the cloud
      • Catch issues with service limits and quotas before they emerge using Azure tools
      • Understand important steps to take before beginning work on scaling
      • List techniques for making an application more scalable includingdecoupling, queues, in-memory caching and database sharding
      • Learn about the Azure tools that make it possible to take yourapplication or service global

Syllabus
    • Module 1: Improve your reliability with modern operations practices: An introduction
      • Introduction
      • Why reliability matters
      • Modern operations
      • The Dickerson hierarchy of reliability
      • Summary
    • Module 2: Improve your reliability with modern operations practices: Monitoring
      • Introduction
      • Operational awareness
      • Expanding our understanding of reliability
      • Changing the frame
      • Azure monitoring tools
      • Log analytics and KQL queries
      • Service level indicators (SLIs) and service level objectives (SLOs)
      • Actionable alerts
      • Summary
    • Module 3: Improve your reliability with modern operations practices: Incident response
      • Introduction
      • Importance of incident response
      • Characteristics and lifecycle of an incident
      • Foundations of incident response
      • Incident tracking
      • Communication and collaboration
      • Remediation
      • Summary
    • Module 4: Improve your reliability with modern operations practices: Learning from failure
      • Introduction
      • Why learn from incidents?
      • What is a post-incident review?
      • Characteristics and components of a good post-incident review
      • The post-incident review process
      • Common traps to avoid
      • Helpful practices for learning from failure
      • Summary
    • Module 5: Improve your reliability with modern operations practices: Deployment
      • Introduction
      • What is software deployment?
      • The continuous delivery deployment model
      • Test automation and the delivery pipeline
      • Deployment strategies
      • Summary
    • Module 6: Improve your reliability with modern operations practices: Capacity planning and scaling
      • Introduction
      • What is scalability?
      • Prepare for growth
      • Capacity planning considerations
      • Make applications scalable
      • Go global
      • Summary