Distributed programming on the cloud

Go to class
Write Review

Free Online Course: Distributed programming on the cloud provided by Microsoft Learn is a comprehensive online course, which lasts for 6-7 hours worth of material. The course is taught in English and is free of charge.

Overview
    • Module 1: Carnegie Mellon University's Cloud Developer course. Learn about distributed programming and why it's useful for the cloud, including programming models, types of parallelism, and symmetrical vs. asymmetrical architecture.
    • In this module, you will:

      • Classify programs as sequential, concurrent, parallel, and distributed
      • Indicate why programmers usually parallelize sequential programs
      • Explain why cloud programs are important for solving complex computing problems
      • Define distributed systems, and indicate the relationship between distributed systems and clouds
      • Define distributed programming models
      • Indicate why synchronization is needed in shared-memory systems
      • Describe how tasks can communicate by using the message-passing programming model
      • Outline the difference between synchronous and asynchronous programs
      • Explain the bulk synchronous parallel (BSP) model
      • Outline the difference between data parallelism and graph parallelism
      • Distinguish between these distributed programs: single program, multiple data (SPMD); and multiple program, multiple data (MPMD)
      • Discuss the two main techniques that can be incorporated in distributed programs so as to address the communication bottleneck in the cloud
      • Define heterogeneous and homogenous clouds, and identify the main reasons for heterogeneity in the cloud
      • State when and why synchronization is required in the cloud
      • Identify the main technique that can be used to tolerate faults in clouds
      • Outline the difference between task scheduling and job scheduling

      In partnership with Dr. Majd Sakr and Carnegie Mellon University.

    • Module 2: Carnegie Mellon University's cloud developer course. MapReduce was a breakthrough in big data processing that has become mainstream and been improved upon significantly. Learn about how MapReduce works.
    • In this module, you will:

      • Identify the underlying distributed programming model of MapReduce
      • Explain how MapReduce can exploit data parallelism
      • Identify the input and output of map and reduce tasks
      • Define task elasticity, and indicate its importance for effective job scheduling
      • Explain the map and reduce task-scheduling strategies in Hadoop MapReduce
      • List the elements of the YARN architecture, and identify the role of each element
      • Summarize the lifecycle of a MapReduce job in YARN
      • Compare and contrast the architectures and the resource allocators of YARN and the previous Hadoop MapReduce
      • Indicate how job and task scheduling differ in YARN as opposed to the previous Hadoop MapReduce

      In partnership with Dr. Majd Sakr and Carnegie Mellon University.

    • Module 3: Carnegie Mellon University's cloud developer course. GraphLab is a big data tool developed by Carnegie Mellon University to help with data mining. Learn about how GraphLab works and why it's useful.
    • In this module, you will:

      • Describe the unique features in GraphLab and the application types that it targets
      • Recall the features of a graph-parallel distributed programming framework
      • Recall the three main parts in the GraphLab engine
      • Describe the steps that are involved in the GraphLab execution engine
      • Discuss the architectural model of GraphLab
      • Recall the scheduling strategy of GraphLab
      • Describe the programming model of GraphLab
      • List and explain the consistency levels in GraphLab
      • Describe the in-memory data placement strategy in GraphLab and its performance implications for certain types of graphs
      • Discuss the computational model of GraphLab
      • Discuss the fault-tolerance mechanisms in GraphLab
      • Identify the steps that are involved in the execution of a GraphLab program
      • Compare and contrast MapReduce, Spark, and GraphLab in terms of their programming, computation, parallelism, architectural, and scheduling models
      • Identify a suitable analytics engine given an application's characteristics

      In partnership with Dr. Majd Sakr and Carnegie Mellon University.

    • Module 4: Carnegie Mellon University's cloud developer course. Spark is an open-source cluster-computing framework with different strengths than MapReduce has. Learn about how Spark works.
    • In this module, you will:

      • Recall the features of an iterative programming framework
      • Describe the architecture and job flow in Spark
      • Recall the role of resilient distributed datasets (RDDs) in Spark
      • Describe the properties of RDDs in Spark
      • Compare and contrast RDDs with distributed shared-memory systems
      • Describe fault-tolerance mechanics in Spark
      • Describe the role of lineage in RDDs for fault tolerance and recovery
      • Understand the different types of dependencies between RDDs
      • Understand the basic operations on Spark RDDs
      • Step through a simple iterative Spark program
      • Recall the various Spark libraries and their functions

      In partnership with Dr. Majd Sakr and Carnegie Mellon University.

    • Module 5: Carnegie Mellon University's cloud developer course. The increase of available data has led to the rise of continuous streams of real-time data to process. Learn about different systems and techniques for consuming and processing real-time data streams.
    • In this module, you will:

      • Define a message queue and recall a basic architecture
      • Recall the characteristics, and present the advantages and disadvantages, of a message queue
      • Explain the basic architecture of Apache Kafka
      • Discuss the roles of topics and partitions, as well as how scalability and fault tolerance are achieved
      • Discuss general requirements of stream processing systems
      • Recall the evolution of stream processing
      • Explain the basic components of Apache Samza
      • Discuss how Apache Samza achieves stateful stream processing
      • Discuss the differences between the Lambda and Kappa architectures
      • Discuss the motivation for the adoption of message queues and stream processing in the LinkedIn use case

      In partnership with Dr. Majd Sakr and Carnegie Mellon University.

Syllabus
    • Module 1: What is distributed programming?
      • Introduction
      • Categories of computer programs
      • Why use distributed programming?
      • Distributed programming on the cloud
      • Programming models for clouds
      • Synchronous vs. asynchronous computation
      • Types of parallelism
      • Symmetrical vs. asymmetrical architecture
      • Cloud challenges: Scalability
      • Cloud challenges: Communication
      • Cloud challenges: Heterogeneity
      • Cloud challenges: Synchronization
      • Cloud challenges: Fault tolerance
      • Cloud challenges: Scheduling
      • Summary
    • Module 2: Distributed computing on the cloud: MapReduce
      • Introduction
      • Programming model
      • Data structure
      • Example MapReduce programs
      • Computation and architectural models
      • Job and task scheduling
      • Fault tolerance
      • YARN
      • Summary
    • Module 3: Distributed computing on the cloud: GraphLab
      • Introduction
      • Data structure and graph flow
      • Architectural model
      • Programming model
      • Computational model
      • Fault tolerance
      • An example application in GraphLab
      • Comparison of distributed analytics engines
      • Summary
    • Module 4: Distributed computing on the cloud: Spark
      • Introduction
      • Spark overview
      • Resilient distributed datasets
      • Lineage, fault tolerance, and recovery
      • Programming in Spark
      • The Spark ecosystem
      • Summary
    • Module 5: Message queues and stream processing
      • Introduction
      • Message queues
      • Message queues: Case study
      • Stream processing systems
      • Streaming architectures: Case study
      • Big data processing architectures
      • Real-time architectures in practice
      • Summary