Data Ingestion with Python

Go to class
Write Review

Free Online Course: Data Ingestion with Python provided by LinkedIn Learning is a comprehensive online course, which lasts for 1-2 hours worth of material. The course is taught in English and is free of charge. Upon completion of the course, you can receive an e-certificate from LinkedIn Learning. Data Ingestion with Python is taught by Miki Tebeka.

Overview
  • Learn how to use Python tools and techniques to solve one of the main challenges data scientists face: getting good data to train their algorithms.

Syllabus
  • Introduction

    • Why is data inegstion important?
    • What you should know
    • Using the exercise files
    1. Data Ingestion Overview
    • Overview of data scientists work
    • Where does data come from?
    • Different types of data
    • The data pipeline (ETL)
    • Final destination (data lake)
    2. Reading Files
    • Working in CSV
    • Working in XML
    • Working in Parquet, Avro, and ORC
    • Unstructured text
    • JSON
    • Challenge: CSV to JSON
    • Solution: CSV to JSON
    3. Calling APIs
    • Working with JSON
    • Making HTTP calls
    • Processing event-based data
    • Challenge: Location from IP
    • Solution: Location from IP
    4. Web Scraping
    • Try to find an API
    • Working with Beautiful Soup
    • Working with Scrapy
    • Working with Selenium
    • Other considerations
    • Challenge: GitHub API
    • Solution: GitHub API
    5. Schema
    • What are schemas?
    • Working with ontologies
    • What should be in schema
    • Schema changes
    • Schema validations
    6. Working with Databases
    • Types of databases
    • Hosted and cost of ops
    • Working with relational databases
    • Working with key or value databases
    • Working with document databases
    • Working with graph databases
    • Challenge: ETL
    • Solution: ETL
    7. Troubleshooting Data
    • Data is never 100% okay
    • Causes of errors
    • Filling missing values
    • Finding outliers (manual)
    • Finding outliers (ML)
    • Challenge: Clean rides according to ride duration
    • Solution: Clean rides according to ride duration
    8. Data KPIs and Process
    • Design your data
    • KPIs
    • What to monitor?
    Conclusion
    • Next steps