Applied Big Data Analytics

Apply Data Science techniques from Exploratory Data Analytics to Model Development and Deployment on Big Data

What you will learn

This big data certification course is for software engineers, data analysts, business analysts, technical program managers, architects, database administrators, and researchers with an interest in data science and big data engineering. The format of the course is 50% lectures and 50% labs with exercises. This course is a practical introduction to the interdisciplinary field of data science and machine learning, which is at the intersection of computer science, statistics, and business.

A significant portion of the course will be a hands-on approach to the fundamental modeling techniques and machine learning algorithms that enable you to build robust predictive models. You will learn to use the Python programming language, AWS and Azure Machine Learning tools, and technologies to help you apply machine learning techniques to practical real-world problems. The two in-class projects with Kaggle capstone and IoT streaming will crystallize the concepts learned in the course.

Key skills & technologies

Acquire, clean, and parse large sets of data using Python
Gain knowledge on choosing the appropriate modeling technique to apply to your data
Apply probability and statistics concepts to create and validate predictions about your data
Programmatically create predictive data models using machine learning techniques
Communicate your results to an appropriate audience

Quick info

Duration: Mon-Fri (50 hours)
Schedule: 9:00am-7:00pm
Level: Basic to Intermediate.
Pricing: $3,500

Course plan

Each module introduces one or two core Machine Learning techniques, and Big Data framework concepts and tools while working through the practical implementation.

DATA EXPLORATION AND VISUALIZATION: The first and most important task of the data scientist is to understand their data. The bulk of our first day is dedicated to the theory and practice of understanding data. Through a series of interactive, hands-on exercises, we teach you how to dissect and explore data, engineer your features, and clean your data to prepare it for modeling. You will learn not just the mechanics of data exploration, but also the proper mindset, one that will help you tease out the patterns hidden in your data.
INTRODUCTION TO PREDICTIVE ANALYTICS AND CLASSIFICATION: Our first foray into predictive analytics is guided by a deep dive into the mechanics and theory behind decision tree models. The basis of some of the most successful predictive models, decision trees provide a useful vehicle for hands-on exercises in training and testing classification models in Python.
EVALUATION OF PREDICTIVE MODELS: One of the subtlest and trickiest areas of modern data science is in model evaluation. The risk of “overfitting” and producing a model that generalizes very poorly constantly hangs over the practitioner’s head. We teach you about the metrics and methods you can use to protect yourself from this danger, giving you direct, practical experience in how to tune your models for greatest effectiveness. We’ll familiarize you with the evaluation and model tuning capabilities of Python.
ENSEMBLE METHODS: Two models working in concert are better than one. That’s the fundamental principle underlying one of the most important modern advances in machine learning, ensembles. We take you through the underlying theory, explaining why ensembles outperform single models, as well as pointing out the most common pitfalls and dangers. We cover both bagging and boosting strategies for constructing ensembles, using random forests and AdaBoost as concrete examples. You’ll build and tune multiple kinds of ensemble methods for yourself, in Python.
DEPLOYING MACHINE LEARNING MODELS: The best model in the world is useless if you can’t get new data to it. Azure and Amazon Machine Learning both provide direct and simple processes for setting up real-time prediction endpoints in the cloud, allowing you to access your trained model from anywhere in the world. We walk you through constructing your own endpoints, and show a few practical demos of how this can be used to expose a predictive model to anyone you’d like to use it.
PARAMETER TUNING: Modern machine learning algorithms are designed to work well “out of the box”, but the default settings are rarely the truly optimal ones. We teach you to understand the effects of each algorithm’s configuration parameters, and to use this knowledge to tune your models for optimal performance.
INTRODUCTION TO REGRESSION: Regression and classification are the two sides of the supervised learning coin. You will learn how to adapt the techniques you have learned to the challenge of predicting prices, revenues, click rates, and more. We give you an overview of how regression models learn, teach you how to evaluate them, and demonstrate the use of regularization to prevent overfitting. We end with hands-on exercises in Python.
TEXT ANALYTICS: Many applications of data science require analysis of unstructured data. We will teach you the basics of converting text into structured data, showing you how to avoid some of the most common pitfalls.
FUNDAMENTALS OF BIG DATA ENGINEERING: The first challenge of big data isn’t one of analysis, but rather of volume and velocity. How do you process terabytes of data in a reliable, relatively rapid way? We teach you the basics of MapReduce and HDFS, the technologies which underlie Hadoop, the most popular distributed computing platform. We also introduce you to Spark, the next wave of distributed analysis platforms.
A/B TESTING & ONLINE EXPERIMENTATION: Online experimentation is perhaps the most misused of data science techniques. We will walk through the best practices for designing and evaluating A/B and multi-variate tests. We discuss how to choose the appropriate metrics, how to detect and avoid errors, and how to properly interpret test results.
KAGGLE CAPSTONE: You’ve been learning the knowledge and skills of data science for 3 days. Now it’s time to put those new skills to the test with a real problem. Kaggle’s Bike Sharing Demand prediction competition is the perfect testing ground to cut your teeth on.
EVENT INGESTION AND STREAM PROCESSING: How do we build a scalable data ingestion process? Modern companies need platforms which can redirect gigabytes of data per second, while handling interruptions gracefully and preserving the integrity of the data.
IoT CASE STUDY: You’re now prepared to embark on building your own end-to-end ETL pipeline in the cloud. You will stream data from your smartphone to an event ingestor, process that data, and write it out to cloud storage. You will then be able to read the data into Azure ML for analysis and processing.

Is it for me?

Do I need any programming experience to attend?

Yes – you will need previous experience in Python language.
What will I be doing?

In one week you will learn the foundations of data science, Supervised and Unsupervised learning techniques applied to Unstructured Text and Time-series data both locally and on big data clusters.
What age is this for?

This workshop is suitable for university students and working professionals.
Do I need a computer?

It is best if you bring your own laptop with 16GB RAM and 100-150GB free space on your hard drive.
What’s the experience like?

The course is is 50% lectures to explain the concepts and 50% labs with exercises including two projects.

Register

Call us at (214)-997-6100 if you have special circumstances or looking for dedicated corporate training