Distributed Computing and Big Data

Welcome to Distributed Computing and Big Data course! Massive increase in the availability of data has made the storage, management, and analysis extremely challenging. Various tools, technologies and frameworks have surfaced to help address this challenge. Apache Hadoop is one such framework that enables us to handle big data by making distributed computing easier. Concerns such as reliability, distributed file management and distributed processing have been abstracted from us by hadoop. In this course, we shall start with understanding the characteristics of big data and the fundamental concepts of cloud computing. We will explore the hadoop ecosystem. Specifically, we will explore HDFS, Map-Reduce, Pig and NoSQL DB. Our objective is to handle big data effectively and build web applications and RESTful services over cloud. This is an introductory course focused on the breadth of the big data landscape.

Key Learning Objectives

At the end of this course, you should be able to:

Understand how distribtued file systems work. Be able to use Hadoop HDFS.
Understand distributed processing fundamentals. Code using map-reduce framework and pig scripts.
Understand NoSQL DB concepts using MongoDB and HBase.

Lecture Schedule

Lecture #	Topic	Readings	Slides/Material
Part 1: Introduction
1	Introduction to Big Data	The Complete Beginner's Guide To Big Data Everyone Can Understand Basics About Cloud Computing	Lecture 1 Lecture 2
2	Hands-On Tutorial: A Tour of Big Data Stack with Cloudera VM	CDH Overview	Tutorial 1
3	Distributed File Systems	Files and Directories File System The Hadoop Distributed File System	Lecture 3 Lecture 3.1 (DC Model) DFS (Lecture 4 Updated)
4	Hands-On Tutorial: HDFS	Exploring the File System	Tutorial 2
5	Distributed Processing with Map-Reduce and Pig	Overview of Map-Reduce Pig-Latin	Lecture 5
6	Hands-On Tutorial: Map-Reduce		Tutorial 3
7	Introduction to OOAD and UML		Lecture 6 Lecture 7 Tut4-JavaCode
8	Big Data Design Patterns	Chapter 1 from Thinking in Patterns Map-Reduce Design Patterns	Lecture 8
9	Apache Pig		Lecture 9
10	Hands-On Tutorial: Map-Reduce and Pig	MapReduce Tutorial Pig Tutorial	Tutorial 5: MR [zip], [video], Pig [zip]
11	NoSQL DB	Chapter 4 from BDA Book Columnar Storage NoSQL Explained	Lecture 10
12	Hands-On Tutorial: MongoDB, HBase	MongoDB CRUD Operations	Tut6-MongoDB Exercise Tut6-HBase
13	Web Application Development and Service Oriented Architecture	Web Application Development Sections 1, 2 and 3 of Web Services	Lecture 11 Notes Video Neo4j Neo4j Commands
14	Hands-On Tutorial: Apache Tomcat, JSON, RESTful Services	Building Web Applications with Tomcat RESTful Services	Hive and Solr Tut7-WS Tut7-Code Video
Part 2: Additional Topics
15	Content Delivery Networks (CDN)
16	Big Data Visualization with Tableau
17	Graph DB and Neo4j
Part 3: Student Presentations and Demos
18	Closing Remarks

Evaluation

Instrument	Max Marks
Mid Exam	20%
Final Exam	35%
Assignment (3*10%)	30%
Student Presentation and Demos	15%

Student Presentation and Demos

The presentation topics will be released soon. You may either individually do this or may work in a group of up to four students. Students may individually be called for a viva. Further instructions about submission format and expectations will be discussed in the class.

Pre-requisites

None.

Resources

Text
There is no prescribed text for this course. Readings will be shared during the lectures.

References

Optional Readings

Anatomy of the Linux file system
Chapters 11 and 12 (on File Systems) of Operating Systems Concepts 9th Edition
NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, Sadalage and Fowler, 2012: This is a short 150 page text explaining key concepts of NoSQL. If you do not have time to read this book, watch this talk by Fowler.
Gang of Four Design Patterns Book
A Conversation with Turing Award Winner Leslie Lamport