CBCGS [2019 - current]

CBGS [2015 - 2018]

Old [2000 - 2014]

## Topics with syllabus and resources

100.00 Introduction to Big Data

- Introduction to Big Data, Big Data characteristics, types of Big Data, Traditional vs. Big Data business approach, Case Study of Big Data Solutions.

200.00 Introduction to Hadoop

- What is Hadoop?
- Core Hadoop Components
- Hadoop Ecosystem
- Physical Architecture
- Hadoop limitations

300.00 NoSQL

- What is NoSQL? NoSQL business drivers; NoSQL case studies;
- NoSQL data architecture patterns: Key-value stores, Graph stores, Column family (Bigtable) stores, Document stores, Variations of NoSQL architectural patterns;
- Using NoSQL to manage big data:- What is a big data NoSQL solution? Understanding the types of big data problems; Analyzing big data with a shared-nothing architecture; Choosing distribution models: master-slave versus peer-to-peer; Four ways that NoSQL systems handle big data problems

400.00 MapReduce and the New Software Stack

401.00 Distributed File Systems

- Physical Organization of Compute Nodes, LargeScale File-System Organization.

402.00 MapReduce

- The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners, Details of MapReduce Execution, Coping With Node Failures.

403.00 Algorithms Using MapReduce

- Matrix-Vector Multiplication by MapReduce, Relational-Algebra Operations, Computing Selections by MapReduce, Computing Projections by MapReduce, Union, Intersection, and Difference by MapReduce, Computing Natural Join by MapReduce, Grouping and Aggregation by MapReduce, Matrix Multiplication, Matrix Multiplication with One MapReduce Step.

500.00 Finding Similar Items

- Applications of Near-Neighbor Search, Jaccard Similarity of Sets, Similarity of Documents, Collaborative Filtering as a Similar-Sets Problem.
- Distance Measures:- Definition of a Distance Measure, Euclidean Distances, Jaccard Distance, Cosine Distance, Edit Distance, Hamming Distance.

600.00 Mining Data Streams

601.00 The Stream Data Model

- A Data-Stream-Management System, Examples of Stream Sources, Stream Querie, Issues in Stream Processing.

602.00 Sampling Data in a Stream

- Obtaining a Representative Sample, The General Sampling Problem, Varying the Sample Size.

603.00 Filtering Streams

- The Bloom Filter, Analysis.

604.00 Counting Distinct Elements in a Stream

- The Count-Distinct Problem, The Flajolet-Martin Algorithm, Combining Estimates, Space Requirements.

605.00 Counting Ones in a Window

- The Cost of Exact Counts, The Datar-Gionis-Indyk-Motwani Algorithm, Query Answering in the DGIM Algorithm, Decaying Windows.

700.00 Link Analysis

- PageRank Definition, Structure of the web, dead ends, Using Page rank in a search engine, Efficient computation of Page Rank:- PageRank Iteration Using MapReduce, Use of Combiners to Consolidate the Result Vector.
- Topic sensitive Page Rank, link Spam, Hubs and Authorities.

800.00 Frequent Itemsets

801.00 Handling Larger Datasets in Main Memory

- Algorithm of Park, Chen, and Yu, The Multistage Algorithm, The Multihash Algorithm.

802.00 The Son Algorithm and MapReduce

803.00 Counting Frequent Items in a Stream

- Sampling Methods for Streams, Frequent Itemsets in Decaying Windows.

900.00 Clustering

- CURE Algorithm, Stream-Computing, A Stream-Clustering Algorithm, Initializing & Merging Buckets, Answering Queries.

1000.00 Recommendation Systems

- A Model for Recommendation Systems, Content-Based Recommendations, Collaborative Filtering.

1100.00 Mining Social-Network Graphs

- Social Networks as Graphs, Clustering of Social-Network Graphs, Direct Discovery of Communities, SimRank, Counting triangles using MapReduce.