Week 3
TASK :
Word Count
Week 2
Week 6
Prime Number Count
TASK:
TASK
25/26
Machine Learning - Classifiers
Week 9
Take away
- Schedules differences
- Mini talks complemented knowledge
- Helpful on retakes
- Slow but Sure
- Amazing course structure
Kmer count
Familiarize with snapshots
Minimizing collisions by k-mer
subset selection
Week 3
We had more time cause of the smaller sample size
- Getting a better grasp of Hadoop & Spark concepts
- Finally, Coding!
- Master Crashed? RE-Networking
Install Spark
12/26
- Rate of training and test: Cross Validation
- Python Versioning – Our last headache
Parallelizing training set with RDD: Big performance improvement
Take I2DL and Machine Learning lectures
24/26
- Changing the replication factor to 1,
loosing a node and crying over a bucket of ice cream
Using ‘links’ to view the spark UI
18/26
11/26
Support Vector Machine
Learn Ansible
Random Forest Classifier
Maddy found a clever way to make it work
17/26
Week 7
TASK:
Decision Tree Classifier
Encoding
Week 2
23/26
Hadoop Ecosystem
Mapreduce Limitations
Week 2
Ansible
10/26
Minimizing collisions by varying encoding scheme
We decided to explore multiple encodings best of which was the 7bit group encoding yielding a 9.02% collision
9/26
- Silly mistakes, re-computation
- Memory issues, cleaning the system
Time crunch
20/26
19/26
Week 5
TASK :
Week 1
Setup
We further explore k-mer counting
Week 8
TASK :
A
We start to hash the k-mers
K-means
Multi-Node
Bisecting K-means
TASK :
We look into optimizing the spark-submit
Week 5
Hadoop
This was the first time we experimented
with a different configuration
Temporarily lost the master node PANIC!!!
We documented the fix for that
Finally started working on scala with sbt
L
D
A
16/26
Temporary loss of teammate
15/26
Hamming Distance
MCA
Week 8
Environment
- Data sparsity is a big problem
- Do NOT be afraid to try new things
- Go for the cleaner approach to quantify the data
Lots
of
math!!
Best K – Using Highest squared Euclidean distance
K means: 5
Bisecting K-mean: 15
22/26
Week 1
Hafiz Sameeullah
Hadoop
Multi-Node
Installation
Informatics - Master of Science (M.Sc.)
Expectations
- Under the hood Data mining
- Digestion of distributed-systems-buzz
- It’s goanna be boring
Learning Outcome
VS
mapred-site.xml
core-site.xml
- Linux World
- Looking for the right tools
- Different ways to look at the same problem
- Python, Scala, Dask
yarn-site.xml
hadf-site.xml
Hadoop Configuration
21/26
5/26
TEAM 3
8/26
Mohammad Shaharyar Shaukat
Jonathan Narvaez
Informatics - Master of Science (M.Sc.)
Data Engineering and Analytics
General concepts
- Distributed systems IN2259
General concepts
- Networking and Virtualization
- Foundations of Data Engineering
- Distributed Systems
- Basic Linux administration skills
L
A
B
Expectations
- Apply Theoretical Concepts
- Map Reduce
- Apply Theoretical Concepts
- Hands on - Hadoop ecosystem
3/26
6/26
7/26
Mining
Group 3
Madhavendra S. Negi
Aka Maddy
Informatics - Master of Science (M.Sc.)
Expectations :-
- Learn new tools and languages
- Learning to work in a distributed environment
2/26
4/26
Week 4
TASK :
Working with Dask
Distributed Data Mining Lab
Cluster Scale out vs Scale up dilemma
Kmer Count
Week 4
- Run time comparison for same problem using different tools
- Learning curve for Dask and Scala
- Reconfiguration of Cluster
14/26
13/26