Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Thesis

Thesis presentation
by

Arantxa Duque

on 18 September 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Thesis

BIG DATA
ANALYTICS
BIG DATA
30 billion pieces of content
shared every month
approximately 12TB
of data per day
25 billion devices connected
to the Internet by 2015 and
50 billion by 2020.
A METHODOLOGY
FOR IDENTIFYING
SIMILAR SUPPORT
REQUESTS
USING HADOOP AND
BIG DATA
The world’s information is doubling every two
years and is predicted to reach 35ZB by 2020.
BIG DATA
Velocity
Volume
Variety
DATA ANALYTICS
Extracting information from data to gain meaningful insight and to make more informed decisions.
DATA ANALYTICS
credit card fraud
biology
market basket analysis
target customers
Technical Support Centre
Traditionally, call centres discarded data related to customer enquires within a relatively short period of time
VALUABLE INFORMATION CAN BE EXTRACTED FROM TECHNICAL SUPPORT DATA
first call resolution
customer satisfaction
IN THIS THESIS
daily closure rates
TECHNOLOGIES
HADOOP
HADOOP
HADOOP
HBASE
HIVE
MAHOUT
clustering
collaborative filtering
classification
canopy clustering
fuzzy k-means
text clustering
text vectorization
term frequency
inverse document frequency
term frequency
term frequency-inverse document frequency
term frequency
latent dirichlet analysis
how do we put the pieces together?
k-means
dirichlet clustering
Service Request
information
is stored in
Salesforce.com
Information stored
in Salesforce.com is
exported as CSV files
CSV files are loaded into HDFS daily
Mahout MapReduce clustering algorithms
are run to analyse the data
Clustering results are stored in HBase
to ensure real-time
access
Users query the
clustering results
using a web interface
SRImport
k-means
k-means with
canopy clustering
k-means vs k-means with
canopy clustering
fuzzy k-means
k-means vs fuzzy k-means
dirichlet clustering
latent dirichlet analysis
clustering evaluation
clustering evaluation
clustering evaluation
clustering evaluation
LoadHBase
demo
TD-IDF limitations
Conclusions
Who has the 1st question?
sample SRs
We have 20 vm running in the virtual desktop environment and 11 of those machines coming up with this error "all available desktop sources for this desktop are currently busy, please try to contact system administrator"
Users experiencing a slow performance when connecting to VM although other users connected to physical server works properly
Arantxa Duque Barrachina
Supervised by Aisling O'Driscoll
Full transcript