Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Developing scalable solutions for data Cleaning and API generation(BigData)

Presented at : PlugFest Conference @ Singapore : http://www.plugfest.asia/workshops/20130112

Gautam Anand

on 13 May 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Developing scalable solutions for data Cleaning and API generation(BigData)

Scalable Data Cleaning & API Generation for BigData Problem ? Solution ! Gautam Anand Lets try to understand it by "example" BigData :
Large Volume,High velocity,High variety ,Varsity ! Multiple data sources : Sensor + Social + Open data etc Rate of flow is too high . Analyst and Business decision makers are not confident as its not visualized properly ! Structured + Semi structured + Unstructured Catch : One needs to use Multiple Data sources to quantify data better ! But all data sources have different format ? ! Why this matters ? Data scientist need to convert all the formats in one single MIME type and hence then make sense out of it . "This is not real time and hence BigData analysis is not working " Data source 1 : Data.gov /Raw Data Data source 2 : Social data via Twitter n so on...you can take n number of sources ..... What is the Problem statement ? We have Huge volume of already collected data
New real time data is generated by sensor / internet etc
Cloud computing to power HPC.
Good analytic software for finding nuggets by mining.
Post analysis visualization What we have : What is missing : Already collected data is in a format which required multiple MIME conversions .
Real time data maybe in web standards like XML/J son
This data is not in the "post query format" that would make analysis easier .
This data is not cleaned . Catch ! Wastage of Computation in MIME conversions
Wastage of time and efforts in serializing to a "analytic software format"
Less valuable as cleaning is done after MIME conversions ,as some data value will be changed or listed as garbage and removed. If somehow we could have a solution that could : Help major data sources to convert their data into Web standard format such as XML/JSON etc .
Port this over a API (Public or private)
Pre-defined Visualization channels for representing this data better .
Data cleaning operations should be automated and fast with suggestions in GUI. Why a solution ? Same format makes it universal available and understandable .
Intelligent automated "Operations" can be applied saving time & efforts.
Powerful applications around Web can be implemented !
More users ! More Knowledgeable insights ! Why it is a solution ? Data becomes developer friendly .
More new innovative applications come .
More users and hence more data is collected .Powerful data ecology booster . Why this is a solution ? Rules of Real time manipulation have to be set by experts.
If they have a tool box for pre-defined visualization ,it will make more sense and human error decreases.
They can convince their business partners . Why this is a solution ? This stage is an important step of Knowledge discovery process .
Analytical and high end quantifying algorithms work on cleaned data.
This will save time in processing it on a BigData set. How data can be cleaned : Data Wrangler How the framework works ? Fetch data from various sources "Static".Real-time data is pre-web standard if one uses cloud solution for terminal sensor fetching.
Scalable Data cleaning & Processing
Generate API calls
Generate Security on API type Any Open data site is good to go ! Web Application with required interactive GUI using open Google API like BigQuery etc Convert Scalable Spreadsheets,scripts,structured text into JSON/XML by using Open source python Libraries.

This makes your Structure calls compatible to SOAP/RESTful services. Data fetching Limitations and conversion policies. Lets Talk more !

Full transcript