Developing scalable solutions for data Cleaning and API generation(BigData)

Presented at : PlugFest Conference @ Singapore : http://www.plugfest.asia/workshops/20130112

Gautam Anand

on 13 May 2013

Transcript of Developing scalable solutions for data Cleaning and API generation(BigData)

Scalable Data Cleaning & API Generation for BigData Problem ? Solution ! Gautam Anand Lets try to understand it by "example" BigData :
Large Volume,High velocity,High variety ,Varsity ! Multiple data sources : Sensor + Social + Open data etc Rate of flow is too high . Analyst and Business decision makers are not confident as its not visualized properly ! Structured + Semi structured + Unstructured Catch : One needs to use Multiple Data sources to quantify data better ! But all data sources have different format ? ! Why this matters ? Data scientist need to convert all the formats in one single MIME type and hence then make sense out of it . "This is not real time and hence BigData analysis is not working " Data source 1 : Data.gov /Raw Data Data source 2 : Social data via Twitter n so on...you can take n number of sources ..... What is the Problem statement ? We have Huge volume of already collected data
New real time data is generated by sensor / internet etc
Cloud computing to power HPC.
Good analytic software for finding nuggets by mining.
Post analysis visualization What we have : What is missing : Already collected data is in a format which required multiple MIME conversions .
Real time data maybe in web standards like XML/J son
This data is not in the "post query format" that would make analysis easier .
This data is not cleaned . Catch ! Wastage of Computation in MIME conversions
Wastage of time and efforts in serializing to a "analytic software format"
Less valuable as cleaning is done after MIME conversions ,as some data value will be changed or listed as garbage and removed. If somehow we could have a solution that could : Help major data sources to convert their data into Web standard format such as XML/JSON etc .
Port this over a API (Public or private)
Pre-defined Visualization channels for representing this data better .
Data cleaning operations should be automated and fast with suggestions in GUI. Why a solution ? Same format makes it universal available and understandable .
Intelligent automated "Operations" can be applied saving time & efforts.
Powerful applications around Web can be implemented !
More users ! More Knowledgeable insights ! Why it is a solution ? Data becomes developer friendly .
More new innovative applications come .
More users and hence more data is collected .Powerful data ecology booster . Why this is a solution ? Rules of Real time manipulation have to be set by experts.
If they have a tool box for pre-defined visualization ,it will make more sense and human error decreases.
They can convince their business partners . Why this is a solution ? This stage is an important step of Knowledge discovery process .
Analytical and high end quantifying algorithms work on cleaned data.
This will save time in processing it on a BigData set. How data can be cleaned : Data Wrangler How the framework works ? Fetch data from various sources "Static".Real-time data is pre-web standard if one uses cloud solution for terminal sensor fetching.
Scalable Data cleaning & Processing
Generate API calls
Generate Security on API type Any Open data site is good to go ! Web Application with required interactive GUI using open Google API like BigQuery etc Convert Scalable Spreadsheets,scripts,structured text into JSON/XML by using Open source python Libraries.

This makes your Structure calls compatible to SOAP/RESTful services. Data fetching Limitations and conversion policies. Lets Talk more !

