Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Bergen 2010

Exploitation web-based data

Maciej Piotrowski

on 3 April 2012

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Bergen 2010

Research design, collection and exploitation of web-based data
Experiences from estate and labor market research Maciej Piotrowski

Institute of Economics
University of Information Technology in Rzeszów, Poland http://www.wsiz.pl 20 threads 5 threads 1 thread UITM Sample approaches Web-based data collection sample approaches http://www.msnbc.msn.com/id/38463013/ns/technology_and_science-security/ http://www.techeye.net/internet/apple-bbc-ibm-intel-and-more-downloaded-facebook-user-torrent http://www.wolframalpha.com But ....
limited sources of data linked
no flexibility to create own databases

But ....
no possibility to automatically browse through pages
no option to automate incremental storage of data in specified time intervals Efficiency Other factors:
our network connection
website network connection
number of variables
complexity of regular expressions for variables DoS attack ?!? The need Our solution Challenges Businees
application Research opporunities Thank you for your attention! Questions? Maciej Piotrowski

Institute of Economics
University of Information Technology in Rzeszów, Poland

e-mail: mpiotrowski@wsiz.rzeszow.pl
Skype: maciej_piotrowski
WWW: http://ig.wsiz.pl
Complex quantitative analysis of the real estate market in Poland:
price levels depending on type of property, location, parameters
analysis of dynamics (monthly period)
spatial analysis
Complex quantitative analysis of the labor market in Poland:
availability of jobs depending in regions
analysis of structure and dynamics
searched skills and competences

Exploitation of real estate data:
Real estate agencies
Investors and developers
End customers
Media (especially local and regional)

Exploitation of labor market data:
job offices and agencies
education institutions

Future plans:
Complex spatial analysis

Development and integration with real estates listings of "price attractiveness mark"

efficiency of various algorithms, regular expressions, databases, infrastructures
optimization of large databases for future processing
optimization of end-user interfaces linking to databases
data mining techniques

Legal aspects:
lawfulness of extensive web-data collection and database building
comparative analysis of national and international regulations
Statistics, econometrics, forecasting:
spatial analysis of data (spatial dependency, auto-correlation, spacial interpolation) - GIS software
forecasting models
turning page rules
error-proof mechanisms
flexibility and extensibility
optimization of data cleaning, sorting and other data operations mechanisms
data presentation and visualization interfaces for large daatabases (e.g. real estate database contains ca. 1 milion records for each months, each with 10-30 variables depending on the type of the property)
optimization of data collection algorithms
CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart)

System architecture Basic facts:
Data collection application written in C# (Framework .NET 3.5) is fully scalable and can benefit from multiple CPUs and multicore CPUs
Data saved in PostgreSQL database through Npgsql
Management web-based interface is written in PHP, xHTML and AJAX. SOAP protocol used to communicate with Server
Web-based presentation interface is written in PHP, xHTML and AJAX, additionally Adobe Flex used
4 thousand lines of code

Regular expressions:
getting number of records: @"(?<=<td class=.paginateopis. nowrap>wyniki[0-9 -]+spo.r.d )([0-9]+)"

getting links of records: @"(?<=<div class=.line0.><a href=./details,[0-9]+,)([0-9]+)"

getting street name: @"((?<=ulica:.)[\w\s-]+)"

getting house type: @"(?<=<div class=.opis.>Podstawowe informacje</div>[\w\s.,-]+)(blok|kamienica|dom wolnostoj.cy|apartamentowiec|wie.owiec|inny?)"

getting dimentions of the land: @"((?<=Wymiary dzia[&#0-9ęóąśłżźćń;]+ki:[a-z\s,.-]+)|(?<=wymiary[ ]*)|(?<=kszta[&#0-9ęóąśłżźćń;]+t:[a-z&#0-9ęóąśłżźćń;.\s]+)|(?<=na[ ]*)|(?<=wa[ ]*)|(?<=wy[ ]*)|(?<=wymiarach ok.[\s]*))([0-9,.]+)([-mxszer.dl/\s]+|[na\s]+)([0-9,.]+)" http://ig.wsiz.pl Other promising areas of application:
Business entities database
Online auction websites (e.g. used cars price index, automatic valuation of cars based on feautures)
Social networks and other e-communities websites
Automatic sentiment analysis

Full transcript