Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Flight Delay Analysis

No description

Michael Fouché

on 15 June 2015

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Flight Delay Analysis

Neo4j Loves Cypher
delay analysis
We will focus on the delays related to flights.
This can include

Flight routes with the most delays
Flight routes with the least delays
Airports with high delays to stay clear of
Airports with the least delays
The average delay for each type of delay, for example weather, security, carrier, late aircraft and NAS (National Aviation System) delays

We love travelling
We we’re looking for a project that could provide knowledge from the research and provide a good opportunity to investigate new databases.
We both enjoy exploring and travel and though it would be great to get information about flight delays.
Furthermore we wanted to create a project that can be expanded for our Project 4, which we plan to do in 2016.

Reasons for the project
Relational vs NoSQL
Relational vs NoSql database
Padhy (2011) explains that a
relational databases
A data structure that allows you to link information from different tables by referring to a table’s index.
Other tables can refer to that key in order to create a link between their data members, and the member pointed to by the index.

A non-relational database, also known as a NoSQL Database
Stores data without explicit and structured mechanisms to link data from different tables to one another.

Delayed Flights
Problem Space
People tend to like applications where they can get information which is relevant to themselves and which they can use to improve their decisions. As Wang, Park and Fesenmaier (2011) explain, people have with the recent boom in social media accessibility, found pleasure in reading about the experiences of travelers, and ways to improve their journey to wherever they're going (Cole 2015).

People find it extremely frustrating when they have to wait for something, especially when they do not know the reason for the delays.

We saw the opportunity to provide people with information to improve their journeys, by suggesting various possibilities to avoid delayed flights. furthermore we could show the most common reasons why flights get delayed.

Flight Delay Analysis
Commercial Flight Delay Analysis with Java and a graph database (Neo4j)
Neo4j goes hand in hand with graph databases
It has:
Connectivity support to netbeans
Excellent documentation
Great community support
Follows ACID properties.

It therefore presented us with the most obvious choice to go with for a graph database.
Why Neo4j
Big data
We Found a data set about commercial flights from the US Department of Transportation.
We investigated Relational vs Non-Relational databases
We investigated NoSQL database types
We investigated Graph Databases
Why Neo4j
NoSQL Databases
NoSQL Databases can be categorized into the following areas:
Column Family Store
Graph Database
NoSQL Database types
Based on our research we decided to use a graph database
we decided to use Neo4j as the graph database platform.
Disney World
From our research we can conclude:
Orlando is constantly in the top 5 for least delays based on average or lowest. Therefore we would recommend if you're looking for a trip with the least delays while having fun, go to Disney World in Orlando via Orlando International Airport
If you we're thinking of visiting Buffalo - New York for it's architectural art, you should consider whether it's worth the likelihood of waiting for delayed flights, as this destination was on our top 5 for delays by highest value, average value, and by route.
Mindmap of the dataset
The data in the data set consisted of over 100 columns and therefore needed refining for our scope.

We kept data related to:
Flight details
Origin airport
Destination airport
Delay time
Cause for delays

The Data set
Relational vs NoSql database
Using NoSQL databases allows us to develop a solution without having to convert in-memory structures to relational structures, which could cause a frustrating impedance mismatch (Sadalage 2014).

NoSQL Databases gives us many database options to choose from and allows us to fine tune our system to meet the specific requirements.

According to Nance et el. (2013) NoSQL databases are:
Highly scalable
Have simple data models
Is able to handle unstructured data.

Some claim that NoSQL databases enables better performance by not having the constraints of Relational Databases(Nance et el. 2013).
Column Family
Column Family Store
simple Graph database
Graph Database
Key Value
Document databases
These are the simplest NoSQL data stores to use from an API perspective.
It is implemented as a hash table which has a unique key and a pointer to a particular item of data.
The value is a blob that the database just stores, without caring or knowing about the the content of the value.
This database has great performance because it uses primary-key access.

This database stores documents, which can be XML, JSON, BSON, etc.
These documents are self-describing, hierarchical tree data-structures which can consist of maps, collections, and scalar-values (Sadalage 2014).
Documents are stored in the value part of the Key-Value store; Document Databases can be seen as Key-Value stores except the value is examinable.

MongoDB is an example of a document database

A column-family database stores data as rows that have many columns associated with a row key (Sadalage 2014).
Column families are groups of related data that is often accessed together.
Each column-family can be compared to a container of rows in a RDBMS table where the key identifies the row, which consists of multiple columns.
The difference is that various rows don’t need to have the same columns and can be dynamically added to any individual row.

Graph Databases allow us to store entities and relationships between these entities, also known as nodes.
Nodes are organized by relationships which allows us to find patterns between the nodes.
The organization of the graph stores the data once and then interprets it in different ways according to the relationships (Fowler, Sadalage 2012).
A query on a graph is also known as traversing the graph.
An advantage of the graph databases is that we can change the traversing requirements without having to change the nodes or edges(Fowler, Sadalage 2012).

Graph Databases
Using a graph database for our solution was preferable for many reasons.
Traversing relationships in graph databases is very fast because the relationships are persisted and not calculated at query time(Fowler, Sadalage 2012). This provides a large advantage for us because we work with a very large dataset.
Most of the value of graph databases are derived from relationships, which is exactly what we needed to determine relationships between our various entities.
Nodes and Edges are visually represented in graph databases.
Less storage space is required for the database due to not storing null values.
The agility of a graph databases allows us to develop frictionless and with the API’s testable nature and query language, we are able to develop an application in a controlled manner.
Graph database development aligns well with today’s agile and test-driven software development practices (Robinson, Webber, Eifrem 2013).
Graph Databases
Java program: Exploring the possibilities
Neo4j Spring data
Based on our experience with Netbeans we decided to implement the solution with Netbeans, fortunately there is provisions to allow this.
In our research we found 2 ways to do this.
A Maven java project with spring and REST.
A Java project with without spring or REST that connects directly to the neo4j server

We attempted to implement the maven project approach as this had much better built in functionality to use
We could only manage to get an embedded Neo4j database working.
This was great, but did not connect to the Neo4j server itself, and therefore did not allow visual output for queries.

Java Program: Information flow from .CSV file -> Java -> Neo4j
The Java program reads the csv file.
The Java program processes it and writes the data into relevant java class objects.
From the java class objects the java program writes the data into the Neo4j server.

Java Program: Side effects
This entire process of getting nodes for each flight and then setting relationships, created a time issue.
We improved the code significantly and finally came to a point where we could not optimize it further.
We suspect using Maven with spring and REST would be a solution to the time constraint
This will be investigated in our Project 4 we plan on doing in 2016.
Java program: Refining the solution
Graphical representation
We reverted to a normal java project using the connection parameters.
This enabled us to:
Write nodes, relationships and properties to the Neo4j server
Get instance of the nodes to create relationships to existing nodes
View a graphical representation of the database.
Java Program: Write to Neo4j
To Write to the Neo4j Server
Firstly all the airlines and airports that was discovered has to be created in Neo4j with their nodes and properties.
From here for each flights a node was created with it's properties.
During this iteration the program has to get the node created for the airline, originating airport and destination airport from the Neo4j database.
Then the relationships is created between the existing nodes that was retrieved and the newly created flight node.
We furthermore also created a node, if there was a cause for a delay, with properties set to the causes.
Java Program: Write to Neo4j

One of the advantage of graph databases as we mentioned already is that it only contains the information that is present, instead of a column for each possible data value, like in relational databases.

Therefore we had to determine for each value whether it is present (Not Null), and only then create the node or property for that value.
Flights added
Results this way ->
Average delay by arriving airport
In the following slides we will present the results from Neo4j in Graphical form
Delay in minutes
Average delay by departing airport
Delay in minutes
Highest delay by arriving airport
Delay in minutes
Highest delay by departing airport
Delay in minutes
Average delay based on route
Delay in minutes
Average delay based on cause
Delay Cause
Delay in minutes
All flights with New york as origin
Airports to choose
Orlando - Florida
Based on the results we received we can suggest the following tips:

If you would have to plan a
, a
airport to choose would be Orlando - Florida or Lexington - Kentucky, because their flights on average arrive with negative delays.
On the flipside, Jacksonville - Florida, or Lansing - Michigan are
options to
On a overall basis, the flight route with the best statistics is from Memphsis to New York.

NAS (National Aviation System) Delays have typically the
impact on delays, so for example if you need to circle around an airport once before you can land, is simply a slap on the wrist compared to other delays
Airports to avoid
Buffalo, New York
Based on the results we received we can suggest the following tips:

If you would have to plan a
, a
airport to choose would Scranton - Pennsylvania or Rapid City - South Dakota, because their flights on average arrive with very high delays.
On the flipside, Green bay - Wisconsin or Knoxville - Tennessee are
options to
On a overall basis, the flight route with the
statistics is from Buffalo to New York

If a weather delay strikes at Buffalo though, strap yourself in because a weather delay is typically the
Cole, L, 2015, 'future of travel', TheEconomist, Spain. available online: <http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time >
Fowler, M, Sadalage, P, 2012, NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, Addison-Wesley, Cape Town.
Nance, C, Losser, T, Iype, R, & Harmon, G. (2013). Nosql vs rdbms-why there is room for both. Proceedings of the Southern Association for Information Systems Conference pp. 111-116.
Padhy, R (2011), ‘RDBMS to NoSQL: Reviewing Some Next-Generation Non-Relational Database's’, INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES, vol. 11, no. 1, pp. 015-030.
Robinson, I,Webber, J, Eifrem, E, 2013,Graph Databases, O’Reilly Media, United States of America.
Sadalage, P, 2014, NoSQL Databases: An overview. viewed 4 June 2015 <http://www.thoughtworks.com/insights/blog/nosql-databases-overview >
United States Department of Transportation, 2015, On-Time: On-Time Performance, viewed 24 May 2015,
Wang, D, Park, S, Fesenmaier, D, 2011, ‘The Role of Smartphones in Mediating the Touristic Experience’, Journal of Travel Research, vol. 51, no. 4, pp. 371-387.
Full transcript