Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Data Mining

No description
by

Riley Englin

on 8 October 2015

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Data Mining

Indirect Association Rule Mining for Crime Data Analysis
Overview
Results & Analysis
What is Data Mining?
Discover inherent and previously unknown information from data
Find hidden knowledge within data
Association Rule Mining
Explore relationships among itemsets
First introduced as a way to discover interesting co-occurrences in supermarket data
Apriori Algorithm
Step One
Find frequent itemsets
Itemsets are combinations of items
purchased together in at least a certain, specified number
n
transactions

Create association rules in the form of X Y and/or Y X from frequent itemsets
Support
- proportion of records that contain item/items

Support (X) = # records with X
# records in DB
Confidence
- degree of correlation among items

Confidence (X Y) = # records with X and Y
#records with X
high confidence can be deceptive and suggest strong association rules even if items are independent
Lift
- determines correlation of items in rule

Lift (X Y) = Confidence (X Y)
Support (Y)
Introduction
Related Work
Methods
Results & Analysis
Conclusion & Future Work

OR
Support (X U Y)
Support (X) * Support (Y)
Lift < 1
Negatively correlated
Lift = 1
Not related
Lift > 1
Positively correlated
Step Two
Example
- Buys Cereal Buys Milk
[1] A. Sacasere, E. Omiecinski, and S. Navathe. “Mining for strong negative associations in a large database of customer transactions.” In Proc. of the 14th International Conference on Data Engineering, pages 494-502, Orlando, Florida, February 1998.

[2] Buczak, A.L., Gifford, C.M. “Fuzzy association rule mining for community pattern discovery,” In ACM SIGKDD Workshop on Intelligence and Security Informatics, 2010.

[3] C. McCue, “Connecting the Dots: Data Mining and Predictive Analytics in Law Enforcement and Intelligence Analysis,” Copyright held by the International Association of Chiefs of Police. The Police Chief, vol. 70, no. 10, October 2003. Accessed 12/4/2014.

[4] C.M. Kuok, A. Fu, and M.H. Wong, “Mining Fuzzy Association Rules in Databases,” ACM SIGMOD Record 27(1), pp. 41-46, New York, NY, 1998.

[5] C. Rudin, R. Sevieri, D. Wagner, and T. Wang, “Learning to Detect Patterns of Crime”. http://web.mit.edu/rudin/www/WangRuWaSeECML13.pdf

[6] City of Chicago Data Portal, “Crimes – 2001 to present,” 2011. https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2

[7] Fournier-Viger, P., Gomariz, Gueniche, T., A., Soltani, A., Wu., C., Tseng, V. S. (2014). SPMF: a Java Open-Source Pattern Mining Library. Journal of Machine Learning Research (JMLR), 15: 3389-3393.

[8] H. Verlinde, M. De Cock, and R. Boute, “Fuzzy Versus Quantitative Association Rules: A Fair Data-Driven Comparison,” In IEEE Transactions on Systems, Man, and Cybernetics, Vol. 36, No. 3, June 2006.

Data Set
City of Chicago open data portal
Data was extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system
22 possible categorical, quantitative, and boolean attributes
4,556,343 records ranging from January 1, 2001 to October 1, 2014
1

2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
ID

Case Number
Date
Block
IUCR
Primary Type
Description
Location Description
Arrest
Domestic
Beat
District
Ward
Community Area
FBI Code
X-Coordinate
Y-Coordinate
Year
Updated On
Latitude
Longitude
Location
Unique identifier for each record
Case number assigned to each case
Date/Time that the crime took place
Approximate address of occurrence
Illinois Uniform Crime Reporting codes
Type of crime committed
Further description of the type of crime committed
Describes type of place crime took place
Yes or no if culprit was arrested
Yes or no if crime was domestic
Code corresponding to territory and time of police patrol
22 Districts
50 Wards
77 Community areas
Code assigned to case based off primary type of crime
X-Coordinate of crime
Y-Coordinate of crime
Year crime took place
Date the case was last updated
Latitude of crime
Longitude of crime
(Latitude, Longitude)
Data Subset
Record Subset
Training set : October 2011 - September 2013
655,309 records
Prediction set : October 2013 - September 2014
276,209 records

Attribute Subset
"Duplicate" attributes removed
Date/Time split, Day/Night attribute added
1
2
3
4
5
6
7
8
9
10
11


Date
Time
Primary Type
Description
Location Description
Arrest
Domestic
Beat
District
Ward
Community Area


Date that the crime took place
Time that the crime took place
Type of crime committed
Further description of the type of crime committed
Describes type of place crime took place
Yes or no if culprit was arrested
Yes or no if crime was domestic
Code corresponding to territory and time of police patrol
22 Districts
50 Wards
77 Community areas

Data Cleansing
Commas Within Attributes
Date/Time Clean Up
Missing Values
Location Attribute
Filled with "NONE"
Crimes such as "Deceptive Practice" - "Theft"

Latitude and Longitude Attributes
Filled with "0"
Location Description Attribute

"Boat, Watercraft" "Hotel, Motel"

"Boat/Watercraft" "Hotel/Motel"

Beat, District, Ward
and Community Area
Observations
Beat has no missing values
For each beat, common district, ward, and community area values
For each record containing no missing district, ward, and community area values, list values and keep count of each
Beat - 10

District
22
39
Count
5
3
Ward
134
Count
8
CA
111
76
45
Count
1
3
4
Scan data again
For each record containing missing values, fill with highest recorded count for corresponding beat
Beat - 10
District - 22
Ward - 134
Community Area - 45
Split into 3 separate attributes:
Date Attribute
Time Attribute
Day/Night Attribute
Conclusion & Future Work
Further data transformation
Improve algorithm efficiency
Experiment with thresholds
References
Thank you!

Questions?
Riley Englin
Advisor: Dr. Li

Introduction
Data mining interest
Crime data challenges
Domestic violence focus using public crime data set
Indirect Association Rule Mining
and Extension
Cosine
- determines how closely related items or rules are
Cosine (X Y) = # records with X and Y
(# records with X) * (# records with Y)
Interest
- quantifies strength between items
Interest (X Y) = # records with X and Y
(# records with X) * (# records with Y)
Extensions of Apriori
FP-Growth: uses FP-Tree to generate candidate item sets
Partition: uses the intersection of item sets to generate candidate item sets

Association Rule Mining Algorithms Used On Crime Data
Quantitative
- Map categorical values to integers
- Information loss vs. execution time

Fuzzy
- Member or non-member of every set
- Subject matter expert to define "fuzzification" membership functions

Indirect Association Rule Mining
Web recommendations, text mining, stock market data

Discovers relationships between items that would not pass the support threshold for other algorithms
Example:
X and Y rarely occur in the same transaction
{X, Y} does not pass the minimum support value
HOWEVER
X and Y are highly dependant on item/item set Z
THEREFORE
X and Y are indirectly associated through mediator Z
{X, Y | {Z}}
[9] Hipp, Jochen, Ulrich Güntzer, and Gholamreza Nakhaeizadeh. "Algorithms for association rule mining—a general survey and comparison." ACM sigkdd explorations newsletter 2.1 (2000): 58-64.

[10] Kazienko, Przemysław. "Mining indirect association rules for web recommendation." International Journal of Applied Mathematics and Computer Science 19.1 (2009): 165-186.

[11] Merceron, Agathe, and Kalina Yacef. "Interestingness measures for association rules in educational data." Educational Data Mining 2008. 2008.

[12] Pang-Ning, Tan, Michael Steinbach, and Vipin Kumar. "Introduction to data mining." Library of Congress. 2006.

[13] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules between Sets of Items in Large Databases,” In Proceedings of the International Conference on Management of Data, Washington, D.C., pp. 207-216, May 1993.

[14] R. Srikant and R. Arawal, “Mining Quantitative Association Rules in Large Relational Tables,” In Proceedings of the International Conference on Management of Data, Montreal, Quebec, Canada, pp. 1-12, 1996.

[15] Tan, Pang-Ning, Vipin Kumar, and Jaideep Srivastava. “Indirect association: Mining higher order dependencies in data.” Springer Berlin Heidelberg. 2000.

[16] Y. Sucahyo, R. Gopalan. “CT-ITL: Efficient Frequent Item Set Mining Using a Compressed Prefix Tree with Pattern Growth.” In 14th Australasian Database Conference (ADC2003), Vol 17, Adelaide, Australia, 2003.

Indirect Association Rule Mining
SPMF Data Mining Library

Input
: Data set with integers representing all values
Database mapping
3 Parameters required

minsup
- minimum support for X or Y and M
ts
- minimum support for {X, Y}
minconf
- minimum confidence for indirect associations

Output
: Rules in the form of {X, Y} | M

Example
minsup = 60% ts = 40% minconf = 10%











SUP (B U D) / 5 = 0.6
SUP (C U D) / 5 = 0.8

SUP (B U C) / 5 = 0.4 >= ts

SUP (B U D) / SUP (B) = 1.0
SUP (C U D) / SUP (C) = 1.0
{B, C | {D}}
>= minsup
>= minconf
Transaction ID
1
2
3
4
5
Items
{C, D}
{B, C, D}
{A, C, D, E}
{B, D}
{A, B, C, D}
Determines if each item passes minsup

Apriori-style generation of frequent sets

For each item set j of size k, for k >2

Compare j against all other sets of size k
looking for only 1 differing item

If found
Remove 2 differing items (ex. A & B)
Remaining items are mediatior M
Determine if valid rule
i.e.
Support {A, B} > ts
Confidence {A, M} > minconf
Confidence {B, M} > minconf
Algorithm Extension
Parameters Added
Lift
Cosine
Interest

Algorithm Tweaks
Allow for indirect association rules that could demonstrate an indirect relationship between sets of items rather than single items
Restrict specific attributes to mediator set
Focus on "Domestic" and "Arrest" attributes
Training Data Parameter Results
Training Data VS Test Data Parameter Results
Indirect Association Crime Rules
Step 1:
931,518 records total
only 110,472 (~12%) "NOT Arrest" and "Domestic"
minsup = 0.08 ts = 0.07 minconf = 0.05
Item X
Domestic
Domestic
Domestic
Domestic
Domestic
Item Y
Street
Theft
Battery
Residence
Apartment
Mediator
NOT Arrest
NOT Arrest
NOT Arrest
NOT Arrest
NOT Arrest
Step 2:
Restrict "Arrest" "NOT Arrest" "Domestic" "NOT Domestic" to Mediator
Item to item set relations added

Step 3:
Subset of data with "Domestic"
94,885 records
Item X
Battery
Criminal Damage
Apartment
Narcotics
Narcotics
Battery
$500 and Under
Narcotics
Street
Item Y
Theft, $500 and Under
Battery, Domestic Battery Simple
Battery, Domestic Battery Simple
Theft, $500 and Under
Battery, Domestic Battery Simple
Theft, Domestic Battery Simple
Battery, Domestic Battery Simple
Theft, $500 and Under
Theft, $500 and Under
Mediator
NOT Arrest
NOT Arrest
NOT Arrest
Arrest
Arrest
NOT Domestic
NOT Domestic
Arrest, NOT Domestic
NOT Arrest, NOT Domestic
Item X
CommArea25
District7
Street
Simple
Other Offense
Other Offense
Assault
Street
Street
District7
Simple
Item Y
Battery, Residence
Battery, Domestic Battery Simple
Domestic Battery Simple, Apartment
Battery, Apartment
Battery Residence
Battery, Domestic Battery Simple
Battery, Residence
Battery, Residence
Domestic Battery Simple, Residence
Battery, Apartment
Domestic Battery Simple, Apartment
Mediator
Domestic
Domestic
Domestic
Domestic
Domestic
Domestic
NOT Arrest
Domestic
NOT Arrest
Domestic
Domestic
Full transcript