...

Particle swarm Optimization Based Association Rule Mining S w a

by user

on
Category:

auctions

4

views

Report

Comments

Transcript

Particle swarm Optimization Based Association Rule Mining S w a
International Journal of Computer Application (2250-1797)
Volume 5– No. 5, August 2015
Particle swarm Optimization Based Association Rule Mining
Swati Shrivastava#1,Vikram Rajput#2
#1 M.Tech (CSE) Scholor, Deptt. Of CSE, LNCT Bhopal(M.P.) , swati1989shrivastava @gmail.com
\
#2 Assistant Professor ,Deptt. Of CSE , LNCT Bhopal(M.P.) , [email protected] gmail.com
\
\
ABSTRACT: Association rule mining is one of the widely using and simple concepts to find the
frequent item sets from large number of datasets. While generating frequent item sets from a
large dataset using association rule mining is not so efficient. This can be improved by using
particle swarm optimization algorithm (PSO). PSO algorithm is population based evolutionary
heuristic search methods used for solving different combinatorial problems. In general the rule
generated by association rule mining technique do not consider the negative occurrences of
attributes in them, but by using PSO algorithm over these rules the system can predict the rules
which contains negative attributes.
Keywords::Evolutionary algorithm, PSO GSA, Association rule mining, Data Mining.
Corresponding Author:Swati Shrivastava
INTRODUCTION: Data Mining is the process of processing large volumes of data (usually
stored in a database), searching for patterns and relationships within that data. Generally, data
mining (sometimes called data or knowledge discovery) is the process of analyzing data from
different perspectives and summarizing it into useful information - information that can be used
to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical
tools for analyzing data. It allows users to analyze data from many different dimensions or
angles, categorize it, and summarize the relationships identifiedHowever, continuous innovations
in computer processing power, disk storage, and statistical software are dramatically increasing
the accuracy of analysis while driving down the cost. Data mining is sorting through data to
identify patterns and establish relationships [1] and [3-6].
Data: Data are any facts, numbers, or text that can be processed by a computer. Today,
organizations are accumulating vast and growing amounts of data in different formats and
different databases.
Information: The patterns, associations, or relationships among all this data can provide
information. For example, analysis of retail point of sale transaction data can yield information
on which products are selling.
Knowledge: Information can be converted into knowledge about historical patterns and future
trends. For example, summary information on retail supermarket sales can be analyzed in light of
promotional efforts to provide knowledge of consumer buying behavior.
105
International Journal of Computer Application (2250-1797)
Volume 5– No. 5, August 2015
In general, association rule mining can be viewed as a two-step process: Find all frequent
itemsets: By definition, each of these itemsets will occur at least as frequently as a
predetermined minimum support count, min sup. Generate strong association rules from the
frequent itemsets: By definition, these rules must satisfy minimum support and minimum
confidence[1].Most currently existing association rule mining algorithms were designed with
market basket analysis in mind. Also, it is necessary to specify the minimum support of the
mined rules in advance, often leading to either too many or too few rules; this negatively
impacts the performance of the overall system[3].
II. What is Association Rule?
Association rule mining (a-priori): The Association Rule rules designed is as follows:
IF NFC and RIA THEN BI
IF QC and VLSI THEN IPR
Where NFC,
RIA, QC, VLSI,
refer to eight subjects out of which student has to choose four. The rules above shows that if a
student takes NFC and RIA then the probability is high that he will choose BI too; similarly if
he chooses QC and VLSI then the probability is high that he will take IPR. There is no limit on
the number of antecedents in the rules, but there is a constraint on the number of consequents
and i.e.: Number of consequents = 1
The limit doesn’t make any harm, because if in case the user wants to see the confidence value
of a rule that contains the more than one consequents [6] can do the same by taking two rules
from our system which allows its intersection.
Apriori algorithm:
General process: Association rule generation is usually split up into two separate steps:
• First, minimum support is applied to find all frequent itemsets in a database
• Second, these frequent itemsets and the minimum confidence constraint are used to form rules
While the second step is straight forward, the first step needs more attention. Finding all
frequent itemsets in a database is difficult since it involves searching all possible itemsets (item
combinations). The set of possible itemsets is the power set over I and has size 2n-1 (excluding
the empty set which is not a valid itemset). Exploiting this property, efficient algorithmscan
find all frequent itemsets.
Apriori Algorithm Pseudocode
PseudocodeApriori (T, minsupport) {//T is the database and L1 = {frequent items};
For (k = 2; −1 ! = ∅; k++) {
 = candidates generated from −1
// that is cartesian product −1×−1 and eliminating any k-1 size itemset that is not frequent for
each transaction t in database do
{
106
International Journal of Computer Application (2250-1797)
Volume 5– No. 5, August 2015
# increament the count of all candidates in  that are contained in t
 = candidates in  with minsupport
}//end for each
}//end for
Return ; }
As is common in association rule mining, given a set of itemsets (for instance, sets of retail
transactions, each listing individual items purchased), the algorithm attempts to find subsets.
III Problem Statement
Frequent itemset mining plays an important role in several data mining fields as association rules
warehousing, correlations, clustering of high-dimensional biological data, and classification.
Given a data set d that contains k items, the number of itemsets that could be generated is 2k - 1,
excluding the empty set. In order to searching the frequent itemsets, the support of each itemset
must be computed by scanning each transaction in the dataset. A brute force approach for doing
this will be computationally expensive due to the exponential number of itemsets whose support
counts must be determined.
Let I be a set of items. A set X = {i1, . . . ,ik} ⊆ I is called an itemset, or a k-itemset if it contains
k items. A transaction over I is a couple T = (tid , I) where tid is the transaction identifier and I is
an itemset. A transaction T = (tid , I) is said to support an itemset X ⊆I, if X ⊆I.
A transaction database D over I is a set of transactions over I. It omit I whenever it is clear from
the context. The cover of an itemset X in D consists of the set of transaction identifiers of
transactions in D that support X: cover(X, D) := {tid | (tid , I) ∈ D,X ⊆ I}. The support of an
itemset X in D is the number of transactions in the cover of X in D:
support(X,D) := |cover(X,D)|.
The frequency of an itemset X in D is the probability of X occurring in a transaction T ∈ D:
frequency(X,D) := P(X)= {support (X,D)}/|D|
It note that |D| = support ({},D). It omit D whenever it is clear from the context. An itemset is
called frequent if its support is no less than a given absolute minimal support threshold σabs ,
with 0 <= σabs<= |D|. When working with frequencies of itemsets instead of their supports, it use
a relative minimal frequency threshold σrel, with 0 <=σrel<=1. Obviously, σabs = г{σrel · |D|}‫ד‬.
IV Particle Swarm Optimization
In PSO, a population of particles starts to move in search space by following the current
optimum particles and changing the positions in order to find out the optima. When a particle
takes part of the population as its topological neighbours, the best value is a local best. The
particles tend to move to good areas in the search space by the information spreading to the
swarm. The particle is moved to a new position calculated by the velocity updated at each time
107
International Journal of Computer Application (2250-1797)
Volume 5– No. 5, August 2015
step t. determine the significance of ()and (), respectively. Furthermore,  at any time step
of the algorithm is constrained by the parameter  . The swarm in PSO is initialized by
assigning each particle to a uniformly and randomly chosen position in the search space.
Velocities are initialized randomly in the range  ,  .
V. Proposed Methodology
Swarm behavior can be seen in bird flocks, fish schools, as well as in insects like mosquitoes and
midges. Many animal groups such as fish schools and bird flocks clearly display structural order,
with the behavior of the organisms so integrated that even though they may change shape and
direction, they appear to move as a single coherent entity [11]. The main principles of the
collective behavior are:
• Homogeneity: every bird in flock has the same behavior model. The flock moves without a
leader, even though temporary leaders seem to appear.
• Locality: the motion of each bird is only influenced by its nearest flock mates. Vision is
considered to be the most important senses for flock organization.
• Collision Avoidance: avoid with nearby flock mates.
• Velocity Matching : attempt to match velocity with nearby flock mates.
• Flock Centering: attempt to stay close to nearby flock mates
According to the definition of association rule mining, the intersection of the association rule of
item set X to item set Y (X→Y) must be empty. Items which appear in the item set X do not
appear on item set Y, and vice versa. Hence, both the front and back partition points must be
given for the purpose of chromosome encoding. The item set before the front partition point is
called item set X, while that between the front partition and back partition points is called ―item
set Y. The chromosome encoding approach in this study is string encoding. Each value
represents a different item name, which means that item 1 is encoded as ―1‖ and item 2 is
encoded as 2. The value of each item is encoded into a string type chromosome by the
corresponding order.
1 Fitness Value Calculation
The fitness value is utilized to evaluate the importance of each particle. The fitness value of each
particle comes from the fitness function.
Fitness(k) = confidence(k) *log(support(k)* length(k) +1)
2. Population Generation
3. Search the Best Particle, The particle is moved to a new position calculated by the velocity
updated at each time step t. This new position is calculated as the sum of the previous position
and the new velocity by (4):
( + 1) = ( + 1) + ( + 1)
The velocity update is performed as indicated in Equation:
  + 1 =   + 1 + 1  0,1   −   + 2  0,1 (() − ())
The parameter  is called the inertia weight and controls the magnitude of the old velocity  ()
in the calculation of the new velocity, whereas 1 and2 determine the significance of Personal
best ()and group best (), respectively. Furthermore,  at any time step of the algorithm is
constrained by the parameter . The swarm in PSO is initialized by assigning each particle to
a uniformly and randomly chosen position in the search space. Velocities are initialized
108
International Journal of Computer Application (2250-1797)
Volume 5– No. 5, August 2015
randomly in the range
 ,  .
 ,  . Velocities are initialized randomly in the range
Main steps of the procedure are:
1: Initialize Population
2: repeat
3: Calculate fitness values of particles
4: Modify the best particles in the swarm
5: Choose the best particle
6: Calculate the velocities of particles
7: Update the particle positions
8: until requirements are met
Particles’ velocities on each dimension are clamped to a maximum velocity tmax. If the sum of
accelerations would cause the velocity on that dimension to exceed tmax, which is a parameter
specified by the user, then the velocity on that dimension is limited to tmax.
Figure 1: Flow Diagram of Proposed methodology
VI. Result Analysis
The standard Dataset of Online American lottery system which has transaction ID as index
number and the lottery numbers are the items ID. The “lottery.txt” this file has lower number of
data entry around 55, the left hand side there is transaction IDs and at right hand side the values
varies from four to six, and the “Testdata.txt” has 200 transaction IDs at Left hand Side and
109
International Journal of Computer Application (2250-1797)
Volume 5– No. 5, August 2015
entries vary from five to nine in Right Hand side the other data set is “karate” Data set which
has thirty four transaction IDs at left hand side and values varies from one to thirty four at the
right hand side. The mining criteria parameters are Minimum Support, Minimum confidence and
the size of file to be mined. The mining results comparison parameters are timing, time per rules
and number rules generated.The support and Confidence are measured and calculated by the
following equations 1 and 2:
support(X,D) := |cover(X,D)|.
(1)
(6.1)
The frequency of an itemset X in D is the probability of X occurring in a transaction T ∈ D.
Confidence (X) Y, D):= P(Y |X) = { support(X ∪ Y,D) }/ support(X,D)
Table 1 Results with Confidence 70 percent for Test Data Set
S.No. Support Mining Time
Time per entry
No. Of rules
1
20
1.1679
0.005753
5
2
30
0.5979
0.002945
8
3
40
0.585168
0.002882
8
(2)
Table 2 Results with Confidence 40 percent for Lottery dataset.
S.No. Support Mining Time
Time per entry
1
10
2.166
0.040331
2
15
2.08166
0.040331
3
20
2.0879
0.04991
No. Of rules
4
No rules generated
No rules generated
Table 3 Results with Confidence 70 percent for student Dataset.
S.No. Support Mining Time
Time per entry
1
10
2.08166
NA
2
20
2.79
NA
3
40
2.89
NA
No. Of rules
5
No rules generated
No rules generated
Proposed methodology based result Analysis
Karate Dataset with 70 percent confidence which is imitated by personal best of element in
Swarm with the different fitness values
Table 4 Results with Confidence 70 percent for student Dataset
S.No. Support Mining Time
Time per entry
No. Of rules
1
10
0.02366
1.040331
3
2
15
0.041775
1.089331
5
3
20
0.014307
0.04991
4
Refer to serial No. 1 with the individual support and confidence of the transaction itemset.
items33 ->items34 (29.4118%, 83.3333%)
items34 ->items33 (29.4118%, 58.8235%)
items2 ->items1 (20.5882%, 77.7778%)
110
International Journal of Computer Application (2250-1797)
Volume 5– No. 5, August 2015
Refer to serial No. 2 with the individual support and confidence of the transaction itemset
items33 ->items34 (29.4118%, 83.3333%)
items34 ->items33 (29.4118%, 58.8235%)
items1 ->items2 (20.5882%, 43.75%)
items2 ->items1 (20.5882%, 77.7778%)
items3 ->items34 (17.6471%, 60%)
Refer to serial No3 with the individual support and confidence of the transaction itemset
Rule (Support, Confidence)
items33 ->items34 (29.4118%, 83.3333%)
items34 ->items33 (29.4118%, 58.8235%)
items1 ->items2 (20.5882%, 43.75%)
items2 ->items1 (20.5882%, 77.7778%)
Comparison of Existing and Proposed methodology
Here comparing the existing method and the proposed method on the parameters number of rules
generation for associativity of items.The below table shows that the rules are produced by the
existing method are more as compare to the proposed methodology.so, the mining time is lesser
also for the proposed methodology As shown in Table 6.
Table 6.5 :Comparison of Number of Rules generation in mining
S.no. Support
Existing Methods
PSO(proposed) method
1
Sup= 10
5
3
2
Sup= 15
8
5
3
Sup= 20
8
4
Table 6 :Comparison of Mining time
S.no. Support
Existing Methods
1
Sup= 10
1.1679
2
Sup= 15
0.5979
3
Sup= 20
0.585168
PSO(proposed) method
0.02366
0.041775
0.014307
111
International Journal of Computer Application (2250-1797)
Volume 5– No. 5, August 2015
9
8
7
6
5
4
3
2
1
0
Existing Methods
PSO(proposed) method
Sup= 10
Sup= 15
Sup= 20
Figure 2: The Graph Between the support vs Number of rules generated during Mining
1.4
1.2
1
0.8
Existing Methods
0.6
PSO(proposed) method
0.4
0.2
0
Sup= 10
Sup= 15
Sup= 20
Figure 3: The Graph between Support and Mining time for the rule generations.
The figure 2 and 3 shows the graphical comparison of two methods. In first graph shows the
number of generated rules.The number of rules are less in proposed method because of the
optimization property of PSO so ,the processing time is less during the rule mining from the
given dataset.
VII. Conclusion
If application supports multiple minimum supports then the above work is very easy and
comparative study can be done.Data mining, which is the exploration of knowledge from the
large set of data, generated as a result of the various data processing activities. Frequent Pattern
Mining is a very important task in data mining. The proposed method work well with static
dataset by using support count particle swarm optimization concept with as well as for mining
streams requires fast, real-time processing in order to keep up with the high data arrival rate and
mining results are expected to be available within short response time.
112
International Journal of Computer Application (2250-1797)
Volume 5– No. 5, August 2015
VIII.References
[1] Nicholson, S. The Bibliomining Process: Data Warehousing and Data Mining for Library
Decision-Making. Information Technology and Libraries. 2003, 22(4):146-151.
[2] JING YANG , HA0 WANG SUE-GANG HU, ZHONG-HUI HU, “A NEW
CLASSIFICATION ALGORITHM BASED ON ROUGH SET AND ENTROPY”, IEEE 2003
Proceedings of the Second International Conference on Machine Learning and Cybernetics.
[3] DE-SHENG YIN, GUO-YIN WANG, W WU, “A SELF-LEARNING ALGORITHM FOR
DECISION TREE PRE-PRUNING”, IEEE 2004 Proceedings of the Third Intemationd
Conference on Machine Leaming and Cybemetics.
[4]Yang B R. Knowledge discovery based on inner mechanism: construction, realization and
application [M]. USA: Elliott & Fitzpatrick Inc. 2004.
[5] C. Giannella, J. Han, J. Pei, X. Yan, and P.S. Yu, “ Mining frequent patterns in data streams
at multiple time granularities”, In Data Mining: Next Generation Challenges and Future
Directions. AAAI/MIT Press, 2004.
[6] M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy, “Mining data streams: A review,” ACM
SIGMOD Record, vol. Vol. 34,no. 1, 2005.
[7] Jiann-CherngShieh, Yung-Shun Lin. Bibliomining User Behaviors in the Library. Journal of
Educational Media & Library Sciences.2006, 44(1):36-60.
[8] N. Jiang and L. Gruenwald, “Research issues in data stream association rule mining”, ACM
SIGMOD Record, 35(1):14–19, 2006.
[9] Gregory Buehrer, SrinivasanParthasarathy, and AmolGhoting. Out-ofcore frequent pattern
mining on a commodity pc. In KDD ’06: Proceedings of the 12th ACM SIGKDD international
conference on Knowledge discovery and data mining, New York, NY, USA, 2006, pp 86–95.
[10] DPVG06] NeleDexters, Paul W. Purdom, and Dirk Van Gucht. A probability analysis for
candidate-based frequent itemset algorithms. In SAC ’06: Proceedings of the 2006 ACM
symposium on Applied computing, New York, NY, USA, 2006. ACM, pp541–545.
[11] C.C. Aggarwal, “ On density based transforms for uncertain data mining”, In Proceedings of
the 23rd IEEE International Conference on Data Engineering (ICDE), pages 866–875, Istanbul,
Turkey, 2007.
[12] C. C. Aggarwal, Data Streams: models and algorithms. Springer, 2007
[13] C. K. Chui, B. Kao, and E. Hung. Mining frequent itemsets from uncertain data. In 11th
Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2007,
Nanjing, China, pages 47–58, 2007.
[14] N. Dalvi and D. Suciu. "Efficient query evaluation on probabilistic databases", The VLDB
Journal, 16(4):523–544, 2007.
[15] EhsanValian, ShahramMohanna And SaeedTavakoli , “Improved Cuckoo Search Algorithm
For Feedforward Neural Network Training”, International Journal Of Artificial Intelligence &
Applications (Ijaia), Vol.2(3), pp. 36-43, July 2011.
113
International Journal of Computer Application (2250-1797)
Volume 5– No. 5, August 2015
[16] S. Walton; O. Hassan, K. Morgan and M.R. Brown, "Modified Cuckoo Search : A new
gradient
free
optimisation
algorithm".
Chaos,
Solitons&
Fractals.
doi:10.1016/j.chaos.2011.06.004, 30 June 2011.
[17] D.Rodrigues, L.A.M. Pereira, A. N. Souza, C. C.Ramos, Xin-She Yan “Binary Cuckoo
Search : A Binary Cuckoo Search Algorithm For Feature Selection”, IEEE International
Symposium on Circuits and Systems (ISCAS), pp. 465 - 468 ,2013.
[18] C. K.-S. Leung, C. L. Carmichael, and B. Hao,” Efficient mining of frequent patterns
from uncertain data”, In ICDMW ’07: Proceedings of the Seventh IEEE International
Conference on Data Mining Workshops, pages 489–494, 2007.
[19] Weiming Hu; Ou Wu; Zhouyao Chen; Zhouyu Fu; Maybank, S, “Recognition of
Pornographic Web Pages by Classifying Texts and Images ”Pattern Analysis and Machine
Intelligence, IEEE Transactions 2007.
[20] Jun Fang; Lei Guo; XiaoDong Wang; Ning Yang, “Ontology-Based Automatic
Classification and Ranking for Web Documents”, Fuzzy Systems and Knowledge Discovery,
2007. FSKD 2007;
114
Fly UP