جزییات کتاب
Data Mining, the automatic extraction of implicit and potentially useful information from data, is increasingly used in commercial, scientific and other application areas. Principles of Data Mining explains and explores the principal techniques of Data Mining: for classification, association rule mining and clustering. Each topic is clearly explained and illustrated by detailed worked examples, with a focus on algorithms rather than mathematical formalism. It is written for readers without a strong background in mathematics or statistics, and any formulae used are explained in detail. This second edition has been expanded to include additional chapters on using frequent pattern trees for Association Rule Mining, comparing classifiers, ensemble classification and dealing with very large volumes of data. Principles of Data Mining aims to help general readers develop the necessary understanding of what is inside the 'black box' so they can use commercial data mining packages discriminatingly, as well as enabling advanced readers or academic researchers to understand or contribute to future technical advances in the field. Suitable as a textbook to support courses at undergraduate or postgraduate levels in a wide range of subjects including Computer Science, Business Studies, Marketing, Artificial Intelligence, Bioinformatics and Forensic Science.Table of ContentsCoverPrinciples of Data Mining, Second EditionISBN 9781447148838 ISBN 9781447148845About This Book Note on the Second EditionContents1 Introduction to Data Mining 1.1 The Data Explosion 1.2 Knowledge Discovery 1.3 Applications of Data Mining 1.4 Labelled and Unlabelled Data 1.5 Supervised Learning: Classi.cation 1.6 Supervised Learning: Numerical Prediction 1.7 Unsupervised Learning: Association Rules 1.8 Unsupervised Learning: Clustering2 Data for Data Mining 2.1 Standard Formulation 2.2 Types of Variable 2.2.1 Categorical and Continuous Attributes 2.3 Data Preparation 2.3.1 Data Cleaning 2.4 Missing Values 2.4.1 Discard Instances 2.4.2 Replace by Most Frequent/Average Value 2.5 Reducing the Number of Attributes 2.6 The UCI Repository of Datasets 2.7 Chapter Summary 2.8 Self-assessment Exercises for Chapter 23 Introduction to Classi.cation: Na�ive Bayes and Nearest Neighbour 3.1 What Is Classi.cation? 3.2 Na�ive Bayes Classi.ers 3.3 Nearest Neighbour Classi.cation 3.3.1 Distance Measures 3.3.2 Normalisation 3.3.3 Dealing with Categorical Attributes 3.4 Eager and Lazy Learning 3.5 Chapter Summary 3.6 Self-assessment Exercises for Chapter 34 Using Decision Trees for Classi.cation 4.1 Decision Rules and Decision Trees 4.1.1 Decision Trees: The Golf Example 4.1.2 Terminology 4.1.3 The degrees Dataset 4.2 The TDIDT Algorithm 4.3 Types of Reasoning 4.4 Chapter Summary 4.5 Self-assessment Exercises for Chapter 45 Decision Tree Induction: Using Entropy for Attribute Selection 5.1 Attribute Selection: An Experiment 5.2 Alternative Decision Trees 5.2.1 The Football/Netball Example 5.2.2 The anonymous Dataset 5.3 Choosing Attributes to Split On: Using Entropy 5.3.1 The lens24 Dataset 5.3.2 Entropy 5.3.3 Using Entropy for Attribute Selection 5.3.4 Maximising Information Gain 5.4 Chapter Summary 5.5 Self-assessment Exercises for Chapter 56 Decision Tree Induction: Using Frequency Tables for Attribute Selection 6.1 Calculating Entropy in Practice 6.1.1 Proof of Equivalence 6.1.2 A Note on Zeros 6.2 Other Attribute Selection Criteria: Gini Index of Diversity 6.3 The .2 Attribute Selection Criterion 6.4 Inductive Bias 6.5 Using Gain Ratio for Attribute Selection 6.5.1 Properties of Split Information 6.5.2 Summary 6.6 Number of Rules Generated by Di.erent Attribute Selection Criteria 6.7 Missing Branches 6.8 Chapter Summary 6.9 Self-assessment Exercises for Chapter 67 Estimating the Predictive Accuracy of a Classi.er 7.1 Introduction 7.2 Method 1: Separate Training and Test Sets 7.2.1 Standard Error 7.2.2 Repeated Train and Test 7.3 Method 2: k-fold Cross-validation 7.4 Method 3: N -fold Cross-validation 7.5 Experimental Results I 7.6 Experimental Results II: Datasets with Missing Values 7.6.1 Strategy 1: Discard Instances 7.6.2 Strategy 2: Replace by Most Frequent/Average Value 7.6.3 Missing Classi.cations 7.7 Confusion Matrix 7.7.1 True and False Positives 7.8 Chapter Summary 7.9 Self-assessment Exercises for Chapter 78 Continuous Attributes 8.1 Introduction 8.2 Local versus Global Discretisation 8.3 Adding Local Discretisation to TDIDT 8.3.1 Calculating the Information Gain of a Set of Pseudo-attributes 8.3.2 Computational E.ciency 8.4 Using the ChiMerge Algorithm for Global Discretisation 8.4.1 Calculating the Expected Values and .2 8.4.2 Finding the Threshold Value 8.4.3 Setting minIntervals and maxIntervals 8.4.4 The ChiMerge Algorithm: Summary 8.4.5 The ChiMerge Algorithm: Comments 8.5 Comparing Global and Local Discretisation for Tree Induction 8.6 Chapter Summary 8.7 Self-assessment Exercises for Chapter 89 Avoiding Over.tting of Decision Trees 9.1 Dealing with Clashes in a Training Set 9.1.1 Adapting TDIDT to Deal with Clashes 9.2 More About Over.tting Rules to Data 9.3 Pre-pruning Decision Trees 9.4 Post-pruning Decision Trees 9.5 Chapter Summary 9.6 Self-assessment Exercise for Chapter 910 More About Entropy 10.1 Introduction 10.2 Coding Information Using Bits 10.3 Discriminating Amongst M Values (M Not a Power of 2) 10.4 Encoding Values That Are Not Equally Likely 10.5 Entropy of a Training Set 10.6 Information Gain Must Be Positive or Zero 10.7 Using Information Gain for Feature Reduction for Classi.cation Tasks 10.7.1 Example 1: The genetics Dataset 10.7.2 Example 2: The bcst96 Dataset 10.8 Chapter Summary 10.9 Self-assessment Exercises for Chapter 1011 Inducing Modular Rules for Classi.cation 11.1 Rule Post-pruning 11.2 Con.ict Resolution 11.3 Problems with Decision Trees 11.4 The Prism Algorithm 11.4.1 Changes to the Basic Prism Algorithm 11.4.2 Comparing Prism with TDIDT 11.5 Chapter Summary 11.6 Self-assessment Exercise for Chapter 1112 Measuring the Performance of a Classi.er 12.1 True and False Positives and Negatives 12.2 Performance Measures 12.3 True and False Positive Rates versus Predictive Accuracy 12.4 ROC Graphs 12.5 ROC Curves 12.6 Finding the Best Classi.er 12.7 Chapter Summary 12.8 Self-assessment Exercise for Chapter 1213 Dealing with Large Volumes of Data 13.1 Introduction 13.2 Distributing Data onto Multiple Processors 13.3 Case Study: PMCRI 13.4 Evaluating the E.ectiveness of a Distributed System: PMCRI 13.5 Revising a Classi.er Incrementally 13.6 Chapter Summary 13.7 Self-assessment Exercises for Chapter 1314 Ensemble Classi.cation 14.1 Introduction 14.2 Estimating the Performance of a Classi.er 14.3 Selecting a Di.erent Training Set for Each Classi.er 14.4 Selecting a Di.erent Set of Attributes for Each Classi.er 14.5 Combining Classi.cations: Alternative Voting Systems 14.6 Parallel Ensemble Classi.ers 14.7 Chapter Summary 14.8 Self-assessment Exercises for Chapter 1415 Comparing Classi.ers 15.1 Introduction 15.2 The Paired t-Test 15.3 Choosing Datasets for Comparative Evaluation 15.3.1 Con.dence Intervals 15.4 Sampling 15.5 How Bad Is a `No Signi.cant Di.erence' Result? 15.6 Chapter Summary 15.7 Self-assessment Exercises for Chapter 1516 Association Rule Mining I 16.1 Introduction 16.2 Measures of Rule Interestingness 16.2.1 The Piatetsky-Shapiro Criteria and the RI Measure 16.2.2 Rule Interestingness Measures Applied to the 16.2.3 Using Rule Interestingness Measures for Con.ict Resolution 16.3 Association Rule Mining Tasks 16.4 Finding the Best N Rules 16.4.1 The J -Measure: Measuring the Information Content of a Rule 16.4.2 Search Strategy 16.5 Chapter Summary 16.6 Self-assessment Exercises for Chapter 1617 Association Rule Mining II 17.1 Introduction 17.2 Transactions and Itemsets 17.3 Support for an Itemset 17.4 Association Rules 17.5 Generating Association Rules 17.6 Apriori 17.7 Generating Supported Itemsets: An Example 17.8 Generating Rules for a Supported Itemset 17.9 Rule Interestingness Measures: Lift and Leverage 17.10 Chapter Summary 17.11 Self-assessment Exercises for Chapter 1718 Association Rule Mining III: Frequent Pattern Trees 18.1 Introduction: FP-Growth 18.2 Constructing the FP-tree 18.2.1 Pre-processing the Transaction Database 18.2.2 Initialisation 18.2.3 Processing Transaction 1: f, c, a, m, p 18.2.4 Processing Transaction 2: f, c, a, b, m 18.2.5 Processing Transaction 3: f, b 18.2.6 Processing Transaction 4: c, b, p 18.2.7 Processing Transaction 5: f, c, a, m, p 18.3 Finding the Frequent Itemsets from the FP-tree 18.3.1 Itemsets Ending with Item p 18.3.2 Itemsets Ending with Item m 18.4 Chapter Summary 18.5 Self-assessment Exercises for Chapter 1819 Clustering 19.1 Introduction 19.2 k-Means Clustering 19.2.1 Example 19.2.2 Finding the Best Set of Clusters 19.3 Agglomerative Hierarchical Clustering 19.3.1 Recording the Distance Between Clusters 19.3.2 Terminating the Clustering Process 19.4 Chapter Summary 19.5 Self-assessment Exercises for Chapter 1920 Text Mining 20.1 Multiple Classi.cations 20.2 Representing Text Documents for Data Mining 20.3 Stop Words and Stemming 20.4 Using Information Gain for Feature Reduction 20.5 Representing Text Documents: Constructing a Vector Space Model 20.6 Normalising the Weights 20.7 Measuring the Distance Between Two Vectors 20.8 Measuring the Performance of a Text Classi.er 20.9 Hypertext Categorisation 20.9.1 Classifying Web Pages 20.9.2 Hypertext Classi.cation versus Text Classi.cation 20.10 Chapter Summary 20.11 Self-assessment Exercises for Chapter 20A Essential MathematicsB DatasetsC Sources of Further InformationD Glossary and NotationE Solutions to Self-assessment ExercisesIndex