Please use this identifier to cite or link to this item:
|Title:||Mining Sequential Patterns from Probabilistic Data|
|Presented at:||University of Leicester|
|Abstract:||Sequential Pattern Mining (SPM) is an important data mining problem. Although it is assumed in classical SPM that the data to be mined is deterministic, it is now recognized that data obtained from a wide variety of data sources is inherently noisy or uncertain, such as data from sensors or data being collected from the web from different (potentially conflicting) data sources. Probabilistic databases is a popular framework for modelling uncertainty. Recently, several data mining and ranking problems have been studied in probabilistic databases. To the best of our knowledge, this is the first systematic study of mining sequential patterns from probabilistic databases. In this work, we consider the kind of uncertainties that could arise in SPM. We propose four novel uncertainty models for SPM, namely tuple-level uncertainty, event-level uncertainty, source-level uncertainty and source-level uncertainty in deduplication, all of which fit into the probabilistic databases framework, and motivate them using potential real-life scenarios. We then define the interestingness predicate for two measures of interestingness, namely expected support and probabilistic frequentness. Next, we consider the computational complexity of evaluating the interestingness predicate, for various combinations of uncertainty models and interestingness measures, and show that different combinations have very different outcomes from a complexity theoretic viewpoint: whilst some cases are computationally tractable, we show other cases to be computationally intractable. We give a dynamic programming algorithm to compute the source support probability and hence the expected support of a sequence in a source-level uncertain database. We then propose optimizations to speedup the support computation task. Next, we propose probabilistic SPM algorithms based on the candidate generation and pattern growth frameworks for the source-level uncertainty model and the expected support measure. We implement these algorithms and give an empirical evaluation of the probabilistic SPM algorithms and show the scalability of these algorithms under different parameter settings using both real and synthetic datasets. Finally, we demonstrate the effectiveness of the probabilistic SPM framework at extracting meaningful patterns in the presence of noise.|
|Rights:||Copyright © the author, 2012|
|Appears in Collections:||Theses, Dept. of Computer Science|
Items in LRA are protected by copyright, with all rights reserved, unless otherwise indicated.