Cogprints

A Probabilistic Model for Data Cube Compression and Query Approximation

Missaoui, Rokia and Goutte, Cyril and Kouomou Choupo, Anicet and Boujenoui, Ameur (2007) A Probabilistic Model for Data Cube Compression and Query Approximation. [Conference Paper] (In Press)

Full text available as:

[img]
Preview
PDF
178Kb

Abstract

Databases and data warehouses contain an overwhelming volume of information that users must wade through in order to extract valuable and actionable knowledge to support the decision-making process. This contribution addresses the problem of automatically analyzing large multidimensional tables to get a concise representation of data, identify patterns and provide approximate answers to queries. Since data cubes are nothing but multi-way tables, we propose to analyze the potential of a probabilistic modeling technique, called non-negative multi-way array factorization, for approximating aggregate and multidimensional values. Using such a technique, we compute the set of components (clusters) that best fit the initial data set and whose superposition approximates the original data. The generated components can then be exploited for approximately answering OLAP queries such as roll-up, slice and dice operations. The proposed modeling technique will then be compared against the log-linear modeling technique which has already been used in the literature for compression and outlier detection in data cubes. Finally, three data sets will be used to discuss the potential benefits of non-negative multi-way array factorization.

Item Type:Conference Paper
Keywords:Data warehousing, data cubes, OLAP, approximation, compression, data mining, non-negative multi-way array factorization, log-linear modeling
Subjects:Computer Science > Statistical Models
Computer Science > Artificial Intelligence
ID Code:5702
Deposited By: Goutte, Dr. Cyril
Deposited On:10 Sep 2007
Last Modified:11 Mar 2011 08:56

References in Article

Select the SEEK icon to attempt to find the referenced article. If it does not appear to be in cogprints you will be forwarded to the paracite service. Poorly formated references will probably not work.

H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6):716--723, 1974.

J. L. Ambite, C. Shahabi, R. R. Schmidt, and A. Philpot. Fast approximate evaluation of OLAP queries for integrated statistical data. In Proceedings of the First National Conference on Digital Government Research, 2001.

B. Babcock, S. Chaudhuri, and G. Das. Dynamic sample selection for approximate query processing. In SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international

conference on Management of data, pages 539--550, New York, NY, USA, 2003. ACM Press.

D. Barbara and X. Wu. Using loglinear models to compress datacubes. In WAIM '00: Proceedings of the First International Conference on Web-Age Information Management, pages 311--322, London, UK, 2000. Springer-Verlag.

D. Barbara and X. Wu. Loglinear-based quasi cubes. J. Intell. Inf. Syst., 16(3):255--276, 2001.

A. Boujenoui and D. Zéghal. Effet de la structure des droits de vote sur la qualité des mécanismes internes de gouvernance: cas des entreprises canadiennes. Canadian Journal of Administrative Sciences, 23(3):183--201, 2006.

K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim. Approximate query processing using wavelets. The VLDB Journal, 10(2-3):199--223, 2001.

S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. SIGMOD Rec., 26(1):65--74, 1997.

R. Christensen. Log-linear Models. Springer-Verlag, New York, 1997.

W.E. Deming and F.F. Stephan. On a least squares adjustment of a sampled frequency table when the expected marginal total are known. The Annals of Mathematical Statistics, 11(4):427--444, 1940.

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1--38, 1977.

I. S. Dhillon and S. Sra. Generalized nonnegative matrix approximations with Bregman divergences. In Advances in Neural Information Processing Systems 17, pages

283--290, 2005.

Venkatesh Ganti, Mong-Li Lee, and Raghu Ramakrishnan. Icicles: Self-tuning samples for approximate query answering. In VLDB '00: Proceedings of the 26th International Conference on Very Large Data Bases, pages 176--187, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.

Eric Gaussier and Cyril Goutte. Relation between PLSA and NMF and implications. In SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 601--602, New York, NY, USA, 2005. ACM Press.

C. Goutte, R. Missaoui, and A. Boujenoui. Data cube approximation and mining using probabilistic modelling. Technical Report NRC 49284, National Research Council, 2007. http://iit-iti.nrc-cnrc.gc.ca/publications/nrc-49284_e.html.

S.J. Haberman. Analysis of qualitative data, Volume 1. Academic Press, New York, 1978.

Thomas Hofmann. Probabilistic latent semantic analysis. In UAI'99, pages 289--296. Morgan Kaufmann, 1999.

T. Imielinski, L. Khachiyan, and A. Abdulghani. Cubegrades: Generalizing association rules. Data Min. Knowl. Discov., 6(3):219--257, 2002.

Laks V. S. Lakshmanan, Jian Pei, and Yan Zhao. Quotient cube: How to summarize the semantics of a data cube. In Proceedings of the 28th International Conference on Very Large Databases, VLDB, pages 778--789, 2002.

Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401:788--791, 1999.

Hongjun Lu, Ling Feng, and Jiawei Han. Beyond intratransaction association analysis: mining multidimensional intertransaction association rules. ACM Trans. Inf. Syst., 18(4):423--454, 2000.

S. Naouali and R. Missaoui. Flexible query answering in data cubes. In DaWaK, pages 221--232, 2005.

Themis Palpanas, Nick Koudas, and Alberto Mendelzon. Using datacube aggregates for approximate querying and deviation detection. IEEE Transactions on Knowledge and Data Engineering, 17(11):1465--1477, 2005.

Sunita Sarawagi, Rakesh Agrawal, and Nimrod Megiddo. Discovery-driven exploration of olap data cubes. In EDBT '98: Proceedings of the 6th International Conference on Extending Database Technology, pages 168--182, London, UK, 1998. Springer-Verlag.

Gideon Schwartz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461--464, 1978.

J. Shanmugasundaram, U. Fayyad, and P. S. Bradley. Compressed data cubes for olap aggregate query approximation on continuous dimensions. In KDD '99: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 223--232. ACM Press, 1999.

J. S. Vitter and M. Wang. Approximate computation of multidimensional aggregates of sparse data using wavelets. In Proceeding of the SIGMOD'99 Conference, pages 193--204, 1999.

Max Welling and Markus Weber. Positive tensor factorization. Pattern Recognition Letters, 22(12):1255--1261, 2001.

F. Yu and W. Shan. Compressed data cube for approximate OLAP query processing. J. Comput. Sci. Technol., 17(5):625--635, 2002.

Metadata

Repository Staff Only: item control page