--- abstract: | Databases and data warehouses contain an overwhelming volume of information that users must wade through in order to extract valuable and actionable knowledge to support the decision-making process. This contribution addresses the problem of automatically analyzing large multidimensional tables to get a concise representation of data, identify patterns and provide approximate answers to queries. Since data cubes are nothing but multi-way tables, we propose to analyze the potential of a probabilistic modeling technique, called non-negative multi-way array factorization, for approximating aggregate and multidimensional values. Using such a technique, we compute the set of components (clusters) that best fit the initial data set and whose superposition approximates the original data. The generated components can then be exploited for approximately answering OLAP queries such as roll-up, slice and dice operations. The proposed modeling technique will then be compared against the log-linear modeling technique which has already been used in the literature for compression and outlier detection in data cubes. Finally, three data sets will be used to discuss the potential benefits of non-negative multi-way array factorization. altloc: [] chapter: ~ commentary: ~ commref: ~ confdates: 'Nov. 9, 2007' conference: ACM Tenth International Workshop on Data Warehousing and OLAP confloc: 'Lisbon, Portugal' contact_email: ~ creators_id: [] creators_name: - family: Missaoui given: Rokia honourific: '' lineage: '' - family: Goutte given: Cyril honourific: '' lineage: '' - family: Kouomou Choupo given: Anicet honourific: '' lineage: '' - family: Boujenoui given: Ameur honourific: '' lineage: '' date: 2007 date_type: published datestamp: 2007-09-10 department: ~ dir: disk0/00/00/57/02 edit_lock_since: ~ edit_lock_until: ~ edit_lock_user: ~ editors_id: [] editors_name: [] eprint_status: archive eprintid: 5702 fileinfo: /style/images/fileicons/application_pdf.png;/5702/1/DOLAP_final.pdf full_text_status: public importid: ~ institution: ~ isbn: ~ ispublished: inpress issn: ~ item_issues_comment: [] item_issues_count: 0 item_issues_description: [] item_issues_id: [] item_issues_reported_by: [] item_issues_resolved_by: [] item_issues_status: [] item_issues_timestamp: [] item_issues_type: [] keywords: 'Data warehousing, data cubes, OLAP, approximation, compression, data mining, non-negative multi-way array factorization, log-linear modeling' lastmod: 2011-03-11 08:56:57 latitude: ~ longitude: ~ metadata_visibility: show note: ~ number: ~ pagerange: ~ pubdom: FALSE publication: ~ publisher: Sheridan Printing refereed: TRUE referencetext: | H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6):716--723, 1974. J. L. Ambite, C. Shahabi, R. R. Schmidt, and A. Philpot. Fast approximate evaluation of OLAP queries for integrated statistical data. In Proceedings of the First National Conference on Digital Government Research, 2001. B. Babcock, S. Chaudhuri, and G. Das. Dynamic sample selection for approximate query processing. In SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 539--550, New York, NY, USA, 2003. ACM Press. D. Barbara and X. Wu. Using loglinear models to compress datacubes. In WAIM '00: Proceedings of the First International Conference on Web-Age Information Management, pages 311--322, London, UK, 2000. Springer-Verlag. D. Barbara and X. Wu. Loglinear-based quasi cubes. J. Intell. Inf. Syst., 16(3):255--276, 2001. A. Boujenoui and D. Zéghal. Effet de la structure des droits de vote sur la qualité des mécanismes internes de gouvernance: cas des entreprises canadiennes. Canadian Journal of Administrative Sciences, 23(3):183--201, 2006. K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim. Approximate query processing using wavelets. The VLDB Journal, 10(2-3):199--223, 2001. S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. SIGMOD Rec., 26(1):65--74, 1997. R. Christensen. Log-linear Models. Springer-Verlag, New York, 1997. W.E. Deming and F.F. Stephan. On a least squares adjustment of a sampled frequency table when the expected marginal total are known. The Annals of Mathematical Statistics, 11(4):427--444, 1940. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1--38, 1977. I. S. Dhillon and S. Sra. Generalized nonnegative matrix approximations with Bregman divergences. In Advances in Neural Information Processing Systems 17, pages 283--290, 2005. Venkatesh Ganti, Mong-Li Lee, and Raghu Ramakrishnan. Icicles: Self-tuning samples for approximate query answering. In VLDB '00: Proceedings of the 26th International Conference on Very Large Data Bases, pages 176--187, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. Eric Gaussier and Cyril Goutte. Relation between PLSA and NMF and implications. In SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 601--602, New York, NY, USA, 2005. ACM Press. C. Goutte, R. Missaoui, and A. Boujenoui. Data cube approximation and mining using probabilistic modelling. Technical Report NRC 49284, National Research Council, 2007. http://iit-iti.nrc-cnrc.gc.ca/publications/nrc-49284_e.html. S.J. Haberman. Analysis of qualitative data, Volume 1. Academic Press, New York, 1978. Thomas Hofmann. Probabilistic latent semantic analysis. In UAI'99, pages 289--296. Morgan Kaufmann, 1999. T. Imielinski, L. Khachiyan, and A. Abdulghani. Cubegrades: Generalizing association rules. Data Min. Knowl. Discov., 6(3):219--257, 2002. Laks V. S. Lakshmanan, Jian Pei, and Yan Zhao. Quotient cube: How to summarize the semantics of a data cube. In Proceedings of the 28th International Conference on Very Large Databases, VLDB, pages 778--789, 2002. Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401:788--791, 1999. Hongjun Lu, Ling Feng, and Jiawei Han. Beyond intratransaction association analysis: mining multidimensional intertransaction association rules. ACM Trans. Inf. Syst., 18(4):423--454, 2000. S. Naouali and R. Missaoui. Flexible query answering in data cubes. In DaWaK, pages 221--232, 2005. Themis Palpanas, Nick Koudas, and Alberto Mendelzon. Using datacube aggregates for approximate querying and deviation detection. IEEE Transactions on Knowledge and Data Engineering, 17(11):1465--1477, 2005. Sunita Sarawagi, Rakesh Agrawal, and Nimrod Megiddo. Discovery-driven exploration of olap data cubes. In EDBT '98: Proceedings of the 6th International Conference on Extending Database Technology, pages 168--182, London, UK, 1998. Springer-Verlag. Gideon Schwartz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461--464, 1978. J. Shanmugasundaram, U. Fayyad, and P. S. Bradley. Compressed data cubes for olap aggregate query approximation on continuous dimensions. In KDD '99: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 223--232. ACM Press, 1999. J. S. Vitter and M. Wang. Approximate computation of multidimensional aggregates of sparse data using wavelets. In Proceeding of the SIGMOD'99 Conference, pages 193--204, 1999. Max Welling and Markus Weber. Positive tensor factorization. Pattern Recognition Letters, 22(12):1255--1261, 2001. F. Yu and W. Shan. Compressed data cube for approximate OLAP query processing. J. Comput. Sci. Technol., 17(5):625--635, 2002. relation_type: [] relation_uri: [] reportno: ~ rev_number: 13 series: ~ source: ~ status_changed: 2007-09-18 02:42:48 subjects: - comp-sci-stat-model - comp-sci-art-intel succeeds: ~ suggestions: ~ sword_depositor: ~ sword_slug: ~ thesistype: ~ title: A Probabilistic Model for Data Cube Compression and Query Approximation type: confpaper userid: 7131 volume: ~