Publications-Theses
Article View/Open
Publication Export
-
Google ScholarTM
NCCU Library
Citation Infomation
Related Publications in TAIR
題名 NoSQL 資料庫子集查詢的學習索引
Learned Index for Subset Query of NoSQL Databases作者 許軒祥
Hsu, Hsuan-Hsiang貢獻者 沈錳坤
Shan, Man-Kwan
許軒祥
Hsu, Hsuan-Hsiang關鍵詞 學習索引
NoSQL資料庫
子集查詢
Learned Index
NoSQL Database
Subset Query日期 2024 上傳時間 4-Sep-2024 14:59:08 (UTC+8) 摘要 NoSQL資料庫處理半結構化或非結構化資料,子集查詢是NoSQL資料庫中常見的查詢。近年來,運用機器學習的學習索引技術為資料庫的索引技術開闢了新途徑。與傳統的B-Tree相比,學習索引在查詢時間上具有顯著優勢。傳統索引的查詢時間主要是記憶體擷取時間,而學習索引的查詢時間主要是CPU運算時間。現有學習索引的研究主要針對傳統關聯式資料庫的查詢。針對子集查詢,僅有近期基於Deep Sets的DGM。DGM主要在記憶體空間效率方面節省空間,但在查詢速度上仍有提升的空間。 本研究提出了兩種創新的學習索引技術:LI4Subset-D和LI4Subset-P以提升NoSQL資料庫子集查詢的效能。LI4Subset-D與LI4Subset-P分別運用DeepSets與學習索引的PGM-index。實驗結果顯示LI4Subset-D在查詢速度上比DGM提升近149倍,記憶體空間僅增加約 7倍。LI4Subset-P在查詢速度比DGM快約3235倍,而記憶體空間約增加4倍。
NoSQL databases target at semi-structured or unstructured data, and subset queries are common in NoSQL databases. In recent years, learned index techniques based on machine learning have opened new avenues for database indexing. Compared to traditional B-Trees, learned indexes offer significant advantages in query time. Traditional indexes is memory intensive while learned index is CPU intensive. Existing research on learned indexes mainly focuses on traditional relational databases queries. For subset queries, the only recent development is the DGM approach based on Deep Sets. DGM is designed for space efficiency but still has room for improvement in time efficiency. This thesis proposes two novel learned index techniques, LI4Subset-D and LI4Subset-P, to enhance the performance of subset queries in NoSQL databases. LI4Subset-D and LI4Subset-P leverage Deep Sets and the PGM-index of learning indexes, respectively. Experimental results show that LI4Subset-D improves query speed by nearly 149 times compared to DGM, with the expense of 7 times increase in memory space. LI4Subset-P is approximately 3235 times faster than DGM in query speed, with the expense of 4 times increase in memory space.參考文獻 [1] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis, The Case for Learned Index Structures, in Proceedings of the ACM 2018 International Conference on Management of Data (SIGMOD), pp. 489-504, 2018. [2] A. Davitkova, D. Gjurovski, and S. Michel, Learning over Sets for Databases, in Proceedings of the 27th International Conference on Extending Database Technology (EDBT), pp. 68-80, 2024. [3] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola, Deep Sets, in Proceedings of Advances in Neural Information Processing Systems (NIPS), vol. 30, 2017. [4] P. Ferragina and G. Vinciguerra, The PGM-index: A Fully-Dynamic Compressed Learned Index with Provable Worst-Case Bounds, in Proceedings of the VLDB Endowment, vol. 13, no. 8, pp. 1162-1175, 2020. [5] U. Deppisch, S-tree: A Dynamic Balanced Signature Index for Office Retrieval, in Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 77-87, 1986. [6] M. Morzy, T. Morzy, A. Nanopoulos, and Y. Manolopoulos, Hierarchical Bitmap Index: An Efficient and Scalable Indexing Technique for Set-Valued Attributes, in Proceedings of 7th East European Conference on Advances in Databases and Information Systems:: Springer, pp. 236-252, 2003. [7] S. Helmer, R. Aly, T. Neumann, and G. Moerkotte, Indexing set-valued attributes with a multi-level extendible hashing scheme, in Proceedings of 18th International Conference on Database and Expert Systems Applications:: Springer, pp. 98-108, 2007. [8] S. Bevc and I. Savnik, Using Tries for Subset and Superset Queries, in Proceedings of the ITI 2009 31st International Conference on Information Technology Interfaces: IEEE, pp. 147-152, 2009. [9] I. Savnik, Efficient Subset and Superset Queries, in DB&Local Proceedings: Citeseer, pp. 45-57, 2012. [10] I. Savnik, Index Data Structure for Fast Subset and Superset Queries, in Proceedings of International Conference on Availability, Reliability, and Security: Springer, pp. 134-148, 2013. [11] A. Galakatos, M. Markovitch, C. Binnig, R. Fonseca, and T. Kraska, Fiting-tree: A Data-Aware Index Structure, in Proceedings of the 2019 ACM International Conference on Management of Data (SIGMOD), pp. 1189-1206, 2019. [12] J. Rao and K. A. Ross, Cache Conscious Indexing for Decision-Support in Main Memory, in Proceedings of the 25th VLDB Conference, 1999. [13] A. Kipf et al., RadixSpline: A Single-Pass Learned Index, in Proceedings of the 3rd International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, pp. 1-5, 2020. [14] R. Marcus et al., Benchmarking Learned Indexes, Proceedings of the VLDB Endowment, Volume 14, Issue 1, 2020. 描述 碩士
國立政治大學
資訊科學系
111753122資料來源 http://thesis.lib.nccu.edu.tw/record/#G0111753122 資料類型 thesis dc.contributor.advisor 沈錳坤 zh_TW dc.contributor.advisor Shan, Man-Kwan en_US dc.contributor.author (Authors) 許軒祥 zh_TW dc.contributor.author (Authors) Hsu, Hsuan-Hsiang en_US dc.creator (作者) 許軒祥 zh_TW dc.creator (作者) Hsu, Hsuan-Hsiang en_US dc.date (日期) 2024 en_US dc.date.accessioned 4-Sep-2024 14:59:08 (UTC+8) - dc.date.available 4-Sep-2024 14:59:08 (UTC+8) - dc.date.issued (上傳時間) 4-Sep-2024 14:59:08 (UTC+8) - dc.identifier (Other Identifiers) G0111753122 en_US dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/153375 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 資訊科學系 zh_TW dc.description (描述) 111753122 zh_TW dc.description.abstract (摘要) NoSQL資料庫處理半結構化或非結構化資料,子集查詢是NoSQL資料庫中常見的查詢。近年來,運用機器學習的學習索引技術為資料庫的索引技術開闢了新途徑。與傳統的B-Tree相比,學習索引在查詢時間上具有顯著優勢。傳統索引的查詢時間主要是記憶體擷取時間,而學習索引的查詢時間主要是CPU運算時間。現有學習索引的研究主要針對傳統關聯式資料庫的查詢。針對子集查詢,僅有近期基於Deep Sets的DGM。DGM主要在記憶體空間效率方面節省空間,但在查詢速度上仍有提升的空間。 本研究提出了兩種創新的學習索引技術:LI4Subset-D和LI4Subset-P以提升NoSQL資料庫子集查詢的效能。LI4Subset-D與LI4Subset-P分別運用DeepSets與學習索引的PGM-index。實驗結果顯示LI4Subset-D在查詢速度上比DGM提升近149倍,記憶體空間僅增加約 7倍。LI4Subset-P在查詢速度比DGM快約3235倍,而記憶體空間約增加4倍。 zh_TW dc.description.abstract (摘要) NoSQL databases target at semi-structured or unstructured data, and subset queries are common in NoSQL databases. In recent years, learned index techniques based on machine learning have opened new avenues for database indexing. Compared to traditional B-Trees, learned indexes offer significant advantages in query time. Traditional indexes is memory intensive while learned index is CPU intensive. Existing research on learned indexes mainly focuses on traditional relational databases queries. For subset queries, the only recent development is the DGM approach based on Deep Sets. DGM is designed for space efficiency but still has room for improvement in time efficiency. This thesis proposes two novel learned index techniques, LI4Subset-D and LI4Subset-P, to enhance the performance of subset queries in NoSQL databases. LI4Subset-D and LI4Subset-P leverage Deep Sets and the PGM-index of learning indexes, respectively. Experimental results show that LI4Subset-D improves query speed by nearly 149 times compared to DGM, with the expense of 7 times increase in memory space. LI4Subset-P is approximately 3235 times faster than DGM in query speed, with the expense of 4 times increase in memory space. en_US dc.description.tableofcontents 摘要 i 目錄 iv 表目錄 vi 圖目錄 vii 第一章 緒論 1 1.1 研究背景 1 1.2 研究動機 2 1.3 研究目的 2 第二章 相關研究 3 2.1 子集查詢 3 2.2 學習索引 5 第三章 研究方法 8 3.1 問題定義 8 3.2 研究架構 8 3.3 資料前處理 10 3.4 Inversion Construction 11 3.5 Set2Seq 13 3.6 Partitioning 13 3.7 Ranking 15 3.8 Deep Sets 15 3.9 PGM-index 19 3.10 Key Lookup 22 第四章 實驗設計與結果分析 24 4.1 實驗設計與評估方法 24 4.1.1 資料集 24 4.1.2 查詢評估方法 25 4.2 實驗結果與分析 26 4.2.1 LI4Subset-D與DGM的效能比較 26 4.2.2 LI4Subset-D模型複雜度對效能的影響 28 4.2.3 LI4Subset-D中Set2Seq對效能的影響 30 4.2.4 LI4Subset-D中Seq2Int Hash對Partitioning效果 31 4.2.5 LI4Subets-D中Partitioning對效能的影響 32 4.2.6 LI4Subset-D和DGM的批次處理對效能的影響 34 4.2.7 LI4Subset-P與DGM的效能比較 37 4.2.8 LI4Subset-P中Set2Seq對效能的影響 38 4.2.9 LI4Subset-P模型複雜度對效能的影響 40 4.2.10 LI4Subset-P中Partitioning對效能的影響 41 4.2.11 學習索引方法與傳統索引方法記憶體使用比較 43 4.2.12 實作議題討論 44 第五章 結論 46 參考文獻 47 zh_TW dc.format.extent 1262048 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0111753122 en_US dc.subject (關鍵詞) 學習索引 zh_TW dc.subject (關鍵詞) NoSQL資料庫 zh_TW dc.subject (關鍵詞) 子集查詢 zh_TW dc.subject (關鍵詞) Learned Index en_US dc.subject (關鍵詞) NoSQL Database en_US dc.subject (關鍵詞) Subset Query en_US dc.title (題名) NoSQL 資料庫子集查詢的學習索引 zh_TW dc.title (題名) Learned Index for Subset Query of NoSQL Databases en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) [1] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis, The Case for Learned Index Structures, in Proceedings of the ACM 2018 International Conference on Management of Data (SIGMOD), pp. 489-504, 2018. [2] A. Davitkova, D. Gjurovski, and S. Michel, Learning over Sets for Databases, in Proceedings of the 27th International Conference on Extending Database Technology (EDBT), pp. 68-80, 2024. [3] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola, Deep Sets, in Proceedings of Advances in Neural Information Processing Systems (NIPS), vol. 30, 2017. [4] P. Ferragina and G. Vinciguerra, The PGM-index: A Fully-Dynamic Compressed Learned Index with Provable Worst-Case Bounds, in Proceedings of the VLDB Endowment, vol. 13, no. 8, pp. 1162-1175, 2020. [5] U. Deppisch, S-tree: A Dynamic Balanced Signature Index for Office Retrieval, in Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 77-87, 1986. [6] M. Morzy, T. Morzy, A. Nanopoulos, and Y. Manolopoulos, Hierarchical Bitmap Index: An Efficient and Scalable Indexing Technique for Set-Valued Attributes, in Proceedings of 7th East European Conference on Advances in Databases and Information Systems:: Springer, pp. 236-252, 2003. [7] S. Helmer, R. Aly, T. Neumann, and G. Moerkotte, Indexing set-valued attributes with a multi-level extendible hashing scheme, in Proceedings of 18th International Conference on Database and Expert Systems Applications:: Springer, pp. 98-108, 2007. [8] S. Bevc and I. Savnik, Using Tries for Subset and Superset Queries, in Proceedings of the ITI 2009 31st International Conference on Information Technology Interfaces: IEEE, pp. 147-152, 2009. [9] I. Savnik, Efficient Subset and Superset Queries, in DB&Local Proceedings: Citeseer, pp. 45-57, 2012. [10] I. Savnik, Index Data Structure for Fast Subset and Superset Queries, in Proceedings of International Conference on Availability, Reliability, and Security: Springer, pp. 134-148, 2013. [11] A. Galakatos, M. Markovitch, C. Binnig, R. Fonseca, and T. Kraska, Fiting-tree: A Data-Aware Index Structure, in Proceedings of the 2019 ACM International Conference on Management of Data (SIGMOD), pp. 1189-1206, 2019. [12] J. Rao and K. A. Ross, Cache Conscious Indexing for Decision-Support in Main Memory, in Proceedings of the 25th VLDB Conference, 1999. [13] A. Kipf et al., RadixSpline: A Single-Pass Learned Index, in Proceedings of the 3rd International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, pp. 1-5, 2020. [14] R. Marcus et al., Benchmarking Learned Indexes, Proceedings of the VLDB Endowment, Volume 14, Issue 1, 2020. zh_TW