Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 NoSQL 資料庫子集查詢的學習索引
Learned Index for Subset Query of NoSQL Databases
作者 許軒祥
Hsu, Hsuan-Hsiang
貢獻者 沈錳坤
Shan, Man-Kwan
許軒祥
Hsu, Hsuan-Hsiang
關鍵詞 學習索引
NoSQL資料庫
子集查詢
Learned Index
NoSQL Database
Subset Query
日期 2024
上傳時間 4-Sep-2024 14:59:08 (UTC+8)
摘要 NoSQL資料庫處理半結構化或非結構化資料,子集查詢是NoSQL資料庫中常見的查詢。近年來,運用機器學習的學習索引技術為資料庫的索引技術開闢了新途徑。與傳統的B-Tree相比,學習索引在查詢時間上具有顯著優勢。傳統索引的查詢時間主要是記憶體擷取時間,而學習索引的查詢時間主要是CPU運算時間。現有學習索引的研究主要針對傳統關聯式資料庫的查詢。針對子集查詢,僅有近期基於Deep Sets的DGM。DGM主要在記憶體空間效率方面節省空間,但在查詢速度上仍有提升的空間。 本研究提出了兩種創新的學習索引技術:LI4Subset-D和LI4Subset-P以提升NoSQL資料庫子集查詢的效能。LI4Subset-D與LI4Subset-P分別運用DeepSets與學習索引的PGM-index。實驗結果顯示LI4Subset-D在查詢速度上比DGM提升近149倍,記憶體空間僅增加約 7倍。LI4Subset-P在查詢速度比DGM快約3235倍,而記憶體空間約增加4倍。
NoSQL databases target at semi-structured or unstructured data, and subset queries are common in NoSQL databases. In recent years, learned index techniques based on machine learning have opened new avenues for database indexing. Compared to traditional B-Trees, learned indexes offer significant advantages in query time. Traditional indexes is memory intensive while learned index is CPU intensive. Existing research on learned indexes mainly focuses on traditional relational databases queries. For subset queries, the only recent development is the DGM approach based on Deep Sets. DGM is designed for space efficiency but still has room for improvement in time efficiency. This thesis proposes two novel learned index techniques, LI4Subset-D and LI4Subset-P, to enhance the performance of subset queries in NoSQL databases. LI4Subset-D and LI4Subset-P leverage Deep Sets and the PGM-index of learning indexes, respectively. Experimental results show that LI4Subset-D improves query speed by nearly 149 times compared to DGM, with the expense of 7 times increase in memory space. LI4Subset-P is approximately 3235 times faster than DGM in query speed, with the expense of 4 times increase in memory space.
參考文獻 [1] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis, The Case for Learned Index Structures, in Proceedings of the ACM 2018 International Conference on Management of Data (SIGMOD), pp. 489-504, 2018. [2] A. Davitkova, D. Gjurovski, and S. Michel, Learning over Sets for Databases, in Proceedings of the 27th International Conference on Extending Database Technology (EDBT), pp. 68-80, 2024. [3] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola, Deep Sets, in Proceedings of Advances in Neural Information Processing Systems (NIPS), vol. 30, 2017. [4] P. Ferragina and G. Vinciguerra, The PGM-index: A Fully-Dynamic Compressed Learned Index with Provable Worst-Case Bounds, in Proceedings of the VLDB Endowment, vol. 13, no. 8, pp. 1162-1175, 2020. [5] U. Deppisch, S-tree: A Dynamic Balanced Signature Index for Office Retrieval, in Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 77-87, 1986. [6] M. Morzy, T. Morzy, A. Nanopoulos, and Y. Manolopoulos, Hierarchical Bitmap Index: An Efficient and Scalable Indexing Technique for Set-Valued Attributes, in Proceedings of 7th East European Conference on Advances in Databases and Information Systems:: Springer, pp. 236-252, 2003. [7] S. Helmer, R. Aly, T. Neumann, and G. Moerkotte, Indexing set-valued attributes with a multi-level extendible hashing scheme, in Proceedings of 18th International Conference on Database and Expert Systems Applications:: Springer, pp. 98-108, 2007. [8] S. Bevc and I. Savnik, Using Tries for Subset and Superset Queries, in Proceedings of the ITI 2009 31st International Conference on Information Technology Interfaces: IEEE, pp. 147-152, 2009. [9] I. Savnik, Efficient Subset and Superset Queries, in DB&Local Proceedings: Citeseer, pp. 45-57, 2012. [10] I. Savnik, Index Data Structure for Fast Subset and Superset Queries, in Proceedings of International Conference on Availability, Reliability, and Security: Springer, pp. 134-148, 2013. [11] A. Galakatos, M. Markovitch, C. Binnig, R. Fonseca, and T. Kraska, Fiting-tree: A Data-Aware Index Structure, in Proceedings of the 2019 ACM International Conference on Management of Data (SIGMOD), pp. 1189-1206, 2019. [12] J. Rao and K. A. Ross, Cache Conscious Indexing for Decision-Support in Main Memory, in Proceedings of the 25th VLDB Conference, 1999. [13] A. Kipf et al., RadixSpline: A Single-Pass Learned Index, in Proceedings of the 3rd International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, pp. 1-5, 2020. [14] R. Marcus et al., Benchmarking Learned Indexes, Proceedings of the VLDB Endowment, Volume 14, Issue 1, 2020.
描述 碩士
國立政治大學
資訊科學系
111753122
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0111753122
資料類型 thesis
dc.contributor.advisor 沈錳坤zh_TW
dc.contributor.advisor Shan, Man-Kwanen_US
dc.contributor.author (Authors) 許軒祥zh_TW
dc.contributor.author (Authors) Hsu, Hsuan-Hsiangen_US
dc.creator (作者) 許軒祥zh_TW
dc.creator (作者) Hsu, Hsuan-Hsiangen_US
dc.date (日期) 2024en_US
dc.date.accessioned 4-Sep-2024 14:59:08 (UTC+8)-
dc.date.available 4-Sep-2024 14:59:08 (UTC+8)-
dc.date.issued (上傳時間) 4-Sep-2024 14:59:08 (UTC+8)-
dc.identifier (Other Identifiers) G0111753122en_US
dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/153375-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學系zh_TW
dc.description (描述) 111753122zh_TW
dc.description.abstract (摘要) NoSQL資料庫處理半結構化或非結構化資料,子集查詢是NoSQL資料庫中常見的查詢。近年來,運用機器學習的學習索引技術為資料庫的索引技術開闢了新途徑。與傳統的B-Tree相比,學習索引在查詢時間上具有顯著優勢。傳統索引的查詢時間主要是記憶體擷取時間,而學習索引的查詢時間主要是CPU運算時間。現有學習索引的研究主要針對傳統關聯式資料庫的查詢。針對子集查詢,僅有近期基於Deep Sets的DGM。DGM主要在記憶體空間效率方面節省空間,但在查詢速度上仍有提升的空間。 本研究提出了兩種創新的學習索引技術:LI4Subset-D和LI4Subset-P以提升NoSQL資料庫子集查詢的效能。LI4Subset-D與LI4Subset-P分別運用DeepSets與學習索引的PGM-index。實驗結果顯示LI4Subset-D在查詢速度上比DGM提升近149倍,記憶體空間僅增加約 7倍。LI4Subset-P在查詢速度比DGM快約3235倍,而記憶體空間約增加4倍。zh_TW
dc.description.abstract (摘要) NoSQL databases target at semi-structured or unstructured data, and subset queries are common in NoSQL databases. In recent years, learned index techniques based on machine learning have opened new avenues for database indexing. Compared to traditional B-Trees, learned indexes offer significant advantages in query time. Traditional indexes is memory intensive while learned index is CPU intensive. Existing research on learned indexes mainly focuses on traditional relational databases queries. For subset queries, the only recent development is the DGM approach based on Deep Sets. DGM is designed for space efficiency but still has room for improvement in time efficiency. This thesis proposes two novel learned index techniques, LI4Subset-D and LI4Subset-P, to enhance the performance of subset queries in NoSQL databases. LI4Subset-D and LI4Subset-P leverage Deep Sets and the PGM-index of learning indexes, respectively. Experimental results show that LI4Subset-D improves query speed by nearly 149 times compared to DGM, with the expense of 7 times increase in memory space. LI4Subset-P is approximately 3235 times faster than DGM in query speed, with the expense of 4 times increase in memory space.en_US
dc.description.tableofcontents 摘要 i 目錄 iv 表目錄 vi 圖目錄 vii 第一章 緒論 1 1.1 研究背景 1 1.2 研究動機 2 1.3 研究目的 2 第二章 相關研究 3 2.1 子集查詢 3 2.2 學習索引 5 第三章 研究方法 8 3.1 問題定義 8 3.2 研究架構 8 3.3 資料前處理 10 3.4 Inversion Construction 11 3.5 Set2Seq 13 3.6 Partitioning 13 3.7 Ranking 15 3.8 Deep Sets 15 3.9 PGM-index 19 3.10 Key Lookup 22 第四章 實驗設計與結果分析 24 4.1 實驗設計與評估方法 24 4.1.1 資料集 24 4.1.2 查詢評估方法 25 4.2 實驗結果與分析 26 4.2.1 LI4Subset-D與DGM的效能比較 26 4.2.2 LI4Subset-D模型複雜度對效能的影響 28 4.2.3 LI4Subset-D中Set2Seq對效能的影響 30 4.2.4 LI4Subset-D中Seq2Int Hash對Partitioning效果 31 4.2.5 LI4Subets-D中Partitioning對效能的影響 32 4.2.6 LI4Subset-D和DGM的批次處理對效能的影響 34 4.2.7 LI4Subset-P與DGM的效能比較 37 4.2.8 LI4Subset-P中Set2Seq對效能的影響 38 4.2.9 LI4Subset-P模型複雜度對效能的影響 40 4.2.10 LI4Subset-P中Partitioning對效能的影響 41 4.2.11 學習索引方法與傳統索引方法記憶體使用比較 43 4.2.12 實作議題討論 44 第五章 結論 46 參考文獻 47zh_TW
dc.format.extent 1262048 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0111753122en_US
dc.subject (關鍵詞) 學習索引zh_TW
dc.subject (關鍵詞) NoSQL資料庫zh_TW
dc.subject (關鍵詞) 子集查詢zh_TW
dc.subject (關鍵詞) Learned Indexen_US
dc.subject (關鍵詞) NoSQL Databaseen_US
dc.subject (關鍵詞) Subset Queryen_US
dc.title (題名) NoSQL 資料庫子集查詢的學習索引zh_TW
dc.title (題名) Learned Index for Subset Query of NoSQL Databasesen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) [1] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis, The Case for Learned Index Structures, in Proceedings of the ACM 2018 International Conference on Management of Data (SIGMOD), pp. 489-504, 2018. [2] A. Davitkova, D. Gjurovski, and S. Michel, Learning over Sets for Databases, in Proceedings of the 27th International Conference on Extending Database Technology (EDBT), pp. 68-80, 2024. [3] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola, Deep Sets, in Proceedings of Advances in Neural Information Processing Systems (NIPS), vol. 30, 2017. [4] P. Ferragina and G. Vinciguerra, The PGM-index: A Fully-Dynamic Compressed Learned Index with Provable Worst-Case Bounds, in Proceedings of the VLDB Endowment, vol. 13, no. 8, pp. 1162-1175, 2020. [5] U. Deppisch, S-tree: A Dynamic Balanced Signature Index for Office Retrieval, in Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 77-87, 1986. [6] M. Morzy, T. Morzy, A. Nanopoulos, and Y. Manolopoulos, Hierarchical Bitmap Index: An Efficient and Scalable Indexing Technique for Set-Valued Attributes, in Proceedings of 7th East European Conference on Advances in Databases and Information Systems:: Springer, pp. 236-252, 2003. [7] S. Helmer, R. Aly, T. Neumann, and G. Moerkotte, Indexing set-valued attributes with a multi-level extendible hashing scheme, in Proceedings of 18th International Conference on Database and Expert Systems Applications:: Springer, pp. 98-108, 2007. [8] S. Bevc and I. Savnik, Using Tries for Subset and Superset Queries, in Proceedings of the ITI 2009 31st International Conference on Information Technology Interfaces: IEEE, pp. 147-152, 2009. [9] I. Savnik, Efficient Subset and Superset Queries, in DB&Local Proceedings: Citeseer, pp. 45-57, 2012. [10] I. Savnik, Index Data Structure for Fast Subset and Superset Queries, in Proceedings of International Conference on Availability, Reliability, and Security: Springer, pp. 134-148, 2013. [11] A. Galakatos, M. Markovitch, C. Binnig, R. Fonseca, and T. Kraska, Fiting-tree: A Data-Aware Index Structure, in Proceedings of the 2019 ACM International Conference on Management of Data (SIGMOD), pp. 1189-1206, 2019. [12] J. Rao and K. A. Ross, Cache Conscious Indexing for Decision-Support in Main Memory, in Proceedings of the 25th VLDB Conference, 1999. [13] A. Kipf et al., RadixSpline: A Single-Pass Learned Index, in Proceedings of the 3rd International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, pp. 1-5, 2020. [14] R. Marcus et al., Benchmarking Learned Indexes, Proceedings of the VLDB Endowment, Volume 14, Issue 1, 2020.zh_TW