Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 協同多層次描述方法之無人機定位機制
Cooperative Localization Mechanism for Drones Using Multi-level Representations
作者 黃曉柔
Huang, Hsiao-Jou
貢獻者 廖文宏
Liao, Wen-Hung
黃曉柔
Huang, Hsiao-Jou
關鍵詞 GNSS受限環境
無人機視覺定位
特徵擷取與比對
語意分割
旋轉邊界框
GNSS-denied Environment
UAV Visual Localization
Feature Extraction and Matching
Semantic Segmentation
FOriented Bounding Box
日期 2025
上傳時間 1-Sep-2025 16:57:13 (UTC+8)
摘要 無人機技術的快速發展為軍事和民用領域帶來了革命性的變革,隨著無人機應用於地面觀測、災防勘查與智慧城市巡檢等任務日益普及,對其在無 GNSS 環境下之自主定位能力提出更高需求。傳統依賴全球衛星導航系統(Global Navigation Satellite System, GNSS)者,在都會遮蔽區、室內或極端氣候下常面臨訊號遮斷與誤差放大問題,亟需可靠之替代方案。為解決此挑戰,本研究提出一套多層次 UAV 視覺定位系統,建構於語意遮罩、幾何物件與影像特徵等不同層級之比對策略,提升定位準確性與穩健性。 本系統架構具高度模組化,依據是否保留原始影像資源,分為「有實景參照」與「無實景參照」兩種模式。語意層利用經微調之 Segment Anything Model(SAM)進行語意遮罩生成,搭配結構加權 S-IoU 與九宮格語意佈局過濾影像;物件層則以 OBB 幾何對齊與 Vector IoU 分析物件配置一致性;特徵層則結合 GIM 與 LightGlue 模型進行局部特徵匹配。實驗結果顯示,本系統在無 GNSS 輔助條件下仍能準確辨識 UAV 所在位置,其中語意與物件層已涵蓋大部分判斷任務,特徵層作為進階補償機制展現潛在延伸應用價值。
The rapid advancement of unmanned aerial vehicle (UAV) technology has brought transformative changes to both military and civilian sectors. As UAVs are increasingly deployed for tasks such as ground observation, disaster response, and smart city inspection, there is a growing demand for reliable autonomous localization systems that function in GNSS-denied environments. Traditional systems that rely on Global Navigation Satellite Systems (GNSS) often suffer from signal loss or amplified errors in urban canyons, indoor settings, or adverse weather conditions, highlighting the need for robust alternatives. To address this challenge, this study proposes a multi-level UAV visual localization framework that integrates semantic, geometric, and feature-based matching strategies to enhance both accuracy and resilience. The proposed system is highly modular and adapts its pipeline based on the availability of reference imagery, operating in either a reference-based or reference-free mode. At the semantic level, segmentation masks are generated using a fine-tuned Segment Anything Model (SAM), followed by structural-weighted Semantic IoU (S-IoU) and a grid-based semantic layout check to filter candidate images. The geometric level utilizes Oriented Bounding Boxes (OBB) and Vector IoU to assess object configuration consistency, while the feature level employs Generalizable Image Matching (GIM) and LightGlue for detailed local feature matching. Experimental results demonstrate that the system can accurately estimate UAV positions even without GNSS support, with the semantic and geometric layers handling the majority of cases and the feature layer serving as a robust fallback mechanism for challenging scenarios.
參考文獻 [1] AI, M. (2023). Segment anything: The first foundation model for image segmentation. https://ai.meta.com/blog/segment-anything-foundation-model-imagesegmentation. [2] Aqel, M. O., Marhaban, M. H., Saripan, M. I., and Ismail, N. B. (2016). Review of visual odometry: types, approaches, challenges, and applications. SpringerPlus, 5:1–26. [3] Ayodeji Olalekan Salau, S. J. (2019). Feature extraction: A survey of the types, techniques, applications. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [4] Bay, H., Tuytelaars, T., and Van Gool, L. (2006). Surf: Speeded up robust features. In Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part I 9, pages 404–417. Springer. [5] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S.(2020). End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer. [6] Chan, C. and Tan, S. (2001). Determination of the minimum bounding box of an arbitrary solid: an iterative approach. Computers & Structures, 79(15):1433–1449. [7] Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848. [8] Cheng, J., Deng, C., Su, Y., An, Z., and Wang, Q. (2024). Methods and datasets on semantic segmentation for unmanned aerial vehicle remote sensing images: A review. ISPRS Journal of Photogrammetry and Remote Sensing, 211:1–34. [9] Cui, L. and Ma, C. (2020). Sdf-slam: Semantic depth filter slam for dynamic environments. IEEE Access, 8:95301–95311. [10] DeTone, D., Malisiewicz, T., and Rabinovich, A. (2017). Superpoint: Selfsupervised interest point detection and description. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [11] Developer, N. (2020). Detecting rotated objects using the odtk. https://developer.nvidia.com/blog/detecting-rotated-objects-using-the-odtk. [12] Durrant-Whyte, H. and Bailey, T. (2006). Simultaneous localization and mapping: part i. IEEE robotics & automation magazine, 13(2):99–110. [13] Gleize, P., Wang, W., and Feiszli, M. (2023). Silk: Simple learned keypoints. In Proceedings of the IEEE/CVF international conference on computer vision, pages 22499–22508. [14] Guo, Y., Liu, Y., Georgiou, T., and Lew, M. S. (2018). A review of semantic segmentation using deep neural networks. International journal of multimedia information retrieval, 7:87–93. [15] Hou, J.-B., Zhu, X., and Yin, X.-C. (2021). Self-adaptive aspect ratio anchor for oriented object detection in remote sensing images. Remote Sensing, 13(7):1318. [16] Inc., K. (2025). Our technology: Visual slam. https://www.kudan.io/ourtechnology. [17] Jasim, W. N. and Mohammed, R. J. (2021). A survey on segmentation techniques for image processing. Iraqi Journal for Electrical & Electronic Engineering, 17(2). [18] Jeong, J., Yoon, T. S., and Park, J. B. (2018). Towards a meaningful 3d map using a 3d lidar and a camera. Sensors, 18(8):2571. [19] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., et al. (2023). Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026. [20] Li, H., Zhao, R., and Wang, X. (2014). Highly efficient forward and backward propagation of convolutional neural networks for pixelwise classification. arXiv preprint arXiv:1412.4526. [21] Lindenberger, P., Sarlin, P.-E., and Pollefeys, M. (2021). Lightglue: Local feature matching at light speed. Conference on Neural Information Processing Systems. [22] Liu, L., Pan, Z., and Lei, B. (2017). Learning a rotation invariant detector with rotatable bounding box. arXiv preprint arXiv:1711.09405. [23] Long, J., Shelhamer, E., and Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440. [24] Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60:91–110. [25] Lyu, Y., Vosselman, G., Xia, G.-S., Yilmaz, A., and Yang, M. Y. (2020). Uavid: A semantic segmentation dataset for uav imagery. ISPRS journal of photogrammetry and remote sensing, 165:108–119. [26] Mount, D. M., Netanyahu, N. S., and Le Moigne, J. (1999). Efficient algorithms for robust feature matching. Pattern recognition, 32(1):17–38. [27] Mur-Artal, R., Montiel, J. M. M., and Tardos, J. D. (2015). Orb-slam: a versatile and accurate monocular slam system. IEEE transactions on robotics, 31(5):1147–1163. [28] Nigam, I., Huang, C., and Ramanan, D. (2018). Ensemble knowledge transfer for semantic segmentation. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1499–1508. IEEE. [29] Nistér, D., Naroditsky, O., and Bergen, J. (2004). Visual odometry. [30] NVIDIA (2024). Jetson orin developer kit. [31] Oktay, O., Schlemper, J., Folgoc, L. L., Lee, M., Heinrich, M., Misawa, K., Mori, K., McDonagh, S., Hammerla, N. Y., Kainz, B., et al. (2018). Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999. [32] Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer. [33] Sarlin, P.-E., DeTone, D., Malisiewicz, T., and Rabinovich, A. (2020). Superglue: Learning feature matching with graph neural networks. Conference on Computer Vision and Pattern Recognition. [34] Shen, X., Cai, Z., Yin, W., Müller, M., Li, Z., Wang, K., Chen, X., and Wang, C. (2023). Gim: Learning generalizable image matcher from internet videos. IEEE Transactions on Pattern Analysis and Machine Intelligence. [35] Wang, J., Ding, J., Guo, H., Cheng, W., Pan, T., and Yang, W. (2019). Mask obb: A semantic attention-based mask oriented bounding box representation for multicategory object detection in aerial images. Remote Sensing, 11(24):2930. [36] Wen, L., Cheng, Y., Fang, Y., and Li, X. (2023). A comprehensive survey of oriented object detection in remote sensing images. Expert Systems with Applications, 224:119960. [37] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M., and Luo, P. (2021a). Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems, 34:12077–12090. [38] Xie, X., Cheng, G., Wang, J., Yao, X., and Han, J. (2021b). Oriented r-cnn for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3520–3529. [39] Xin, G.-x., Zhang, X.-t., Wang, X., and Song, J. (2015). A rgbd slam algorithm combining orb with prosac for indoor mobile robot. In 2015 4th International Conference on Computer Science and Network Technology (ICCSNT), volume 1, pages 71–74. IEEE. [40] Yang, X., Yan, J., Feng, Z., and He, T. (2021). R3det: Refined single-stage detector with feature refinement for rotating object. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 3163–3171. [41] Yi, K. M., Trulls, E., Lepetit, V., and Fua, P. (2016). Lift: Learned invariant feature transform. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14, pages 467–483. Springer. [42] Zhang, S., Long, J., Xu, Y., and Mei, S. (2024). Pmho: Point-supervised oriented object detection based on segmentation-driven proposal generation. IEEE Transactions on Geoscience and Remote Sensing. [43] Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. (2017). Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890. [44] Zhao, X., Wu, X., Chen, W., Chen, P. C. Y., Xu, Q., and Li, Z. (2023). Aliked: A lighter keypoint and descriptor extraction network via deformable transformation. IEEE Transactions on Pattern Analysis and Machine Intelligence. [45] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P. H., et al. (2021). Rethinking semantic segmentation from a sequence-tosequence perspective with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6881–6890. [46] Zhou, Z., Wu, Q. J., Wan, S., Sun, W., and Sun, X. (2020). Integrating sift and cnn feature matching for partial-duplicate image detection. IEEE Transactions on Emerging Topics in Computational Intelligence, 4(5):593–604. [47] Zhu, X., Su, W., Lu, L., Li, B., Wang, X., and Dai, J. (2020). Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159. [48] Zou, Z., Chen, K., Shi, Z., Guo, Y., and Ye, J. (2023). Object detection in 20 years: A survey. Proceedings of the IEEE, 111(3):257–276.
描述 碩士
國立政治大學
資訊科學系
112753117
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0112753117
資料類型 thesis
dc.contributor.advisor 廖文宏zh_TW
dc.contributor.advisor Liao, Wen-Hungen_US
dc.contributor.author (Authors) 黃曉柔zh_TW
dc.contributor.author (Authors) Huang, Hsiao-Jouen_US
dc.creator (作者) 黃曉柔zh_TW
dc.creator (作者) Huang, Hsiao-Jouen_US
dc.date (日期) 2025en_US
dc.date.accessioned 1-Sep-2025 16:57:13 (UTC+8)-
dc.date.available 1-Sep-2025 16:57:13 (UTC+8)-
dc.date.issued (上傳時間) 1-Sep-2025 16:57:13 (UTC+8)-
dc.identifier (Other Identifiers) G0112753117en_US
dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/159413-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學系zh_TW
dc.description (描述) 112753117zh_TW
dc.description.abstract (摘要) 無人機技術的快速發展為軍事和民用領域帶來了革命性的變革,隨著無人機應用於地面觀測、災防勘查與智慧城市巡檢等任務日益普及,對其在無 GNSS 環境下之自主定位能力提出更高需求。傳統依賴全球衛星導航系統(Global Navigation Satellite System, GNSS)者,在都會遮蔽區、室內或極端氣候下常面臨訊號遮斷與誤差放大問題,亟需可靠之替代方案。為解決此挑戰,本研究提出一套多層次 UAV 視覺定位系統,建構於語意遮罩、幾何物件與影像特徵等不同層級之比對策略,提升定位準確性與穩健性。 本系統架構具高度模組化,依據是否保留原始影像資源,分為「有實景參照」與「無實景參照」兩種模式。語意層利用經微調之 Segment Anything Model(SAM)進行語意遮罩生成,搭配結構加權 S-IoU 與九宮格語意佈局過濾影像;物件層則以 OBB 幾何對齊與 Vector IoU 分析物件配置一致性;特徵層則結合 GIM 與 LightGlue 模型進行局部特徵匹配。實驗結果顯示,本系統在無 GNSS 輔助條件下仍能準確辨識 UAV 所在位置,其中語意與物件層已涵蓋大部分判斷任務,特徵層作為進階補償機制展現潛在延伸應用價值。zh_TW
dc.description.abstract (摘要) The rapid advancement of unmanned aerial vehicle (UAV) technology has brought transformative changes to both military and civilian sectors. As UAVs are increasingly deployed for tasks such as ground observation, disaster response, and smart city inspection, there is a growing demand for reliable autonomous localization systems that function in GNSS-denied environments. Traditional systems that rely on Global Navigation Satellite Systems (GNSS) often suffer from signal loss or amplified errors in urban canyons, indoor settings, or adverse weather conditions, highlighting the need for robust alternatives. To address this challenge, this study proposes a multi-level UAV visual localization framework that integrates semantic, geometric, and feature-based matching strategies to enhance both accuracy and resilience. The proposed system is highly modular and adapts its pipeline based on the availability of reference imagery, operating in either a reference-based or reference-free mode. At the semantic level, segmentation masks are generated using a fine-tuned Segment Anything Model (SAM), followed by structural-weighted Semantic IoU (S-IoU) and a grid-based semantic layout check to filter candidate images. The geometric level utilizes Oriented Bounding Boxes (OBB) and Vector IoU to assess object configuration consistency, while the feature level employs Generalizable Image Matching (GIM) and LightGlue for detailed local feature matching. Experimental results demonstrate that the system can accurately estimate UAV positions even without GNSS support, with the semantic and geometric layers handling the majority of cases and the feature layer serving as a robust fallback mechanism for challenging scenarios.en_US
dc.description.tableofcontents 謝誌 I 摘要 II Abstract III 目錄 IV 表次 VII 圖次 VIII 第一章 緒論 1 1.1 研究背景與動機 1 1.2 研究目的與貢獻 2 1.3 論文架構 3 第二章 技術背景與相關研究 5 2.1 基於視覺的定位技術 5 2.1.1 視覺里程計(Visual Odometry) 5 2.1.2 同時定位與地圖構建(SLAM) 5 2.2 語意分割技術 6 2.2.1 基於深度學習的語意分割技術 7 2.2.2 Segment Anything Model (SAM) 8 2.2.3 空拍視角下的語意分割挑戰 11 2.3 物件偵測與旋轉邊界框(OBB) 12 2.3.1 OBB 基本概念與表示 13 2.3.2 基於幾何的旋轉物件框擷取方法 14 2.4 特徵擷取與匹配 15 2.4.1 傳統特徵擷取與匹配方法 16 2.4.2 基於深度學習的特徵擷取與匹配方法 17 第三章 系統架構與方法 21 3.1 架構總覽與設計原則 21 3.2 語意層級匹配(Semantic-Level Matching) 27 3.2.1 語意類別過濾 28 3.2.2 雜訊區塊過濾 28 3.2.3 九宮格語意位置一致性比對 29 3.2.4 語意交集比SIoU 31 3.2.5 語意層對應結構之觀察與解釋 32 3.2.6 語意層第二輪比對 33 3.3 物件層級匹配(Object-Level Matching) 36 3.3.1 OBB 表示與取得方法 36 3.3.2 OBB 角度差與遮罩旋轉補償 38 3.3.3 幾何相似性計算:Vector IoU 39 3.4 特徵層級匹配(Feature-Level Matching) 40 3.4.1 特徵層方法測試 41 3.4.2 特徵層於邊緣運算設備之測試 46 3.4.3 邊緣運算設備介紹:Jetson Orin Nano 46 3.4.4 實驗結果與分析 47 第四章 架構可行性驗證與分析 49 4.1 前置作業:語意分割模型調整 49 4.2 一般案例 52 4.2.1 實驗設計與影像資料 52 4.2.2 實驗結果與分析 53 4.2.3 閾值敏感度觀察與分析 57 4.3 掉幀案例 58 4.3.1 實驗設計與影像資料 58 4.3.2 實驗結果與分析(有實景參照模式) 59 4.3.3 實驗結果與分析(無實景參照模式) 61 4.4 模擬返程案例 63 4.4.1 實驗設計與影像資料 63 4.4.2 實驗結果與分析 64 4.5 匹配情境分類與對應策略分析 64 4.5.1 Ideal Case:語意層即完成定位 64 4.5.2 Common case:需物件或特徵層補充 66 4.5.3 Semantic Failure Case:語意資訊失效,必定需要特徵模型補充 67 4.5.4 Failure Case:目前流程無法處理的情況 69 第五章 結論與未來工作 71 參考文獻 72zh_TW
dc.format.extent 45462295 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0112753117en_US
dc.subject (關鍵詞) GNSS受限環境zh_TW
dc.subject (關鍵詞) 無人機視覺定位zh_TW
dc.subject (關鍵詞) 特徵擷取與比對zh_TW
dc.subject (關鍵詞) 語意分割zh_TW
dc.subject (關鍵詞) 旋轉邊界框zh_TW
dc.subject (關鍵詞) GNSS-denied Environmenten_US
dc.subject (關鍵詞) UAV Visual Localizationen_US
dc.subject (關鍵詞) Feature Extraction and Matchingen_US
dc.subject (關鍵詞) Semantic Segmentationen_US
dc.subject (關鍵詞) FOriented Bounding Boxen_US
dc.title (題名) 協同多層次描述方法之無人機定位機制zh_TW
dc.title (題名) Cooperative Localization Mechanism for Drones Using Multi-level Representationsen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) [1] AI, M. (2023). Segment anything: The first foundation model for image segmentation. https://ai.meta.com/blog/segment-anything-foundation-model-imagesegmentation. [2] Aqel, M. O., Marhaban, M. H., Saripan, M. I., and Ismail, N. B. (2016). Review of visual odometry: types, approaches, challenges, and applications. SpringerPlus, 5:1–26. [3] Ayodeji Olalekan Salau, S. J. (2019). Feature extraction: A survey of the types, techniques, applications. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [4] Bay, H., Tuytelaars, T., and Van Gool, L. (2006). Surf: Speeded up robust features. In Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part I 9, pages 404–417. Springer. [5] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S.(2020). End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer. [6] Chan, C. and Tan, S. (2001). Determination of the minimum bounding box of an arbitrary solid: an iterative approach. Computers & Structures, 79(15):1433–1449. [7] Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848. [8] Cheng, J., Deng, C., Su, Y., An, Z., and Wang, Q. (2024). Methods and datasets on semantic segmentation for unmanned aerial vehicle remote sensing images: A review. ISPRS Journal of Photogrammetry and Remote Sensing, 211:1–34. [9] Cui, L. and Ma, C. (2020). Sdf-slam: Semantic depth filter slam for dynamic environments. IEEE Access, 8:95301–95311. [10] DeTone, D., Malisiewicz, T., and Rabinovich, A. (2017). Superpoint: Selfsupervised interest point detection and description. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [11] Developer, N. (2020). Detecting rotated objects using the odtk. https://developer.nvidia.com/blog/detecting-rotated-objects-using-the-odtk. [12] Durrant-Whyte, H. and Bailey, T. (2006). Simultaneous localization and mapping: part i. IEEE robotics & automation magazine, 13(2):99–110. [13] Gleize, P., Wang, W., and Feiszli, M. (2023). Silk: Simple learned keypoints. In Proceedings of the IEEE/CVF international conference on computer vision, pages 22499–22508. [14] Guo, Y., Liu, Y., Georgiou, T., and Lew, M. S. (2018). A review of semantic segmentation using deep neural networks. International journal of multimedia information retrieval, 7:87–93. [15] Hou, J.-B., Zhu, X., and Yin, X.-C. (2021). Self-adaptive aspect ratio anchor for oriented object detection in remote sensing images. Remote Sensing, 13(7):1318. [16] Inc., K. (2025). Our technology: Visual slam. https://www.kudan.io/ourtechnology. [17] Jasim, W. N. and Mohammed, R. J. (2021). A survey on segmentation techniques for image processing. Iraqi Journal for Electrical & Electronic Engineering, 17(2). [18] Jeong, J., Yoon, T. S., and Park, J. B. (2018). Towards a meaningful 3d map using a 3d lidar and a camera. Sensors, 18(8):2571. [19] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., et al. (2023). Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026. [20] Li, H., Zhao, R., and Wang, X. (2014). Highly efficient forward and backward propagation of convolutional neural networks for pixelwise classification. arXiv preprint arXiv:1412.4526. [21] Lindenberger, P., Sarlin, P.-E., and Pollefeys, M. (2021). Lightglue: Local feature matching at light speed. Conference on Neural Information Processing Systems. [22] Liu, L., Pan, Z., and Lei, B. (2017). Learning a rotation invariant detector with rotatable bounding box. arXiv preprint arXiv:1711.09405. [23] Long, J., Shelhamer, E., and Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440. [24] Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60:91–110. [25] Lyu, Y., Vosselman, G., Xia, G.-S., Yilmaz, A., and Yang, M. Y. (2020). Uavid: A semantic segmentation dataset for uav imagery. ISPRS journal of photogrammetry and remote sensing, 165:108–119. [26] Mount, D. M., Netanyahu, N. S., and Le Moigne, J. (1999). Efficient algorithms for robust feature matching. Pattern recognition, 32(1):17–38. [27] Mur-Artal, R., Montiel, J. M. M., and Tardos, J. D. (2015). Orb-slam: a versatile and accurate monocular slam system. IEEE transactions on robotics, 31(5):1147–1163. [28] Nigam, I., Huang, C., and Ramanan, D. (2018). Ensemble knowledge transfer for semantic segmentation. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1499–1508. IEEE. [29] Nistér, D., Naroditsky, O., and Bergen, J. (2004). Visual odometry. [30] NVIDIA (2024). Jetson orin developer kit. [31] Oktay, O., Schlemper, J., Folgoc, L. L., Lee, M., Heinrich, M., Misawa, K., Mori, K., McDonagh, S., Hammerla, N. Y., Kainz, B., et al. (2018). Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999. [32] Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer. [33] Sarlin, P.-E., DeTone, D., Malisiewicz, T., and Rabinovich, A. (2020). Superglue: Learning feature matching with graph neural networks. Conference on Computer Vision and Pattern Recognition. [34] Shen, X., Cai, Z., Yin, W., Müller, M., Li, Z., Wang, K., Chen, X., and Wang, C. (2023). Gim: Learning generalizable image matcher from internet videos. IEEE Transactions on Pattern Analysis and Machine Intelligence. [35] Wang, J., Ding, J., Guo, H., Cheng, W., Pan, T., and Yang, W. (2019). Mask obb: A semantic attention-based mask oriented bounding box representation for multicategory object detection in aerial images. Remote Sensing, 11(24):2930. [36] Wen, L., Cheng, Y., Fang, Y., and Li, X. (2023). A comprehensive survey of oriented object detection in remote sensing images. Expert Systems with Applications, 224:119960. [37] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M., and Luo, P. (2021a). Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems, 34:12077–12090. [38] Xie, X., Cheng, G., Wang, J., Yao, X., and Han, J. (2021b). Oriented r-cnn for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3520–3529. [39] Xin, G.-x., Zhang, X.-t., Wang, X., and Song, J. (2015). A rgbd slam algorithm combining orb with prosac for indoor mobile robot. In 2015 4th International Conference on Computer Science and Network Technology (ICCSNT), volume 1, pages 71–74. IEEE. [40] Yang, X., Yan, J., Feng, Z., and He, T. (2021). R3det: Refined single-stage detector with feature refinement for rotating object. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 3163–3171. [41] Yi, K. M., Trulls, E., Lepetit, V., and Fua, P. (2016). Lift: Learned invariant feature transform. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14, pages 467–483. Springer. [42] Zhang, S., Long, J., Xu, Y., and Mei, S. (2024). Pmho: Point-supervised oriented object detection based on segmentation-driven proposal generation. IEEE Transactions on Geoscience and Remote Sensing. [43] Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. (2017). Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890. [44] Zhao, X., Wu, X., Chen, W., Chen, P. C. Y., Xu, Q., and Li, Z. (2023). Aliked: A lighter keypoint and descriptor extraction network via deformable transformation. IEEE Transactions on Pattern Analysis and Machine Intelligence. [45] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P. H., et al. (2021). Rethinking semantic segmentation from a sequence-tosequence perspective with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6881–6890. [46] Zhou, Z., Wu, Q. J., Wan, S., Sun, W., and Sun, X. (2020). Integrating sift and cnn feature matching for partial-duplicate image detection. IEEE Transactions on Emerging Topics in Computational Intelligence, 4(5):593–604. [47] Zhu, X., Su, W., Lu, L., Li, B., Wang, X., and Dai, J. (2020). Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159. [48] Zou, Z., Chen, K., Shi, Z., Guo, Y., and Ye, J. (2023). Object detection in 20 years: A survey. Proceedings of the IEEE, 111(3):257–276.zh_TW