基於強化學習下的餐廳機器人— 接待與送餐之應用

Publications-Theses

Article View/Open

pdf(208)

Publication Export

Google Scholar^TM

題名	基於強化學習下的餐廳機器人— 接待與送餐之應用 Restaurant Robot Based on Reinforcement Learning—Application of Reception and Delivery
作者	鄭玉筠 Cheng, Yu-Yun
貢獻者	蔡子傑 Tsai, Tzu-Chieh 鄭玉筠 Cheng, Yu-Yun
關鍵詞	強化學習近端策略優化（RL-PPO）演算法馬可夫決策過程局部觀測餐廳機器人接待與送餐 Reinforcement Learning—Proximal Policy Optimization Algorithm Markov Decision Process Partially Observable Robot of Restaurant Reception and Delivery
日期	2023
上傳時間	9-Mar-2023 18:26:04 (UTC+8)
摘要	台灣在2020年人口首度呈現負成長，少子化已經成為產業人力缺口的重大問題；又或是有高度傳染性疾病流行時，人與人之間可能也不適合有太多接觸。餐飲服務業面臨上述的問題，若是能導入自動化AI系統，使用服務機器人來取代部分的人力，負責接待與送餐任務，就可以減輕人力不足與減少傳染性疾病的感染風險。餐廳若是有多個機器人的服務系統，透過工作排程，可以同時去完成不同的任務，不但可以減少人力運用，也可以具有較高的顧客滿意度的優勢。本文提出基於強化學習近端策略優化（RL-PPO）演算法的多個機器人服務系統的訓練框架，探索用於建構能夠減少人力的自動智慧餐廳的可能性。系統整合OpenAI Gym與Pygame 做為模擬環境，運用RL-PPO演算法的技術，並在最終階段類比成效。在本文中，我們對餐廳服務機器人系統進行建立模型，我們是以增加服務顧客的數量與減少顧客等待的時間為評估指標，而這與路徑規劃的距離會有正相關，在這樣的框架下，還可以進一步優化其他的指標：例如顧客的滿意度、員工每工時的勞動生產率等。我們針對這二項評估指標優化，因為問題涉及順序決策，同時也需要實時決策，所以我們將二項服務任務建模為馬可夫決策過程，採用RL-PPO演算法來解決該問題。本文模擬系統針對服務顧客數量與顧客等待時間二項指標的優化，證明經過本系統RL-PPO演算法架構下訓練的機器人系統，只需要餐廳的局部觀測資訊，通過自我學習，即可以維持餐廳服務機器人的服務效能。意即在餐廳臨時因應服務硬體佈局有所調整時，餐廳機器人從事接待與送餐工作時，也不需要更改系統或架構，餐廳機器人還是可以運作。這樣的框架系統，更具有靈活性、泛化性與穩定性，可以做為未來次世代的餐廳服務機器人系統的應用。 In 2020, Taiwan`s population showed negative growth for the first time, and the declining birth rate has become a major problem for the industry`s manpower shortage; or when there are highly contagious diseases, it may not be suitable for too much contact between people. The catering service industry is facing the above-mentioned problems. If an automated AI system can be introduced with service robots for reception and delivery tasks, the shortage of manpower and the risk of infection of epidemic diseases can be alleviated. If a restaurant has such a service system with multiple robots, different tasks can be completed at the same time through appropriate job scheduling. Thus it can not only reduce the use of manpower, but also have the advantage of higher customer satisfaction. This thesis proposes a training framework for the multiple robot service system based on the Reinforcement Learning—Proximal Policy Optimization (RL-PPO) algorithm. It explores the possibility of constructing an automatic smart restaurant that can reduce manpower. We use OpenAI Gym and Pygame as the simulation environment. We build a model for the restaurant service robot system to evaluate the performance. The waiting time of customers versus number of serving customers is considered, which will be positively correlated with the robot working distance of path planning. Other indicators can also be further optimized, such as customer satisfaction, employee productivity per working hour, etc. In order to optimize the two evaluation indicators, sequential and real-time decision-makings are required. We model it as a Markov Decision Process, and use the RL-PPO algorithm to solve this problem. We also prove that the robot system trained under the RL-PPO algorithm framework of this system only needs part of the observation information of the restaurant, and can maintain the efficiency through robot self-learning. That is to say, when the restaurant temporarily adjusts the service hardware layout, the restaurant robot can still operate without changing the system. Such a framework system is more stable, flexible and generalizable, and can be used as an application in the next generation of restaurant service robot systems in the future.
參考文獻	[1] A. M. Turing (1950). Computing Machinery and Intelligence. Mind, New Series, 59(236), 433-460. [2] David Silver (2016). Tutorial: Deep Reinforcement Learning [3] Chathurangi Shyalika, Thushari Silva, Asoka Karunananda (2020). Reinforcement Learning in Dynamic Task Scheduling: A Review. SN Computer Science, 1(6), 306 [4] Byrd, K.、Fan, A. et al. (2021). Robot vs human: expectations, performances and gaps in off-premise restaurant service modes. International Journal of Contemporary Hospitality Management, 11(33), 3996-4016 [5] Jun Yang, Xinghui You et al. (2019). Application of reinforcement learning in UAV cluster task scheduling, Future Generation Computer Systems, 95, 140-148 [6] Tingxiang Fan, Pinxin Long et al. (2020). Distributed multi-robot collision avoidance via deep reinforcement learning for navigation in complex scenarios. SAGE Journals [7] Takeshi Shimmura, Ryosuke Ichikari et al. (2020). Service robot introduction to a restaurant enhances both labor productivity and service quality. Procedia CIRP, 88, 589-594 [8] Ruijun Yang, Liang Cheng, (2019). Path Planning of Restaurant Service Robot Based on A-star Algorithms with Updated Weights. 2019 12th International Symposium on Computational Intelligence and Design (ISCID) [9] Thanh Thi Nguyen, Ngoc Duy Nguyen et al. (2020). Deep Reinforcement Learning for Multi-Agent Systems: A Review of Challenges, Solutions and Applications. IEEE Transactions on Cybernetics (Volume: 50, Issue: 9) [10] Sutton, R. S., and Barto, A. G. (1998). Reinforcement learning: An introduction. MIT press. [11] Thorndike, E. L. (1898). Animal intelligence: an experimental study of the associate processes in animals. American Psychologist, 53(10), 1125-1127. [12] Deng, L., and Yu, D. (2014). Deep learning: methods and applications. Foundations and Trends in Signal Processing, 7(34), 197-387. [13] Min-Gyu Kim, Heeyoon Yoon et al. (2021).Investigating Frontline Service Employees to Identify Behavioral Goals of Restaurant Service Robot: An Exploratory Study. 2021 18th International Conference on Ubiquitous Robots (UR) [14] Prejitha.CT, Vikram Raj.N et al. (2020). Design of Restaurant Service Robot for Contact less and Hygienic Eating Experience. International Research Journal of Engineering and Technology (IRJET), 07(08), 2938-2943 [15] OpenAI (Christopher Berner, Greg Brockman, et al. (2021). Dota 2 with Large Scale Deep Reinforcement Learning. arVix:1912.06680v1 [16] K. Lakshmi Narayanan, et al. (2021). Fuzzy Guided Autonomous Nursing Robot through Wireless Beacon Network. Multimedia Tools and Applications, doi: 10.1007/s11042-021-11264-6 [17] Lai, Chien-Jung; Tsai, Ching-Pei (2018). Design of Introducing Service Robot into Catering Services. Proceedings of the 2018 International Conference on Service Robotics Technologies, 62-66, doi:10.1145/3208833.3208837 [18] Osman El-Said, Sara Al Hajri. (2022). Are customers happy with robot service? Investigating satisfaction with robot service restaurants during the COVID-19 pandemic. Heliyon 8(10), doi:10.1016/j.heliyon.2022.e08986 [19] Hideharu Ouchi, Ryosuke Ueno et al. (2019). Development of Robot Restaurant Simulator. 2019 16th International Conference on Ubiquitous Robots [20] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347 [21] Vanessa Hayes et al. (2019). Human origins in a southern African palaeo-wetland and first migrations. Nature [22] Thorndike, E. L. (1898). Animal intelligence: an experimental study of the associate processes in animals. American Psychologist, 53(10), 1125-1127. [23] Minsky, M. L. (1954). Theory of neural-analog reinforcement systems and its application to the brain model problem. Princeton University. [24] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of Go with deep neural networks and tree search. nature 529, 484. [25] Beakcheol Jang, Myeonghwi Kim, et al. (2019). Q-Learning Algorithms: A Comprehensive Classification and Applications. IEEE Access, 7 [26] Schulman, John, et al. (2015). Trust Region Policy Optimization. arXiv:1502.05477
描述	碩士國立政治大學資訊科學系碩士在職專班 109971017
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0109971017
資料類型	thesis

dc.contributor.advisor	蔡子傑	zh_TW
dc.contributor.advisor	Tsai, Tzu-Chieh	en_US
dc.contributor.author (Authors)	鄭玉筠	zh_TW
dc.contributor.author (Authors)	Cheng, Yu-Yun	en_US
dc.creator (作者)	鄭玉筠	zh_TW
dc.creator (作者)	Cheng, Yu-Yun	en_US
dc.date (日期)	2023	en_US
dc.date.accessioned	9-Mar-2023 18:26:04 (UTC+8)	-
dc.date.available	9-Mar-2023 18:26:04 (UTC+8)	-
dc.date.issued (上傳時間)	9-Mar-2023 18:26:04 (UTC+8)	-
dc.identifier (Other Identifiers)	G0109971017	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/143784	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	資訊科學系碩士在職專班	zh_TW
dc.description (描述)	109971017	zh_TW
dc.description.abstract (摘要)	台灣在2020年人口首度呈現負成長，少子化已經成為產業人力缺口的重大問題；又或是有高度傳染性疾病流行時，人與人之間可能也不適合有太多接觸。餐飲服務業面臨上述的問題，若是能導入自動化AI系統，使用服務機器人來取代部分的人力，負責接待與送餐任務，就可以減輕人力不足與減少傳染性疾病的感染風險。餐廳若是有多個機器人的服務系統，透過工作排程，可以同時去完成不同的任務，不但可以減少人力運用，也可以具有較高的顧客滿意度的優勢。本文提出基於強化學習近端策略優化（RL-PPO）演算法的多個機器人服務系統的訓練框架，探索用於建構能夠減少人力的自動智慧餐廳的可能性。系統整合OpenAI Gym與Pygame 做為模擬環境，運用RL-PPO演算法的技術，並在最終階段類比成效。在本文中，我們對餐廳服務機器人系統進行建立模型，我們是以增加服務顧客的數量與減少顧客等待的時間為評估指標，而這與路徑規劃的距離會有正相關，在這樣的框架下，還可以進一步優化其他的指標：例如顧客的滿意度、員工每工時的勞動生產率等。我們針對這二項評估指標優化，因為問題涉及順序決策，同時也需要實時決策，所以我們將二項服務任務建模為馬可夫決策過程，採用RL-PPO演算法來解決該問題。本文模擬系統針對服務顧客數量與顧客等待時間二項指標的優化，證明經過本系統RL-PPO演算法架構下訓練的機器人系統，只需要餐廳的局部觀測資訊，通過自我學習，即可以維持餐廳服務機器人的服務效能。意即在餐廳臨時因應服務硬體佈局有所調整時，餐廳機器人從事接待與送餐工作時，也不需要更改系統或架構，餐廳機器人還是可以運作。這樣的框架系統，更具有靈活性、泛化性與穩定性，可以做為未來次世代的餐廳服務機器人系統的應用。	zh_TW
dc.description.abstract (摘要)	In 2020, Taiwan`s population showed negative growth for the first time, and the declining birth rate has become a major problem for the industry`s manpower shortage; or when there are highly contagious diseases, it may not be suitable for too much contact between people. The catering service industry is facing the above-mentioned problems. If an automated AI system can be introduced with service robots for reception and delivery tasks, the shortage of manpower and the risk of infection of epidemic diseases can be alleviated. If a restaurant has such a service system with multiple robots, different tasks can be completed at the same time through appropriate job scheduling. Thus it can not only reduce the use of manpower, but also have the advantage of higher customer satisfaction. This thesis proposes a training framework for the multiple robot service system based on the Reinforcement Learning—Proximal Policy Optimization (RL-PPO) algorithm. It explores the possibility of constructing an automatic smart restaurant that can reduce manpower. We use OpenAI Gym and Pygame as the simulation environment. We build a model for the restaurant service robot system to evaluate the performance. The waiting time of customers versus number of serving customers is considered, which will be positively correlated with the robot working distance of path planning. Other indicators can also be further optimized, such as customer satisfaction, employee productivity per working hour, etc. In order to optimize the two evaluation indicators, sequential and real-time decision-makings are required. We model it as a Markov Decision Process, and use the RL-PPO algorithm to solve this problem. We also prove that the robot system trained under the RL-PPO algorithm framework of this system only needs part of the observation information of the restaurant, and can maintain the efficiency through robot self-learning. That is to say, when the restaurant temporarily adjusts the service hardware layout, the restaurant robot can still operate without changing the system. Such a framework system is more stable, flexible and generalizable, and can be used as an application in the next generation of restaurant service robot systems in the future.	en_US
dc.description.tableofcontents	目次致謝 i 摘要 ii ABSTRACT iii 目次 v 表次 vi 圖次 vii 第一章緒論 1 第一節論文介紹 1 第二節研究動機與目的 2 第三節文獻探討 4 第四節論文架構 8 第二章基礎理論介紹 10 第一節強化學習 10 第二節近端策略優化演算法（PPO） 14 第三節最短路徑規畫：A-Star演算法 17 第三章模擬系統架構 19 第一節模擬環境 19 第二節模擬系統架構 22 第四章實驗設計與結果分析 31 第一節實驗設計 31 第二節結果分析 33 第五章結論與未來展望 48 參考文獻 49 表次表1 系統環境模型參數表 24 表2 系統中PPO演算法的參數設定表 26 表3 模擬系統中隨機環境參數表 28 表4 模擬系統中障礙物隨機參數表 29 表5 餐廳固定模式下平均顧客等待時間統計表 35 表6 餐廳固定模式下服務顧客數量與機器人平均移動距離表 36 表7 餐廳固定模式下中央排程系統之效能比較表 38 表8 餐廳隨機模式下平均顧客等待時間統計表 42 表9 餐廳隨機模式下服務顧客數量與機器人平均移動距離統計表 42 表10 餐廳隨機障礙物模式下平均顧客等待時間比較表 44 表11 餐廳隨機模式下服務顧客數量與機器人平均移動距離統計表 44 圖次圖1 人工智慧與機器學習與強化學習的關係圖 11 圖2 強化學習方法圖 12 圖3 強化學習演算法的分類圖 12 圖4 Q Learning 演算法原型 13 圖5 Actor-Critic方法 14 圖6 PPO演算法 15 圖7 TRPO[26]與PPO[20]演算法 16 圖8 PPO2[20]演算法 16 圖9 PPO演算法[20] 17 圖10 A演算法公式計算推導圖 18 圖11 模擬系統與角色間的互動關係圖 19 圖12 餐廳配置圖 20 圖13 顧客與中央排程系統互動關係圖 21 圖14 餐廳服務機器人與中央排程系統互動關係圖 22 圖15 餐廳服務機器人與顧客與中央排程狀態轉換模型圖 23 圖16 PPO架構下-中央排程系統的深度神經網路圖 25 圖17 PPO架構下-服務機器人系統的深度神經網路圖 25 圖18 Agent, Environment, Reward的關係圖 27 圖19 模擬系統強化學習PPO演算法 27 圖20 餐廳固定模式下中央排程系統的平均Rewards 34 圖21 餐廳固定模式下服務機器人路徑規劃系統之機器人平均Rewards 34 圖22 餐廳固定模式下RL-PPO組之模擬系統影片截圖 35 圖23 餐廳固定模式下平均顧客等待時間比較圖 36 圖24 餐廳固定模式下服務顧客數量與機器人平均移動距離圖 36 圖25 RL-PPO組的機器人足跡圖 37 圖26 類比實驗組-FIFO+A的機器人足跡圖 37 圖27 餐廳固定模式下中央排程系統之效能比較圖 38 圖28 餐廳隨機模式下中央排程系統的平均Rewards 39 圖29 餐廳隨機模式下服務機器人路徑規劃系統之機器人平均Rewards 40 圖30 餐廳隨機模式下A*組之模擬系統影片截圖 41 圖31 餐廳隨機模式下RL-PPO組之模擬系統影片截圖 41 圖32 餐廳隨機模式下RL-PPO組之模擬系統影片截圖 41 圖33 餐廳隨機模式下平均顧客等待時間比較圖 42 圖34 餐廳隨機模式下服務顧客數量與機器人平均移動距離圖 42 圖35 餐廳隨機障礙物模式下平均顧客等待時間比較圖 44 圖36 餐廳隨機障礙物模式下服務顧客數量與機器人平均移動距離圖 44 圖37 餐廳隨機障礙物模式下RL-PPO組之模擬系統影片截圖 45 圖38 違法行為對機器人平均顧客等待時間影響 46 圖39 共享模型與否對機器人的平均顧客等待時間之影響 47	zh_TW
dc.format.extent	3089861 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0109971017	en_US
dc.subject (關鍵詞)	強化學習近端策略優化（RL-PPO）演算法	zh_TW
dc.subject (關鍵詞)	馬可夫決策過程	zh_TW
dc.subject (關鍵詞)	局部觀測	zh_TW
dc.subject (關鍵詞)	餐廳機器人	zh_TW
dc.subject (關鍵詞)	接待與送餐	zh_TW
dc.subject (關鍵詞)	Reinforcement Learning—Proximal Policy Optimization Algorithm	en_US
dc.subject (關鍵詞)	Markov Decision Process	en_US
dc.subject (關鍵詞)	Partially Observable	en_US
dc.subject (關鍵詞)	Robot of Restaurant	en_US
dc.subject (關鍵詞)	Reception and Delivery	en_US
dc.title (題名)	基於強化學習下的餐廳機器人— 接待與送餐之應用	zh_TW
dc.title (題名)	Restaurant Robot Based on Reinforcement Learning—Application of Reception and Delivery	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	[1] A. M. Turing (1950). Computing Machinery and Intelligence. Mind, New Series, 59(236), 433-460. [2] David Silver (2016). Tutorial: Deep Reinforcement Learning [3] Chathurangi Shyalika, Thushari Silva, Asoka Karunananda (2020). Reinforcement Learning in Dynamic Task Scheduling: A Review. SN Computer Science, 1(6), 306 [4] Byrd, K.、Fan, A. et al. (2021). Robot vs human: expectations, performances and gaps in off-premise restaurant service modes. International Journal of Contemporary Hospitality Management, 11(33), 3996-4016 [5] Jun Yang, Xinghui You et al. (2019). Application of reinforcement learning in UAV cluster task scheduling, Future Generation Computer Systems, 95, 140-148 [6] Tingxiang Fan, Pinxin Long et al. (2020). Distributed multi-robot collision avoidance via deep reinforcement learning for navigation in complex scenarios. SAGE Journals [7] Takeshi Shimmura, Ryosuke Ichikari et al. (2020). Service robot introduction to a restaurant enhances both labor productivity and service quality. Procedia CIRP, 88, 589-594 [8] Ruijun Yang, Liang Cheng, (2019). Path Planning of Restaurant Service Robot Based on A-star Algorithms with Updated Weights. 2019 12th International Symposium on Computational Intelligence and Design (ISCID) [9] Thanh Thi Nguyen, Ngoc Duy Nguyen et al. (2020). Deep Reinforcement Learning for Multi-Agent Systems: A Review of Challenges, Solutions and Applications. IEEE Transactions on Cybernetics (Volume: 50, Issue: 9) [10] Sutton, R. S., and Barto, A. G. (1998). Reinforcement learning: An introduction. MIT press. [11] Thorndike, E. L. (1898). Animal intelligence: an experimental study of the associate processes in animals. American Psychologist, 53(10), 1125-1127. [12] Deng, L., and Yu, D. (2014). Deep learning: methods and applications. Foundations and Trends in Signal Processing, 7(34), 197-387. [13] Min-Gyu Kim, Heeyoon Yoon et al. (2021).Investigating Frontline Service Employees to Identify Behavioral Goals of Restaurant Service Robot: An Exploratory Study. 2021 18th International Conference on Ubiquitous Robots (UR) [14] Prejitha.CT, Vikram Raj.N et al. (2020). Design of Restaurant Service Robot for Contact less and Hygienic Eating Experience. International Research Journal of Engineering and Technology (IRJET), 07(08), 2938-2943 [15] OpenAI (Christopher Berner, Greg Brockman, et al. (2021). Dota 2 with Large Scale Deep Reinforcement Learning. arVix:1912.06680v1 [16] K. Lakshmi Narayanan, et al. (2021). Fuzzy Guided Autonomous Nursing Robot through Wireless Beacon Network. Multimedia Tools and Applications, doi: 10.1007/s11042-021-11264-6 [17] Lai, Chien-Jung; Tsai, Ching-Pei (2018). Design of Introducing Service Robot into Catering Services. Proceedings of the 2018 International Conference on Service Robotics Technologies, 62-66, doi:10.1145/3208833.3208837 [18] Osman El-Said, Sara Al Hajri. (2022). Are customers happy with robot service? Investigating satisfaction with robot service restaurants during the COVID-19 pandemic. Heliyon 8(10), doi:10.1016/j.heliyon.2022.e08986 [19] Hideharu Ouchi, Ryosuke Ueno et al. (2019). Development of Robot Restaurant Simulator. 2019 16th International Conference on Ubiquitous Robots [20] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347 [21] Vanessa Hayes et al. (2019). Human origins in a southern African palaeo-wetland and first migrations. Nature [22] Thorndike, E. L. (1898). Animal intelligence: an experimental study of the associate processes in animals. American Psychologist, 53(10), 1125-1127. [23] Minsky, M. L. (1954). Theory of neural-analog reinforcement systems and its application to the brain model problem. Princeton University. [24] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of Go with deep neural networks and tree search. nature 529, 484. [25] Beakcheol Jang, Myeonghwi Kim, et al. (2019). Q-Learning Algorithms: A Comprehensive Classification and Applications. IEEE Access, 7 [26] Schulman, John, et al. (2015). Trust Region Policy Optimization. arXiv:1502.05477	zh_TW

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM