Recognizing human actions is crucial for an effective and safe collaboration between humans and robots. For example, in a collaborative assembly task, human workers can use gestures to communicate with the robot, and the robot can use the recognized actions to anticipate the next steps in the assembly process, leading to improved safety and productivity. In this work, we propose a general framework for human action recognition based on 3D pose estimation and ensemble techniques, which allows to recognize both body actions and hand gestures. The framework relies on OpenPose and 2D to 3D lifting methods to estimate 3D joints for the human body and the hands, feeding then these joints into a set of graph convolutional networks based on the Shift-GCN architecture. The output scores of all networks are combined using an ensemble approach to predict the final human action. The proposed framework was evaluated on a custom dataset designed for human–robot collaboration tasks, named IAS-Lab Collaborative HAR dataset. The results showed that using an ensemble of action recognition models improves the accuracy and robustness of the overall system; moreover, the proposed framework can be easily specialized on different scenarios and achieve state-of-the-art results on the HRI30 dataset when coupled with an object detector or classifier.

A general skeleton-based action and gesture recognition framework for human–robot collaboration

Matteo Terreran
;
Stefano Ghidoni;
2023

Abstract

Recognizing human actions is crucial for an effective and safe collaboration between humans and robots. For example, in a collaborative assembly task, human workers can use gestures to communicate with the robot, and the robot can use the recognized actions to anticipate the next steps in the assembly process, leading to improved safety and productivity. In this work, we propose a general framework for human action recognition based on 3D pose estimation and ensemble techniques, which allows to recognize both body actions and hand gestures. The framework relies on OpenPose and 2D to 3D lifting methods to estimate 3D joints for the human body and the hands, feeding then these joints into a set of graph convolutional networks based on the Shift-GCN architecture. The output scores of all networks are combined using an ensemble approach to predict the final human action. The proposed framework was evaluated on a custom dataset designed for human–robot collaboration tasks, named IAS-Lab Collaborative HAR dataset. The results showed that using an ensemble of action recognition models improves the accuracy and robustness of the overall system; moreover, the proposed framework can be easily specialized on different scenarios and achieve state-of-the-art results on the HRI30 dataset when coupled with an object detector or classifier.
File in questo prodotto:
File Dimensione Formato  
1-s2.0-S0921889023001628-main.pdf

accesso aperto

Tipologia: Published (publisher's version)
Licenza: Creative commons
Dimensione 2.85 MB
Formato Adobe PDF
2.85 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3495700
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 2
social impact