شبکه عصبی عمیق برای پیش‌بینی تعامل انسان در ویدئو با استفاده از روابط فازی و شار نوری

نوع مقاله : علمی-پژوهشی

نویسندگان

دانشکده فنی و مهندسی - دانشگاه بوعلی سینا

چکیده

 پیش‌بینی تعامل در ویدئو یکی از موضوعات فعال در بینایی کامپیوتر است، که هدف آن پیش‌بینی تعامل قبل از انجام کامل آن است. این موضوع به دلیل چالش‌های موجود در این زمینه هنوز مورد توجه است. در این مقاله یک شبکه عصبی عمیق برای پیش‌بینی تعامل با استفاده از روابط فازی و شار نوری ارائه‌شده است. نوآوری این روش ایجاد دو تصویر فازی از یک ویدئو است. این تصاویر فازی بر مبنای گرادیان و شار نوری ایجاد می‌شود. توابع عضویت فازی مناسب برای روابط مکانی بین افراد در حال تعامل در تصاویر گرادیان و شار نوری ایجاد شده است. از طرفی یک تابع عضویت فاصله برای ارزش‌دهی به فریم‌ها و یک تابع عضویت فاصله برای ارزش‌دهی به ناحیه‌ی بین افراد در حال تعامل تعریف شده است. سپس ویژگی‌های مناسب مکانی-زمانی از این تصاویر با استفاده از معماری شبکه عصبی کانولوشن استخراج‌شده است. نتایج این روش بر روی دو مجموعه داده استاندارد تشخیص تعامل، BIT و UT ارزیابی شده است. نتایج نشان می‌دهد ایجاد تصاویر فازی و استخراج ویژگی‌های عمیق از آن تصاویر باعث افزایش دقت پیش‌بینی تعامل نسبت به روش‌های پیشین شده است.

کلیدواژه‌ها


عنوان مقاله [English]

Deep neural network for interaction prediction in video using fuzzy relationship and optical flow

نویسندگان [English]

  • M. Afrasiabi
  • H. Khotanlou
  • M. Mansoorizadeh
Department of Computer Engineering, Bu-Ali Sina University, Hamedan, Iran,
چکیده [English]

The aim of interaction prediction in videos is to predict the interaction before it actually happens. Recently, this task has been important in computer vision domain and is gaining a lot of attention due to its challenges. In this paper, a deep neural network using fuzzy relationship and optical flow is proposed to deal with the problem. In this approach for each frame of a given video, first, two fuzzy images are obtained based on the gradient and the optical flow of the frame. Then, two set of features are extracted by a convolutional neural network trained on these images. Final prediction is made by aggregating the two outputs of the network. The proposed method shows promising results on two interaction datasets, namely BIT-Interaction and UT-Interaction.

کلیدواژه‌ها [English]

  • Fuzzy spatial relationship
  • gradient
  • optical flow
  • convolutional neural network
[1]      M. S. Ryoo, “Human activity prediction: Early recognition of ongoing activities from streaming videos,” IEEE International Conference on Computer Vision, 2011.
[2]      M. Ramanathan, W. Y. Yau and E. K. Teoh, “Human action recognition with video data: research and evaluation challenges,” IEEE Transactions on Human-Machine Systems, vol. 44, no. 5, pp. 650-663, 2014.
[3]      N. G. Cho, S. H. Park, J. S. Park, U. Park and S. W. Lee, “Compositional interaction descriptor for human interaction recognition,” Neurocomputing, vol. 265, pp. 169-181, 2017.
[4]      امیر سزاوار، حسن فرسی، سجاد محمدزاده، »بازیابی تصویر مبتنی بر محتوا با استفاده از شبکه‌های عصبی کانولوشن عمیق «، مجله مهندسی برق دانشگاه تبریز، جلد 48، شماره 4، صفحه 1603- 1595، زمستان 1397.
[5]      A. Krizhevsky, I. Sutskever and G.E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in Neural Information Processing Systems, pp.1097-1105, 2012.
[6]      G. Acampora, P. Foggia, A. Saggese and M. Vento, “Combining neural networks and fuzzy systems for human behavior understanding,” IEEE International Conference on Advanced Video and Signal-based Surveillance, 2012.
[7]      J. Y. Chang, J. J. Shyu, and C. W. Cho, “Fuzzy rule inference based human activity recognition,” IEEE Conference on Control Applications and IEEE International Symposium on Intelligent Control, 2009.
[8]      H. Medjahed, D. Istrate, J. Boudy and B. Dorizzi, “Human activities of daily living recognition using fuzzy logic for elderly home monitoring,” IEEE International Conference on Fuzzy Systems, 2009.
[9]      ندا خانبانی، امیرمسعود افتخاری مقدم، »ارائه یک روش تشخیص زبان علامت مبتنی بر رویکرد MLRF فازی با استفاده از اطلاعات عمق تصویر«، مجله مهندسی برق دانشگاه تبریز, 1396 مجله مهندسی برق دانشگاه تبریز، جلد 47، شماره 3، صفحه 987-977، پاییز1396.
[10]      N. P. Trong, H. Nguyen, K. Kazunori and B. Le Hoai, “A comprehensive survey on human activity prediction,” International Conference on Computational Science and Its Applications, Springer, 2017.
[11]      T. Lan, T. C. Chen, and S. Savarese, “A hierarchical representation for future action prediction,” European Conference on Computer Vision, Springer, 2014.
[12]      C. Gao, L. Yang, Y. Du, Z. Feng and J. Liu, “From constrained to unconstrained datasets: an evaluation of local action descriptors and fusion strategies for interaction recognition,” World Wide Web, vol. 19, no. 2, pp. 265-276, 2016.
[13]      Y. Kong, D. Kit, and Y. Fu, “A discriminative model with multiple temporal scales for action prediction,” European Conference on Computer Vision, Springer, 2014.
[14]      Y. Kong and Y. Fu, “Max-margin action prediction machine,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 9, pp. 1844-1858, 2016.
[15]      W. Choi, K. Shahid, and S. Savarese, “What are they doing?: Collective activity classification using spatio-temporal relationship among people,” IEEE International Conference on Computer Vision, Workshops, pp. 1282-1289, 2009.
[16]      Z. Wang, S. Liu, J. Zhang, S. Chen, and Q. Guan, “A spatio-temporal CRF for human interaction understanding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 8, pp. 1647-1660, 2017.
[17]      J. M. Le Yaouanc and J.-P. Poli, “A fuzzy spatio-temporal-based approach for activity recognition,” International Conference on Conceptual Modeling, Springer, 2012.
[18]      B. Yao, H. Hagras, M. J. Alhaddad and D. Alghazzawi, “A fuzzy logic-based system for the automation of human behavior recognition using machine vision in intelligent environments,” Soft Computing, vol. 19, no. 2, pp. 499-506, 2015.
[19]      K. Mozafari, N. M. Charkari, H. S. Boroujeni and M. Behrouzifar, “A novel fuzzy hmm approach for human action recognition in video,” Knowledge Technology, Springer, pp. 184-193, 2012.
[20]      M. Raptis and L. Sigal, “Poselet key-framing: A model for human activity recognition,” IEEE Conference on Computer Vision and Pattern Recognition, 2013.
[21]      A. Iosifidis, A. Tefas, and I. Pitas, “Activity-based person identification using fuzzy representation and discriminant learning,” IEEE Transactions on Information Forensics and Security, vol. 7, no. 2, pp. 530-542, 2012.
[22]      L. Nanni, S. Ghidoni, and S. Brahnam, “Handcrafted vs. non-handcrafted features for computer vision classification,” Pattern Recognition, vol. 71, pp. 158-172, 2017.
[23]      E. P. Ijjina and K. M. Chalavadi, “Human action recognition in RGB-D videos using motion sequence information and deep learning,” Pattern Recognition, vol. 72, pp. 504-516, 2017.
[24]      H. J. Kim, J. S. Lee, and H. S. Yang, “Human action recognition using a modified convolutional neural network,” International Symposium on Neural Networks, Springer, 2007.
[25]      R. J. Moreno, O. A. Sanchez, and D. M. Ovalle, “RGB-D training for convolutional neural network with final fuzzy layer for depth weighting,” Contemp. Eng. Sci., vol. 10, no. 29, pp. 1419-1429, 2017.
[26]      Q. Ke, M. Bennamoun, S. An, F. Boussaid, and F. Sohel, “Human interaction prediction using deep temporal features,” European Conference on Computer Vision, Springer, 2016.
[27]      Q. Ke, M. Bennamoun, S. An, F. Sohel and F. Boussaid, “Leveraging structural context models and ranking score fusion for human interaction prediction,” IEEE Transactions on Multimedia, vol. 20, no. 7, pp. 1712-1723, 2018.
[28]      J. Donahue, L. Anne Hendricks , S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” IEEE Conference on Computer Vision and Pattern Recognition, 2015.
[29]      A. Stergiou and R. Poppe, “Understanding human-human interactions: a survey,” arXiv preprint arXiv:1808.00022, 2018.
[30]      X. Wang and J.M. Keller, “Human-based spatial relationship generalization through neural/fuzzy approaches,” Fuzzy Sets and Systems, vol. 101, no. 1, pp. 5-20, 1999.
[31]      R. Pierrard, J. P. Poli, and C. Hudelot, “Learning Fuzzy Relations and Properties for Explainable Artificial Intelligence,” IEEE International Conference on Fuzzy Systems, 2018.
[32]      H. Hüttenrauch, K. S. Eklundh, A. Green and E. A. Topp, “Investigating spatial relationships in human-robot interaction,” International Conference on Intelligent Robots and Systems, 2006.
[33]      I. Bloch, “Fuzzy spatial relationships for image processing and interpretation: a review,” Image and Vision Computing, vol. 23, no. 2, pp. 89-110, 2005.
[34]      A. Delmonte, I. Bloch, D. Hasboun, C. Mercier, J. Pallud and P. Gori, “Segmentation of white matter tractograms using fuzzy spatial relations,” Organization for Human Brain Mapping, 2018.
[35]      J. J. Gibson, The perception of the visual world, 1950.
[36]      T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, “High accuracy optical flow estimation based on a theory for warping,” European conference on Computer Vision, Springer, 2004.
[37]      S. Baker, D. Scharstein, J. P. Lewis, S. Roth, M. J. Black and R. Szeliski, “A database and evaluation methodology for optical flow,” International Journal of Computer Vision, vol. 92, no. 1, pp. 1-31, 2011.
[38]      A. Ess, B. Leibe, K. Schindler and L. Van Gool, “A mobile vision system for robust multi-person tracking,” IEEE Conference on Computer Vision and Pattern Recognition, 2008.
[39]      O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma and A. C. Berg, “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211-252, 2015.
[40]      C. Li, Y. Hou, P. Wang, and W. Li, “Joint distance maps based action recognition with convolutional neural networks,” IEEE Signal Processing Letters, vol. 24, no. 5, pp. 624-628. 2017.
[41]      Y. Kong, Y. Jia, and Y. Fu., “Learning human interaction by interactive phrases,” European Conference on Computer Vision, Springer, 2012.
[42]      M. S. Ryoo and J. Aggarwal., “UT-interaction dataset, ICPR contest on semantic description of human activities (SDHA),” IEEE International Conference on Pattern Recognition Workshops, 2010.
[43]      Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” ACM International Conference on Multimedia, 2014.