گسترش پرس‌وجو با سرپرستی ضعیف با استفاده از شبکه سیامی عمیق حافظه کوتاه-مدت طولانی

نوع مقاله : علمی-پژوهشی

نویسندگان

گروه مهندسی کامپیوتر - دانشگاه یزد

چکیده

عدم‌همخوانی واژگان مهمترین چالش پیش روی سیستم‌های بازیابی اطلاعات از وب هستند. عدم‌همخوانی واژگانی به تفاوت‌های موجود بین پرس‌وجوهای کاربران و محتوای اسناد وب در حالی اطلاق می‌گردد که هر دو به یک موضوع واحد اشاره دارند. روش‌های گسترش پرس‌وجو برای رویارویی با مشکل عدم‌همخوانی واژگانی، پرس‌وجوی کاربر را بازآرایی می‌نمایند تا بدینوسیله همپوشانی بین عبارت‌های موجود در پرس‌وجو و اسناد را افزایش دهند. در این مقاله یک چهارچوب گسترش پرس‌وجوی مبتنی بر شبکه سیامی عمیق حافظه کوتاه-مدت طولانی ارائه شده است. به علاوه، برای نخستین بار وابستگی ارتباطی در این مقاله تعریف شده و برای برچسب‌گذاری جفت‌های متشکل از پرس‌وجوی کاربر و پرس‌وجوی جایگزین مورد استفاده قرار گرفته است. شبکه سیامی آموزش داده شده با استفاده از جفت‌های برچسب‌گذاری شده با نظارت ضعیف، علاوه بر ارائه برچسب برای جفت‌های ورودی، هزینه هم‌سنجی آن‌ها را نیز محاسبه نموده و اعلام می‌کند. پس از برچسب‌گذاری، جفت‌های با کم‌ترین هزینه هم‌سنجی انتخاب و در هم ادغام می‌شوند تا به یک پرس‌وجوی گسترش یافته تبدیل شوند. نتایج آزمایشات نشان‌دهنده برتری روش پیشنهادی بر سایر روش‌های مشابه گسترش پرس‌وجوی مبتنی بر جاسازی کلمات بوده است.

کلیدواژه‌ها


عنوان مقاله [English]

Weakly Supervised Query Expansion using Deep Siamese LSTM

نویسندگان [English]

  • F. Kaveh-Yazdy
  • A. M. Zareh-Bidoki
Department of Computer Engineering, Yazd University, Yazd, Iran
چکیده [English]

Term mismatch is the most important challege in web information retrieval. The term mismatch problem is defined as differences between user queries and contents of documents while referring to the same topic. Query expansion methods deal with term mismatch by reformulating the queries to increase their term-overlap with relevant documents. In this paper, we proposed a query expansion framework based on a deep Siamese LSTM neural network. In addition, we defined the relevant relatedness for the first time and used this concept to label pairs made from user query and candidate query. Weakly-supervised labeled pairs are utilized in training of the deep Siamese network. The trained Siamese network provides labels for testset pairs in addition to contrastive loss values. The contrastive loss value reflects the cost of pulling together similar pairs. Pairs with minimum contrastive loss values are selected and merged together to form one expanded query. Results of our tests showed that the proposed framework outperforms similar word embedding based query expansion methods.

کلیدواژه‌ها [English]

  • Information Retrieval
  • Query Expansion
  • Word Embedding
  • Semantic Relatedness
  • Relevant Relatedness
  • Deep Siamese Network
  • LSTM cell
[1]  C. Carpineto and G. Romano, “A Survey of Automatic Query Expansion in Information Retrieval,” ACM Comput. Surv., vol. 44, no. 1, pp. 1:1–1:50, Jan. 2012.
[2] رضا خدایی، محمدعلی بالافر، سیدناصر رضوی، «اثربخشی بسط پرس‌وجو مبتنی بر خوشه‌بندی اسناد شبه بازخورد با الگوریتم K-NN»، مجله مهندسی برق دانشگاه تبریز، دوره 46،  شماره 1، صفحات 143-151، 1395.
[3]  S. A. Takale and S. S. Nandgaonkar, “Measuring Semantic Similarity Between Words Using Web Search Engines,” in Proceedings of the 16th International Conference on World Wide Web, Banff, Alberta, Canada, 2007, pp. 757–766.
[4]  K. Gulordava and M. Baroni, “A Distributional Similarity Approach to the Detection of Semantic Change in the Google Books Ngram Corpus,” in Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics, Stroudsburg, PA, USA, 2011, pp. 67–71.
[5]  فاطمه کاوه یزدی، علی‌محمّد زارع بیدکی، محمدرضا پژوهان، «تعیین مشابهت معنایی به روش بدون سرپرست با استفاده از قدم‌زنی تصادفی بر گراف جایگزینی زبانی»، دوره 48، شماره 1، صفحات 237-249، 1397.
[6]  H. Bast, B. Buchhold, and E. Haussmann, “Semantic Search on Text and Knowledge Bases,” Found. Trends Inf. Retr., vol. 10, no. 1, pp. 119–271, 2016.
[7]  H. Cui, J.-R. Wen, J.-Y. Nie, and W.-Y. Ma, “Probabilistic Query Expansion Using Query Logs,” in Proceedings of the 11th International Conference on World Wide Web, Honolulu, Hawaii, USA, 2002, pp. 325–332.
[8]  S. V. Pantazi, “Unsupervised grammar induction and similarity retrieval in medical language processing using the Deterministic Dynamic Associative Memory (DDAM) model,” J. Biomed. Inform., vol. 43, no. 5, pp. 844–857, Oct. 2010.
[9]  T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Distributed Representations of Words and Phrases and their Compositionality,” in Advances in Neural Information Processing Systems (NIPS’ 13), Lake Tahoe, Nevada, 2013, pp. 3111–3119.
[10]  J. Pennington, R. Socher, and C. D. Manning, “Glove: Global Vectors for Word Representation,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP'14), Doha, Qatar, 2014, pp. 1532–1543.
[11]  D. Roy, D. Paul, M. Mitra, and U. Garain, “Using Word Embeddings for Automatic Query Expansion,” in SIGIR Workshop on Neural Information Retrieval, Pisa, Italy, 2016, pp. 1–5.
[12]  S. Kuzi, A. Shtok, and O. Kurland, “Query Expansion Using Word Embeddings,” in Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM ’16), Indianapolis, Indiana, USA, 2016, pp. 1929–1932.
[13]  D. Ganguly, D. Roy, M. Mitra, and G. J. F. Jones, “Word Embedding based Generalized Language Model for Information Retrieval,” in Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’15), Santiago, Chile, 2015, pp. 795–798.
[14]  S. Balaneshin-kordan and A. Kotov, “Embedding-based Query Expansion for Weighted Sequential Dependence Retrieval Model,” in Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’17), Shinjuku, Tokyo, Japan, 2017, pp. 1213–1216.
[15]  H. Zamani and W. B. Croft, “Estimating Embedding Vectors for Queries,” in Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval (ICTIR ’16), Newark, Delaware, USA, 2016, pp. 123–132.
[16]  H. Zamani and W. B. Croft, “Relevance-based Word Embedding,” in Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’17), Shinjuku, Tokyo, Japan, 2017, pp. 505–514.
[17]  G. Zheng and J. Callan, “Learning to Reweight Terms with Distributed Representations,” in Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, 2015, pp. 575–584.
[18]  M. ALMasri, C. Berrut, and J.-P. Chevallet, “A Comparison of Deep Learning Based Query Expansion with Pseudo-Relevance Feedback and Mutual Information,” in Advances in Information Retrieval, Padua, Italy, 2016, pp. 709–715.
[19]  G. Zuccon, B. Koopman, P. Bruza, and L. Azzopardi, “Integrating and Evaluating Neural Word Embeddings in Information Retrieval,” in Proceedings of the 20th Australasian Document Computing Symposium (ADCS ’15), Sydney, Australia, 2015, pp. 12:1–12:8.
[20]  F. C. Fernández-Reyes, J. Hermosillo-Valadez, and M. Montes-y-Gómez, “A Prospect-Guided global query expansion strategy using word embeddings,” Inf. Process. Manag., vol. 54, no. 1, pp. 1–13, Jan. 2018.
[21]  Q. Liu, H. Huang, J. Lut, Y. Gao, and G. Zhang, “Enhanced word embedding similarity measures using fuzzy rules for query expansion,” in IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Naples, Italy, 2017, pp. 1–6.
[22]  N. Rekabsaz, M. Lupu, A. Hanbury, and H. Zamani, “Word Embedding Causes Topic Shifting; Exploit Global Context!,” in Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’17), Shinjuku, Tokyo, Japan, 2017, pp. 1105–1108.
[23]  Z.-H. Zhou, “A brief introduction to weakly supervised learning,” Natl. Sci. Rev., vol. 5, no. 1, pp. 44–53, 2018.
[24]  S. Chopra, R. Hadsell, and Y. LeCun, “Learning a Similarity Metric Discriminatively, with Application to Face Verification,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 2005, vol. 1, pp. 539–546.
[25]  J. Mueller and A. Thyagarajan, “Siamese Recurrent Architectures for Learning Sentence Similarity,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Terrigal, NSW, 2016, pp. 2786–2792.
[26]  R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality Reduction by Learning an Invariant Mapping,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 2006, vol. 2, pp. 1735–1742.
[27]  E. Darrudi, H. Baradaran-Hashemi, A. AleAhmad, A.-M. Zareh-Bidoki, A.-H. Habibian, F. Mahdikhani, A. Shakery, and M. Rahgozar, “dorIR collection for Persian web retrieval,” DBRG-TR-138702, Tech. Report for Iran Telecommunication Research Center, Tehran, Iran, 2008.
[28]  A. AleAhmad, H. Amiri, E. Darrudi, M. Rahgozar, and F. Oroumchian, “Hamshahri: A standard Persian text collection,” Knowledge-Based Syst., vol. 22, no. 5, pp. 382–387, Jul. 2009.
[29]  R. T.-W. Lo, B. He, and I. Ounis, “Automatically Building a Stopword List for an Information Retrieval System,” J. Digit. Inf. Manag., vol. 3, no. 1, pp. 3–8, 2005.
[30]  L. Geng and H. J. Hamilton, “Interestingness Measures for Data Mining: A Survey,” ACM Comput. Surv., vol. 38, no. 3, 2006.
[31]  F. Kaveh-Yazdy and A.-M. Zareh-Bidoki, “Aleph or Aleph-Maddah, That is the Question! Spelling Correction for Search Engine Autocomplete Service,” in The 4th International eConference on Computer and Knowledge Engineering (ICCKE’14), Mashhad, Iran, 2014, pp. 1-10.