Skip to main content
OpenConf small logo

Providing all your submission and review needs
Abstract and paper submission, peer-review, discussion, shepherding, program, proceedings, and much more

Worldwide & Multilingual
OpenConf has powered thousands of events and journals in over 100 countries and more than a dozen languages.

Evaluating Cross-Lingual Semantic Textual Similarity Using Bert-Based Sentence Embeddings

Semantic textual similarity is a well-established Natural Language Processing task present in various applications and systems. Despite its importance, this task remains challenging due to inherent textual factors such as irony, slang, and domain-specific language. This paper compares the performance of pre-trained large language models (LLMs), specifically RoBERTa, DistilBERT, and MiniLM, in evaluating semantic similarity across multilingual texts in three languages: English, Spanish, and Portuguese. Moreover, a classical NLP technique based on word count and the Dice Coefficient is also considered for comparison. Similarity scores are assessed using Pearson correlation. The experiments show that LLMs significantly outperform the classical word count approach. Results for English, a higher-resource language, are significantly superior compared to the lower-resource languages, Portuguese and Spanish. Among the evaluated models, MiniLM exhibited the best overall performance.

Nathália O. Pereira
Universidade de Brasília
Brazil

Vinicius R. P. Borges
Universidade de Brasília
Brazil