Evaluating Cross-Lingual Semantic Textual Similarity Using Bert-Based Sentence Embeddings

Semantic textual similarity is a well-established Natural Language Processing task present in various applications and systems. Despite its importance, this task remains challenging due to inherent textual factors such as irony, slang, and domain-specific language. This paper compares the performance of pre-trained large language models (LLMs), specifically RoBERTa, DistilBERT, and MiniLM, in evaluating semantic similarity across multilingual texts in three languages: English, Spanish, and Portuguese. Moreover, a classical NLP technique based on word count and the Dice Coefficient is also considered for comparison. Similarity scores are assessed using Pearson correlation. The experiments show that LLMs significantly outperform the classical word count approach. Results for English, a higher-resource language, are significantly superior compared to the lower-resource languages, Portuguese and Spanish. Among the evaluated models, MiniLM exhibited the best overall performance.

Nathália O. Pereira
Universidade de Brasília
Brazil

Vinicius R. P. Borges
Universidade de Brasília
Brazil