Comparative Evaluation of Multimodal Large Language Models For Technical Content Simplification and Visual Interpretation

This study highlights the critical role of Large Language Models in simplifying technical content and integrating visual data for accessible communication. It compares GPT-4 and LLaMA-3.2-90b-Vision-Preview, focusing on readability, semantic similarity, and multimodal interpretation using robust metrics like Flesch Reading Ease, Gunning Fog Index, and CLIP Score. GPT-4 retains key information and achieves high semantic and textual integration scores, making it more suitable for complex technical scenarios. Conversely, LLaMA prioritizes readability and simplicity, outperforming in generating accessible captions. Both models show optimal performance with a temperature setting of 0.5, balancing simplicity and meaning preservation. The research underscores LLMs' potential to democratize technical knowledge across disciplines but notes precision and multimodal integration limitations. Future directions include fine-tuning for domain-specific applications and expanding input modalities to enhance accessibility and efficiency in real-world technical tasks.

Leonardo Pilarski
Universidade de Trás-os-Montes e Alto Douro
Portugal

Luiz E. Luiz
University of Tras-os-Montes e and Alto Douro and CeDRI, SusTEC, Instituto Politécnico de Bragança
Portugal

Gonçalo S. Gomes
Universidade de Trás-os-Montes e Alto Douro
Portugal

Tiago Pinto
Universidade de Trás-os-Montes e Alto Douro and INESCTEC
Portugal

Vitor Filipe
Universidade de Trás-os-Montes e Alto Douro and INESCTEC
Portugal

João Barroso
Universidade de Trás-os-Montes e Alto Douro and INESCTEC
Portugal