Comparative Evaluation of Multimodal Large Language Models For Technical Content Simplification and Visual Interpretation
This study highlights the critical role of Large Language Models in simplifying technical content and integrating visual data for accessible communication. It compares GPT-4 and LLaMA-3.2-90b-Vision-Preview, focusing on readability, semantic similarity, and multimodal interpretation using robust metrics like Flesch Reading Ease, Gunning Fog Index, and CLIP Score. GPT-4 retains key information and achieves high semantic and textual integration scores, making it more suitable for complex technical scenarios. Conversely, LLaMA prioritizes readability and simplicity, outperforming in generating accessible captions. Both models show optimal performance with a temperature setting of 0.5, balancing simplicity and meaning preservation. The research underscores LLMs' potential to democratize technical knowledge across disciplines but notes precision and multimodal integration limitations. Future directions include fine-tuning for domain-specific applications and expanding input modalities to enhance accessibility and efficiency in real-world technical tasks.