Examining quantitative metrics, including sacreBLEU, TER, ChrF, ChrF++, BERTScore, ROUGE, METEOR, and Semantic Similarity to evaluate and compare the quality of responses from more than a dozen popular Large Language Models (LLMs) — To quote LangChain’s recent post, Conversational Retrieval Agents, “LLMs only know what they are trained on. To combat this, a style of generation known as “retrieval augmented generation” has emerged. In this technique, documents are retrieved and then inserted into the prompt, and the language model is instructed to only…