Salesforce announced the world’s first LLM benchmark for CRM to help businesses evaluate the rapidly growing number of large language models (LLMs) for customer relationship management (CRM) systems.
According to Industry sources, the latest benchmark is a comprehensive evaluation framework that measures the performance of LLMs against four key measures: accuracy, cost, speed, and trust and safety. It’s been specifically designed to evaluate common sales and service use cases, including prospecting, lead nurturing, sales opportunity, and service case summaries.
The benchmark also comprises a public leaderboard to help professionals decide which LLM is best for their CRM needs. Salesforce will continue to incorporate new use case scenarios into the benchmark and enhance its evaluation of LLMs, which will soon include fine-tuned LLMs.
Silvio Savarese, EVP & Chief Scientist, Salesforce AI Research stated “As AI continues to evolve, enterprise leaders are saying it’s important to find the right mix of performance, accuracy, responsibility, and cost to unlock the full potential of generative AI to drive business growth. Salesforce’s new LLM Benchmark for CRM is a significant step forward in the way businesses assess their AI strategy within the industry. It not only provides clarity on next-generation AI deployment but also can accelerate time to value for CRM-specific use cases. Our commitment is to continuously evolve this benchmark to keep pace with technological advancements, ensuring it remains relevant and valuable.”
Clara Shih, CEO of Salesforce AI stated “Business organizations are looking to utilize AI to drive growth, cut costs, and deliver personalized customer experiences, not to plan a kid’s birthday party or summarize Othello. Our customers have been asking for a purpose-built way to evaluate and select from among the proliferation of new AI models, and we are thrilled to introduce the world’s first LLM benchmark for CRM to help them navigate the complex landscape of models. This benchmark is not just a measure; it’s a comprehensive, dynamically evolving framework that empowers companies to make informed decisions, balancing accuracy, cost, speed, and trust.”
Why it matters: Existing LLM benchmarks have been limited to academic and consumer use cases, with very little business relevance. They also lack adequate expert human evaluations and fail to address accuracy, speed, cost, and trust considerations. These deficiencies have left CRM customers lacking a reliable way to gauge the effectiveness of generative AI-powered CRM solutions. Without a clear sense of how LLMs perform across those metrics for specific use cases, businesses are left to make decisions in the dark.
Dive deeper: Developed by Salesforce AI Research, the benchmark uniquely uses real-world CRM data, and also uniquely makes use of expert human evaluations by practitioners. This enables businesses to use the benchmark to make more strategic decisions about how to incorporate generative AI into their CRM systems, with specific attention to:
- Accuracy: This metric comprises four subcategories: factuality, completeness, conciseness, and instruction-following. The more accurate the predictions or recommendations, the more valuable the results are to teams across the organization. And the more valuable the results, the better the actions they can take to improve customer experience. If a model is accurate enough for a use case, it’s also important to consider the other metrics. Even if the model isn’t accurate enough, techniques like prompt engineering and fine-tuning can improve it.
- Cost: The cost metric is categorized as high, medium, and low, based on percentiles. It’s the estimated operational cost that varies by CRM use case. Customers can evaluate the cost-effectiveness of different LLMs to ensure they align with their budget and resource allocation strategies.
- Speed: This metric assesses the LLM’s responsiveness and efficiency in processing and delivering information. Faster response times enhance the user experience, reduce wait times for customers, and enable sales and service teams to address inquiries and issues promptly.
- Trust and Safety: This metric measures the LLM’s capability to shield sensitive customer data, adhere to data privacy regulations, secure information, and refrain from bias and toxicity for CRM use cases. By assessing the reliability of LLMs for CRM, this benchmark gives organizations a sense of transparency regarding trust and safety.