04 November 2024 / 08:30 AM

Generative AI: How Do You Measure the Success of an LLM?

SDG News

Jesús Vicente García, Senior Executive in Data Science

Generative and cognitive artificial intelligence, while far from having reached their full potential, continue to penetrate many areas of both daily life and the business world. We are witnessing how foundational models and LLMs are constantly evolving, and multimodality is starting to become the norm in a context where the understanding and generation of images, audio, and video are no longer a dream, but a reality. This new scenario does not only imply an improvement in the raw capabilities of these artificial intelligences, but is also leading to miniaturization, optimization, and even specialization that is leading to improvements in generation times and scalability – two of the great obstacles that still lie ahead. The purely technical aspects, however, pale in comparison to the greatest challenge AI faces: meeting (and, who knows, perhaps even surpassing) the expectations already placed on it, and providing all of the value that we anticipate it will provide. This theorization will not be resolved – as usual – by any means other than experimentation, learning and, above all, the analysis of its application:

1. Observability of AI

Observability is understood as the virtue of being able to measure the state and internal performance of a system based on the knowledge we have of its application. It has long been a key element within the strategies for applying AI and machine learning in the business world, with objectives that revolve around cataloging, governance, explainability, and even the regulation of artificial intelligence solutions. However, the application of these concepts to the field of generative AI has come up against the complexity of adding a cognitive layer to measurement and evaluation systems. The case of determining the capabilities of a conversational agent is paradigmatic, where establishing the goodness and quality of the responses provided to user requests – and, based on this, deriving metrics – requires an analysis based on the same techniques that have given rise to the solution: LLMs, prompt engineering, or chain-of-thought (CoT), among others. That is why this certain subjectivity not only establishes needs in relation to architectures that support the capture and processing of all this activity and these interactions, but also expands the scope in which an appropriate AI application strategy is essential. This also gives rise to new concepts to deal with, such as the ROUGE/BLEU score metrics.

2. Measuring Impact and Return

The quality of a solution or product from a purely technical perspective is no guarantee of its success at a business level. Such success can only be established by virtue of effective measurement in business terms, where we talk about concepts such as linkage or engagement, impact and, above all, return. The establishment of all these KPIs effectively and fundamentally requires all the technological enablement that accompanies the AI observability paradigm. It must also contemplate a pre- and post-implementation strategy that accompanies the definition of an operational or performance baseline appropriate metrics around the processes to be improved, and a rollout that enables capabilities to measure, compare, and clarify whether the impact achieved materializes the entire study prior to the investment.

3. Continuous Improvement

Expectations, however, are not always easy to meet. That being said, the most solid strategy in the face of possible disappointment is understanding that generative and cognitive AI, not unlike what has already been experienced in other stages of artificial intelligence, requires various iterations and learning – and not only on the part of the models and systems, but also the developers behind them. Thus, the circle closes, and the acquired learning can be applied from a new technological perspective, completing the path from business to engineering, from engineering to observability, from observability to business, and back to engineering, conceiving a new iteration that falls down the gradient of expectations towards true business value. This is where it is key to understand that AI – the one we have now – will only be as good as the information and data we manage, that the grounding and our knowledge base are really the fuel, and that this fuel must be refined as much as possible. In many cases, this must be done with equally cognitive techniques, so that the mixture between algorithm and information combusts with the expected power. Above all, it will be essential to consider that the knowledge that AI "generates" by twisting and adapting to our needs is not created out of nothing; it belongs to the same plane of reality in which the business's own knowledge lives, represented by all those relevant actors whose feedback – necessarily, and thankfully, human – can be incorporated into these systems to give rise to a new iteration.

Of course, other paradigms and challenges will remain ahead. The trends already predict the creation of smaller, denser models that are specialists in specific areas of knowledge (SLMs), as well as the democratization of computing resources that enable algorithmic hyper-personalization. These will undoubtedly bring with them advantages when it comes to increasing the adaptability of AI to each business need, but not at any price: methodologies such as LLMOps must assume this "eternal return", and contemplate those needs in the governance and control of AI that, although only temporarily, some believed had been overcome. Whatever the case, SDG Group is helping companies in this constant transformation, laying the technical, methodological, and business foundations that allow them to really obtain the maximum value from AI.