You might have heard that “AI is the new electricity” or that we are having our “next Gutenberg moment.”
What has been lost amidst this rapturous hyperbole are the words “potential” and “might.” That is, AI might have the potential to be the new electricity and might ascend the Gutenbergian throne of world-transforming technology—but we certainly aren’t there yet.
If we are ever to get to the point in which this early hype is warranted, we will need a clear-eyed way of measuring how effective AI is. To not do so is to set ourselves up for a chronic oscillation between hype and disillusionment.
That isn’t to say that AI efficacy isn’t currently being measured. Indeed, the early results are in and they aren’t stellar. AI isn’t getting the same level of engagement as the human baseline, be it customer service or web copy. Many report being turned off by the familiar patterns of AI-speak, such as the typical tells “unlock the potential” or “delve into.”
The obvious conclusion of late has been that AI was, in the end, just another hype bubble. Still, many of these judgments are missing several critical factors. To address this, I will introduce two new ways of thinking about AI’s effectiveness—AI Capability and AI Impact—that can help us get closer to delivering on some of the breathless hype.
AI Capability
Let’s take a workforce of 1,000 employees. At the moment, let’s assume their ability to do their work can be measured by the variable x, which we’ll define as the amount of work they can do in a fixed period of time.
Now let’s imagine that workforce maxed out on GenAI capabilities. How will that x be affected?
Without rigorous metrics in place, we can’t say for sure what the multiplier to x will be. But in terms of maximizing their ability with cutting-edge frontier AI models like ChatGPT, Claude, and Gemini (models that can be used in conjunction to achieve even more impressive results), we can look to some recent research to give us a rough notion.
Three different studies, focusing on three different industries, each with realistic use cases, compared cohorts using GenAI in their work to those doing the exact same task but without Gen AI.
The first study by Erik Brynjolfsson, Danielle Li, and Lindsey R. Raymond (2023) found that customer service reps increased the number of requests they could handle by 13.8%. This study also baked in a longitudinal component, tracking users over several months to determine how long it took them to finish a typical customer service training when using GenAI. The results: it typically takes customer service representatives 8 months to complete training; with GenAI, it took only 4 months.
Another study by Shakked Noy and Whitney Zhang (2023) focused on how many documents business professionals could write with and without AI. Compared to the non-AI cohort, those who use GenAI could generate 59% more business documents per hour.
Finally, a third study by Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer (2023) determined that programmers who used AI could code 126% more projects per week. That is more than twice the number of the baseline.
Throughout the study, the quality of outputs did not falter, and in some cases even slightly increased.
Interestingly, all of these studies were conducted with ChatGPT 3.5, a far less sophisticated model than ChatGPT 4o and Claude Sonnet 3.5, today’s cutting-edge models.
Additionally, besides the first study, the one tracking the customer service reps, the studies did not actually train the participants to become better at using Generative AI over time, something that would likely have led to even more impressive results for the AI cohort.
With these results and factors in mind, let’s return to our hypothetical example of a 1,000 employee workforce and the x variable—the % increase in productivity/efficiency that can be had with GenAI.
A conservative estimate for x would be 50%. Remember, the customer service reps quadrupled how fast they were able to finish training, the business professional wrote nearly 60% more proposals (without a lapse in quality), and coders were more than twice as efficient (also without a corresponding drop in the quality of the code.)
This delta—the difference between 1.5x - 1x—is one of the two components of something I call AI Capability. The larger this number, the more your organization has to gain from using AI.
The second and final factor that makes up AI Capability is the size of your workforce. Basically, the larger your workforce, the larger your AI Capability
The following equation captures this:
AI Capability = (maximized ability with GenAI training - current ability) x # of employees
Our example of 1.5x and 1,000 employees yields an AI Capability score of 500. Below is a graph of how this score would play out using the numbers cited in the research above, assuming a total of 700 employees in an organization.
There are of course many factors at play that could change the delta between maximized GenAI ability and current ability. Again 50% is an approximate, and conservative one, based on current research. Still, let’s dive a little deeper into some of the complexities of measuring this delta.
Industry for one does matter. How does a marketing strategy firm compare to a marketing content generation agency? How does either compare to a bioengineering firm? In the final shakeout, industries are going to inevitably differ on the extent to which GenAI can aid them. But again, as we saw in the three different studies across three very different industries, the results are promising.
Other factors that determine this delta include how much GenAI employees currently know? An online startup that does Data Science training will probably have a higher baseline than a financial service institute.
Despite these nuances, the studies point to a positive delta, one that will become even more pronounced the more GenAI training employees receive. Coupled with constantly improving Frontier AI, this will help them get closer to their “maximized ability with GenAI training.”
AI Impact
But this is not the whole story. Simply using AI across all situations and use cases is not always beneficial. Or, framing it more scientifically, the “maximized ability with GenAI” is not necessarily better than the human baseline.
Let’s use an example that many can relate to: The New York Times. Regardless of your political leanings, you’ll probably agree that the quality of writing in this newspaper is very high. Or to put it more bluntly: these people know how to write.
Would having them use GenAI in their writing process make them better writers? Potentially, it might make them faster at generating ideas, it might help them think of that perfect word faster (though, that “perfect word” might simply not be one that an LLM based on its training parameters would arrive at).
Ultimately, we don’t know if maximizing their GenAI ability with writing would necessarily make their final article any better. It certainly might, and who knows, some New York Times staff writers might even be using AI to improve the speed and perhaps even some aspect of their writing.
But we don’t know for sure. To measure this uncertainty, the second important factor is something I call AI Impact.
AI Impact = The maximum quality/efficiency with AI for use case X - Efficiency without AI for use case X
AI Impact is important because it allows us to focus on specific use cases within any organization. This leads to a more nuanced way of knowing when (and when not to) use GenAI.
If this number is negative, then it doesn’t make sense to use AI for a specific use case. If the number is positive, then using GenAI for a specific use case will be likely to reap benefits.
Below is a hypothetical graph, since there aren’t any studies that have measured quality along these specific use cases. The graph is intended to be illustrative of the idea of AI Impact and how it might play out across different use cases.
Implications of AI Capability and AI Impact
So what does this all mean?
Having these two metrics allows organizations to determine the value of training its own workforce to maximize their GenAI capabilities.
More specifically, AI Capability gives a broad organizational perspective, whereas AI Impact determines the specific use cases where GenAI should be used—and, by extension, the type of training it should provide.
Measuring the delta and determining exactly what the maximal GenAI use is will not be easy. It will certainly take some upfront investment but the positive effects of creating a GenAI-enabled workforce will quickly become a strong competitive advantage.
Additionally, organizations do not need to know the exact ceiling on GenAI maximal potential. As long as they believe they are close to that ceiling, determining AI capability and AI Impact will be instructive about how to best proceed with creating AI-enhanced workers at an organizational scale.
Companies that ignore these metrics not only short-change the AI but also their employees.
Kommentarer