Enterprise AI Might Be Crap

2025-07-10

Some of you may have noticed: companies want you to use their AI tools.

I know, I know, it's subtle and you may have missed it ¹.

I was recently exploring customer-AI interaction in the scope of a business support tool. In my research I came across two amazingly relevant research papers published back in May (of 2025).

The first is a Microsoft research paper: LLMs Get Lost In Multi-Turn Conversation where they discovered that LLM unreliability skyrocketed in multi-turn conversations. And this was on the big models!

Secondly was a paper from Salesforce: CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions. This study was even more telling:

Experiments show leading LLM agents achieve approximately solely 58% single-turn success rate on CRMArena-Pro, with significant performance drops in multi-turn settings to 35%.

In summary: it's not great.

What strikes me as interesting here is that, especially in the case of the Salesforce study, the model had high-quality data and was working in a best-case environment to complete the needed tasks and still had a significant problem in multi-turn conversations.

I don't want to dismiss the metrics on single-turn conversations completely but from nothing more than my own vibes it seems highly likely that for an enterprise level LLM the tasks that users need assistance with are going to require more than one prompt. But even in the single-turn case a 58% success rate seems rough.

Enterprise environments have a lot less room for error. If you are relying on this system being able to accurately determine the error rate is crucial and there are many many many many many platforms and communities and SAAS solutions that purport to offer this service.

My question is: should they? With studies like the above I think we can clearly see that the underlying models themselves (I'm talking GPT, Gemini, etc.) have some serious reliability problems when it comes to longer-running interactions (and if any of you have used AI regularly in any interface you know that the classic "turn it off and on again" is a reliable tactic for getting better results) ².

This lends support to an overarching concern I have that people really are overestimating the efficacy of LLM usage. I know, hot-take. But it concerns me when I see this rush to implement AI in an Enterprise environment. It's not impossible, of course but you're looking at:

Sanitizing inputs - ensuring that we coerce user input in a way that the LLM can more reliably parse.
Building RAG/MCP integrations - to help your LLM/Agent "do less" and rely on backend integrations.
Sanitizing outputs - you don't want sensitive data returned to the user.
Using LLM as a judge - to evaluate responses before hand or on the fly as another guardrail.

And with this we've drastically increased complexity in an effort to skirt around the issues inherent with LLM unreliability and now we have way more to test, way more that can go wrong, and a lot more work in actually defining what our error rate is because, even with the above, there is a non-zero error-rate.

What Should We Do About It?

I am not fully-versed on every possible AI use-case and am not trying to say that success in this regard is not worth chasing. I know larger companies say they are implementing support chat bots. You may have noticed that the paper that talks about the unreliability of multi-turn conversations was written by Salesforce...which also offers support chatbots in their product. I have to wonder how reliable that part of their service is when their research paper shows quite a few issues.

But to me the solution (in this moment) is: don't let AI drive.

Customer interactions should be driven by a workflow that is "hardcoded" or driven by a preset number of steps. AI can be sprinkled into this as much as you want and you can more reliable test the efficacy of those steps. Depending on the step of the workflow it's possible that this implementation will result in mostly single-turn interactions where we can take better advantage of the LLM's higher success rate for that metric.

LLMs will continue to improve and I know the big companies are not blind to this. I also believe that with enough time and effort you could stand up a robust system - but it's not going to be some plug-and-play solution it will take a significant amount of work and testing. In AI Engineering by Chip Huyen a primary focus is evaluation of LLMs. If you're not evaluating your LLM implementation then you are rolling the dice. Whatever you build, make sure it's thoroughly tested because, as these research papers indicate, the error rate will remain non-zero for a long time to come.

Thanks so much for reading!

-- Rick

This is sarcasm.↩
In the current context it's now "close this chat and start a new one". Even Cursor employees recommend doing this every 5 minutes.↩

#AI #LLM #research