“If I see a sick gen AI demo on Twitter, my first reaction is that the product actually sucks”

This is a common reaction I hear from people nowadays, and there are valid reasons behind it. Startup demos using generative AI have been popping up like flies for the last few months and the demos look genuinely impressive (this LLM technology is impressive after all). But when you actually try most of these products, they rarely work as promised, if ever. The very few number of gen AI products with sustained success and usage suggests that these poor individual experiences are the norm. The hype may just be hype.

A number of us at Codeium used to work in the self-driving industry, and the same thing happened there in the beginning - impressive demos, lots of money, and not much to show in terms of productionized rollouts. It was easy to come up with a demo that solved 90% or even 99% of driving, but the reality is that for the last decade and a half, self-driving companies have sought out to productionize a technology that just wasn’t ready. We may finally be turning a corner in the self-driving industry, where the maturity of the technology might be catching up with the robustness required for productionization, but it is a fair question of whether gen AI is set to have a similarly extended hype cycle.

I think it is safe to say the short answer is no - the Midjourneys and ChatGPTs of the world have clearly shown that some stuff using LLMs can be productionized - that is, deployed and trusted by paying users. So how can you identify what applications can be and can’t be effectively addressed with LLMs?

At Codeium, we often get the question on how we develop so quickly yet rarely, if ever, miss wildly in our new capabilities. In the AI code assistant space, we were the first to roll out in-IDE integrated chat and natural language based search, with more cool functionalities in the pipeline. While we aren’t going to say that we can see the future, we also don’t leave adoption to luck, so we wanted to talk a little bit about how we view AI product development and how we separate what we should build from what we shouldn’t.

Figuring out what to build comes down to finding the intersection between two things: (a) being honest about what level of robustness LLM technology is at today and (b) being cognizant of what will actually help target users, who in our case are developers. If you miss either, you will end up with something that doesn’t really stick.

A great example in our space of a failure to operate in this intersection is GitHub CopilotX (the research-y division of GitHub Copilot) launching automated PR descriptions. We actually tried this feature of CopilotX when it came out internally in our development, and all of us stopped using it after a few days because it would often miss important context or even entire files. Why? On face value, this is an example of being cognizant of what will help developers - developers don’t like writing PR descriptions, which means reviewers often have to read through all of the change list (and perhaps more) to get an idea of what is going on. But GitHub was not being honest with themselves about the level of reliability required from LLM technology to productionize this. The PR review is the final step before code goes into production, and so the PR description must be accurate close to 100% of the time; otherwise, it will just waste more time for the reviewer and developers will quickly lose trust in the tool. On the flip side, a good example of the opposite mistake (a robust application that is not useful to developers) was also in this feature! In the automated PR description, CopilotX generated a haiku summarizing the change. LLM technology is totally at the robustness level to hit developers’ “reliability requirements” of haiku quality, but will these haikus actually help target users? Nope. The proof is obviously in the pudding, with this entire PR summarization feature not being brought into the main GitHub Copilot product, even though it was launched almost half a year ago.

Compare this with code autocomplete, which is a modality that most AI code assistants share, for good reason. It is absolutely useful to a developer for an AI to autocomplete boilerplate code for them, allowing the developer to stay in the flow state and worry about more interesting things. And at the same time, autocomplete is passive and low-stakes, so if a suggestion isn’t accurate (and isn’t too lengthy so that it is quick to review), a developer can just keep typing and continue on with their day. LLM technology is robust enough today to give suggestions at a high enough quality at a high enough rate to provide value in this modality, and with better models and better context awareness, the area of intersection increases between the two factors of usefulness and accuracy.

We then almost always get a follow up question - is it worth bringing incremental value at this intersection or should gen AI companies focus on some home run application? In the self-driving analogy, the reason self-driving companies go straight for “L5 autonomy” is because there is belief that there would need to be massively different paradigms as you go from L1 to L5, and so it would be silly to build 5 companies along the way to L5 since you won’t be able to pass the learnings from one to the next.

Bringing it back to AI code assistants, people often cite software development “agents,” a set of LLMs that work with each other to fully generate and commit PRs from scratch, as an end state in our AI code assistant field. Heck, you can see a million demos of this on Twitter, with the subheader that “software engineering is a dead career.” Well, the careers are still around, and this is obviously an egregious example of where the LLM technology is just not robust enough today to do this reliably in production. This is why, even though our end goal at Codeium is to accelerate every aspect of the software development life cycle with AI, we did not join in the hype wave of LLM agents, and now that wave is starting to die down. Agents in some capacity are part of the future, and just like chat and search, we hope to be the ones to make the final breakthroughs when the time is right.

The difference between self driving and gen AI in this respect, however, is that in gen AI, we can actually build things that will both be useful today and required tomorrow. Maybe we do get to a state of fully generating PRs as the level of reliability of LLM technology increases, but things like real time context awareness are still going to be required, and even the functionalities of autocomplete and chat will still be used as backups to help guide a developer when the full end-to-end isn’t confident enough. If we do this right, the things we build today can still be used in the future.

So this is how we develop our AI product. We find the balance between usefulness and robustness today while keeping an eye on whether such features are building blocks towards the future. It sounds simple, but it works. Sure, we have needed some small adjustments to our product or feature launches, but nothing major. And we have organically built traction and sustained usage, with hundreds of thousands of users, Fortune 500 enterprise customers, and hours upon hours of time given back to developers. Fundamentally we are making the most educated decisions in gen AI product development, and that is how we will keep going.