tl;dr We have built expertise in pretraining, task specific finetuning, customer specific finetuning, and model API integrations so that we can always provide the best models across the variety of software development tasks.
Observation #1: Models are Components of Tools
How many parameters? What’s your max context length? What’s your base model? What is your training data? What is your [name your favorite model benchmark] scores? How does it compare to [name your favorite LLM]?
These are all valuable questions, and we get them all the time. If we were a research lab or perhaps a model inference provider, we would probably agree that these questions closely approximate the set of all relevant questions.
But we are a product company. We drive value to a customer via our entire application, our entire tool. And a model is not the same as a tool.
A model is one incredibly important component of a tool, and these LLMs have fundamentally expanded the set of applications that we can build tools for with high enough quality to be trusted. That being said, a model is still just a component of the tool, so any discussion about models needs to happen in the context of the end application quality and end user experience.
Observation #2: Focus Required on User Expectations
So, if all of the talk of “what is the best model” is really a proxy for a conversation of how to solve a particular task best, there should be a lot more conversation on what the task needs in the first place to satisfy the user. The best way to demonstrate this is to talk about a few of the different tasks that a tool like Codeium provides its users, and what the requirements are from a user experience perspective:
- Autocomplete: This workload passively provides a suggestion on every keystroke that the developer performs. For that reason, latency is critical - if the suggestion comes even a second after the keystroke, it is far too late no matter the quality. The developer would have already moved on to the next keystroke.
- Command: This workload takes an instruction and performs refactoring inline, such as adding comments, consistently changing variable naming, or performing some code optimizations. The latency requirement is a bit less strict, but when it comes to output, it is important for the model to only provide changes scoped to the instruction. If the model starts rambling or performs cosmetic edits to irrelevant lines of code, that becomes extremely distracting to a developer trying to review the changes proposed by the AI.
- Chat: This workload takes a question from a developer, and it could be really any kind of question, whether it is generally about programming or specific questions regarding the codebase. For this, there is actually quite a large amount of leeway on latency (as long as it is not obnoxiously slow and if the outputs stream out) from a UX perspective. On the other hand, users definitely expect the answers to be consistently accurate and relevant, a far cry from Autocomplete, where if a suggestion is irrelevant, the user just types the next character and moves on.
Given how different tasks are from a pure user expectation perspective, it would be odd to assume that there is a single model that is objectively the best at all tasks, as we will discuss later.
User expectation also points to a part of a tool that gets little love. Bigger models, all else equal, are generally better. One of the key parts of “all else equal” is the latency piece. Because user expectations are so tied to latency of the tool, it is incredibly important for companies to think about the model inference infrastructure. An edge on the infrastructure simply increases the ceiling of what models can be used while still fitting within user expectations (which is why we call ourselves an infrastructure company before we call ourselves an AI company!).
Observation #3: Democratization of Industry-Leading Models
Two years ago, when ChatGPT came out, it seemed that OpenAI had an insurmountable lead when it came to these foundational models. No matter what competitors or open source did, it always felt that the most recent GPT model was clearly the gold standard of achieved model potential.
Today, that feels less of a case. Not only are there many reports of other closed-source models such as Anthropic’s Claude Sonnet outperforming OpenAI’s GPT-4o model on a variety of tasks, but open source models such as Meta’s Llama 3.1 405B and NVIDIA’s Nemotron 340B have reached a comparable level of quality and opened a whole new world of possibility. Because these models are open-sourced, anyone has the potential to do additional training on these models in order to clearly outperform close sourced models such as GPT-4o on application-specific tasks. And because these models are commercially usable, there is now reason to invest in doing this additional model tuning now, unlike before when the exhorbitant 8 or 9 figure price tag for pretraining such a model completely eliminated the possibility of owning a massive proprietary model. In many ways, the rise of OpenAI competitors and open-source models have democratized the top-end models.
A number of startups raised a lot of money under the claim that it was necessary to train and own the model. The logic was that there would be fundamental limitations of closed-sourced models, so might as well skip to the end state where the best product for their particular application would have proprietary models. These no longer feel like smart investments - just to keep up with Meta, NVIDIA, and others, a startup would have to invest at least a billion dollars in a training cluster, and coming back to Observation #1, at the end of the day, the model is just one component of the overall tool.
Observation #4: The Training Cost Question is Nuanced
Very often, the conversation around training costs is very simple - more parameters (bigger model) means that it costs more to train. In reality, there are two main factors to cost: size of model and stage of training.
Obviously, the larger the model, the more expensive it is to train the model. That is the core of modern discourse as people compare the costs of training various foundational models.
The additional wrinkle is that there are multiple potential stages of training: conditional on a fixed model size, pretraining a model from scratch is more expensive than task specific finetuning, which in turn is more expensive than customer-specific finetuning. This is simply a function of amount of data used at each stage and the amount of training steps required to maximally capture the informational entropy of that data. For clarity on what the difference is between task specific and customer specific finetuning, we strongly recommend this blog post.
So what does this wrinkle mean in practice? Performing the full training (including pretraining) end-to-end for a smaller model is equally cost-effective as performing task-specific or customer-specific finetuning on a very large model. Pretraining a very large model is perhaps simply not worth the juice, but that does not mean that we should discredit training smaller models from scratch or performing additional finetuning on very large models. These are significantly more feasible.
Our Current Models
Okay, those are a lot of observations, but they help us turn the simplistic question of “What is the best model?” into the more nuanced question: “What is the best feasible model that will satisfy user expectations?”
As per Observation #2, let us look at each workload that we actually provide in Codeium to explain why we use the models that we do:
- Autocomplete: Because of the strict latency constraint, there is a fundamental limit to how large of a model can even be used for autocomplete. It is simply not realistic to run GPT-4o or another large model on every keystroke, even with all of the model serving infrastructure tricks in the book. At the same time, the workload realities of editing existing text and pulling together discontinuous code snippets as context actually require for tasks such as fill in the middle, inline fill in the middle, and context awareness to be captured within the pretraining process. These are not tasks that can simply be patched over via finetuning or “prompt engineering” of an existing base model that has not been trained as such nor seen any examples even remotely similar to the prompts used for autocomplete. So, for autocomplete, we trained the model entirely from scratch, with these tasks built into the training data and objective functions from the beginning. With our infrastructure background, we still maxed out the model size possible to fit inference within user expectations on latency, but now we have a purpose-built model that we are confident outperforms every other model on the market at Autocomplete.
- Command: We strongly recommend reading the case study on Command development. The tl;dr is that we started with off-the-shelf closed-source models and for various reasons (inability to perform the instruction-following task concisely or latency), realized that these models just did not cut it. We instead found that taking an open-source model and doing additional task-specific finetuning worked really well to teach the model how to perform this particular instruction-following task incredibly well. There was no reason to go pretrain our model entirely from scratch here! At the same time, because of the looser latency constraints, we actually could use a much larger base model for task-specific finetuning, one that we would less likely want to pretrain in the first place for cost feasibility reasons.
- Chat: A whole different situation. The reality is that really large foundational models, whether closed-source or open-source, are just fine for Chat. In fact, given the breadth of questions that could appear, our best strategy to drive the most value is to provide maximum optionality to the user between models so that they could compare results from multiple models and use the best results. We provide smaller, Codeium-optimized Chat models for easier questions where faster results are better, but also provide optionality to Open AI’s GPT-4o, Meta’s LLama 3.1 405B, and very soon, Google’s Gemini and Anthropic’s Claude Sonnet. Note that we do not expose every OpenAI model or every Google model or every Anthropic model - just the best and biggest. Latency is not as strong of a consideration for Chat so we should always be suggesting the best models, not ones that are objectively suboptimal choices.
To recap:
- For Autocomplete we pretrained a model from scratch
- For Command, we performed task-specific finetuning on a great open-source foundational model
- For Chat, we allow for optionality across a bunch of leading models
All different strategies, but with one common thread: serve the models that actually drive the most value for the particular workload. There is no need to provide optionality everywhere as an unnecessary abstraction (which, used improperly, may actually lead to poorer performance for the end user), but definitely provide optionality where it adds value.
An Aside: Customer-specific Finetuning
The discussion above was all about the choices around pretraining and task-specific finetuning given the particular workloads, but what about additional finetuning of these models on a customer’s private data for an even higher quality experience for the developers at that particular company? The short answer is that we enable this too! That being said, customer-specific finetuning is not a silver bullet, does not work in every scenario, and comes with additional maintenance overhead. We strongly recommend reading more about the limitations of customer-specific finetuning, especially when compared to the personalization benefits driven by our industry-leading context awareness engine.
Codeium’s Ongoing Model Strategy
We will continue to have the unique competency to do anything and everything when it comes to models: pretraining, task specific finetuning, customer specific finetuning, and API integrations. We will continue to be honest with ourselves on what drives the most value for each workload, keep reevaluating whatever we currently have deployed, and iterate accordingly as additional research and releases happen in the industry. We will continue to lean into open source models and steer them to our benefit. Turns out, we are the only company in this space that does all of this!
And from an ethical and legal standpoint, we will continue to never use non-permissively licensed code during any training that we perform so that we can do our best proactively, and will continue to combine this with our state-of-the-art attribution filtering and logging to provide an end-to-end compliant offering.
Our model strategy is simple: Be opinionated, but not dogmatic, on the models being used.