On this blog, we have not shied from talking about how important it is to personalize our state-of-the-art base system to a particular customer. This allows us to drive significantly more value, as evident by seeing a 27% increase in the amount of code accepted when we turned on our first version of context awareness. When personalizing an LLM application, there are two primary levers - personalize the data being fed into the model and personalize the model itself. At Codeium, we have built systems, context awareness, as implementations of these two levers, and have talked about both of these in depth on this blog. However, we still noticed a lot of questions about these approaches, uncertainties around their comparative benefits and weaknesses, and general confusion about our terminology in the context of the industry. This article is meant to add some clarity to how we think about personalization.

Some Clarifications

Before we get into the actual technologies and their benefits/drawbacks, there is a lot of confusion we need to address.

First, we need to be clear on what the end goal of any of these approaches is - to get results that are higher quality for the particular tasks on hand. This is why we call the general goal “personalization.” Very often we get potential customers asking us for context awareness or customer-specific finetuning, but it is important to note that these are just tools in the toolbox - what these customers really want are autocomplete suggestions and chat responses that leverage the existing code, match syntax and semantics, and utilize preexisting entities, i.e. personalized responses.

Second, we need to clear up what context awareness and finetuning really are in the context of a code assistance tool like Codeium.

Context Awareness vs RAG

If you have paid any attention to the LLM world, you have probably heard the term “RAG,” short for Retrieval Augmented Generation, thrown around a lot. The idea behind RAG is relatively intuitive - if you can provide snippets of information relevant to the prompt to the LLM (in conjunction with the prompt), then the responses will have fewer hallucinations, utilize the most recent accurate information (no re-training necessary), and you can even cite sources, increasing trust. In other words, a RAG system retrieves particularly relevant snippets from a pre indexed corpus of generally relevant information, augments the prompt with these snippets, and lets the LLM generate a more meaningful response from this augmented prompt.

What we term as context awareness is similar in many ways to RAG, but also different in others. Part of the system is indeed retrieving useful snippets of code in an existing codebase to augment a prompt (autocomplete or chat) - we AST parse the codebase ahead-of-time to chunk the code into semantic pieces, compute embeddings, and index these embeddings for real-time retrieval. We do this for both the current repo and even remote repositories. This is very textbook RAG. Our retrieval logic is a bit more complex. Not only do we use this index, but we also crawl imports and look at directory structure to determine snippets with higher “relevance,” leveraging the unique organization of code. This might still be called RAG. However, we also take into account a lot of user intent as context, such as what tabs are open and which files have been recently edited, and keep memory on past interactions, such as previous chat messages. We are no longer doing retrieval on just a data store, but also building state on how a user is interacting with the tool. These details are what make context awareness such a powerful personalization tool for pretty much every companies’ code.

Context awareness can be thought of as a super-charged RAG system that also has memory and incorporates signals on user intent.

Finetuning vs Finetuning vs… Finetuning?

If RAG is an overused term, then finetuning as a term is overused beyond saving. The first distinction that needs to be made is what finetuning means in the context of training our base model and what it means in the context of personalization.

When we train our base model, we have two main stages - pretraining and finetuning. Pretraining is training a foundational model from scratch that generally understands what text and code is, and is an incredibly expensive and time consuming process. Popular foundational models such as the Llama family are the output of pretraining on a lot of text. However, there are a lot of particular tasks that we would want an LLM to do given the particular tasks that our application requires. For example, we have a context awareness engine and we really want the model to be good at ingesting the retrieved code snippets in the format that we specify them in. While prompt engineering might somewhat help, during pretraining the model was never taught to do such a task (it just got really good at producing outputs linearly). For application-specific tasks, we might generate a lot of examples with special tokens and objective functions and do “finetuning” stages of training to get the model to be intrinsically good at these unique tasks. The reason we do this as a follow up finetuning stage is because we can iterate on getting better on these tasks a lot faster (and with a lot less cost) than if we did this during pretraining, and often we don’t have enough examples of this specific task for it to be a meaningful fraction of the pretraining data.

However, and crucially, this finetuning is still for building our base model. There is no changing of weights or personalization to an individual customer here. So what then is the finetuning that we do for a customer? Well, given the even smaller order of magnitude of code present in a customer’s codebase (even “large” codebases of hundreds of millions of lines pale in comparison to the hundreds of billions of lines used to train our base model), we created a separate system to ingest a customer’s codebase and further modify the model to hopefully be more tuned to the customer’s existing code.

In hindsight, we probably should not have called this just “finetuning” to distinguish from the finetuning done when building the base model, which is what the popular discourse on finetuning is focused on. At the least, we should have been calling it customer-specific finetuning from the start. Finetuning when generating a base model, with enough data, is actually quite a powerful technique, but finetuning when matching a company’s codebase is much more limited in its value-additive use cases, as we will see in the next section.

Finally, as yet another source of confusion, we have reasons to believe that many other code assistant tools use the term “finetuning” to actually refer to their context awareness system, which doesn’t actually modify any weights in the model. Totally does not help the situation here…

Context Awareness vs Customer-Specific Finetuning

With hopefully a lot of confusion removed, we can really focus on the pros and cons of a context awareness system and customer-specific finetuning. To be clear, this is not an either/or decision where one is objectively better than the other, and can be complimentary, but at the same time, they are not universally applicable.

The first distinction is the goal of personalization that each technique helps with. Context awareness helps with retrieval - pulling in the relevant information for what the user is trying to do at the moment. Customer-specific finetuning helps with learning tasks that the base model does not intrinsically know, even with its extensive pretraining and code-specific finetuning stages. Customer-specific finetuning unfortunately is popularly seen as a catch all for all situations, including as a solution for retrieval, when in reality context awareness is the right solution there. As some rationalization, if we want an autocomplete suggestion to reuse some private function foo, it is a lot more likely to happen if the exact function signature of foo is in the augmented prompt passed in during inference, as compared to hoping that foo is perfectly fuzzily embedded in some model weights.

That being said, let us say you have some incredibly rare or domain specific languages (DSLs) extensively used in your private codebase. The base model simply does not know how these languages work at an intrinsic level - no matter how good the “retrieval’ piece is, there is no way the model will zero-shot understand how a language works from a few snippets in the prompt. In these cases, fine-tuning can actually potentially help to improve performance generally on these languages. Customer-specific finetuning is not a silver bullet, and if the language is really similar to popular languages, then you might need lots of data to really teach the model to distinguish the particular structure of this new one, but there is a chance. If you try to finetune a base model already trained on multiple billions of lines of React code on an additional tens of millions of lines of private React code, the model is not going to get any better at React. Even when it comes to frameworks, we have empirically found retrieval to work a lot better than customer-specific finetuning to match the semantics of the framework.

The second big distinction is maintainability, and this is one where context awareness handily trumps customer-specific finetuning. If the base model improves or is updated, all of the precomputed indices in context awareness still apply, but customer-specific finetuning will need to be redone as there is no easy way to “transfer” weights. Even more than this, as soon as a codebase changes, it is incredibly fast to update the indices in the context awareness system, but any new code won’t be reflected via customer-specific finetuning until a new finetuning job is completed. Since customer-specific finetuning should not be used for retrieval anyways, this last point is less worrisome, but still highlights the update speed of the two systems.

The final distinction is on how predictable a win is with the corresponding tool. Getting wins from context awareness is much more predictable, and its outputs are inspectable, while for customer-specific finetuning, it often takes multiple cycles of choosing which parts of the codebase to finetune over and testing quality of results over long periods of time before seeing any wins. For our customers who have seen wins from customer-specific finetuning on their DSLs and rare languages, it has always taken at least a few weeks of constant iteration and evaluation to get to a place where there is confidence. On top of this, with context awareness there are more places where a developer can help guide the system by providing more explicit user intent, such as with @ mentions or context pinning, which makes improved quality in results even more likely.

All of these positives of context awareness and negatives of customer-specific finetuning are not meant to convince people that the former is objectively better than the latter. We just want to be clear why we recommend customer-specific finetuning to be used sparingly and tactically since almost every potential customer asks us about our finetuning system under the impression that it will create a personalized experience. The reality is that with Codeium, companies indeed will get a personalized experience, but more likely because of our advanced context awareness system.

What’s Next for Codeium’s Personalization Engine

To be transparent, these findings reflect the current state of Codeium’s context awareness and customer-specific finetuning systems, and they align with results that we have heard from other practitioners. Our opinions may completely change as we continue to iterate on and improve the systems.

And the reality is that we actually think there are improvements to be made on both context awareness and customer-specific finetuning, and are actively working on brand new paradigms for both. Context awareness still suffers from false positives in retrieval, which only become more problematic as a codebase gets larger, and customer-specific finetuning likely has targeted applications where the value is really highlighted. We have been researching how to massively increase the precision-recall of retrieval, and how to minimize the overhead of customer-specific finetuning while still being able to encode a lot of context as “long term memory.” We also said that context awareness and customer-specific finetuning were two tools in the personalization toolbox - we actually think there might be more… More to come on personalization engine breakthroughs shortly.