Get AI code assist VSCode with local LLMs using Ollama and the Continue.dev extension

This is about running VSCode AI code assist locally as a replacement for Copilot or some other service. You may run local models to guarantee none of your code ends up on external servers. Or, you may not want to maintain an ongoing AI subscription.

We are going to use Ollama as our LLM service and the continue.dev VSCode extension as the language service inside VSCode.

This was tested on a MacBook using the Apple GPU. Macs are an interesting platform for running local AI code assist and LLMs because you can treat much of main memory as GPU VRAM.

Several related blogs and videos that cover VSCode and local LLMs

Blog Get AI code assist VSCode with local LLMs using Ollama and the Continue.dev extension - Mac
Get AI code assist VSCode with local LLMs using LM Studio and the Continue.dev extension - Windows
Rocking an older Titan RTX 24GB as my local AI Code assist on Windows 11, Ollama and VS Code
YouTube Video Using local Large Language Models for AI code assist in Visual Studio Code
YouTube Video See tabAutoComplete AI assist and Chat AI assist relying on local LLMs in 4

Installing Ollama

Ollama is a simple tool that lets you run models locally, assuming you have the required hardware. It can run in a server mode providing local API endpoints for various tools like the VS Code AI assist extensions. The Ollama team provides some guidance on the different models and VRAM.

Download and Install Ollama It can be run as a CLI tool or be run so that it stays in the tray or the app bar, depending on the platform.

Selecting Models

We want models that are tuned for coding. I used models that were part of the Ollama catalog.

You should be judicious with the size of the models. You want them to run on the GPU and fit in video memory. Note: You can use the same model for auto-completion and chat if you are low on space. The configs below use two models.

On Windows, you will be constrained by your NVidia GPU VRAM. You will need at least 8GB of VRAM for a single model plus embedding. You can run better models or split models with 12-16GB. On the Mac, the GPU can access some percentage of system memory. You will want at least 32GB of RAM on a Mac because main memory is VRAM on an ARM Mac.

MacOS gives the GPU access to 2/3rds of system memory on Macs with 36GB or less and 3/4 on machines with 48GB or more. A 96GB Mac has 72 GB available to the GPU. Some of that will be needed beyond the model data itself.

The image to the right shows GPU usage spikes when I run code assist queries on an M1 Macbook.

Download models to run them in Ollama using Ollama

Ollama has 19 different coding-oriented models at the time of this article. You'll want to investigate which models you want to use based on accuracy, GPU memory, and disk space.

Pull the models you want to use for code completion with

ollama pull <your_model>

I pulled codegemma (5GB on disk) and codequen (4.2GB on disk). The commands for these two models were

ollama pull codegemma
ollama pull codeqwen

We can see the available models with the ollama list command

(base) joefreeman@Joes-MBP ~ % ollama list
NAME              ID            SIZE    MODIFIED
codeqwen:latest   df352abf55b1  4.2 GB  40 minutes ago
codegemma:latest  0c96700aaada  5.0 GB  2 days ago
phi3:medium       cf611a26b048  7.9 GB  2 days ago
phi:latest        e2fd6321a5fe  1.6 GB  2 days ago

Then we run Ollama in server mode. This essentially creates LLM as a service. The running Ollama server can serve up any of the downloaded models

ollama serve
...
a bunch of log messages 
and finally
...
...level=INFO source=server.go:632 msg="llama runner started in 0.76 seconds"

VSCode Extension: Chat and auto-complete functions

The extension supports two modes for interacting with the LLM, chat and auto-complete. Ollama supports multiple simultaneous models. The VSCode extension understands this and makes all models on the Ollama server/endpoint available for code assistance.

You can use different models for these two functions if you have enough local models or access to external model endpoints. I'm only going to show local models going forward.

Chat: Chat happens in a dedicated pane where you can type or copy/paste your question. The extension lets you select any of the configured LLMs. Each conversation in the chat window can only be bound to a single LLM definition. You can change the LLM used for a new conversation by using the drop list in the chat pane. The following image was captured for a locally running Ollama instance with 4 available models.
Autocomplete: Tab-based auto-complete that works in line with your work. It just works once it is configured correctly. The auto-complete LLM has its own configuration section. You can use the same LLM for both chat and auto-complete.

Install and Configure the Continue.dev extension.

Install the continue extension.

The extension will recommend moving the chat pane to the right-hand side away from the primary sidebar. This makes the chat window the secondary side bar. That worked for me.

The continue extension may run a configuration wizard for you on the first install. If you miss this or need to reconfigure the extension later, you will need to edit the config.json by hand. I just go straight to the json editor

Click on the gear icon at the bottom of the Continue extension view. Or find the config file in ~/.config/config.json

Configuration on startup

You only see this screen one time. After this you always edit the config.json.

The initial configuration screen looks like this. Skip the onboarding and edit the config.json using the gear icon mentioned above

~/.continue/config.json

This is the configuration file for the Continue Visual Studio Code extension. We must edit two different areas to enable and configure both chat and auto-complete. The models must be downloaded into Ollama and be available in the local model listing. They will be loaded by Ollama on demand as they are requested.

models: The models for chat

Thare are two options for Contiue integration chat LLM models. You can pick a specific model or enable a UI dropdown in the plugin that can be used to switch between models.

This configuration says to use the local Ollama and auto-detect all the currently available (downloaded) models on that Ollama instance. We could select a single model but use this instead so that we can pick a specific model for any single chat session. You can see the model selector above.

  "models": [
    {
      "title": "Ollama",
      "provider": "ollama",
      "model": "AUTODETECT"
    }
  ],

This screenshot shows the output of a question about writing a Dart unit test.

tabAutoCompleteModel: The model for inline suggestions and auto-complete

The tabeAutoCompleteModle used for in-line suggestions must point at a specific model. It can be any model available in Ollama with `ollama ls`. But, it can only be one.

This configuration says to use codeqwen as the specific model for tab auto-complete. The model must be downloaded and available on the local Ollama server just like the models in the models list.

  "tabAutocompleteModel": {
    "title": "Tab Autocomplete Model",
    "provider": "ollama",
    "model": "codeqwen"
  },

Embeddings

No configuration is required if using ollama embeddings. At least none was offered when using a local ollama instance.

Complete sample config.json file

This is my complete config.json with the two entries shown above. Everything else is exactly as it was created without modification.

{
  "allowAnonymousTelemetry": false,
  "models": [
    {
      "title": "Ollama",
      "provider": "ollama",
      "model": "AUTODETECT"
    }
  ],
  "customCommands": [
    {
      "name": "test",
      "prompt": "{{{ input }}}\n\nWrite a comprehensive set of unit tests for the selected code. It should setup, run tests that check for correctness including important edge cases, and teardown. Ensure that the tests are complete and sophisticated. Give the tests just as chat output, don't edit any file.",
      "description": "Write unit tests for highlighted code"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Tab Autocomplete Model",
    "provider": "ollama",
    "model": "codeqwen"
  },
  "contextProviders": [
    {
      "name": "code",
      "params": {}
    },
    {
      "name": "docs",
      "params": {}
    },
    {
      "name": "diff",
      "params": {}
    },
    {
      "name": "terminal",
      "params": {}
    },
    {
      "name": "problems",
      "params": {}
    },
    {
      "name": "folder",
      "params": {}
    },
    {
      "name": "codebase",
      "params": {}
    }
  ],
  "slashCommands": [
    {
      "name": "edit",
      "description": "Edit selected code"
    },
    {
      "name": "comment",
      "description": "Write comments for the selected code"
    },
    {
      "name": "share",
      "description": "Export the current chat session to markdown"
    },
    {
      "name": "cmd",
      "description": "Generate a shell command"
    },
    {
      "name": "commit",
      "description": "Generate a git commit message"
    }
  ]
}

Refresh the workspace after reconfiguring the extension

Reload the VSCode Workspace after editing the config file. This forces an immediate configuration refresh.

ctrl-shift-p
Developer:Reload Window

The reload is actually available here after a ctrl-shift-p

Examples

A chat session

This chat is about a unit test. Zero prompt engineering has been done to get better answers.

Autocomplete in action

Autocomplete will provide suggestions while you are typing in a code window. The light grey text in this image was provided by the LLM when I clicked on a Dart unit test file.

Windows

Ollama on Windows is in pre-release at this time.

Fini

That's it. At this point, you should have AI assist in your editors and in the chat pane. All of this capability without leaving the friendly environment of your local network!

Revision History

Created: 2024 08

Blog de Joe Freeman