Self-Hosting LLMs: A Practical Guide to Ollama

Published November 08, 2025 • 10 min read

Large Language Models (LLMs) like GPT-4, Claude, and others have become essential tools for developers, but they come with costs, privacy concerns, and you need internet to use them. Ollama offers a way to run LLMs on your own computer instead. This guide will walk you through what Ollama is, why you might want to use it, and how to get started.

What is Ollama?

Ollama is a free, open-source tool that makes it easy to download and run large language models on your own computer. Think of it like having ChatGPT or Claude running locally on your machine instead of accessing them through a website.

Instead of sending your code or questions to external services, everything stays on your computer. You can use models like Llama, Mistral, CodeLlama, and others. Ollama handles all the complicated setup - you just use simple commands to download and run models.

Why Self-Host?

Before diving into the technical details, let’s consider when self-hosting makes sense:

Privacy and data control - Your code, questions, and data never leave your computer. This is important if you’re working with client code, proprietary information, or just want more privacy.

Cost considerations - API calls to services like ChatGPT or Claude add up quickly, especially if you use them heavily. Local models have no per-use costs, though you do need decent hardware upfront.

Learning and experimentation - Running models locally helps you understand how LLMs actually work. You can experiment freely without worrying about API costs piling up.

Works offline - Once downloaded, models work without internet. Useful if you travel or have unreliable connectivity.

Customization options - You can fine-tune models for your specific needs or modify them for particular use cases.

Trade-offs to consider:

Cloud models like GPT-4 or Claude Opus are generally more capable than local models
You need decent computer hardware to run local models well
When you need the absolute best results, cloud APIs are still your best bet
Sharing access with a team is harder with local models

Hardware Requirements

Let’s be realistic about what you need. LLMs need decent hardware to run well.

Minimum for basic models (7B parameters):

16GB RAM
Modern CPU (Intel i5/i7 or AMD Ryzen 5/7 from the last few years)
10-20GB free disk space per model
No GPU required, but it helps a lot

Recommended for better performance:

32GB+ RAM
Dedicated GPU with 16GB+ VRAM (NVIDIA or AMD - both work well with Ollama)
50GB+ free disk space for multiple models
SSD storage for faster loading

Reality check: Smaller models (7B-13B parameters) can run on decent consumer hardware but won’t match GPT-4 or Claude in quality. Larger models (30B-70B) need serious hardware and still won’t reach that level. Set your expectations accordingly.

Installing Ollama

Ollama works on Linux, macOS, and Windows.

On Linux (example for Ubuntu/Debian):

curl -fsSL https://ollama.ai/install.sh | sh

On macOS:

# Using Homebrew
brew install ollama

On Windows: Download the installer from ollama.ai and run it.

After installation, verify it works (open PowerShell or Windows Terminal and type):

ollama --version

The Ollama service should start automatically. If not:

# Linux/macOS
ollama serve

# Windows: Usually runs as a system service automatically
# If needed, you can start it from PowerShell/Windows Terminal

Note for Windows users: Throughout this guide, when you see commands to type, use PowerShell or Windows Terminal to run them.

Downloading Your First Model

Ollama makes this straightforward. Let’s start with a popular, reasonably-sized model.

Example: Download Llama 3.1 (8B parameter version):

ollama pull llama3.1

This downloads the model (several GB) and sets it up. The first time takes a while depending on your internet connection.

Other popular models to try:

# Mistral - good balance of performance and size
ollama pull mistral

# CodeLlama - optimized for code generation
ollama pull codellama

# Phi-3 - smaller, faster, decent quality
ollama pull phi3

You can see available models at ollama.ai/library.

Basic Usage

Once you have a model, you can interact with it directly from the command line:

ollama run llama3.1

This starts an interactive chat session. Type your questions and press Enter.

Example interaction (example prompts and responses):

>>> Explain what a Python decorator is
A decorator in Python is a function that modifies the behavior of another 
function...

>>> Write a simple example
def my_decorator(func):
    def wrapper():
        print("Before function")
        func()
        print("After function")
    return wrapper
...

Type /bye to exit the chat.

Using Open WebUI (For Those Who Prefer a Visual Interface)

If you’re not comfortable with the command line, Open WebUI provides a ChatGPT-like interface for Ollama.

Installing Open WebUI with Docker:

The easiest way to install Open WebUI is using Docker. First, make sure you have Docker installed on your system (download from docker.com).

Then run this command:

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main

Once it’s running, open your web browser and go to:

http://localhost:3000

You’ll see a clean interface similar to ChatGPT. The first time you visit, you’ll need to create an account (this is just a local account on your computer, not an online account).

Using Open WebUI:

After logging in, you’ll see a chat interface
Click the model dropdown at the top to select which Ollama model to use
Type your questions in the chat box
The interface keeps your conversation history organized
You can create multiple conversations for different topics

This is much more user-friendly than the command line, especially if you’re used to ChatGPT’s interface.

Note: Open WebUI runs in Docker, so you’ll need to have Docker running whenever you want to use it. Ollama also needs to be running in the background.

Creating a Desktop Shortcut:

After the initial Docker setup, the container keeps running automatically (because of the --restart always flag). You don’t need to run that long Docker command again.

To make it easy to access Open WebUI, create a desktop shortcut:

On Windows:

Right-click on your desktop and select New → Shortcut
For the location, enter: http://localhost:3000
Click Next
Name it “Open WebUI” or whatever you prefer
Click Finish

Now you can double-click the shortcut to open Open WebUI in your browser anytime.

On macOS:

Open Safari and go to http://localhost:3000
Go to File → Add to Dock
The icon will appear in your Dock for quick access

On Linux: Create a desktop file at ~/.local/share/applications/open-webui.desktop:

[Desktop Entry]
Name=Open WebUI
Exec=xdg-open http://localhost:3000
Type=Application
Icon=web-browser

As long as Docker is running, clicking your shortcut will take you straight to Open WebUI.

Practical Use Cases

Here’s where local LLMs typically work well:

Code explanation and documentation - Understanding unfamiliar code or generating docstrings. Local models handle this reasonably well.

Boilerplate generation - Creating standard code structures, config files, or repetitive code patterns.

Local development assistance - Quick questions without leaving your editor or consuming API credits.

Learning and experimentation - Trying different prompts, testing model capabilities, or learning about LLMs without cost concerns.

Privacy-sensitive work - Analyzing proprietary code or working with confidential information.

Where local models struggle:

Complex reasoning tasks
Very long context (they have smaller context windows than top-tier cloud models)
Latest information (they have training cutoffs like all LLMs)
Highly specialized or domain-specific tasks

Model Selection Guide

Different models have different strengths. Here’s a general comparison:

Llama 3.1 (8B) - Good all-rounder. Decent at code and general tasks. Reasonable resource requirements.

Mistral (7B) - Similar to Llama, sometimes better at reasoning tasks. Worth trying both.

CodeLlama (7B-13B) - Optimized for code generation. Better than general models for programming tasks but not dramatically so.

Phi-3 (3.8B) - Smaller and faster, lower quality but can run on modest hardware.

Larger models (30B+) - Better quality but require more resources. Even these won’t match GPT-4 or Claude Opus.

Start with a 7B-8B model and see if it meets your needs before investing in hardware for larger models.

Performance Tips

Use a GPU if available - Makes a dramatic difference. Both NVIDIA (with CUDA) and AMD GPUs work well with Ollama.

Close unnecessary applications - LLMs are memory-hungry. Free up RAM before running models.

Use appropriate model sizes - Bigger isn’t always better if it makes your system sluggish.

Adjust context length - Smaller context windows use less memory and run faster.

Integration with Development Tools

Ollama can integrate with various tools:

VS Code - Several extensions let you use Ollama models directly in your editor. Search for “Ollama” in the VS Code marketplace.

Continue.dev - An open-source Copilot alternative that works with Ollama models.

Custom integrations - Use Ollama’s API to build your own tools or integrate with existing workflows.

Cost-Benefit Analysis

Let’s be realistic about costs:

Cloud APIs (example costs, not current pricing):

$0.01-0.03 per 1K tokens (varies by model)
Heavy usage: $50-200/month easily
Scales with usage
Zero upfront cost

Self-hosting with Ollama:

Hardware cost: $0 if you have decent hardware, $500-2000 for capable GPU
One-time setup time investment
Fixed cost regardless of usage volume

Self-hosting makes economic sense if:

You have existing capable hardware
You use LLMs heavily (10K+ API calls/month)
Privacy concerns justify the investment
You want to learn about LLM operations

It may not make sense if:

You’re on a budget and have modest hardware
Your usage is light or occasional
You need the highest quality outputs

Limitations to Be Aware Of

Local models are less capable - Even the best local models don’t match GPT-4, Claude Opus, or Gemini Ultra in quality, reasoning, or knowledge.

Hardware requirements are real - Don’t expect good performance on a basic laptop with 8GB RAM.

No magic solutions - Local LLMs have the same fundamental limitations as cloud models: hallucinations, knowledge cutoffs, inconsistency.

Maintenance required - You’re responsible for updates, troubleshooting, and managing disk space.

Limited context windows - Local models typically have smaller context windows than top-tier cloud models.

When to Use Cloud vs. Local

My practical approach:

Use cloud APIs (Claude, GPT-4, etc.) for:

Complex reasoning tasks
Important code generation
When quality matters most
Production applications
Tasks requiring large context

Use local Ollama models for:

Code explanation and exploration
Boilerplate generation
Learning and experimentation
Privacy-sensitive analysis
High-volume simple tasks

Many developers use both: local models for day-to-day work and cloud APIs when quality is critical.

Getting Started Checklist

Verify your hardware meets minimum requirements
Install Ollama following instructions for your OS
Download a model - start with llama3.1 or mistral
Test basic usage with ollama run model-name or install Open WebUI for a visual interface
Integrate with your workflow - add to your editor or build custom tools
Evaluate quality for your specific use cases
Decide if local models meet your needs or if you need cloud models

Resources and Next Steps

Official documentation: ollama.ai/docs

Model library: ollama.ai/library - browse available models

Community: GitHub discussions and Discord (links on Ollama website)

Learning more:

Experiment with different models and compare results
Read model cards to understand training and capabilities
Monitor your resource usage to understand costs
Join local LLM communities to learn from others

Final Thoughts

Ollama makes self-hosting LLMs accessible, but it’s not a silver bullet. Local models are useful tools with real limitations. They won’t replace cloud APIs for demanding tasks, but they offer privacy, cost savings for high volume use, and valuable learning opportunities.

Start small, experiment with different models, and find the right balance between local and cloud models for your needs. For me, having local models available has been useful for certain tasks, but I still use cloud APIs when I need the highest quality outputs.

The technology is evolving rapidly. Models get better and more efficient regularly. What’s true today about performance and capabilities will change. Stay curious and keep experimenting.

Tags: #ollama #llm #self-hosting #ai