Self-Hosting LLMs: A Practical Guide to Ollama
Large Language Models (LLMs) like GPT-4, Claude, and others have become essential tools for developers, but they come with costs, privacy concerns, and you need internet to use them. Ollama offers a way to run LLMs on your own computer instead. This guide will walk you through what Ollama is, why you might want to use it, and how to get started.
What is Ollama?
Ollama is a free, open-source tool that makes it easy to download and run large language models on your own computer. Think of it like having ChatGPT or Claude running locally on your machine instead of accessing them through a website.
Instead of sending your code or questions to external services, everything stays on your computer. You can use models like Llama, Mistral, CodeLlama, and others. Ollama handles all the complicated setup - you just use simple commands to download and run models.
Why Self-Host?
Before diving into the technical details, let’s consider when self-hosting makes sense:
Privacy and data control - Your code, questions, and data never leave your computer. This is important if you’re working with client code, proprietary information, or just want more privacy.
Cost considerations - API calls to services like ChatGPT or Claude add up quickly, especially if you use them heavily. Local models have no per-use costs, though you do need decent hardware upfront.
Learning and experimentation - Running models locally helps you understand how LLMs actually work. You can experiment freely without worrying about API costs piling up.
Works offline - Once downloaded, models work without internet. Useful if you travel or have unreliable connectivity.
Customization options - You can fine-tune models for your specific needs or modify them for particular use cases.
Trade-offs to consider:
- Cloud models like GPT-4 or Claude Opus are generally more capable than local models
- You need decent computer hardware to run local models well
- When you need the absolute best results, cloud APIs are still your best bet
- Sharing access with a team is harder with local models
Hardware Requirements
Let’s be realistic about what you need. LLMs need decent hardware to run well.
Minimum for basic models (7B parameters):
- 16GB RAM
- Modern CPU (Intel i5/i7 or AMD Ryzen 5/7 from the last few years)
- 10-20GB free disk space per model
- No GPU required, but it helps a lot
Recommended for better performance:
- 32GB+ RAM
- Dedicated GPU with 16GB+ VRAM (NVIDIA or AMD - both work well with Ollama)
- 50GB+ free disk space for multiple models
- SSD storage for faster loading
Reality check: Smaller models (7B-13B parameters) can run on decent consumer hardware but won’t match GPT-4 or Claude in quality. Larger models (30B-70B) need serious hardware and still won’t reach that level. Set your expectations accordingly.
Installing Ollama
Ollama works on Linux, macOS, and Windows.
On Linux (example for Ubuntu/Debian):
curl -fsSL https://ollama.ai/install.sh | sh
On macOS:
# Using Homebrew
brew install ollama
On Windows: Download the installer from ollama.ai and run it.
After installation, verify it works (open PowerShell or Windows Terminal and type):
ollama --version
The Ollama service should start automatically. If not:
# Linux/macOS
ollama serve
# Windows: Usually runs as a system service automatically
# If needed, you can start it from PowerShell/Windows Terminal
Note for Windows users: Throughout this guide, when you see commands to type, use PowerShell or Windows Terminal to run them.
Downloading Your First Model
Ollama makes this straightforward. Let’s start with a popular, reasonably-sized model.
Example: Download Llama 3.1 (8B parameter version):
ollama pull llama3.1
This downloads the model (several GB) and sets it up. The first time takes a while depending on your internet connection.
Other popular models to try:
# Mistral - good balance of performance and size
ollama pull mistral
# CodeLlama - optimized for code generation
ollama pull codellama
# Phi-3 - smaller, faster, decent quality
ollama pull phi3
You can see available models at ollama.ai/library.
Basic Usage
Once you have a model, you can interact with it directly from the command line:
ollama run llama3.1
This starts an interactive chat session. Type your questions and press Enter.
Example interaction (example prompts and responses):
>>> Explain what a Python decorator is
A decorator in Python is a function that modifies the behavior of another
function...
>>> Write a simple example
def my_decorator(func):
def wrapper():
print("Before function")
func()
print("After function")
return wrapper
...
Type /bye to exit the chat.
Using Open WebUI (For Those Who Prefer a Visual Interface)
If you’re not comfortable with the command line, Open WebUI provides a ChatGPT-like interface for Ollama.
Installing Open WebUI with Docker:
The easiest way to install Open WebUI is using Docker. First, make sure you have Docker installed on your system (download from docker.com).
Then run this command:
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main
Once it’s running, open your web browser and go to:
http://localhost:3000
You’ll see a clean interface similar to ChatGPT. The first time you visit, you’ll need to create an account (this is just a local account on your computer, not an online account).
Using Open WebUI:
- After logging in, you’ll see a chat interface
- Click the model dropdown at the top to select which Ollama model to use
- Type your questions in the chat box
- The interface keeps your conversation history organized
- You can create multiple conversations for different topics
This is much more user-friendly than the command line, especially if you’re used to ChatGPT’s interface.
Note: Open WebUI runs in Docker, so you’ll need to have Docker running whenever you want to use it. Ollama also needs to be running in the background.
Creating a Desktop Shortcut:
After the initial Docker setup, the container keeps running automatically (because of the --restart always flag). You don’t need to run that long Docker command again.
To make it easy to access Open WebUI, create a desktop shortcut:
On Windows:
- Right-click on your desktop and select New → Shortcut
- For the location, enter:
http://localhost:3000 - Click Next
- Name it “Open WebUI” or whatever you prefer
- Click Finish
Now you can double-click the shortcut to open Open WebUI in your browser anytime.
On macOS:
- Open Safari and go to
http://localhost:3000 - Go to File → Add to Dock
- The icon will appear in your Dock for quick access
On Linux:
Create a desktop file at ~/.local/share/applications/open-webui.desktop:
[Desktop Entry]
Name=Open WebUI
Exec=xdg-open http://localhost:3000
Type=Application
Icon=web-browser
As long as Docker is running, clicking your shortcut will take you straight to Open WebUI.
Practical Use Cases
Here’s where local LLMs typically work well:
Code explanation and documentation - Understanding unfamiliar code or generating docstrings. Local models handle this reasonably well.
Boilerplate generation - Creating standard code structures, config files, or repetitive code patterns.
Local development assistance - Quick questions without leaving your editor or consuming API credits.
Learning and experimentation - Trying different prompts, testing model capabilities, or learning about LLMs without cost concerns.
Privacy-sensitive work - Analyzing proprietary code or working with confidential information.
Where local models struggle:
- Complex reasoning tasks
- Very long context (they have smaller context windows than top-tier cloud models)
- Latest information (they have training cutoffs like all LLMs)
- Highly specialized or domain-specific tasks
Model Selection Guide
Different models have different strengths. Here’s a general comparison:
Llama 3.1 (8B) - Good all-rounder. Decent at code and general tasks. Reasonable resource requirements.
Mistral (7B) - Similar to Llama, sometimes better at reasoning tasks. Worth trying both.
CodeLlama (7B-13B) - Optimized for code generation. Better than general models for programming tasks but not dramatically so.
Phi-3 (3.8B) - Smaller and faster, lower quality but can run on modest hardware.
Larger models (30B+) - Better quality but require more resources. Even these won’t match GPT-4 or Claude Opus.
Start with a 7B-8B model and see if it meets your needs before investing in hardware for larger models.
Performance Tips
Use a GPU if available - Makes a dramatic difference. Both NVIDIA (with CUDA) and AMD GPUs work well with Ollama.
Close unnecessary applications - LLMs are memory-hungry. Free up RAM before running models.
Use appropriate model sizes - Bigger isn’t always better if it makes your system sluggish.
Adjust context length - Smaller context windows use less memory and run faster.
Integration with Development Tools
Ollama can integrate with various tools:
VS Code - Several extensions let you use Ollama models directly in your editor. Search for “Ollama” in the VS Code marketplace.
Continue.dev - An open-source Copilot alternative that works with Ollama models.
Custom integrations - Use Ollama’s API to build your own tools or integrate with existing workflows.
Cost-Benefit Analysis
Let’s be realistic about costs:
Cloud APIs (example costs, not current pricing):
- $0.01-0.03 per 1K tokens (varies by model)
- Heavy usage: $50-200/month easily
- Scales with usage
- Zero upfront cost
Self-hosting with Ollama:
- Hardware cost: $0 if you have decent hardware, $500-2000 for capable GPU
- One-time setup time investment
- Fixed cost regardless of usage volume
Self-hosting makes economic sense if:
- You have existing capable hardware
- You use LLMs heavily (10K+ API calls/month)
- Privacy concerns justify the investment
- You want to learn about LLM operations
It may not make sense if:
- You’re on a budget and have modest hardware
- Your usage is light or occasional
- You need the highest quality outputs
Limitations to Be Aware Of
Local models are less capable - Even the best local models don’t match GPT-4, Claude Opus, or Gemini Ultra in quality, reasoning, or knowledge.
Hardware requirements are real - Don’t expect good performance on a basic laptop with 8GB RAM.
No magic solutions - Local LLMs have the same fundamental limitations as cloud models: hallucinations, knowledge cutoffs, inconsistency.
Maintenance required - You’re responsible for updates, troubleshooting, and managing disk space.
Limited context windows - Local models typically have smaller context windows than top-tier cloud models.
When to Use Cloud vs. Local
My practical approach:
Use cloud APIs (Claude, GPT-4, etc.) for:
- Complex reasoning tasks
- Important code generation
- When quality matters most
- Production applications
- Tasks requiring large context
Use local Ollama models for:
- Code explanation and exploration
- Boilerplate generation
- Learning and experimentation
- Privacy-sensitive analysis
- High-volume simple tasks
Many developers use both: local models for day-to-day work and cloud APIs when quality is critical.
Getting Started Checklist
- Verify your hardware meets minimum requirements
- Install Ollama following instructions for your OS
- Download a model - start with llama3.1 or mistral
- Test basic usage with
ollama run model-nameor install Open WebUI for a visual interface - Integrate with your workflow - add to your editor or build custom tools
- Evaluate quality for your specific use cases
- Decide if local models meet your needs or if you need cloud models
Resources and Next Steps
Official documentation: ollama.ai/docs
Model library: ollama.ai/library - browse available models
Community: GitHub discussions and Discord (links on Ollama website)
Learning more:
- Experiment with different models and compare results
- Read model cards to understand training and capabilities
- Monitor your resource usage to understand costs
- Join local LLM communities to learn from others
Final Thoughts
Ollama makes self-hosting LLMs accessible, but it’s not a silver bullet. Local models are useful tools with real limitations. They won’t replace cloud APIs for demanding tasks, but they offer privacy, cost savings for high volume use, and valuable learning opportunities.
Start small, experiment with different models, and find the right balance between local and cloud models for your needs. For me, having local models available has been useful for certain tasks, but I still use cloud APIs when I need the highest quality outputs.
The technology is evolving rapidly. Models get better and more efficient regularly. What’s true today about performance and capabilities will change. Stay curious and keep experimenting.