In the growing world of AI, the ability to run large language models (LLMs) locally offers privacy, flexibility, and cost savings. One of the best tools to do this is Ollama, a free and open-source solution that allows you to download and run models like LLaMA, Mistral, and others entirely on your machine.
Whether you’re a developer, AI enthusiast, or just want to experiment without using online services like ChatGPT, this guide walks you through setting up and using Ollama effectively.
💾 Installing Ollama
- Go to ollama.com and download the installer for your OS (Windows, macOS, Linux).
- Install it just like any regular app.
- On Windows, you can launch Ollama from the Start menu; on Mac or Linux, use terminal commands.
Once installed, run ollama
in your terminal to verify it’s set up correctly.
🧑💻 Running Your First Model
To run a model:
ollama run llama2
Ollama will automatically download the model if it’s not already present and start an interactive session.
You can switch to other models like Mistral simply by:
ollama run mistral
List your installed models:
ollama list
Remove models:
ollama rm <model-name>
🚀 Using Ollama’s HTTP API
Ollama provides a built-in HTTP server so you can interact with models through code.
Start the server:
ollama serve
This runs the API locally (usually on port 11434). You can then use tools like Python to interact:
import requests
url = "http://localhost:11434/api/chat"
payload = {
"model": "mistral",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}
response = requests.post(url, json=payload, stream=True)
for line in response.iter_lines():
print(line.decode('utf-8'))
🔧 Customizing Models
You can create custom personalities or configurations by creating a model file
:
FROM llama2
PARAMETER temperature=0.8
SYSTEM "You are a helpful assistant."
Then create the model:
ollama create mario -f ./model-file
Run it:
ollama run mario
📊 Hardware Considerations
Keep in mind:
- LLaMA 2 7B: ~8GB RAM minimum
- Mistral 7B: ~12-16GB RAM
- LLaMA 65B or 70B: 48GB+ RAM or use quantized versions
You can filter lighter models or use 4-bit quantization for better performance on low-end machines.
🌟 Why Use Ollama?
- Privacy: Your data stays on your device
- Cost: Completely free
- Flexibility: Use any open-source model you want
- Offline Capability: No need for internet access