How to Fine-Tune Your Own LLM Locally Using Ollama and Unsloth

Advertisements

Fine-tuning your own language model and running it locally might sound complicated, but with tools like Ollama and Unsloth, it’s surprisingly doable—even on a consumer-grade GPU like the Nvidia RTX 4090. In this guide, I’ll walk you through everything step-by-step, from selecting the right dataset to running your fine-tuned model locally with an OpenAI-compatible API.


🔍 Step 1: Choosing the Right Dataset

The dataset you use is critical. A well-matched dataset can help a small fine-tuned model outperform much larger ones on specific tasks. In this tutorial, I’m building a small, fast LLM that generates SQL queries based on given table data.

One of the best datasets for this task is the Synthetic Text-to-SQL dataset. It includes over 105,000 entries with:

  • Prompt
  • SQL query
  • Complexity
  • And more useful metadata

🧰 Step 2: Setting Up Your Environment

I’m using Ubuntu with an Nvidia 4090 GPU, but you can also do this using Google Colab if you don’t have local GPU access. The hardware requirements are fairly lightweight thanks to Unsloth’s efficient architecture.

Prerequisites

Install Dependencies

Create and activate your environment:

conda create -n unsloth-env python=3.10
conda activate unsloth-env

Install required libraries:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install git+https://github.com/unslothai/unsloth.git
pip install jupyter

Then launch Jupyter Notebook:

jupyter notebook

🚀 Step 3: Load the Model with Unsloth

In your Jupyter Notebook, import the Fast Language Model from Unsloth:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3-8b-Instruct",  # Llama 3 model
    max_seq_length=2048,
    load_in_4bit=True  # Reduces memory usage significantly
)

You’ll see a fun ASCII graphic when it’s successfully loaded!


🧠 Step 4: Set Up PEFT (LoRA Adapters)

Unsloth uses PEFT (Parameter-Efficient Fine-Tuning) with LoRA adapters to train only a small portion (1–10%) of the model’s parameters.

from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(...)
model = get_peft_model(model, lora_config)

No need to retrain the entire model—saving you time, compute, and energy.

Advertisements

📊 Step 5: Format Your Dataset for LLaMA 3

LLaMA 3 uses Alpaca-style prompts, like:

### Instruction:
Generate an SQL query for the following task.

### Input:
Table: users(id, name, age)

Task: Get all users over the age of 25.

### Response:
SELECT * FROM users WHERE age > 25;

Format your dataset to follow this structure. You’ll need to extract and arrange the prompt, input, and response fields accordingly.


🏋️ Step 6: Fine-Tune the Model with Hugging Face Trainer

Here’s a minimal setup using Hugging Face’s Trainer:

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./outputs",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    num_train_epochs=3,
    max_steps=1000,
    save_steps=500,
    learning_rate=2e-5,
    warmup_steps=100,
    logging_dir="./logs",
    seed=42,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=your_dataset,
    tokenizer=tokenizer
)

trainer.train()

🔁 Step 7: Convert to Ollama-Compatible Format

Once training is done, you need to convert the fine-tuned model into a format that Ollama understands.

Unsloth provides a one-liner to do this:

FastLanguageModel.save_gguf(model, tokenizer, "llama3-sql", quantization_method="q4_k_m")

This creates a .gguf file ready to be served by Ollama.


⚙️ Step 8: Configure Ollama with a Modelfile

Open your terminal and create a Modelfile in the same directory as your .gguf file:

touch Modelfile

Open it in your favorite code editor and add the following:

FROM llama3-sql.gguf
PARAMETER temperature 0.2
SYSTEM "You are an SQL generator that takes a user's request and returns a helpful SQL query."

This file is like a Dockerfile but for language models. You define the base model and behavior here.


🧪 Step 9: Run the Model with Ollama

Start Ollama and load your model:

ollama run llama3-sql

And just like that, your fine-tuned LLM is now running locally, serving SQL queries on demand via the OpenAI-compatible API.


Advertisements

Leave a Reply

x
Advertisements