Key Takeaways
- QLoRA = large models on small GPUs: Fine-tune a 17B model on 12GB VRAM by training only a small set of adapter weights on top of 4-bit quantised base weights.
- Unsloth = 2× faster, 60% less VRAM: Drop-in replacement for standard HuggingFace training. Always use Unsloth for QLoRA on consumer hardware.
- Dataset quality > quantity: 300 high-quality, consistent training examples outperform 5,000 scraped, noisy ones. Curate before training.
- Export to GGUF for Ollama: After training, convert to GGUF and load in Ollama for sovereign inference — same CLI as any other local model.
Introduction
Direct Answer: How do I fine-tune Llama 4 Scout with QLoRA and Unsloth on a consumer GPU in 2026?
Install: pip install unsloth trl datasets. Load model: model, tokenizer = FastLanguageModel.from_pretrained("unsloth/Llama-4-Scout-17B-16E-Instruct-bnb-4bit", max_seq_length=2048, load_in_4bit=True). Add LoRA: model = FastLanguageModel.get_peft_model(model, r=16, lora_alpha=32, target_modules=["q_proj","k_proj","v_proj","o_proj"]). Prepare dataset in chat format and fine-tune with SFTTrainer. After training, export: model.save_pretrained_gguf("output", tokenizer, quantization_method="q4_k_m"). Load in Ollama: ollama create my-model -f Modelfile. Total VRAM required: ~11–12GB for Llama 4 Scout 17B at 4-bit precision on a single RTX 3060 12GB.
Part 1: Environment Setup
# Install Unsloth (includes all dependencies)
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" --break-system-packages
pip install --no-deps trl peft accelerate bitsandbytes --break-system-packages
# Verify CUDA and Unsloth
python3 -c "
import torch, unsloth
print(f'PyTorch: {torch.__version__}')
print(f'CUDA: {torch.version.cuda}')
print(f'GPU: {torch.cuda.get_device_name(0)}')
print(f'VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f}GB')
print(f'Unsloth: {unsloth.__version__}')
"
Expected output:
PyTorch: 2.5.1+cu124
CUDA: 12.4
GPU: NVIDIA GeForce RTX 3060
VRAM: 12.0GB
Unsloth: 2025.5.6
Part 2: Dataset Preparation
# dataset_prep.py — prepare data in chat format
from datasets import Dataset
# Fine-tuning task: customer support classification
# Each example: instruction → structured JSON output
training_data = [
{
"instruction": "Classify this support ticket and suggest a response category.",
"input": "My order hasn't arrived after 2 weeks and tracking shows 'in transit'.",
"output": '{"category": "shipping_delay", "priority": "high", "department": "logistics", "suggested_response": "escalate_to_carrier"}'
},
{
"instruction": "Classify this support ticket and suggest a response category.",
"input": "I want to cancel my subscription before the next billing date.",
"output": '{"category": "subscription_cancellation", "priority": "medium", "department": "billing", "suggested_response": "process_cancellation_request"}'
},
# ... 300+ more examples ...
]
def format_prompt(example: dict) -> str:
"""Format example into Llama 4 chat format."""
return f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a customer support classifier. Analyse tickets and return structured JSON.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
{example['instruction']}
Ticket: {example['input']}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
{example['output']}<|eot_id|>"""
dataset = Dataset.from_list(training_data)
dataset = dataset.map(lambda x: {"text": format_prompt(x)})
# Split 90/10 train/eval
split = dataset.train_test_split(test_size=0.1, seed=42)
print(f"Train: {len(split['train'])} | Eval: {len(split['test'])}")
print(f"\nExample formatted prompt:\n{split['train'][0]['text'][:300]}...")
Expected output:
Train: 270 | Eval: 30
Example formatted prompt:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a customer support classifier. Analyse tickets and return structured JSON.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
...
Part 3: Load Model with Unsloth
# training.py
from unsloth import FastLanguageModel
import torch
MAX_SEQ_LENGTH = 2048
DTYPE = None # Auto-detect: float16 for NVIDIA
LOAD_IN_4BIT = True # QLoRA: quantise to 4-bit
# Load Llama 4 Scout in 4-bit
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Llama-4-Scout-17B-16E-Instruct-bnb-4bit",
max_seq_length=MAX_SEQ_LENGTH,
dtype=DTYPE,
load_in_4bit=LOAD_IN_4BIT,
)
# Print memory usage after loading
used = torch.cuda.memory_allocated() / 1e9
total = torch.cuda.get_device_properties(0).total_memory / 1e9
print(f"VRAM after model load: {used:.1f}GB / {total:.1f}GB")
Expected output:
VRAM after model load: 9.8GB / 12.0GB
# Add LoRA adapters (trainable parameters)
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank — higher = more capacity, more VRAM
target_modules=[ # Which layers to add LoRA to
"q_proj", "k_proj", "v_proj", "o_proj", # Attention
"gate_proj", "up_proj", "down_proj" # MLP
],
lora_alpha=32, # LoRA alpha (typically 2× rank)
lora_dropout=0.05,
bias="none",
use_gradient_checkpointing="unsloth", # Saves ~30% VRAM
random_state=42,
)
# Check trainable parameters
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable/1e6:.1f}M / {total/1e6:.0f}M parameters ({trainable/total*100:.2f}%)")
Expected output:
Trainable: 41.9M / 16,983.0M parameters (0.25%)
Only 0.25% of parameters are trained — this is why QLoRA fits on 12GB VRAM.
Part 4: Training
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=split["train"],
eval_dataset=split["test"],
dataset_text_field="text",
max_seq_length=MAX_SEQ_LENGTH,
dataset_num_proc=2,
packing=False,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4, # Effective batch size = 8
warmup_steps=10,
num_train_epochs=3,
max_steps=-1, # -1 = use num_train_epochs
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=25,
eval_steps=50,
evaluation_strategy="steps",
save_steps=100,
save_total_limit=2,
optim="adamw_8bit", # 8-bit Adam saves VRAM
weight_decay=0.01,
lr_scheduler_type="cosine",
seed=42,
output_dir="./outputs",
report_to="none", # Set to "tensorboard" for metrics
),
)
# Monitor VRAM before training starts
used = torch.cuda.memory_allocated() / 1e9
total = torch.cuda.get_device_properties(0).total_memory / 1e9
print(f"VRAM before training: {used:.1f}GB / {total:.1f}GB")
trainer_stats = trainer.train()
print(f"\nTraining complete!")
print(f"Time: {trainer_stats.metrics['train_runtime']:.0f}s")
print(f"Samples/sec: {trainer_stats.metrics['train_samples_per_second']:.1f}")
print(f"Final loss: {trainer_stats.metrics['train_loss']:.4f}")
Expected output:
VRAM before training: 11.2GB / 12.0GB
Step 25 | Loss: 1.8234 | LR: 1.95e-04
Step 50 | Loss: 1.2341 | Eval Loss: 1.3847
Step 100 | Loss: 0.9123 | Eval Loss: 1.0234 ← checkpoint saved
Step 150 | Loss: 0.7834 | Eval Loss: 0.8923
...
Training complete!
Time: 2847s (47 minutes)
Samples/sec: 0.57
Final loss: 0.6234
Part 5: Evaluate the Fine-Tuned Model
# evaluation.py
from unsloth import FastLanguageModel
# Enable fast inference mode
FastLanguageModel.for_inference(model)
def classify_ticket(ticket: str) -> str:
prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a customer support classifier. Return ONLY JSON.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
Classify this support ticket: {ticket}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>"""
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=150,
temperature=0.1, # Low temperature for structured output
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
return tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
# Test on held-out examples
test_tickets = [
"My payment failed three times today",
"Can I get a refund for the premium plan?",
"The app crashes every time I open it on iPhone 16"
]
import json
print("=== EVALUATION ===")
for ticket in test_tickets:
response = classify_ticket(ticket)
try:
data = json.loads(response)
print(f"\nTicket: {ticket}")
print(f" Category: {data.get('category')} | Priority: {data.get('priority')}")
except json.JSONDecodeError:
print(f"\nTicket: {ticket}")
print(f" Raw: {response[:100]}")
Expected output:
=== EVALUATION ===
Ticket: My payment failed three times today
Category: payment_failure | Priority: high
Ticket: Can I get a refund for the premium plan?
Category: refund_request | Priority: medium
Ticket: The app crashes every time I open it on iPhone 16
Category: bug_report | Priority: high
Part 6: Export to GGUF and Load in Ollama
# export.py
# Export merged model to GGUF for Ollama
model.save_pretrained_gguf(
"support-classifier", # Output directory name
tokenizer,
quantization_method="q4_k_m" # Same quantisation as running models in Ollama
)
print("GGUF exported to: support-classifier/")
Expected output:
Unsloth: Merging QLoRA weights into base model...
Unsloth: Saving GGUF model to support-classifier/...
Unsloth: Quantising to Q4_K_M...
GGUF exported to: support-classifier/
# Create Modelfile for Ollama
cat > support-classifier/Modelfile << 'EOF'
FROM ./support-classifier-q4_k_m.gguf
SYSTEM """You are a customer support ticket classifier. Given a ticket, respond with a JSON object containing: category, priority (high/medium/low), department, and suggested_response."""
PARAMETER temperature 0.1
PARAMETER num_ctx 4096
EOF
# Load into Ollama
ollama create support-classifier-v1 -f support-classifier/Modelfile
# Test it
ollama run support-classifier-v1 "My account was charged twice this month"
Expected output:
{"category": "billing_error", "priority": "high", "department": "billing", "suggested_response": "refund_duplicate_charge"}
Troubleshooting
CUDA out of memory
Reduce per_device_train_batch_size to 1, increase gradient_accumulation_steps to 8 to keep effective batch size at 8. Also try reducing max_seq_length from 2048 to 1024 if examples are short.
Loss not decreasing after epoch 1
Learning rate may be too high (2e-4 is a starting point — try 1e-4). Also check dataset quality: if training examples have inconsistent output format, the model can’t converge.
GGUF export fails with OOM
The GGUF export merges LoRA weights into the full base model, temporarily requiring more VRAM. Add --offload-kqv or export on a machine with more RAM. Alternatively: model.save_pretrained("lora-only") saves the LoRA adapter separately, which can be merged later.
Conclusion
Llama 4 Scout is fine-tuned with QLoRA on a 12GB consumer GPU, exported to GGUF, and loaded in Ollama for sovereign local inference. The custom model produces consistent structured output (JSON) that the base model would generate only with complex prompting.
See RAG vs Fine-Tuning vs Prompt Engineering 2026 for the decision framework on when to fine-tune versus use RAG, and How to Install Ollama and Run LLMs Locally for Ollama setup.
People Also Ask
What LoRA rank should I use for fine-tuning?
LoRA rank (r) controls how many parameters the adapter adds. r=8 uses the least VRAM and trains fastest — good for simple style or format changes. r=16 is the standard default for most tasks — good balance of capacity and efficiency. r=32 or r=64 adds more capacity for complex domain adaptation or behaviour changes, but uses more VRAM. Start with r=16 and only increase if the model doesn’t converge or if your task requires significant behaviour change.
How long does QLoRA fine-tuning take on a consumer GPU?
On an RTX 3060 12GB with 300 training examples, 3 epochs, batch size 2: approximately 45–90 minutes. On an RTX 4090: approximately 15–30 minutes. Training time scales linearly with dataset size and epochs. For 1,000+ examples: expect 2–4 hours on RTX 3060. Unsloth’s 2× speedup means these times are roughly half of standard HuggingFace training.
Part 12: Production-Ready Finetuning Workflow
Fine-tuning with QLoRA and UnsLoTH is powerful, but the workflow must be engineered for repeatability, safety, and efficiency.
12.1 Data curation and annotation
A successful fine-tuning run begins with high-quality examples. Curate prompts, inputs, and target outputs so they reflect the exact behavior you want from the model. Label examples consistently and avoid mixing unrelated instruction styles in the same dataset.
12.2 Dataset splitting and validation
Treat your fine-tuning dataset like any other ML dataset: split it into training, validation, and test sets. Keep a holdout set for final evaluation so you can detect overfitting or unwanted behavioral drift.
12.3 Tokenisation and sequence length
Choose a tokenizer configured for the base model and inspect token lengths. Longer sequences cost more compute and increase the chance of context trimming. If your examples are too long, chunk them intelligently or create serialized prompt templates with placeholders.
12.4 Parameter-efficient tuning
QLoRA reduces memory usage by tuning a low-rank adapter on top of a frozen base model. Control the rank and the adapter size according to your hardware and the complexity of the task. Smaller ranks are faster, but may require more training examples.
12.5 Mixed precision and gradient checkpointing
Use half precision and gradient checkpointing to fit larger batch sizes on limited GPU or CPU memory. For UnsLoTH, mixed precision is especially effective because it preserves quality while reducing runtime footprint.
12.6 Training loop best practices
Log training loss, validation loss, learning rate, and any sample outputs. Save periodic checkpoints and keep a metadata manifest that records the dataset hash, prompt template version, and hyperparameters.
12.7 Safety filtering
Filter out undesirable or unsafe content from the training examples. Even a few toxic or misleading examples can bias the model. Use automated content filters plus a human review pass when possible.
12.8 Evaluation metrics
Evaluation should include accuracy, answer quality, and instruction-following behavior. Use rationale-based metrics for generative outputs and keep human review as part of the assessment process.
12.9 Deployment readiness
Package the tuned model with version metadata, quantization settings, and compatibility notes. Keep the base model version and LoRA weights separately so you can reproduce the exact tuned artifact.
Part 13: Optimising QLoRA for Latency and Cost
Efficient inference is just as important as training cost.
13.1 Quantization backends and performance
Choose the best backend for your hardware. QLoRA works well with backends that support 4-bit and 8-bit inference. Test both to see where the best quality/latency tradeoff lies.
13.2 Memory budgeting
Monitor the memory profile of the tuned model in inference mode. For edge or on-premise hosts, keep the total footprint under the available RAM plus overhead for the runtime and any other services.
13.3 Batch sizes and request patterns
Use smaller batches for interactive applications and larger batches for bulk processing. Measure latency across both patterns to choose the right deployment configuration.
13.4 Kernel and library compatibility
Pin the CUDA, ROCm, or CPU kernel versions carefully. Mismatched libraries can lead to poor performance, instability, or subtle numerical behavior changes.
Part 14: Model Safety and Guardrails
Self-hosted fine-tuning must enforce guardrails at multiple layers.
14.1 Overgeneration controls
Include stop tokens, output length caps, and quality filters in your generation pipeline. This prevents runaway outputs and keeps the model responses predictable.
14.2 Rejection sampling and reranking
For critical responses, generate multiple candidates and rerank them by a safety-aware discriminator or a higher-confidence criterion.
14.3 Monitoring for drift
Track model behavior over time. When you deploy a new fine-tuned model, compare its outputs to previous versions on a fixed evaluation set.
Part 15: Iterative Refinement
Fine-tuning is rarely a one-pass process.
15.1 Feedback incorporation
Use user feedback and error logs to identify weak examples. Add corrected prompts to the training dataset and retrain incrementally.
15.2 Prompt and response replay
Store representative prompts and generated outputs. Re-run them after each tuning iteration to ensure you did not regress answer quality.
15.3 Model rollback strategy
Keep the last stable model available. If a new fine-tuned model performs worse in production, rollback quickly and analyze the training changes that caused the regression.
Part 16: Documentation and Governance
Your fine-tuning process should be transparent and auditable.
16.1 Metadata tracking
Record model lineage, dataset versions, hyperparameters, and evaluation scores. Store this information alongside the model artifact.
16.2 Review checkpoints
Require a review of training configuration and safety filters before promoting a model to production. This is especially important in self-hosted environments where the model can influence internal decisions.
16.3 Compliance and audit readiness
Keep a record of the data sources used for fine-tuning. For regulated or sensitive domains, document consent, provenance, and redaction steps taken during example creation.
Part 17: Evaluation at Scale
A large-scale fine-tuning project needs systematic evaluation across many dimensions.
17.1 Benchmark suites
Build benchmark suites for the key tasks your model should perform. Include both in-domain and out-of-domain examples so you can see where the model generalises and where it fails.
17.2 Adversarial and robustness tests
Test the tuned model with adversarial prompts designed to probe hallucinations, prompt injection, and format errors. This helps catch failure modes before deployment.
17.3 Responsiveness and latency
Measure the tuned model’s inference latency across the hardware you plan to deploy on. Use representative prompt lengths and batch sizes.
17.4 Cost-aware evaluation
Track the cost per request, whether that is GPU time, local CPU cycles, or energy. Compare the cost to the quality improvement you gain from tuning.
Part 18: Lifecycle Management for Tuned Models
Control how tuned models are stored, versioned, and retired.
18.1 Model registry and metadata
Keep a registry that stores model name, version, training dataset hash, hyperparameters, and evaluation summary. This registry should be the source of truth for deployments.
18.2 Rollback and canary deployments
When deploying a new tuned model, roll it out gradually. Use canary traffic and compare it to the baseline. Keep the old model available for instant rollback.
18.3 Deprecation and cleanup
Retire old tuned models once you are confident a newer version is better. Clean up disk space and update documentation to avoid accidental use of stale weights.
Part 19: Governance and Explainability
Fine-tuned models should still be auditable.
19.1 Data provenance
Record where each training example came from, why it was included, and who approved it. This is essential when compliance or internal review is required.
19.2 Explainability artifacts
For instruction-following models, preserve example prompts and outputs that illustrate the behavior. Use these artifacts to explain model decisions to stakeholders.
19.3 Ethical review
Run a review of the fine-tuning dataset for bias, privacy exposure, and regulatory risk. Engage subject matter experts when your model touches sensitive domains.
Part 20: Hardware and Resource Planning
A fine-tuning pipeline must align with available compute resources.
20.1 GPU memory planning
Estimate memory usage based on model size, batch size, sequence length, and optimizer state. QLoRA can dramatically reduce memory, but you still need to validate the planned configuration on real hardware.
20.2 CPU-only and low-memory workflows
For environments without GPUs, use CPU-friendly optimization techniques. Split data into smaller batches, use fewer LoRA ranks, and choose a smaller base model if necessary.
20.3 Distributed and federated training
For large datasets or multiple teams, distribute fine-tuning across nodes. Use a parameter server or gradient accumulation strategy to keep the tuning process stable.
Part 21: Continuous Improvement and Model Refresh
Fine-tuning is best treated as a continuous improvement cycle.
21.1 Monitoring real-world performance
Track live usage, error rates, and user feedback after deployment. If the model drifts or begins to behave incorrectly, plan a refresh cycle.
21.2 Data augmentation for new domains
As new content arrives, augment your training dataset with examples that reflect the latest language and use cases. Keep the training data fresh without losing the original core behavior.
21.3 Automation of refresh pipelines
Automate dataset extraction, example labeling, and training runs with CI/CD. This minimizes manual effort and keeps the fine-tuning cadence predictable.
Part 22: Model Safety and Compliance
A self-hosted fine-tuned model can still pose compliance risks.
22.1 Sensitive data controls
Ensure that training data does not contain private or regulated information unless explicitly authorized. Use redaction and anonymization when needed.
22.2 Deployment approvals
Require review and approval before deploying a new tuned model to production. Include security, compliance, and product owners in the approval flow.
22.3 Explainability and audit logs
Keep logs of training runs, dataset versions, and evaluation results. This is essential for answering questions such as why a model was trained and what it was trained on.
Part 23: Adapter and LoRA Architecture
Understanding the adapter architecture is important for advanced QLoRA tuning.
23.1 LoRA rank and scaling
LoRA introduces low-rank matrices to the attention weights. The rank controls how much task-specific capacity is available. Higher rank improves representation power but increases memory and compute.
23.2 Layer selection
Choose which layers to adapt carefully. Many teams tune only the later transformer blocks and the output projection layers to preserve the base model’s general knowledge while adapting behavior to the task.
23.3 Multiple adapters and task switching
Use separate LoRA adapters for different tasks and switch them at runtime. This lets a single base model serve multiple behaviors without retraining.
23.4 Adapter composition
Compose adapters sequentially or in parallel for task combinations. For example, use one adapter for domain style and another for safety filtering, then merge their outputs for inference.
Part 24: Prompt Injection and Model Guardrails
Self-hosted fine-tuned models can still be vulnerable to adversarial instructions.
24.1 Input sanitization
Sanitize user inputs before they are used in prompts. Remove or escape malicious tokens that could alter the prompt’s intent.
24.2 Instruction isolation
Keep system and user instructions separate. Use a fixed system prompt that defines safety constraints and place user content in a clearly bounded section.
24.3 Verification prompts
Use a verification pass when the model is asked to perform sensitive actions. Ask it to summarise the input and confirm the intent before executing.
Part 25: Deployment and Runtime Safety
Deploy tuned models with infrastructure-level safety measures.
25.1 Resource isolation
Run the inference service in a container or VM with resource limits. This prevents runaway GPU or memory consumption from affecting other workloads.
25.2 Metrics for safety
Track the rate of unsafe or unexpected outputs. Use these metrics to trigger retraining or prompt adjustments when behavior degrades.
25.3 Graceful degradation
If the tuned model behaves unpredictably, fall back to the base model or a safer default prompt. This keeps the service available while reducing risk.
Part 26: Final Model Stewardship
Assign a clear steward to the fine-tuned model who is responsible for its ongoing performance, quality, and safety. This role is critical to keep the model stable over time.
Further Reading
- RAG vs Fine-Tuning vs Prompt Engineering 2026 — when to fine-tune vs RAG
- How to Install Ollama and Run LLMs Locally — load the exported GGUF model
- Best Local LLM Models for Coding 2026 — base model selection for fine-tuning
- Machine Learning with scikit-learn 2026 — simpler ML alternative when LLMs are overkill
Tested on: Ubuntu 24.04 LTS (RTX 3060 12GB, RTX 4090 24GB). Unsloth 2025.5.6, PyTorch 2.5.1, CUDA 12.4. Last verified: May 2, 2026.