Fine-Tuning LLMs Explained: What It Is, When to Use It, and When to Skip It
In this lesson: usually you should not fine-tune, the data is the work, and PEFT and QLoRA make it cheap.
Top 3 takeaways
Usually, don't fine-tune
Start with model selection, prompting, and RAG. Fine-tuning earns its place mainly when you want to make a small model good enough, cut down on prompting, or keep data private and local.
The data is the work
Fine-tuning is supervised learning, so bad training data quietly degrades the model for good. Curate a few thousand representative labeled pairs and always hold some back to catch overfitting.
PEFT and QLoRA make it cheap
Freezing most layers, LoRA rank reduction, and quantization let you fine-tune a 13B model on a laptop. From there you tune learning rate, epochs, and dropout, and judge the result on accuracy plus human eval.

Aaron Gallant
Lead Instructor, Gauntlet AI
Lead Instructor at Gauntlet AI since the beginning; 18 years in tech across software development, data science, data pipelines, and internal tooling, including five years at Google. For his day job he synthetically generates HIPAA-compliant medical research data with LLMs and works deep in fine-tuning, reinforcement learning, and synthetic data generation — including fine-tuning jobs run on-premises on real hardware — so he knows how these models behave under the hood. MS in Computer Science, University of Illinois Urbana-Champaign; BA in Political Science & Philosophy, University of Rochester.
Lesson notes
A written walkthrough of the lecture, covering the patterns, the code, and the things that trip people up.
What Fine-Tuning Actually Changes
Fine-tuning changes the model itself, not just how you use it.
Prompting, RAG, memory, and context engineering all happen during inference, where the model's underlying parameters remain unchanged. Fine-tuning modifies those parameters, creating a lasting change in how the model behaves.
This is only possible with open-weight models, which allow you to continue training an existing model. Open-weight models are different from open-source models—they expose the trained parameters but not necessarily the full training process or code.
When to Fine-Tune
Most applications do not require fine-tuning.
Start by selecting the right model and improving your prompts. If the application depends on external knowledge, RAG is usually the better solution because building a knowledge base is much easier than creating high-quality training data.
Fine-tuning makes the most sense when you need a smaller model to perform a specialized task efficiently, run locally for privacy reasons, or consistently make decisions such as classification, fraud detection, or spam filtering.
The Fine-Tuning Process
Fine-tuning follows four steps:
- Choose a base model.
- Gather and clean training data.
- Train the model.
- Validate the results.
Most of the work happens in data preparation and evaluation. High-quality, representative data matters far more than large quantities of repetitive examples.
Always reserve a validation dataset so you can measure real performance and avoid overfitting.
PEFT, LoRA, and QLoRA
Modern fine-tuning is practical because only a small portion of the model needs to be trained.
Parameter-Efficient Fine-Tuning (PEFT) freezes most of the model while updating a small number of parameters. LoRA further reduces the amount of training required, and quantization shrinks the model by storing weights with lower precision.
Combined as QLoRA, these techniques make it possible to fine-tune large language models on consumer hardware instead of expensive GPU clusters.
Evaluation Matters More Than Training
Training a model is only part of the process. The real challenge is proving that it performs better.
Common metrics include accuracy, classification performance, alignment with human expectations, and perplexity, but the most valuable evaluation is still human judgment against clearly defined success criteria.
The lecture's main takeaway is that successful fine-tuning depends far more on data quality and evaluation than on the training algorithm itself. Models generate plausible outputs by design, making rigorous testing and human verification essential before deploying them in production.
FAQ
When should you fine-tune an LLM? +
What is fine-tuning, really? +
What are PEFT and QLoRA? +
How do you choose between fine-tuning, RAG, and prompting? +
Why can fine-tuning quietly waste time and budget? +
How much data do you need to fine-tune? +
How do you know fine-tuning actually helped? +
What's next?
Keep building with the rest of Night School, or apply to Gauntlet — twelve weeks of technical intensity with the best AI engineers we can find.