MLX Fine-Tuning Kit
May 2026 · open-source kit
A reusable kit for fine-tuning open-source LLMs locally on Apple Silicon using MLX. Built around an M2 Max with 64GB unified memory, but works on any modern Mac with 16GB+. The repo bundles the templates, worked examples, and reference docs I copy into every new fine-tuning project.
What's in it
templates/— Drop-in starter files for a new project:lora_config.yaml,prepare_data.py,test_model.py.examples/— End-to-end worked examples (e.g., SEO meta description generator) showing the dataset shape and full workflow.docs/— Reference guides on dataset design, deployment to Ollama/GGUF, and what to do when training looks weird.
The five-step workflow
Every project follows the same pattern, which is most of what the kit encodes:
- Define the behavior — Write a two-sentence before/after spec. If you can't, you're not ready to build the dataset.
- Build the dataset — 30 examples gets you a clear pattern, 100–300 gets real generalization. Quality over quantity. One bad example pollutes training.
- Baseline test — Run the base model on your test prompts before training. Skipping this is the #1 beginner mistake.
- Train —
mlx_lm.lora --config lora_config.yaml. Watch train and val loss; if val starts climbing while train falls, you're overfitting. - Evaluate — Compare side-by-side on prompts the model hasn't seen. Loss numbers help, but the real test is "do the outputs look like what I wanted?"
Quick start
git clone https://github.com/collindjohnson/mlx-finetune-kit
mkdir -p ~/AI/finetuning/my-new-project && cd ~/AI/finetuning/my-new-project
cp ../path/to/mlx-finetune-kit/templates/* .
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip && pip install mlx-lm huggingface_hub
# Edit prepare_data.py with your dataset, then:
python prepare_data.py # Build train/valid splits
python test_model.py --base # Baseline outputs (don't skip)
mlx_lm.lora --config lora_config.yaml # Train (5–15 min)
python test_model.py --tuned # See if it worked
When not to fine-tune
Honest reminder: most "I want a custom AI" use cases are better served by better prompting, RAG, or few-shot examples in the prompt. Fine-tune when you need a consistent output format prompting can't reliably produce, you're running cheaply at scale, you need offline/private operation, or you want a small specialized model that beats a big general one on a narrow task.