MLX Fine-Tuning Kit

May 2026 · open-source kit

A reusable kit for fine-tuning open-source LLMs locally on Apple Silicon using MLX. Built around an M2 Max with 64GB unified memory, but works on any modern Mac with 16GB+. The repo bundles the templates, worked examples, and reference docs I copy into every new fine-tuning project.

View the repo on GitHub →

What's in it

templates/ — Drop-in starter files for a new project: lora_config.yaml, prepare_data.py, test_model.py.
examples/ — End-to-end worked examples (e.g., SEO meta description generator) showing the dataset shape and full workflow.
docs/ — Reference guides on dataset design, deployment to Ollama/GGUF, and what to do when training looks weird.

The five-step workflow

Every project follows the same pattern, which is most of what the kit encodes:

Define the behavior — Write a two-sentence before/after spec. If you can't, you're not ready to build the dataset.
Build the dataset — 30 examples gets you a clear pattern, 100–300 gets real generalization. Quality over quantity. One bad example pollutes training.
Baseline test — Run the base model on your test prompts before training. Skipping this is the #1 beginner mistake.
Train — mlx_lm.lora --config lora_config.yaml. Watch train and val loss; if val starts climbing while train falls, you're overfitting.
Evaluate — Compare side-by-side on prompts the model hasn't seen. Loss numbers help, but the real test is "do the outputs look like what I wanted?"

Quick start

git clone https://github.com/collindjohnson/mlx-finetune-kit
mkdir -p ~/AI/finetuning/my-new-project && cd ~/AI/finetuning/my-new-project
cp ../path/to/mlx-finetune-kit/templates/* .

python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip && pip install mlx-lm huggingface_hub

# Edit prepare_data.py with your dataset, then:
python prepare_data.py        # Build train/valid splits
python test_model.py --base   # Baseline outputs (don't skip)
mlx_lm.lora --config lora_config.yaml   # Train (5–15 min)
python test_model.py --tuned  # See if it worked

When not to fine-tune

Honest reminder: most "I want a custom AI" use cases are better served by better prompting, RAG, or few-shot examples in the prompt. Fine-tune when you need a consistent output format prompting can't reliably produce, you're running cheaply at scale, you need offline/private operation, or you want a small specialized model that beats a big general one on a narrow task.