Training (Docker + Axolotl)
This repo recommends training with Axolotl inside a Docker container. The train/ folder includes:
- Example Axolotl configs in
train/configs/. - Chat templates in
train/chat_templates/. - A Kubernetes job spec in
train/training-job.ymlthat references the Axolotl image.
The instructions below are written to match the paths used by the configs in this repo.
Hardware recommendations
- Recommended minimum VRAM: 24GB total across all GPUs.
- Multi-GPU is fine: e.g. 2×12GB, 1×24GB, 2×16GB, etc.
- More VRAM lets you increase
sequence_len, batch size, and/or train larger base models.
Fine-tuning approaches (full vs LoRA vs QLoRA)
All three approaches start from a base model (e.g. google/gemma-3-270m-it) and train on this repo’s synthetic dataset.
Full fine-tuning
Full fine-tuning updates all model weights.
- Pros: highest quality ceiling.
- Cons: highest VRAM/compute; largest checkpoints.
- When to use: you have ample VRAM/compute and want maximum adaptation.
LoRA (Low-Rank Adaptation)
LoRA keeps base weights frozen and trains small adapter matrices.
- Pros: much lower VRAM; fast iteration; adapters are small and easy to share.
- Cons: slightly lower ceiling than full fine-tuning.
- When to use: common default when you’re VRAM-constrained.
QLoRA
QLoRA is LoRA, but the frozen base model is loaded in 4-bit quantized form.
- Pros: lowest VRAM footprint; enables training larger models on modest GPUs.
- Cons: can be slower (dequant overhead) and sometimes trickier to tune.
- When to use: you want LoRA but need the base model to fit in memory.
Training Scripts
We now recommend that you use the Axolotl training script suite to perform training runs.
The Docker images that are recommended are:
axolotlai/axolotl-cloud:main-py3.11-cu128-2.8.0- CUDA 12.8 with PyTorch 2.8.0axolotlai/axolotl-cloud:main-py3.11-cu126-2.8.0- CUDA 12.6 with PyTorch 2.8.0
Both images provide the axolotl CLI used in the examples below. It is recommended to use the "cloud" versions since they have various custom folder mounts already set up that make data management easier.
Dataset Generation
The synthetic dataset is generated by scripts under data/.
- Generator:
data/generate_data.py - Outputs: JSONL files in
data/output/(for examplehome_assistant_train.jsonl,home_assistant_test.jsonl, andsample.jsonl).
The example training configs in train/configs/ are written to read a dataset file mounted at /workspace/data/datasets/sample.jsonl. Depending on the variant of the dataset that you are using for training, you may need to edit the config to point to the correct dataset file.
For local Docker training you’ll typically:
- Generate a dataset JSONL under
data/output/. - Copy it into a host folder that you mount as
/workspace/data/datasets/.
Training Configs
This repo currently includes:
train/configs/gemma3-270m.ymltrain/configs/functiongemma-270m.yml
Additional configs can be found in the Axolotl repo. They will need to be adapted to use this repo’s dataset paths and potentially have their chat template adjusted.
Model selection
base_model: Hugging Face model id.model_type: Transformers model class (example:Gemma3ForCausalLM).
Chat formatting and tool calling
chat_template: jinjaandchat_template_jinja: | ...:- Defines how multi-turn chats (and tools) are rendered into training text.
- For tool calling, this formatting matters a lot.
Related files:
train/chat_templates/contains reusable templates you can adapt.
Dataset configs
datasets:list points at a JSONL file, example:path: /workspace/data/datasets/sample.jsonlds_type: jsontype: chat_templatemessage_field_training: train_on_turnroles_to_train: []
NOTE:
roles_to_traincontrols which message roles are used for loss calculation. Alternatively, you can simply enable training against all "assistant" role messages by settingroles_to_train: [assistant].
Sequence length + packing
sequence_len: max context length.sample_packing:truepacks multiple short samples into one sequence (better throughput).falsekeeps samples separate (often easier to debug).
Batch sizing
Effective batch size is calculated as:
\text{effective\_batch} = \text{micro\_batch\_size} \times \text{gradient\_accumulation\_steps} \times \text{num\_gpus}
Having the correct effective batch size is important for training stability. Modify micro_batch_size until you can fit the model and optimizer states in VRAM, and then set gradient_accumulation_steps to keep the effective batch size in the correct range -- 16, 32, or 64 are values shown to work well for this dataset.
Memory/perf helpers
bf16: true: bfloat16 training on modern NVIDIA GPUs (30-series/Ada or newer)gradient_checkpointing: true: re-compute gradient activations during backward pass instead of storing in VRAM (lower VRAM, more compute.)flash_attention: true: faster attention when supported (almost always).optimizer: adamw_bnb_8bit: 8-bit optimizer states to save VRAM
Outputs
output_dir: /workspace/data/training-runs/<run-name>stores checkpoints and TensorBoard logs.
Running training with Docker (local machine)
These commands assume:
- You are running on Linux with an NVIDIA GPU (or WSL2)
- You have Docker installed
- You have the NVIDIA Driver and the NVIDIA Container Toolkit set up and installed
1) Create host folders to mount
mkdir -p ./train-local/datasets
mkdir -p ./train-local/training-runs
mkdir -p ./train-local/huggingface-cache
2) Generate a dataset JSONL
python3 data/generate_data.py --sample --language english
cp data/output/sample.jsonl ./train-local/datasets/sample.jsonl
3) Run preprocess + train
docker pull axolotlai/axolotl-cloud:main-py3.11-cu128-2.8.0
docker run --rm -it \
--gpus all \
-e AXOLOTL_DO_NOT_TRACK=1 \
-e HF_HOME=/workspace/data/huggingface-cache \
-e HF_TOKEN="$HF_TOKEN" \
-v "$PWD/train-local/datasets:/workspace/data/datasets" \
-v "$PWD/train-local/training-runs:/workspace/data/training-runs" \
-v "$PWD/train-local/huggingface-cache:/workspace/data/huggingface-cache" \
-v "$PWD/train/configs:/workspace/configs" \
axolotlai/axolotl-cloud:main-py3.11-cu128-2.8.0 \
axolotl preprocess /workspace/configs/gemma3-270m.yml --debug
docker run --rm -it \
--gpus all \
-e AXOLOTL_DO_NOT_TRACK=1 \
-e HF_HOME=/workspace/data/huggingface-cache \
-e HF_TOKEN="$HF_TOKEN" \
-v "$PWD/train-local/datasets:/workspace/data/datasets" \
-v "$PWD/train-local/training-runs:/workspace/data/training-runs" \
-v "$PWD/train-local/huggingface-cache:/workspace/data/huggingface-cache" \
-v "$PWD/train/configs:/workspace/configs" \
axolotlai/axolotl-cloud:main-py3.11-cu128-2.8.0 \
axolotl train /workspace/configs/gemma3-270m.yml
Artifacts will appear under ./train-local/training-runs/.
Running on Kubernetes (e.g. cloud GPU host)
train/training-job.yml mounts:
/workspace/data/datasets(dataset JSONL)/workspace/data/training-runs(outputs)/workspace/configs(Axolotl YAML)/workspace/data/huggingface-cache(HF cache)
It runs axolotl preprocess ... as an init container, then axolotl train ....
The helper script train/train.sh copies train/configs/<MODEL_NAME>.yml to a remote server and starts the Kubernetes Job.