# Training (Docker + Axolotl)

This repo recommends training with **Axolotl** inside a Docker container. The `train/` folder includes:

- Example Axolotl configs in `train/configs/`.
- Chat templates in `train/chat_templates/`.
- A Kubernetes job spec in `train/training-job.yml` that references the Axolotl image.

The instructions below are written to match the paths used by the configs in this repo.

## Hardware recommendations

- **Recommended minimum VRAM:** **24GB total** across all GPUs.
  - Multi-GPU is fine: e.g. **2×12GB**, **1×24GB**, **2×16GB**, etc.
  - More VRAM lets you increase `sequence_len`, batch size, and/or train larger base models.

## Fine-tuning approaches (full vs LoRA vs QLoRA)

All three approaches start from a **base model** (e.g. `google/gemma-3-270m-it`) and train on this repo’s synthetic dataset.

### Full fine-tuning

Full fine-tuning updates **all** model weights.

- **Pros:** highest quality ceiling.
- **Cons:** highest VRAM/compute; largest checkpoints.
- **When to use:** you have ample VRAM/compute and want maximum adaptation.

### LoRA (Low-Rank Adaptation)

LoRA keeps base weights frozen and trains small **adapter** matrices.

- **Pros:** much lower VRAM; fast iteration; adapters are small and easy to share.
- **Cons:** slightly lower ceiling than full fine-tuning.
- **When to use:** common default when you’re VRAM-constrained.

### QLoRA

QLoRA is LoRA, but the frozen base model is loaded in **4-bit quantized** form.

- **Pros:** lowest VRAM footprint; enables training larger models on modest GPUs.
- **Cons:** can be slower (dequant overhead) and sometimes trickier to tune.
- **When to use:** you want LoRA but need the base model to fit in memory.

## Training Scripts
We now recommend that you use the [Axolotl](https://github.com/axolotl-ai-cloud/axolotl) training script suite to perform training runs.

The Docker images that are recommended are:
- `axolotlai/axolotl-cloud:main-py3.11-cu128-2.8.0` - CUDA 12.8 with PyTorch 2.8.0
- `axolotlai/axolotl-cloud:main-py3.11-cu126-2.8.0` - CUDA 12.6 with PyTorch 2.8.0

Both images provide the `axolotl` CLI used in the examples below. It is recommended to use the "cloud" versions since they have various custom folder mounts already set up that make data management easier.

## Dataset Generation
The synthetic dataset is generated by scripts under `data/`.

- **Generator:** `data/generate_data.py`
- **Outputs:** JSONL files in `data/output/` (for example `home_assistant_train.jsonl`, `home_assistant_test.jsonl`, and `sample.jsonl`).

The example training configs in `train/configs/` are written to read a dataset file mounted at `/workspace/data/datasets/sample.jsonl`.  Depending on the variant of the dataset that you are using for training, you may need to edit the config to point to the correct dataset file.

For local Docker training you’ll typically:
1. Generate a dataset JSONL under `data/output/`.
2. Copy it into a host folder that you mount as `/workspace/data/datasets/`.

## Training Configs
This repo currently includes:
- `train/configs/gemma3-270m.yml`
- `train/configs/functiongemma-270m.yml`

Additional configs can be found in the [Axolotl repo](https://github.com/axolotl-ai-cloud/axolotl/tree/main/examples). They will need to be adapted to use this repo’s dataset paths and potentially have their chat template adjusted.

### Model selection

- `base_model`: Hugging Face model id.
- `model_type`: Transformers model class (example: `Gemma3ForCausalLM`).

### Chat formatting and tool calling

- `chat_template: jinja` and `chat_template_jinja: | ...`:
  - Defines how multi-turn chats (and tools) are rendered into training text.
  - For tool calling, this formatting matters a lot.

Related files:

- `train/chat_templates/` contains reusable templates you can adapt.

### Dataset configs

- `datasets:` list points at a JSONL file, example:
  - `path: /workspace/data/datasets/sample.jsonl`
  - `ds_type: json`
  - `type: chat_template`
  - `message_field_training: train_on_turn`
  - `roles_to_train: []`

> NOTE: `roles_to_train` controls which message roles are used for loss calculation. Alternatively, you can simply enable training against all "assistant" role messages by setting `roles_to_train: [assistant]`.

### Sequence length + packing

- `sequence_len`: max context length.
- `sample_packing`:
  - `true` packs multiple short samples into one sequence (better throughput).
  - `false` keeps samples separate (often easier to debug).

### Batch sizing
Effective batch size is calculated as:

$$\text{effective\_batch} = \text{micro\_batch\_size} \times \text{gradient\_accumulation\_steps} \times \text{num\_gpus}$$

Having the correct effective batch size is important for training stability. Modify `micro_batch_size` until you can fit the model and optimizer states in VRAM, and then set `gradient_accumulation_steps` to keep the effective batch size in the correct range -- 16, 32, or 64 are values shown to work well for this dataset.

### Memory/perf helpers
- `bf16: true`: bfloat16 training on modern NVIDIA GPUs (30-series/Ada or newer)
- `gradient_checkpointing: true`: re-compute gradient activations during backward pass instead of storing in VRAM (lower VRAM, more compute.)
- `flash_attention: true`: faster attention when supported (almost always).
- `optimizer: adamw_bnb_8bit`: 8-bit optimizer states to save VRAM

### Outputs
- `output_dir: /workspace/data/training-runs/<run-name>` stores checkpoints and TensorBoard logs.

## Running training with Docker (local machine)
These commands assume:
- You are running on Linux with an NVIDIA GPU (or WSL2)
- You have Docker installed
- You have the NVIDIA Driver and the NVIDIA Container Toolkit set up and installed

### 1) Create host folders to mount

```bash
mkdir -p ./train-local/datasets
mkdir -p ./train-local/training-runs
mkdir -p ./train-local/huggingface-cache
```

### 2) Generate a dataset JSONL

```bash
python3 data/generate_data.py --sample --language english
cp data/output/sample.jsonl ./train-local/datasets/sample.jsonl
```

### 3) Run preprocess + train

```bash
docker pull axolotlai/axolotl-cloud:main-py3.11-cu128-2.8.0
```

```bash
docker run --rm -it \
  --gpus all \
  -e AXOLOTL_DO_NOT_TRACK=1 \
  -e HF_HOME=/workspace/data/huggingface-cache \
  -e HF_TOKEN="$HF_TOKEN" \
  -v "$PWD/train-local/datasets:/workspace/data/datasets" \
  -v "$PWD/train-local/training-runs:/workspace/data/training-runs" \
  -v "$PWD/train-local/huggingface-cache:/workspace/data/huggingface-cache" \
  -v "$PWD/train/configs:/workspace/configs" \
  axolotlai/axolotl-cloud:main-py3.11-cu128-2.8.0 \
  axolotl preprocess /workspace/configs/gemma3-270m.yml --debug
```

```bash
docker run --rm -it \
  --gpus all \
  -e AXOLOTL_DO_NOT_TRACK=1 \
  -e HF_HOME=/workspace/data/huggingface-cache \
  -e HF_TOKEN="$HF_TOKEN" \
  -v "$PWD/train-local/datasets:/workspace/data/datasets" \
  -v "$PWD/train-local/training-runs:/workspace/data/training-runs" \
  -v "$PWD/train-local/huggingface-cache:/workspace/data/huggingface-cache" \
  -v "$PWD/train/configs:/workspace/configs" \
  axolotlai/axolotl-cloud:main-py3.11-cu128-2.8.0 \
  axolotl train /workspace/configs/gemma3-270m.yml
```

Artifacts will appear under `./train-local/training-runs/`.

## Running on Kubernetes (e.g. cloud GPU host)

`train/training-job.yml` mounts:

- `/workspace/data/datasets` (dataset JSONL)
- `/workspace/data/training-runs` (outputs)
- `/workspace/configs` (Axolotl YAML)
- `/workspace/data/huggingface-cache` (HF cache)

It runs `axolotl preprocess ...` as an init container, then `axolotl train ...`.

The helper script `train/train.sh` copies `train/configs/<MODEL_NAME>.yml` to a remote server and starts the Kubernetes Job.