Fine-tuning quickstart - Together AI docs

Using a coding agent? Install the together-fine-tuning skill so your agent writes correct fine-tuning code automatically. See Coding agent setup for the install flow.

This quickstart walks through a full fine-tuning lifecycle. You’ll prepare a conversational dataset (CoQA), upload it, launch a LoRA job on Qwen3.5 9B, watch it complete, deploy the result, and compare it to the base model. End-to-end runtime is roughly 20 to 40 minutes for the example dataset. For background on what fine-tuning is and when to use it, see the overview. You can find a runnable notebook for this tutorial on GitHub.

Prerequisites

Before you begin, make sure you have:

A Together AI account and API key.
The Together CLI or the Python / TypeScript SDK installed.
Python install, with datasets, transformers, and tqdm if you want to follow the data-prep step verbatim:

pip install -U together datasets transformers tqdm

Make sure to export your API key before you begin:

export TOGETHER_API_KEY=<your_key>

Step 1: Prepare your dataset

This quickstart uses the CoQA conversational dataset. Together AI supports four text data formats: conversational, instruction, preference, and generic text. JSONL is the default file format, but you can use Parquet for pre-tokenized data and custom loss masking. Transform CoQA into the conversational shape:

Python

from datasets import load_dataset

coqa = load_dataset("stanfordnlp/coqa")

system_prompt = (
    "Read the story and extract answers for the questions.\nStory: {}"
)


def map_fields(row):
    messages = [
        {"role": "system", "content": system_prompt.format(row["story"])}
    ]
    for q, a in zip(row["questions"], row["answers"]["input_text"]):
        messages.append({"role": "user", "content": q})
        messages.append({"role": "assistant", "content": a})
    return {"messages": messages}


train = coqa["train"].map(
    map_fields, remove_columns=coqa["train"].column_names
)
train.to_json("coqa_train.jsonl")

To train the model on only part of each example (for instance, the assistant turns but not the user turns), you can use loss masking or data weights.

Next we’ll upload the file. files.upload() runs a local structural check by default (check=True), catching basic formatting errors such as non-UTF-8 encoding or malformed JSON lines before the file is sent. To inspect the check report yourself before uploading, run check_file() first (see Data preparation for details):

from together import Together

client = Together()

train_file = client.files.upload(
    file="coqa_train.jsonl",
    purpose="fine-tune",
    check=True,
)
print(train_file.id)

For very large files, you can skip the local check with check=False to speed up the upload. After upload, the server validates the full schema (conversation roles, tool calls, and other dataset requirements) during ingestion, reported through the file’s processing_status.

To see files you’ve already uploaded, list them with client.files.list() (tg files list).

If you upload a file whose contents already exist on Together AI, client.files.upload() doesn’t create a duplicate. It returns the existing file’s metadata, including its id, so you can reuse it directly. To force a re-upload, delete the existing file first with client.files.delete(<file_id>).

Upload returns before ingestion finishes, so poll the Files API until processing_status reaches COMPLETED before launching the job. If validation rejects the dataset, processing_status becomes INVALID_FORMAT and validation_report.error carries the reason.

Python

import time

while True:
    meta = client.files.retrieve(train_file.id)
    if meta.processing_status == "COMPLETED":
        break
    if meta.processing_status == "INVALID_FORMAT":
        raise ValueError(
            f"file is not valid for fine-tuning: {meta.validation_report}"
        )
    if meta.processing_status == "FAILED":
        raise RuntimeError(
            f"file processing did not complete: {meta.processing_status}"
        )
    time.sleep(5)

Once processing finishes, the file metadata reflects the outcome. A successful validation (processing_status: COMPLETED):

{
  "processing_status": "COMPLETED",
  "validation_report": {
    "valid": true,
    "dataset_format": "conversation",
    "nlines": 7199
  }
}

A user-correctable failure (processing_status: INVALID_FORMAT):

{
  "processing_status": "INVALID_FORMAT",
  "validation_report": {
    "valid": false,
    "error_type": "INVALID_FORMAT",
    "error": "Line 7: `messages[1]` must contain a `role` field"
  }
}

Save the id from the upload response. You’ll pass it as training_file in the next step.

Step 2: Launch the job

client.fine_tuning.create() starts a LoRA job by default. The example below tunes Qwen3.5 9B for three epochs. See the API reference for the full list of parameters.

job = client.fine_tuning.create(
    training_file=train_file.id,
    model="Qwen/Qwen3.5-9B",
    n_epochs=3,
    n_checkpoints=1,
    learning_rate=1e-5,
    warmup_ratio=0,
    train_on_inputs="auto",
    lora=True,
    suffix="qwen35_9b_demo",
    # wandb_api_key=os.environ.get("WANDB_API_KEY"),  # optional
)
print(job.id)

Response:

ft-d1522ffb-8f3e-4106-9774-aed81e0164a4

Save the job ID.

Job parameters

Here are some common job parameters:

Parameter	Required	Default	Notes
`training_file`	Required	n/a	File ID from Step 1.
`model`	Required	n/a	Base model to fine-tune.
`lora`	Optional	`true`	Set `false` for full fine-tuning.
`n_epochs`	Optional	`1`	Passes through the training set.
`learning_rate`	Optional	`0.00001`	Step size.
`batch_size`	Optional	`"max"`	Examples per optimization step. With packing enabled (the default for JSONL), a step can cover several short examples, so this isn’t the same as JSONL lines per step.
`warmup_ratio`	Optional	`0.0`	Fraction of steps for LR warmup.
`weight_decay`	Optional	`0.0`	L2 regularization.
`max_grad_norm`	Optional	`1.0`	Gradient-clipping threshold. Set to `0` to disable clipping.
`train_on_inputs`	Optional	`"auto"`	Mask user or prompt tokens from the loss.
`suffix`	Optional	n/a	Up to 64 characters appended to the output model name.
`n_checkpoints`	Optional	`1`	Intermediate checkpoints saved during training.
`n_evals`	Optional	`0`	Evaluations against `validation_file` during training.
`hf_api_token`	Optional	n/a	Only required for a private Hugging Face base. Omit otherwise.

See the API reference for the full list of parameters.

Each fine_tuning.create() call starts a new billed job. If you get a retryable error, run client.fine_tuning.list() first to make sure you aren’t launching a duplicate.

Step 3: Watch the job complete

Jobs move through these states: pending → queued → running → uploading → completed. Queue wait time is typically under an hour. Once running, multiply the first epoch’s duration by n_epochs to estimate the time remaining. Poll for completion (or error/cancellation), then read the output model name:

import time

job_id = job.id
deadline = time.time() + 6 * 60 * 60  # safety cap: 6 hours

while True:
    status = client.fine_tuning.retrieve(id=job_id)
    print(status.status)
    if status.status in ("completed", "error", "cancelled"):
        break
    if time.time() > deadline:
        raise TimeoutError(f"Job still {status.status} after 6 hours")
    time.sleep(60)

if status.status != "completed":
    raise RuntimeError(f"Job ended with status: {status.status}")

output_model = status.x_model_output_name
print(output_model)

Here’s a sample event log:

Fine tune request created
Job started at 2026-04-03T03:19:46Z
Model data downloaded at 2026-04-03T03:19:48Z
WandB run initialized.
Training started for Qwen/Qwen3.5-9B
Epoch completed, at step 24
Epoch completed, at step 48
Epoch completed, at step 72
Training completed for Qwen/Qwen3.5-9B at 2026-04-03T03:27:55Z
Uploading output model
Model upload complete
Job finished at 2026-04-03T03:31:33Z

You can also monitor the run on the fine-tuning jobs dashboard. For per-step loss curves, see training metrics.

Step 4: Deploy and call your model

Fine-tuned models can be run on Together AI using dedicated endpoints. The example below deploys, sends one request, and tears the endpoint down to stop billing:

# 1. Preflight: confirm the base can host a fine-tune
client.endpoints.list_hardware(model=status.model)

# 2. Create the endpoint. Use a hardware id returned by list_hardware
# above; for Qwen3.5 9B the platform currently serves 1x H100 80GB SXM.
endpoint = client.endpoints.create(
    display_name="Qwen3.5 9B fine-tune",
    model=output_model,
    hardware="1x_nvidia_h100_80gb_sxm",
    autoscaling={"min_replicas": 1, "max_replicas": 1},
)

# 3. Wait until ready
deadline = time.time() + 20 * 60
while True:
    ep = client.endpoints.retrieve(endpoint.id)
    if ep.state == "STARTED":
        break
    if ep.state in ("FAILED", "STOPPED"):
        raise RuntimeError(f"Endpoint state: {ep.state}")
    if time.time() > deadline:
        raise TimeoutError(f"Endpoint still {ep.state} after 20 minutes")
    time.sleep(30)

# 4. Send a request
response = client.chat.completions.create(
    model=endpoint.name,
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    max_tokens=128,
)
print(response.choices[0].message.content)

# 5. Delete when done
client.endpoints.delete(endpoint.id)

Pass endpoint.name (not output_model) as the model parameter when calling inference APIs. The endpoint name includes a unique suffix that routes traffic to your deployment.

Congrats! You just fine-tuned a model, deployed it to a dedicated endpoint, and ran inference end-to-end.

Step 5: Compare against the base model (optional)

To measure the impact of fine-tuning, run the same prompts through the base model and the fine-tuned model.

Many fine-tunable base models aren’t available on serverless. For example, calling Qwen/Qwen3.5-9B directly returns Unable to access non-serverless model. To compare, deploy the base on its own dedicated endpoint, evaluate against endpoint.name, then tear that endpoint down too. Serverless bases (those with a per-token price listed on the models dashboard) can be called directly without deploying anything.

This GitHub notebook runs an Exact Match and F1 comparison on the CoQA validation split. Here’s a sample result from one run:

Model	EM	F1
Base	0.01	0.18
Fine-tuned	0.32	0.41

Stop the endpoint

Dedicated endpoints bill per minute as long as they’re running. Step 4 deletes the endpoint at the end of the script, but if you skipped that step or want to delete it later, run:

tg endpoints delete "<ENDPOINT_ID>"

Find the endpoint ID by running tg endpoints list.

Continue from a checkpoint

Resume training from an existing job by passing from_checkpoint:

job = client.fine_tuning.create(
    training_file="<NEW_FILE_ID>",
    from_checkpoint="<PREVIOUS_JOB_ID>",
)

from_checkpoint accepts the output model name, the job ID, or a specific step in the form ft-...:{STEP_NUM}. List available checkpoints with tg fine-tuning list-checkpoints <JOB_ID>.

Next steps

Data preparation

See the full schema for conversational, instruction, preference, and tokenized data.

Supported models

Browse base models with context lengths and batch size limits.

Preference tuning

Align a model with paired preferred and dispreferred responses.

Deploy your model

Hosting, teardown, and local inference for fine-tuned models.

​Prerequisites

​Step 1: Prepare your dataset

​Step 2: Launch the job

​Step 3: Watch the job complete

​Step 4: Deploy and call your model

​Step 5: Compare against the base model (optional)

​Stop the endpoint

​Continue from a checkpoint

​Next steps

Data preparation

Supported models

Preference tuning

Deploy your model

Prerequisites

Step 1: Prepare your dataset

Step 2: Launch the job

Step 3: Watch the job complete

Step 4: Deploy and call your model

Step 5: Compare against the base model (optional)

Stop the endpoint

Continue from a checkpoint

Next steps