workflow
Run multi-step datasety workflows from YAML or JSON files.
Usage
# Auto-detect datasety.yaml in current directory
datasety workflow
# Specify file
datasety workflow --file pipeline.yaml
# Validate without running
datasety workflow --dry-runOptions
| Option | Description | Default |
|---|---|---|
--file, -f | Path to workflow file | auto-detect |
--dry-run | Validate without executing | false |
File Format
Workflow files define a list of steps, each with a command and its arguments:
YAML
steps:
- command: resize
args:
input: ./raw
output: ./resized
resolution: 768x1024
- command: caption
args:
input: ./resized
output: ./resized
llm-api: true
model: gpt-5-nanoJSON
{
"steps": [
{
"command": "resize",
"args": {
"input": "./raw",
"output": "./resized",
"resolution": "768x1024"
}
}
]
}Argument Mapping
| YAML type | CLI equivalent |
|---|---|
key: value | --key value |
key: true | --key (flag) |
key: false | (omitted) |
key: [a, b] | --key a --key b |
Auto-Detection
When no --file is specified, the workflow command searches for:
datasety.yamldatasety.ymldatasety.json
Dry Run
The --dry-run flag validates each step by:
- Parsing arguments through the real argparse parser
- Checking required parameters
- Verifying input directories/files exist
- Reporting pass/fail per step
No models are loaded and no images are processed.
Examples
Face LoRA with Masks
The most common pipeline: resize raw photos, caption with a trigger word, and generate face masks for focused training loss.
steps:
- command: resize
args:
input: ./raw
output: ./dataset
resolution: 1024x1024
crop-position: top
- command: caption
args:
input: ./dataset
output: ./dataset
trigger-word: "ohwx person,"
- command: mask
args:
input: ./dataset
output: ./dataset/masks
keywords: "person,face,hair"
model: clipseg
threshold: 0.4
padding: 10
blur: 5Synthetic Augmentation + Re-caption
Expand a small dataset with edited variations, then caption the results.
steps:
- command: synthetic
args:
input: ./dataset
output: ./augmented
prompt: "the person is wearing a knitted beanie hat"
steps: 4
cfg-scale: 2.5
seed: 42
- command: caption
args:
input: ./augmented
output: ./augmented
trigger-word: "ohwx person,"Upscale Training (Paired Degradation)
Create a paired dataset for super-resolution training. Chains JPEG, noise, and blur artifacts.
steps:
- command: resize
args:
input: ./originals
output: ./resized
resolution: 1024x1024
- command: degrade
args:
input: ./resized
output: ./dataset
type:
- jpeg
- noise
- blur
chain: true
intensity-range: "0.3-0.7"
paired: true
seed: 42
- command: align
args:
target: ./dataset/target
control: ./dataset/control
- command: caption
args:
input: ./dataset/target
output: ./dataset/targetInpainting Dataset
Resize, generate masks for removable accessories, and caption.
steps:
- command: resize
args:
input: ./photos
output: ./dataset
resolution: 1024x1024
crop-position: top
- command: mask
args:
input: ./dataset
output: ./dataset/masks
keywords: "hat,glasses,sunglasses,scarf,necklace"
model: sam3
threshold: 0.3
padding: 5
blur: 3
- command: caption
args:
input: ./dataset
output: ./dataset
florence-2-large: trueSee Workflows for more real-world pipelines including background replacement, product LoRAs, multi-resolution datasets, and sweep-then-train patterns.