Skip to content

caption

Generate captions for images using Florence-2 or OpenAI-compatible vision APIs.

Usage

bash
# Florence-2 (default: base model)
datasety caption --input ./images --output ./captions

# Vision API
datasety caption --input ./images --output ./captions --llm-api --model gpt-5-nano

Options

OptionDescriptionDefault
--input, -iInput directory(required*)
--output, -oOutput directory for .txt files(required*)
--input-imageSingle input image
--output-captionSingle output .txt path
--deviceauto, cpu, cuda, or mpsauto
--trigger-wordText to prepend to captions(none)
--promptFlorence-2 task prompt<MORE_DETAILED_CAPTION>
--modelHF model or API model ID(none)
--num-beamsBeam search width (1 = greedy)3
--florence-2-baseUse base model (0.23B, faster)(default)
--florence-2-largeUse large model (0.77B, better)
--llm-apiUse OpenAI-compatible vision API
--max-tokensMax response tokens (API mode)300
--temperatureTemperature (API mode)0.3
--skip-existingSkip images with existing .txtfalse
--appendAppend text to existing captions
--prependPrepend text to existing captions
--recursive, -RSearch input directory recursivelyfalse
--progressShow tqdm progress barfalse
--dry-runPreview without writing filesfalse

Environment Variables

VariableDescription
OPENAI_API_KEYAPI key (required for --llm-api)
OPENAI_BASE_URLCustom API endpoint
OPENAI_API_BASELegacy fallback for base URL
OPENAI_MODELDefault model when --model not specified (default: gpt-5-nano)

Florence-2 Prompts

PromptDescription
<CAPTION>Brief caption
<DETAILED_CAPTION>Detailed caption
<MORE_DETAILED_CAPTION>Most detailed (default)

Examples

bash
# Florence-2 base with trigger word
datasety caption -i ./dataset -o ./dataset --trigger-word "photo of sks person,"

# Florence-2 large
datasety caption -i ./dataset -o ./dataset --florence-2-large --device cuda

# OpenAI vision API
datasety caption -i ./dataset -o ./dataset --llm-api --model gpt-5-nano

# Custom provider via env vars
OPENAI_BASE_URL=https://openrouter.ai/api/v1 \
OPENAI_API_KEY=your-key \
datasety caption -i ./dataset -o ./dataset --llm-api --model x-ai/grok-4.1-fast

Released under the MIT License.