Skip to content

inspect

Show dataset statistics, caption coverage, and detect duplicates.

Usage

bash
datasety inspect --input ./dataset

Options

OptionDescriptionDefault
--input, -iInput directory(required)
--duplicatesDetect duplicate/near-duplicate imagesfalse
--jsonExport report as JSON to this path
--csvExport per-image data as CSV to this path
--recursive, -RSearch input directory recursivelyfalse

Report Contents

  • Image count and error count
  • Resolution: min, max, average, unique sizes distribution
  • Orientation: landscape, portrait, square breakdown
  • Formats: file extension distribution
  • File sizes: total and average
  • Caption coverage: how many images have .txt files, average caption length, empty captions
  • Duplicates (with --duplicates): groups of perceptually similar images using average hash with hamming distance threshold

Examples

bash
# Basic dataset report
datasety inspect -i ./dataset

# With duplicate detection
datasety inspect -i ./dataset --duplicates

# Export report to JSON
datasety inspect -i ./dataset --json report.json

# Export per-image data to CSV
datasety inspect -i ./dataset --csv images.csv

# Recursive scan with full export
datasety inspect -i ./dataset -R --duplicates --json report.json --csv images.csv

JSON Output Format

json
{
  "path": "./dataset",
  "total_images": 150,
  "errors": 0,
  "resolution": {
    "min": "512x512",
    "max": "2048x2048",
    "average": "1024x1024"
  },
  "formats": { "jpg": 120, "png": 30 },
  "orientation": { "landscape": 45, "square": 80, "portrait": 25 },
  "total_size_mb": 245.3,
  "captions_found": 140,
  "captions_missing": 10,
  "avg_caption_length": 85
}

CSV Output Format

ColumnDescription
fileImage file path
widthImage width in pixels
heightImage height in pixels
formatFile extension
size_kbFile size in KB
has_captionWhether a .txt file exists

Released under the MIT License.