Skip to main content

Sample

Take a subset of rows from your data using various sampling strategies.

Sockets

SocketDirectionDescription
inputInputData to sample from
outputOutputSampled rows

Sampling Modes

ModeDescriptionBlocking
First NTake the first N rowsNo (streaming)
Last NTake the last N rowsYes
Every NTake every Nth row (with optional offset)No (streaming)
RandomTake N random rowsYes
Random PercentTake a random percentage of rowsYes
First PercentTake the first N% of rowsYes
Blocking Modes

Last N, Random, Random Percent, and First Percent modes require reading all data before producing output. This means the entire dataset must fit in memory.

First N and Every N are streaming-friendly and work efficiently with large datasets.

Configuration

OptionDefaultApplies ToDescription
ModeFirst NAllSampling strategy to use
N100First N, Last N, Random, Every NNumber of rows (or interval for Every N)
Percent10.0Random Percent, First PercentPercentage of rows (0-100)
Offset0Every NStarting row offset
Seed(none)Random, Random PercentRandom seed for reproducible results

Examples

Take First 1000 Rows

  1. Connect data to the Sample tool
  2. Set Mode to First N
  3. Set N to 1000

Random 10% Sample

  1. Connect data to the Sample tool
  2. Set Mode to Random Percent
  3. Set Percent to 10
  4. Optionally set a Seed for reproducibility

Every 5th Row

  1. Connect data to the Sample tool
  2. Set Mode to Every N
  3. Set N to 5
  4. Set Offset to 0 (start from the first row)

Notes

  • The output schema is always identical to the input schema
  • First N and Every N are the most memory-efficient modes since they don't require loading all data
  • Use Seed with Random/Random Percent modes to get the same sample each time you run the workflow
  • Every N with an offset of 2 and N of 3 takes rows 2, 5, 8, 11, etc.