Sample
Take a subset of rows from your data using various sampling strategies.
Sockets
| Socket | Direction | Description |
|---|---|---|
input | Input | Data to sample from |
output | Output | Sampled rows |
Sampling Modes
| Mode | Description | Blocking |
|---|---|---|
| First N | Take the first N rows | No (streaming) |
| Last N | Take the last N rows | Yes |
| Every N | Take every Nth row (with optional offset) | No (streaming) |
| Random | Take N random rows | Yes |
| Random Percent | Take a random percentage of rows | Yes |
| First Percent | Take the first N% of rows | Yes |
Blocking Modes
Last N, Random, Random Percent, and First Percent modes require reading all data before producing output. This means the entire dataset must fit in memory.
First N and Every N are streaming-friendly and work efficiently with large datasets.
Configuration
| Option | Default | Applies To | Description |
|---|---|---|---|
| Mode | First N | All | Sampling strategy to use |
| N | 100 | First N, Last N, Random, Every N | Number of rows (or interval for Every N) |
| Percent | 10.0 | Random Percent, First Percent | Percentage of rows (0-100) |
| Offset | 0 | Every N | Starting row offset |
| Seed | (none) | Random, Random Percent | Random seed for reproducible results |
Examples
Take First 1000 Rows
- Connect data to the Sample tool
- Set Mode to First N
- Set N to
1000
Random 10% Sample
- Connect data to the Sample tool
- Set Mode to Random Percent
- Set Percent to
10 - Optionally set a Seed for reproducibility
Every 5th Row
- Connect data to the Sample tool
- Set Mode to Every N
- Set N to
5 - Set Offset to
0(start from the first row)
Notes
- The output schema is always identical to the input schema
- First N and Every N are the most memory-efficient modes since they don't require loading all data
- Use Seed with Random/Random Percent modes to get the same sample each time you run the workflow
- Every N with an offset of 2 and N of 3 takes rows 2, 5, 8, 11, etc.