Unique
Remove duplicate rows from your data, with the option to output both unique and duplicate rows.
Sockets
| Socket | Direction | Description |
|---|---|---|
input | Input | Data to deduplicate |
output-unique (U) | Output | Rows identified as unique |
output-duplicate (D) | Output | Rows identified as duplicates |
Unlike most tools, Unique has two outputs. Connect the U output for unique rows and the D output for duplicates. You can use either or both outputs in your workflow.
Deduplication Modes
| Mode | Unique Output | Duplicate Output |
|---|---|---|
| First Occurrence | First row of each value group | All subsequent rows |
| Strictly Unique | Only values appearing exactly once | All rows from repeated groups |
First Occurrence (Default)
Groups rows by the selected columns. The first row in each group goes to the unique output; all remaining rows in the group go to the duplicate output.
Strictly Unique
Only rows whose value combination appears exactly once in the dataset go to the unique output. If a value combination appears two or more times, all rows with that combination go to the duplicate output.
Configuration
| Option | Default | Description |
|---|---|---|
| Columns | (all) | Columns to check for duplicates. Leave empty to check all columns. |
| Mode | First Occurrence | How to determine which rows are unique |
Unique is always a blocking operation - it must read all rows before it can determine which are unique and which are duplicates. The entire dataset must fit in memory.
Examples
Remove Duplicate Emails
- Connect data to the Unique tool
- Select only the
emailcolumn - Set Mode to First Occurrence
- Wire the U output to downstream tools
Find All Duplicated Records
- Connect data to the Unique tool
- Select the columns that define uniqueness
- Set Mode to Strictly Unique
- Wire the D output to see all rows that have duplicates
Deduplicate on Multiple Columns
- Connect data to the Unique tool
- Select
first_name,last_name, anddate_of_birth - Set Mode to First Occurrence
- The first row for each name + DOB combination goes to U, rest to D
Notes
- The output schema for both outputs is identical to the input schema
- When no columns are selected, all columns are used for comparison
- First Occurrence is useful for simple deduplication (keep one, discard rest)
- Strictly Unique is useful for finding records that definitely have no duplicates