Problem Statement
Explain the uniq command and why it's typically used with sort. What are the different options for counting and filtering duplicates?
Explanation
Uniq removes duplicate adjacent lines from input, which is why it's almost always used after sort - sort groups duplicate lines together so uniq can detect them. Without sorting first, uniq only removes duplicates that are consecutive. Example: sort file.txt | uniq removes all duplicate lines after sorting.
Uniq -c counts occurrences of each unique line, prefixing output with count. Example: sort access.log | uniq -c | sort -rn finds most common log entries. Uniq -d shows only duplicate lines (appearing more than once), useful for finding duplicates: sort file.txt | uniq -d. Uniq -u shows only unique lines (appearing exactly once): sort file.txt | uniq -u finds items without duplicates.
Field-based uniqueness: uniq -f N skips first N fields when comparing, useful for structured data. Uniq -w N compares only first N characters. Example: cut -d',' -f1 data.csv | sort | uniq -c counts occurrences of first column values in CSV. Case-insensitive comparison: sort -f file.txt | uniq -i.
Practical examples: ps aux | awk '{print $1}' | sort | uniq -c counts processes per user. sort /var/log/auth.log | uniq finds unique log entries. grep 'ERROR' app.log | sort | uniq -c | sort -rn lists errors by frequency. Understanding sort and uniq together is fundamental for data deduplication and frequency analysis in log files and data processing.
