1

i have Huge list like

67603;4716-5469-1335-0870;5450-7938-7992-5530;14523593;03 Oct 2016 - 17:01:15
63123;5592-6762-4853-6320;4532-4142-5613-9690;1441407;03 Oct 2016 - 17:01:15
62562;4532-5581-3790-0140;5292-4905-4356-2840;28898987;03 Oct 2016 - 17:01:15
68080;5188-1564-9611-7580;4556-9998-5999-3300;2262361;03 Oct 2016 - 17:01:15

i Want Search more Dublicate number after 2 ; and before 3th ;

for first line the number is 5450-7938-7992-5530 and another line 4532-4142-5613-9690 and etc

  • okay, you want to do something with value in 3rd column... question is what you want to do? and your sample doesn't seem to have any duplicates in 3rd column – Sundeep Oct 22 '16 at 06:00
  • its sample my list is 80Mb and i want clear text just with 3rd column and find More duplicate on it . – Mehran Goudarzi Oct 22 '16 at 06:02
  • sorry but find More duplicate is unclear... if you can add a better sample with duplicates and expected output, it will help – Sundeep Oct 22 '16 at 06:03
  • i want just keep 3rd Columns on my list and delete Everything . – Mehran Goudarzi Oct 22 '16 at 06:06
  • 1
    try awk -F';' '{print $3}' ip.txt or cut -d';' -f3 ip.txt – Sundeep Oct 22 '16 at 06:07
  • Yes Thanks , now how can i find duplicate on my new list ? – Mehran Goudarzi Oct 22 '16 at 06:15
  • again, what do you want to do with the duplicates? remove them? keep only unique? keep only one copy of duplicate? that is why it is important to add a clear example with expected output.. otherwise we'll be here all day as you keep adding one feature after another as I suggest potential solutions... possibly your question is duplicate of https://unix.stackexchange.com/questions/171091/remove-lines-based-on-duplicates-within-one-column-without-sort – Sundeep Oct 22 '16 at 06:24

2 Answers2

1

Consider the following awk script, duplicates.awk:

#!/usr/bin/awk -f
BEGIN {
    RS = "(\r\n|\n\r|\r|\n)"
    FS = "[\t\v\f ]*;[\t\v\f ]*"
    split("", count)
}

{
    count[$3]++
}

END {
    for (item in count) {
        if (count[item] > 1)
            printf "%s\n", item
    }
}

Remember to make it executable, using e.g. chmod a+rx duplicates.awk. You can either pipe the input to the command, or supply one or more input files as command-line parameters (multiple files are treated as if they were concatenated into one single file).

The BEGIN rule sets up universal newlines (that is, it accepts all newline conventions from MS-DOS to old Macs to Unix), and semicolons ; as the field separator. For illustration, I made the field separator also consume any whitespace surrounding it, so that x;foo bar ; y parses into three fields: x, foo bar, and y.

The record rule (the middle part of the snippet) is applied to every record (line) in the input. Because awk supports associative arrays, we simply use the third field, a string, as a key to count array, and increment that entry by one. (Incrementing a nonexistent array entry in awk yields 1, so the first increment yields 1, and the code works as you'd expect.)

The END rule scans the count array, printing the entries that occurred at least twice. Note that this output is in random order. (There are ways to sort the output according to the number of occurrences, or even to keep the original order (of first occurrences) in the file, but OP did not mention any requirement wrt. ordering, so I didn't bother; undefined order is the simplest to implement.)

If you want to print e.g. the number of occurrences followed by the string (the value from the third column), then use the following END rule instead:

END {
    for (item in count)
        printf "%15d %s\n", count[item], item
}

The output is formatted so that the first fifteen characters in the output are reserved for the number, and the value starts at the 17th character.

  • [your command] | sort | uniq -c | sort -nr this helped me thank . – Mehran Goudarzi Oct 22 '16 at 06:41
  • @MehranGoudarzi: You could use the latter END part, and ./duplicates.awk input.txt | sort -nr. Anyway, if this solved your question, do consider accepting this as an answer. – Nominal Animal Oct 22 '16 at 07:53
  • RS = "[\r\n]+"; FS = FS"*;"FS"*" – Costas Oct 22 '16 at 08:27
  • @Costas: That record separator would eat empty lines, too -- but if you skip empty lines anyway, then that is a good suggestion. As to the record separator, I prefer the one that is immediately obvious. (Although I'm not sure any regular expression can be said to be immediately obvious...) :) – Nominal Animal Oct 22 '16 at 10:09
0

Creating few duplicate value in stack.txt file and then printing output -

67603;4716-5469-1335-0870;5450-7938-7992-5530;14523593;03 Oct 2016 - 17:01:15
63123;5592-6762-4853-6320;4532-4142-5613-9690;1441407;03 Oct 2016 - 17:01:15
62562;4532-5581-3790-0140;5292-4905-4356-2840;28898987;03 Oct 2016 - 17:01:15
68080;5188-1564-9611-7580;4556-9998-5999-3300;2262361;03 Oct 2016 - 17:01:15
67603;4716-5469-1335-0870;5450-7938-7992-5530;14523593;03 Oct 2016 - 17:01:15
63123;5592-6762-4853-6320;4532-4142-5613-9690;1441407;03 Oct 2016 - 17:01:15
62562;4532-5581-3790-0140;5292-4905-4356-2840;28898987;03 Oct 2016 - 17:01:15
68080;5188-1564-9611-7580;4556-9998-5999-3300;2262361;03 Oct 2016 - 17:01:15
67603;4716-5469-1335-0870;5450-7938-7992-5530;14523593;03 Oct 2016 - 17:01:15
63123;5592-6762-4853-6320;4532-4142-5613-9690;1441407;03 Oct 2016 - 17:01:15

Use below command -

 awk 'BEGIN{FS=";"}{a[$3]++} END {for(k in a) print  a[k],k}' stack.txt

Output -

3 4532-4142-5613-9690
2 5292-4905-4356-2840
3 5450-7938-7992-5530
2 4556-9998-5999-3300