2

I am planning to implement an indexing structure in my program. For example, if I have 100 rows in the table, I will number these rows from 1 to 100 in another column by appending an _ to the end of the number.(1_,2_,3_ etc so that each number can be identified uniquely).

After processing the rows, I am storing the output into a file.

For example, I insert the line 1_,2_,4_,5_ into a file.

if I get a value as 5_,2_,1_,4_ or 2_,5_,1_,4_, I should not insert those values.

An implementation that comes to my mind is, to sort the numbers and then compare them. However, if the total rows becomes 100,000 it won't be a good solution. Is it possible as a single line command in perl script or awk or sed?

EDIT:

To be more precise and short, for a set of unique and distinct values, how can I find all the combinations without repetitions?

Example:

If I have 3 unique keys 1,2 and 3, how can I find all combinations without the same combination repeated twice?

So for the above example, we can find a combination as,

123

Now, when I search for 213 or 321 it should give me a match as I already have the combination 123 obtained.

Ramesh
  • 39,297

5 Answers5

5

Finding uniques is a piece of cake with sed:

{ 
echo identical identical
echo not_so_much as_before
echo abcdef bafcde
} | sed ':u;s/\(.\)\(.*\)\1/\2/;tu'

OUTPUT:

nto_much abfr

Not really sure how to represent this correctly in Markdown, but the output from the above command is actually a newline and the single space that separated the two strings, then the above, then another newline and space.

The t sed function is a fully portable device defined by POSIX so:

[2addr]t [label] Test.

Branch to the : command verb bearing the label if any substitutions have been made since the most recent reading of an input line or execution of a t. If label is not specified, branch to the end of the script.

Let's watch it as it works, I'll add a print in just the right place:

{ 
echo identical identical
echo not_so_much as_before
echo abcdef bafcde
} | sed ':u;s/\(.\)\(.*\)\1/\2/p;tu'

OUTPUT:

dentical identcal
entical ientcal
ntical intcal
tical itcal
ical ical
cal cal
al al
l l


nt_so_much as_befre
ntso_much asbefre
nto_much abefre
nto_much abfr
nto_much abfr
bcdef bfcde
cdef fcde
def fde
ef fe
f f

You can see there is a differing amount of empty space between the three different tests. This is a result of the order in which sed negates the characters, which means that whitespaces are also negated, so long as there is an even number of them.

The command:

sed ':u;s/\(.\)\(.*\)\1/\2/;tu'

s/elects the first \(.character\) on a line that can possibly be selected \1twice, and \(.all*\) characters between the two. It then /replaces/ the entire selection with only the \2selection between them. :Rinse, repeat.

OPTIMIZE IT!

We can improve the performance of this function by a very large margin with the simple addition of only two more characters in the command, just so:

{ 
echo identical identical
echo not_so_much as_before
echo abcdef bafcde
} | sed ':u;s/\(..*\)\(.*\)\1/\2/p;tu'

OUTPUT

nt_so_much as_befre
ntso_much asbefre
nto_much abefre
nto_much abfr
nto_much abfr
bcdef bfcde
cdef fcde
f f

sed now performs negations on any sequence of 1 or more characters that can be twice selected, which is why identical makes no appearance in the above - it is completely negated on the first pass.

And without print the results are the same:

{
echo identical identical
echo about_as_much as_before
echo abcdef bafcde
} | sed ':u;s/\(..*\)\(.*\)\1/\2/;tu'

OUTPUT

ta_mch fr

POSITIVE FROM NEGATIVE

It requires very little else - and no more recursion - to negate the negation.

{
echo identical identical
echo not_so_much as_before
echo abcdef bafcde
} | sed 'h;:u;s/\(..*\)\(.*\)\1/\2/;tu
    / ./{H;g;
    s/^/NOT FULL MATCH:\t/
    s/\n/\n\t%:\t/;b}
    g;s/^/FULL MATCH:\t/'

OUTPUT:

FULL MATCH:     identical identical
NOT FULL MATCH: not_so_much as_before
        %:      nto_much abfr
FULL MATCH:     abcdef bafcde

Or even just:

{
echo identical identical
echo not_so_much as_before
echo abcdef bafcde
} | sed -e ':u;s/\(..*\)\(.*\)\1/\2/;tu' \
    -e '/ ./{cshite...' -e 'b};cHOORAY!'

OUTPUT:

HOORAY!
shite...
HOORAY!
mikeserv
  • 58,310
  • really impress by this simplicity but it work on known fixed pattern, in this case you could have 12545_ and 545_ where only 12 differ (and that could be on a later or sooner 1254_). – NeronLeVelu May 12 '14 at 07:08
4

You could setup a SQLite database and perform SQL selects from that, which would probably be cleaner to implement and would set you up for being more portable later on.

But here's a rough idea. Say I have 2 files:

$ more index.txt new_vals.txt 
::::::::::::::
index.txt
::::::::::::::
1_,2_,4_,5_
::::::::::::::
new_vals.txt
::::::::::::::
5_,2_,1_,4
2_,5_,1_,4

With this command we can match:

$ for i in $(<new_vals.txt); do nums=${i//_,/}; \
        grep -oE "[${nums}_,]+" index.txt; done
1_,2_,4_,5_
1_,2_,4_,5_

This demonstrates that we can match each line from new_vals.txt to an existing line in index.txt.

UPDATE #1

Based on the OP's edit the following would do what he wants using a modification of the above approach.

$ for i in $(<new_vals.txt); do 
  nums=${i//_,/} 

  printf "# to check: [%s]" $i
  k=$(grep -oE "[${nums}_,]+" index.txt | grep "[[:digit:]]_$")
  printf " ==> match: [%s]\n" $k

done

With a modified version of test data:

$ more index.txt new_vals.txt 
::::::::::::::
index.txt
::::::::::::::
1_,2_,4_,5_
0_,2_,3_,9_
::::::::::::::
new_vals.txt
::::::::::::::
5_,2_,1_,4_
2_,5_,1_,4_
1_,1_,1_,1_
1_,2_,4_,4_

Now when we run the above (put inside a script for simplicity, parser.bash):

$ ./parser.bash 
# to check: [5_,2_,1_,4_] ==> match: [1_,2_,4_,5_]
# to check: [2_,5_,1_,4_] ==> match: [1_,2_,4_,5_]
# to check: [1_,1_,1_,1_] ==> match: []
# to check: [1_,2_,4_,4_] ==> match: []

How it works

The above method works by exploiting some key characteristics exhibited by the nature of your data. For example. Only matches will end with a digit followed by a underscore. The grep "[[:digit:]]_$" picks only these results out.

The other part of the script, grep -oE "[${nums}_,]+" index.txt will pick out lines that contain characters from strings in the file new_vals.txt which match strings from index.txt.

Additional adjustments

If the nature of the data is such that strings may be variable in length then the 2nd grep will need to be expanded to guarantee that we're only picking out strings that are of sufficient length. There are several ways to accomplish this, either by expanding the pattern or by making use of a counter, perhaps using wc or some other means, that would confirm that the matches are of a certain type.

Expanding it like so:

k=$(grep -oE "[${nums}_,]+" index.txt | \
    grep "[[:digit:]]_,[[:digit:]]_,[[:digit:]]_,[[:digit:]]_$")

Would allow for the elimination of strings like this:

$ ./parser2.bash 
# to check: [5_,2_,1_,4_] ==> match: [1_,2_,4_,5_]
# to check: [2_,5_,1_,4_] ==> match: [1_,2_,4_,5_]
# to check: [1_,1_,1_,1_] ==> match: []
# to check: [1_,2_,4_,4_] ==> match: []
# to check: [1_,2_,5_] ==> match: []
slm
  • 369,824
  • This seems a good solution. But I might have 1000s of rows and at that time the command might become too much of overhead to get executed. – Ramesh Mar 01 '14 at 02:45
  • @Ramesh - maybe, looping though this isn't that burdensome to a fairly modern system. Can you show more of the actual indexes? I wasn't 100% sure about your full syntax, what I'm showing is rough, in my mind, so I'm not sure how robust it will end up being. I'm not so concerned w/ performance, but more how tolerant this approach would end up being. – slm Mar 01 '14 at 02:54
  • I have not actually implemented it. It is just an idea that I have in my mind. I thought if I use (_) for each number, it would be a better approach to have unique indices. However, I need to think more of this approach before I can explain more about the indices. – Ramesh Mar 01 '14 at 02:57
  • @Ramesh - my first reaction was that the _, was a bad idea. I would need o understand better what you're actual goal is. It feels like you're over complicating something. – slm Mar 01 '14 at 03:01
  • can you please come in chat? I would try and explain the problem – Ramesh Mar 01 '14 at 03:02
  • @Ramesh - I can't tonight, I'm actually working. Was taking a break and answering your Q. – slm Mar 01 '14 at 03:03
  • Oh sorry. No problem. I will try to think bit more and make the problem more clearer to understand and hopefully to implement. :) – Ramesh Mar 01 '14 at 03:07
  • 1
    Correct me if I'm wrong, but won't putting the sequence 123 in a character class in grep like so: [123,_] allow it to match number sequences which are not permutations of 1_,2_,3_ (such as 1_,1_,2_)? – Joseph R. May 09 '14 at 16:07
  • that solution would match 1_,1_,1_,1_,1_ – Emmanuel May 09 '14 at 23:02
  • @Emmanuel - fixed. – slm May 10 '14 at 00:54
1

For the concept using shell tools sed/grep ExistingSeq='8_,1_,2_,3_,4_,5_,9_,7_,6_'

NewSeq="5_,2_,1_,4_"

# prepa
SizeTemp=$( echo "${NewSeq}" | sed -e 's/[^,]//g;s/^/,/' )
Size=${#SizeTemp}
echo "${NewSeq}" | sed -e 's/,/\
,/g;s/^/,/' > /tmp/ToFind

# search
InsideOcc=$( echo "${ExistingSeq}" | sed -e 's/,/\
,/g' | egrep -c -f /tmp/ToFind )

# test
echo "test with an if on 'InsideOcc' [${InsideOcc}] is lower (not present) or equivalent (present) to Size: [${Size}] "

rm /tmp/ToFind

to be efficient and avoid lot of similar sed, work with (temporary ) index file with one element per line.

Now, this is not a good way to create huge file index especially due to exponential time requested on sequence length and index length. AWK is certainly faster in a one occurence of tools test per sequence and only memory (no temp file that eat time to manage)

  • In every comparison I've ever seen, sed outperforms awk. I'm not sure I understand what you mean. – mikeserv May 09 '14 at 20:06
  • on simple operation sed could be faster but often awk is really faster. awk is an evolution of sed. On which test do you see awk slower ? – NeronLeVelu May 10 '14 at 07:13
  • Often awk is faster at what? And how is it any operation a computer performs is not a simple one? It cant count any higher than one. – mikeserv May 10 '14 at 17:41
1

Here's another technique: create a "key" from a string by sorting its characters:

gawk '
    function generate_key(s,  n,a,i,s2) {
        if (s in cached) return cached[s]
        n = split(s, a, //)
        asort(a)
        for (i=1; i<=n; i++) s2 = s2 a[i]
        cached[s] = s2
        return s2
    }
    {
        key = generate_key($1)
        status = (key in seen) ? "no" : "yes"
        print $1, key, status
        seen[key]++
    }
' OFS="\t" <<END
123
231
321
312
1_,2_,4_,5_
5_,2_,1_,4_
2_,5_,1_,4_
END
123 123 yes
231 123 no
321 123 no
312 123 no
1_,2_,4_,5_ 1245,,,____ yes
5_,2_,1_,4_ 1245,,,____ no
2_,5_,1_,4_ 1245,,,____ no
glenn jackman
  • 85,964
0

if indexfile is composed only of lines of 6 characters.
This will match any combination of "abcdef" in indexfile

grep a indexfile | grep b | grep c | grep d |grep e | grep f

if indexfile is more complicated use sed to extract the indexes.

Emmanuel
  • 4,187