How can I delete a fixed part of some lines from a text file?

Question

I have been using ls -Rlh /path/to/directory > file to create some text file records of what's in some hard drives.

I want to delete some strings from the text files after they've been created.

An example of part of a text file is:

external1:
total 36K
drwxrwxr-x 2 emma emma 4.0K Oct 31 01:29 dir1
drwxrwxr-x 2 emma emma  12K Oct 31 01:29 dir2
drwxrwxr-x 2 emma emma  20K Oct 31 01:29 dir3

external1/dir1:
total 4.5M
-rw-rw-r-- 1 emma emma 769K Oct 31 01:12 a001.jpg
-rw-rw-r-- 1 emma emma 698K Oct 31 01:12 a002.jpg
-rw-rw-r-- 1 emma emma 755K Oct 31 01:12 a003.jpg
-rw-rw-r-- 1 emma emma 656K Oct 31 01:12 a004.jpg
-rw-rw-r-- 1 emma emma 756K Oct 31 01:12 a005.jpg
-rw-rw-r-- 1 emma emma 498K Oct 31 01:12 a006.jpg
-rw-rw-r-- 1 emma emma 455K Oct 31 01:12 a007.jpg

external1/dir2:
total 8.7M
-rw-rw-r-- 1 emma emma  952K Oct 31 01:13 a001.jpg
-rw-rw-r-- 1 emma emma  891K Oct 31 01:13 a002.jpg
-rw-rw-r-- 1 emma emma  838K Oct 31 01:13 a003.jpg
-rw-rw-r-- 1 emma emma  846K Oct 31 01:13 a004.jpg
-rw-rw-r-- 1 emma emma  876K Oct 31 01:13 a005.jpg
-rw-rw-r-- 1 emma emma  834K Oct 31 01:13 a006.jpg
-rw-rw-r-- 1 emma emma  946K Oct 31 01:13 a007.jpg
-rw-rw-r-- 1 emma emma  709K Oct 31 01:13 a008.jpg
-rw-rw-r-- 1 emma emma 1007K Oct 31 01:13 a009.jpg
-rw-rw-r-- 1 emma emma  940K Oct 31 01:13 a010.jpg

external1/dir3:
total 4.6M
-rw-rw-r-- 1 emma emma 408K Oct 31 01:15 a001.jpg
-rw-rw-r-- 1 emma emma 525K Oct 31 01:15 a002.jpg
-rw-rw-r-- 1 emma emma 383K Oct 31 01:15 a003.jpg
-rw-rw-r-- 1 emma emma 512K Oct 31 01:15 a004.jpg
-rw-rw-r-- 1 emma emma 531K Oct 31 01:15 a005.jpg
-rw-rw-r-- 1 emma emma 532K Oct 31 01:15 a006.jpg
-rw-rw-r-- 1 emma emma 400K Oct 31 01:15 a007.jpg
-rw-rw-r-- 1 emma emma 470K Oct 31 01:15 a008.jpg
-rw-rw-r-- 1 emma emma 407K Oct 31 01:15 a009.jpg
-rw-rw-r-- 1 emma emma 470K Oct 31 01:15 a010.jpg

The actual text files are thousands of lines long and several megabytes in size.

What I want to do is delete everything before the file size from each applicable line, so that each line starts with the file size. E.g.

512K Oct 31 01:15 a004.jpg
531K Oct 31 01:15 a005.jpg
532K Oct 31 01:15 a006.jpg
400K Oct 31 01:15 a007.jpg
470K Oct 31 01:15 a008.jpg

However, I want to keep all of the other lines (with the directory names and total sizes) intact, so this means that I can't use colrm or cut.

score 5 · Accepted Answer · edited Apr 13 '17 at 12:36

parsing the output of ls is unreliable, but this should work in this particular case:

sed -e 's/^.*emma emma //' file

That deletes everything up to "emma emma " on each line. if that string doesn't appear on a line, it is unchanged.

I've written the regexp to only remove the first space after emma, so that the size field remains right-aligned (e.g. ' 709K' and '1007K' both take the same amount of chars on the line)

if you don't wan't that, use this instead:

sed -e 's/^.*emma emma  *//' file

that will delete all whitespace after emma until the start of the next field.

Here's a sed version that works with any user group:

sed -e 's/^.\{10\} [0-9]\+ [^ ]\+ [^ ]\+ //' file

it relies even more heavily on the exact format of your ls output, so it is technically even worse than the first version....but it should work for your particular file.

see Why *not* parse `ls`? for info on why parsing ls is bad.

If not all files are owned by emma, you might want to use an awk script like this instead.

awk 'NF>2 {print $5,$6,$7,$8,$9} ; NF<3 {print}' file

For lines with more than 2 fields, it prints only fields 5-9. for lines with <3 fields, it prints the entire line. unfortunately, this loses the right-alignment of the size field....that can be fixed with a slightly more complicated awk script:

awk 'NF>2 {printf "%5s %s %s %s %s\n", $5, $6, $7, $8, $9} ; NF<3 {print}' file

This final version merges the for loop from jasonwryan's answer, so copes with filenames that have any number of single spaces in them (but not consecutive spaces, as mentioned by G-Man):

awk 'NF>2 {printf "%5s", $5; for(i=6;i<=NF;i++){printf " %s", $i}; printf "\n"} ; NF<3 {print}' file

Thank you. The first sed command does exactly what I wanted. — EmmaV, Oct 31 '15 at 03:28
I notice that the sed commands cope with spaces in file names. So given that the owner and group is always emma:emma, would there be any situation when I would need to use an awk command instead? — EmmaV, Oct 31 '15 at 03:36
can't think of any. if the sed version works for you, use it. the awk versions were for my own entertainment :) think of them as examples of different approaches to the same problem. the sed version is the simplest approach. — cas, Oct 31 '15 at 03:39

jasonwryan · Answer 2 · 2015-10-31T02:28:06.720

2

Using Awk:

awk '{if ($1 ~/^-|d/) {for(i=5;i<=NF;i++){printf "%s ", $i}; printf "\n"} else print $0}' file

If the first field begins with - or d; then print from the fifth to final field, otherwise print the entire record.

edited Oct 31 '15 at 02:28

answered Oct 31 '15 at 02:22

jasonwryan

73,126

your version copes with filenames that have spaces in them. mine doesn't...although easily fixed by adding $11, $12, etc to output. or rewriting it as a loop like yours. – cas Oct 31 '15 at 02:56
1

It's going to be hard for any awk solution to correctly handle filenames with multiple consecutive spaces; e.g., foo bar; I believe that the sed approach it better. Admittedly, this is a very edgy case. – G-Man Says 'Reinstate Monica' Oct 31 '15 at 03:16
true. handling one space is easy. multiple consecutive spaces is almost impossible (if you split by spaces as awk does by default...and with input like this, there's nothing aside from spaces to split by). – cas Oct 31 '15 at 03:34
one way i can think of to do it is to make a copy of $0, then use match to find starting pos of $9, and then substr() to remove everything up to that char pos. print that as the filename field. but that's parsing ls and therefore wrong – cas Oct 31 '15 at 03:36

score 1 · Answer 3 · answered Oct 31 '15 at 04:31

Since your talking about 100s of MB of data, it might be worthwhile to use the -o and -g options of gnu ls to avoid printing the user and group, resulting in this format:

-rw-rw-r-- 1 952K Oct 31 01:13 a001.jpg

This sed command will remove the unwanted data at the beginning of the line:

sed 's/^[-a-z]{10} \{1,\}[0-9]\{1,\}//'

You can combine the listing and the removing of unneeded data into one step (this also applies to most of the solutions on this page), which can also save you some time:

ls -Rlhog /path/to/directory | sed 's/^[-a-z]\{10\} \{1,\}[0-9]\{1,\}//' > file

How can I delete a fixed part of some lines from a text file?

3 Answers3