manipulate ls text output to add path to filenames

Question

I get sometimes files with following ls output format:

/etc/cron.d:
-rw-r--r-- 1 root root 128 May 15  2020 0hourly
-rw------- 1 root root 235 Dec 17  2020 sysstat
/etc/cron.daily:
-rw------- 1 root root 235 Dec 17  2020 sysstat

Is there any chance using normal gnu tools or even clear bash internals to manipulate that content to:

-rw-r--r-- 1 root root 128 May 15  2020 /etc/cron.d/0hourly
-rw------- 1 root root 235 Dec 17  2020 /etc/cron.d/sysstat
-rw------- 1 root root 235 Dec 17  2020 /etc/cron.daily/sysstat

That would be great.

I mean the easiest is to remove the file paths like that: cat <filename> | grep -v -E "^\/[a-z]"

But like I said how to move these paths down to the follow-up lines with the filenames?

The command that is the given is this one: ls -lR /etc/cron* > <filename>.

I don't have influence to that output, but rather I get these command outputs executed by ls redirected to a separate file <filename> that is transferred to me.

And what I like to do is manipulate it's content into the mentioned second result. basically obtaining the first line an appy the path to the files lines 2 and 3 and then take line 4 and apply it to line 5. And then configured that one as a general approach.

I think that should be possible using awk.

You need to show what command you are running, however, perhaps ls -l /etc/cron*/* is what you are looking for, or even ls -ld /etc/cron*/*. — Bib, Mar 26 '24 at 22:01
If you are looking for a list of output with full paths, find might be a better tool than ls. E.g. find /etc/cron* or if you need the other data from ls pipe the find output. find /etc/cron* -type f | xargs ls -l — cherdt, Mar 26 '24 at 22:18
Use the stat command, rather than falling into the trap of parsing ls output. Do an online search for "Parsing ls considered harmful". — waltinator, Mar 27 '24 at 00:19
Related: https://unix.stackexchange.com/questions/128985/why-not-parse-ls-and-what-to-do-instead — Vilinkameni, Mar 27 '24 at 10:20
@waltinator What cherdt suggested is passing the list of pathnames to ls (1), presumably not further parsing the result. If the output was further parsed, that would be problematic for the reasons stated in the answers to the question linked above. Of course, there are still all kinds of issues with abnormal pathnames containing newlines, spaces etc. — Vilinkameni, Mar 27 '24 at 10:28

terdon · Answer 1 · 2024-03-27T16:10:50.740

You haven't shown us what command you are using or why you're getting this output, but if the objective is to list all files and directories matching /etc/cron*, you could just use find instead:

find /etc/cron*

Or, if you need the full listing (GNU find):

find /etc/cron* -ls

Any find:

find /etc/cron* -exec ls -ld {} +

Here is example output on my Arch Linux:

$ ls /etc/cron*
/etc/cron.deny  /etc/crontab  /etc/crontab~  /etc/crontab.pacnew
/etc/cron.d:
0hourly
/etc/cron.daily:
/etc/cron.hourly:
0anacron
/etc/cron.monthly:
/etc/cron.weekly:

And with find:

$ find /etc/cron* -ls
   262172      4 drwxr-xr-x   2 root     root         4096 Jan 23 19:41 /etc/cron.d
   263666      4 -rw-r--r--   1 root     root          128 Jan 14 14:59 /etc/cron.d/0hourly
   262173      4 drwxr-xr-x   2 root     root         4096 Sep 30 11:38 /etc/cron.daily
   262618      4 -rw-r--r--   1 root     root           74 Jan 14 14:59 /etc/cron.deny
   262174      4 drwxr-xr-x   2 root     root         4096 Jan 23 19:41 /etc/cron.hourly
   263665      4 -rwxr-xr-x   1 root     root          843 Jan 14 14:59 /etc/cron.hourly/0anacron
   262175      4 drwxr-xr-x   2 root     root         4096 Jun 30  2016 /etc/cron.monthly
   262632      0 -rw-r--r--   1 root     root            0 Oct 31  2017 /etc/crontab
   262633      4 -rw-r--r--   1 root     root           49 Sep 22  2017 /etc/crontab~
   272465      4 -rw-r--r--   1 root     root          119 Jan 14 14:59 /etc/crontab.pacnew
   262176      4 drwxr-xr-x   2 root     root         4096 Sep 30 11:38 /etc/cron.weekly
   275802      4 -rwxr--r--   1 root     root           68 Sep 30 11:37 /etc/cron.weekly/clamscan.sh

GNU find -ls outputs the equivalent to the incantation ls -dils so the columns may be different to that expected. Using GNU utilities, find /path -exec stat -c fmt is a little more flexible. — mr.spuratic, Mar 27 '24 at 19:17
@mr.spuratic, if you have to use GNU tool, find -printf is more flexible than using GNU stat. — Stéphane Chazelas, Mar 27 '24 at 19:44
@StéphaneChazelas by superior number of % format specifiers, I concede :) Though neither makes it easy to recreate exactly the default contextual details of ls (relative age, file type specifics) I think. — mr.spuratic, Mar 29 '24 at 15:46

Ed Morton · Accepted Answer · 2024-03-29T11:34:25.080

If none of your file or directory names contain white space then you could do the following using any POSIX awk:

$ awk '
    NF==1 && sub(/:$/,"/") { dir=$0; next }
    match($0,/[^[:space:]]+$/) { $0=substr($0,1,RSTART-1) dir substr($0,RSTART) }
    { print }
' file
-rw-r--r-- 1 root root 128 May 15  2020 /etc/cron.d/0hourly
-rw------- 1 root root 235 Dec 17  2020 /etc/cron.d/sysstat
-rw------- 1 root root 235 Dec 17  2020 /etc/cron.daily/sysstat

or if your file/directory names can contain spaces but your directory paths always start with / and your ls output always has exactly the same number of fields before the file name as shown in your example then you could do something like this:

$ awk '
    /^\// && sub(/:$/,"/") { dir=$0; next }
    match($0,/^([^[:space:]]+[[:space:]]+){8}/) { $0=substr($0,1,RLENGTH) dir substr($0,RLENGTH+1) }
    { print }
' file

But ls doesn't always produce output with those fields (what ls outputs for the date/time depends on the age of your files and locale setting, and user IDs can contain spaces, for example) and all of the characters in the per-file lines could be present in a directory name and file names can end with : since file and directory names can contain any characters except / or NUL so YMMV with whatever you come up with to try to tell the lines apart and then figure out where the file name starts in the per-file lines. Plus file names can contain newlines which is a whole other world of problems.

So there is no robust way to parse the output of ls for every possible output it could produce. If you want to do this then you just have to figure out what kind of pattern matching you think/hope will be good enough for your needs given whatever context you call ls in and then write your script based on that.

Since some other tool is creating a file of ls output for you to then have to parse you should try to get that other tool fixed since it's well known that you shouldn't try to parse the output of ls (see http://mywiki.wooledge.org/ParsingLs and Why *not* parse `ls` (and what to do instead)?) so that tool is setting you up for failure.

If we can assume no spaces, why do something this complex instead of just awk '{ if(/^\// && sub(/:$/, "")){p=$0; next} $(NF)=p"/"$(NF) }1' file? I don't understand what the match() lines are doing there. This really is a bit complicated and would benefit from an explanation. I know you always think awk is self explanatory, but while not an expert, I do have some passing familiarity with awk and I would need the manual and study to grok this. — terdon, Mar 29 '24 at 13:28
@terdon assigning to $NF or any other field would cause awk to reconstruct $0 from its fields, replacing every string that matches an FS with the OFS character and so would change the spaces between the fields in each line so they'd no longer line up in columns, e.g. if the file sizes weren't all 3 digits or different files had different owners. The match() lines are to separate the file name at the end of the line from the rest of the fields plus the spaces that follow them so we can drop the dir in the middle. — Ed Morton, Mar 29 '24 at 16:09
Nah, the action block was a leftover from other things I was trying. Very good point about rebuilding the line, that's what I was missing. — terdon, Mar 29 '24 at 16:28

Kaz · Answer 3 · 2024-03-27T19:07:57.717

Solution with TXR Lisp.

Let's take it for granted you got this ls output from somewhere and have to work with it; you cannot go back to the original time and machine and obtain the information in a different format.

$ txr lsdata.tl < lsdata
-rw-r--r-- 1 root root 128 May 15  2020 /etc/cron.d/0hourly
-rw------- 1 root root 235 Dec 17  2020 /etc/cron.d/sysstat
-rw------- 1 root root 235 Dec 17  2020 /etc/cron.daily/sysstat

Where lsdata.tl is:

(let ((curdir ""))
  (whilet ((line (get-line)))
    (match-case line
      (`@dir:` (set curdir dir))
      (`@{metadata 39} @name` (put-line `@metadata @curdir/@name`)))))

This isn't perfect: it will be fooled by a name ending in :. If we can assume that the directory lines area always absolute paths, we can include that in the match:

(let ((curdir ""))
  (whilet ((line (get-line)))
    (match-case line
      (`/@dir:` (set curdir dir))
      (`@{metadata 39} @name` (put-line `@metadata /@curdir/@name`)))))

I'm sorry, this tool I can't use unfortunately. Rather awk or similar to that. I only can use the GNU tools. — André Letterer, Mar 28 '24 at 23:08

Stéphane Chazelas · Answer 4 · 2024-03-27T19:42:54.283

You can just do:

ls -ld /etc/cron*/*

The point being to pass the full paths of all the files to ls and be sure to pass the -d option so that for files of type directory, ls shows the info about the directory files themselves rather than list the contents of the directory.

The list of paths there is generated by the shell by expanding that /etc/cron*/* glob.

In the fish shell, you can also do:

ls -ld /etc/cron**

To list all the files whose path starts with /etc/cron, so including /etc/crontab, /etc/cron.d and all the files within.

You can achieve something similar with find with:

find /etc -path '/etc/cron*' -exec ls -ld {} +

Or with zsh with

set -o extendedglob
ls -ld /etc/**/*~^/etc/cron*

(or ls -ld /etc/**~^/etc/cron* if you also enable the globstarshort option)

Corrected initial post. It's not about finding better commands. I need to manipulate the output, probably using awk. — André Letterer, Mar 28 '24 at 23:06

user9101329 · Answer 5 · 2024-03-27T18:54:00.850

0

Not entirely sure what you want, but try this command:

$ ls -la | awk -v path=$PWD '{$NF=path"\/"$NF;print}' |sed 's| /| \t/|g'

You can drop the sed part if not interested in the alignment of the paths.

edited Mar 27 '24 at 18:54

answered Mar 27 '24 at 16:41

user9101329

1,004

There are beartraps here (https://unix.stackexchange.com/questions/128985/why-not-parse-ls-and-what-to-do-instead) – mr.spuratic Mar 27 '24 at 19:33
@mr.spuratic It's okay to parse the output of ls when the goal is to customize/rearrange it for human consumption. If ls doesn't do something you want, you can either write your own from scratch, or tweak the output. – Kaz Mar 28 '24 at 04:56
This awk goes a little bit into the direction I thought. However it seems to use basically pwd but not the outputs from the file. – André Letterer Mar 28 '24 at 22:57

score 0 · Answer 6 · answered Mar 29 '24 at 14:09

Simple solution for the simple case:

% awk 'NF == 1 { dir = $1; sub(/:$/, "", dir); next }
       NF >= 9 { $9 = dir "/" $9; print; next }
       { print }' input.txt
-rw-r--r-- 1 root root 128 May 15 2020 /etc/cron.d/0hourly
-rw------- 1 root root 235 Dec 17 2020 /etc/cron.d/sysstat
-rw------- 1 root root 235 Dec 17 2020 /etc/cron.daily/sysstat

On lines with just one field (NF == 1), remove the colon and pick the directory name, and on lines with at least nine fields, add the last seen directory name to the start of the ninth space-separated field ($9), since that's where the (start of) the filename is in the common ls output format. Lines with the wrong number of fields are printed as-is (that would include both empty lines and the total 123 lines that ls -R outputs, not that your sample input includes them).

But more generically, the output of ls can vary, so we need to be careful. For older files, the common timestamp format is May 15 2020, but for recent files, the year is replaced with the hour and minutes, e.g. Mar 29 15:38. Luckily, the number of fields doesn't change there. But the timestamp format may change depending on the locale, and if the listing contains device files or symlinks, other fields in the output change.

With symlinks, the symlink target is added after an arrow, and for device files, the size field is replaced with the device information, which might be multiple fields, or not (the first line with the null device below is from GNU ls, the second from Mac):

lrwxr-xr-x  1 user  group  9 Mar 29 15:38 link.txt -> hello.txt
crw-rw-rw-  1 root  wheel  0x3000002 Mar 29 15:39 null
crw-rw-rw-  1 root  root 1, 3 Sep  2  2022 null

Of course, if the username or group name can also contain whitespace, that would also produce issues.

Also, the AWK script above compresses multiple spaces to one, turning e.g. May 15 2020 into May 15 2020 and messing the alignment of other fields. If you care about that, it might be easier to switch to Perl:

% perl -lne  'chomp; if (/^\S+:/) { $dir = s/:$//r; next; } s#((\S+\s+){8})(.*)#$1$dir/$3#; print' input.txt
-rw-r--r-- 1 root root 128 May 15  2020 /etc/cron.d/0hourly
-rw------- 1 root root 235 Dec 17  2020 /etc/cron.d/sysstat
-rw------- 1 root root 235 Dec 17  2020 /etc/cron.daily/sysstat
-rw------- 1 root root 235 Dec 17  2020 /etc/cron.daily/foo bar

Here, the key is the regex ((\S+\s+){8}), which matches and captures eight instances of non-whitespace characters followed by whitespace characters, so the following (.*) matches the rest of the line.

manipulate ls text output to add path to filenames

6 Answers6