Using Bash to iterate through nested directories and extract certain fields from YAML files

Question

I am learning bash and what I need is to iterate through a directory (inside has other directories) and find all files with the name example.yaml.

These files have several key-value pairs (example below):

name: Andre
age: 13
address: street
weight: 78kgs

What I need is inside a certain directory (must include nested directories) using bash commands to find all example.yaml files and then, copy only the name and age to a new file. This new file needs to be created and would look like this:

persons:
  - name: Andre
    age: 13
  - name: Joao
    age: 18
  ...

I was trying to do something like this to solve this problem

printf 'persons:\n' > output.yml
for i in $(find ./ -name "example.yaml");
do
 name=$(yq '.name' $i)
 age=$(yq '.age' $i)
// append $name and $age to output.yaml
done

I'd split this: YAML has more than one way to denote such key/value pairs, and parsing it using bash and generic regular expression tools is an explicitly bad idea. So, don't. Bash isn't the right tool, and a good carpenter knows that his chisel is not what a masonry needs. — Marcus Müller, Aug 25 '22 at 17:53
Not telling you much new, but shopt -s globstar; for yamlfile in **/example.yaml; do some_specific_yamltool --options "${yamlfile}"; done solves your iterate through all files; and some_specific_yamltool should probably be yq, which is meant for exactly this kind of operation. — Marcus Müller, Aug 25 '22 at 17:55
Hmm... you just added something to the required output that does not seem to be part of the input without explaining it further. If you have further questions about processing YAML, then consider asking a new question instead of modifying the requirements of this already answered question. I'm reverting your edit as it severely alters the question. — Kusalananda, Aug 26 '22 at 10:06

Kusalananda · Accepted Answer · 2022-08-26T06:25:41.567

Note: The length of this answer is due to the fact that there are at least two major variants of utilities called yq, made for parsing YAML data, with slightly different abilities and expression grammar, and I cover both. I also look at simply using filename globbing to find all files and using find (when there simply are too many input files). Finally, I address additional questions asked in the comments.

Don't iterate over the output of find. Instead, call your utility from find using -exec. I have an example of this further down in this answer. You also lack quoting of some expansions.

See also:

Given one or several YAML files on the command line, the following yq command would create the YAML data summary file:

yq -y -s '{ persons: map({ name: .name, age: .age }) }' files

The command reads all input into a large array (thanks to -s, or --slurp) which is then passed to the map() command. The map() command extracts the name and age fields of each element in the array and adds these as an object to the persons array.

This uses Andrey Kislyuk's Python-based yq from https://kislyuk.github.io/yq/, a wrapper around the versatile JSON parser jq. If you drop the -y option from the command, you'll get JSON output instead.

Using Mike Farah's Go-based yq instead:

yq -N '[{ "name": .name, "age": .age }]' files | yq '{ "persons": . }'

In the bash shell, you would apply this to all example.yaml files in the current directory or anywhere below it, creating the output file output.yaml in the current directory, like so:

shopt -s globstar failglob
yq -y -s '{ persons: map({ name: .name, age: .age }) }' ./**/example.yaml >output.yaml

Or, with Mike Farah's yq:

shopt -s globstar failglob
yq -N '[{ "name": .name, "age": .age }]' ./**/example.yaml | yq '{ "persons": . }' >output.yaml

This assumes that there are fewer than a few thousand example.yaml files, or the command line would expand to a too long command.

The globstar shell option is first enabled to allow us to use the ** filename globbing pattern, which matches across / in pathnames. We also enable the failglob shell option to make the whole command fail gracefully if there are no matching filenames.

Testing:

$ tree
.
├── dir1
│   └── example.yaml
├── example.yaml
└── script-andrey
└── script-mike
1 directory, 4 files

$ cat script-andrey
shopt -s globstar failglob
yq -y -s '{ persons: map({ name: .name, age: .age }) }' ./**/example.yaml >output.yaml

$ bash script-andrey
$ cat output.yaml
persons:
  - name: Joao
    age: 18
  - name: Andre
    age: 13

Testing Mikes yq as well:

$ cat script-mike
shopt -s globstar failglob
yq -N '[{ "name": .name, "age": .age }]' ./**/example.yaml | yq '{ "persons": . }' >output.yaml

$ bash script-mike
$ cat output.yaml
persons:
  - name: Joao
    age: 18
  - name: Andre
    age: 13

If you have many many thousands of these YAML input files, then you may want to apply yq a bit smarter, using find.

This is using Andrey's yq:

find . -name example.yaml -type f \
    -exec yq -y -s 'map({ name: .name, age: .age })' {} + |
yq -y '{ persons: . }' >output.yaml

This finds all regular files whose name is example.yaml. These are passed in batches to yq which will extract the name and age fields from each, creating an array. There is then a final yq command that collects the generated YAML array and places it as the value of the persons key in the final output.

Similarly, with Mike's yq:

find . -name example.yaml -type f \
    -exec yq -N '[{ "name": .name, "age": .age }]' {} + |
yq '{ "persons": . }' >output.yaml

Testing with the same set of files as above:

$ rm output.yaml
$ find . -name example.yaml -type f -exec yq -y -s 'map({ name: .name, age: .age })' {} + | yq -y '{ persons: . }' >output.yaml

$ cat output.yaml
persons:
  - name: Andre
    age: 13
  - name: Joao
    age: 18

(Running the command designed for Mike's yq generates the same output.)

Note that the ordering of the output depends on the order in which find finds the files.

Would you want to sort the output file on e.g. the name field, then the following would sort the file in-place (note that I don't know how to do this with Mike Farah's Go-based yq):

yq -i -y '.persons |= sort_by(.name)' output.yaml

To sort (in-place) in the reverse order:

yq -i -y '.persons |= (sort_by(.name) | reverse)' output.yaml

In comments, the user asks whether one can just append data to an existing file. This is possible.

The commands below assume that the last thing in output.yaml is the end of the persons array (so that the command is able to just adds new array entries to it).

Using Andrey's yq:

shopt -s globstar failglob
yq -y -s 'map({ name: .name, age: .age })' ./**/example.yaml >>output.yaml

or, with find,

find . -name example.yaml -type f \
    -exec yq -y -s 'map({ name: .name, age: .age })' {} + >>output.yaml

Using Mike's yq:

shopt -s globstar failglob
yq -N '[{ "name": .name, "age": .age }]' ./**/example.yaml >>output.yaml

or, using find:

find . -name example.yaml -type f \
    -exec yq -N '[{ "name": .name, "age": .age }]' {} + >>output.yaml

kusalananda thankyou for the explanation. But I get error with the above command
Error: unknown shorthand flag: 'y' in -y

I edited my question to show how I was trying to solve this — Andre Silva, Aug 25 '22 at 19:00
@AndreSilva There are a few different tools called yq that works differently. In my answer, I mentioned which one I'm using. See also here: https://kislyuk.github.io/yq/#installation — Kusalananda, Aug 25 '22 at 19:18
Thank you so much @Kusalananda, I was able to install yq and it is working. Is this tool legit to use?
Also is it possible with find command to append the array to an existing file instead of creating new file? — Andre Silva, Aug 25 '22 at 21:49
@AndreSilva - Check out my answer below for a solution that works with basic commands available in effectively all distributions — mainmachine, Aug 25 '22 at 23:24
@AndreSilva I added stuff. Now the answer uses both Andrey Kislyuk's yq and Mike Farah's yq. Both are robust software used in production environments, but I find Mike's yq severely lacking in features in general. I will address your other issue about appending data soon. — Kusalananda, Aug 26 '22 at 05:55
@AndreSilva Now added a bit about appending to an existing file as well. — Kusalananda, Aug 26 '22 at 06:06
@Kusalananda. I understand that this appends data to an existing file. But I am not getting the right yaml indentation (for sure something I am doing wrong). But If I create the file previously (lets say I add an attribute before persons)
printf 'city: Edinburgh \n persons:\n' >output.yaml;

and then add the command

find . -name example.yaml -type f -exec yq -y -s 'map({ name: .name, age: .age})' {} + >>output.yaml

how can I get correct Indentation inside persons array? — Andre Silva, Aug 26 '22 at 10:16
@AndreSilva The output will be a valid YAML with persons being an array of names and ages. The YAML format allows for indent-less lists. — Kusalananda, Aug 26 '22 at 11:20
@Kusalananda you are correct the output is a valid yaml file. I am able to get the job done using Andrey's yq
But when running with Mike's tool I get the following error

Error: unknown command "[{ \"name\": .name, \"age\": .age }]" for "yq

Can this be related to the tool's version. I am using yq (https://github.com/mikefarah/yq/) version 4.16.2 — Andre Silva, Aug 27 '22 at 19:58
@AndreSilva I see eval was made implicit in release "4.18.1", so try yq -N eval '[{ "name": .name, "age": .age }]' .... I'm using "4.25.1" and can't easily test older releases. — Kusalananda, Aug 27 '22 at 20:08
@Kusalananda you are correct (once more). It works with eval — Andre Silva, Aug 27 '22 at 20:10

mainmachine · Answer 2 · 2022-08-30T18:59:54.550

Lots of ways to do this, but the simplest is probably the find command.

FIrst we create the output file with the new array structure:

echo "persons:" > newfile.yaml

Next, we want to identify every file that matches the filename example.yaml, in your target directory (let's call it /home/user/yaml-files). This is a basic use case for find, and fairly simple to understand:

find /home/user/yaml-files -type f -name example.yaml

find has a powerful built-in feature to execute shell commands when it finds a match, using the -exec and -execdir options. -exec executes in the same working directory from which you ran find, while -execdir is a safer option in that the shell command runs "inside" the directory in which the match is found. For simplicity, we'll use -exec.

We need to search these example.yaml files for the lines we want, reformat and append the result to our output file:

find /home/user/yaml-files -type f -name example.yaml -exec awk '$1 ~ /^name:|^age:/ {gsub(/name:/,"  - name:",$1); gsub(/age:/,"    age:",$1); print $0}' {} \; | tee -a newfile.yaml

The awk command in there is searching each example.yaml for lines that start with either name: or age:, with no preceding space or other characters. gsub is an awk built-in which is useful for string substitution. Here we have 2 gsub filters, which we're using to format the matched lines before we print them to stdout.

Typically one would use redirection to write output to a file, but with find -exec that does get a bit more complicated. When that is the case, the tee command is great - it echoes output to the console but also to a file. The -a option tells tee to append to the file, otherwise it would overwrite the file every time and we'd be left with only the results of the last write to the file.

This solution uses only a few commands, which to my knowledge are present on every Linux system you are likely to encounter - there are no special requirements and the code is quite portable.

score -1 · Answer 3 · answered Aug 25 '22 at 17:53

If you are looking for files with the specific name example.yaml, you can do this very easily. First create a new file with persons: and then append all lines starting with name: or age: from all example.yaml files to it:

printf 'persons:\n' > personsFile
find /target/directory -name example.yaml -exec grep -E '^(name|age):' {} + >> personsFile

If you really need the - in front of each name entry, and the indentation, you could add it in a second pass:

printf 'persons:\n' > personsFile
find /target/directory -name example.yaml -exec grep -E '^(name|age):' {} + >> personsFile
sed -i 's/^name/  - name/; s/^age/    age/' personsFile

But if you're really dealing with a structured format like YAML, you should probably look at dedicated tools instead of hacking it like this.

score -1 · Answer 4 · answered Aug 25 '22 at 18:04

-1

Read man find xargs grep bash and do something like:

printf "%s\n" "persons:" >newfile
find . -type f -name '*.yaml' -print0 | \
    xargs -0 -r \
        grep -E --no-filename 'name:|age:' >>newfile

Note: This code has NOT been tested.

answered Aug 25 '22 at 18:04

waltinator

4,865

Does not seem to create an actual YAML array. – Kusalananda Aug 26 '22 at 06:09

Using Bash to iterate through nested directories and extract certain fields from YAML files

4 Answers4