Help me understand a script that uses csplit and sed

Question

I wanted a simple way to export notes from the reference manager, Zotero. I start by selecting multiple notes and dragging them into a blank text file. I also want achieve "atomicity" of my notes, so I need to split the resulting text files which contain the individual notes in sections separated by lines of dashes. I then want to use the heading I gave to each note to name the new files i.e.: rename with the first line of each section. I want to save these new files as markdown files.

The script I have put together is made up of suggestions for each of these functions by contributors on the web. I am trying to make sure that I understand the commands in the script correctly before sharing it with colleagues who have a similar use case to mine. My understanding (from reading Gilles' answer to another question - see reference link below) of the need for quote marks around the "$f" in the "head" command does not seem to be correct. I tried the script without the quotes and got the same result. Are the double quotes not really needed because "$f" appears on the right-hand side of an assignment? Are they only there because it is easier to double quote by default than to remember when they are not needed? Any further explanation would be much appreciated.

An example of the input file would be the following in Notes_test.txt

This is note 1
It has some notes

This is note 2
It has some more notes

The output from this should be two files:

This is note 1.md
This is note 2.md

This is the script I am using on the command line:

csplit Notes_test.txt -f_ -z -b'%03d.md' /--------------------------------------------------/1 {*} && sed -i '/./,$!d' *.md && for f in *.md
    do
    f1=$(head -n1 "$f")
    mv -n "$f" "$f1.md"
    done

and this is my understanding of the commands so far:

-fPREFIX Use PREFIX as the output file name prefix. In this case an underscore is specified: "_" which I see is just a placeholder.

-z Suppress the generation of zero-length output files. I think this is necessary because otherwise csplit will produce an empty file at the end of each run through splitting the original files.

-bSuffix Use SUFFIX as the output file name suffix. In this case: "md"

%03d puts a 3 digit number as a placeholder for the file name. I added the zero before the 3 at the suggestion of FelixJN.

/--------------------------------------------------/1 specifies the delimiter for the split, with the split being made 2 lines below the the line of "-"s (count starts from 0).

{*} tells bash to run the split until the end of the file. As Felix points out, "{n}" is the number of splits to be executed. In this case "*" means do as many as possible.

&& means execute the following command on the condition that the previous command has completed

sed -i directs sed to operate on files with a particular suffix '/./,$!d' means "remove blank lines at head of file" Thanks to Felix again for explaining that that this is to specify the range on which sed works: A "." means any character, so it specifies the first character that occurs in the document. Since empty lines do not have any characters, we will need to apply the negative "!" after defining the range. The range is defined by the pattern /"start"/,/"end"/ to apply the command between the strings "start" and "end". $ refers to the last line, so the range is all the non-empty lines in the document. To apply the negative use "!" meaning "NOT", i.e. tell sed to select the opposite of the previous range. In this case all lines before the first line with any character. "d" then deletes these lines.

*.md means "which has any name with suffix .md"

f1=$(head -n1 "$f") means: define f1 as the first line ("head" means "first line") of the file. This is done by using the variable signifier "$" to define "f1" which will be a placeholder (in the next line of the script) for the new file names (minus suffix). "head" is a bash command that normally outputs the first 10 lines of each file: head [OPTION]... [FILE]... The option -n1 specifies to output one line only. Here, instead of specifying a particular FILE, "$f" specifies "all files." The quote marks around "$f" are needed so that whitespace is ignored (otherwise $f uses whitespace as field separator and further splits the files - see reference link below).

mv -n "$f" "$f1.md" means: rename each file as "f1.md"

The bash command "mv" takes options and arguments: mv [OPTION]... [-T] SOURCE DEST i.e.: "Rename SOURCE to DEST." The -n option stands for --no-clobber "do not overwrite an existing file." I think this is just in case there are files (notes) that have the same first line.

See https://www.tutorialspoint.com/unix_commands/csplit.htm and coreutils for unix-like operating stems at https://www.gnu.org/software/coreutils/manual/coreutils.pdf and https://www.howtoforge.com/linux-csplit-command/ Q2.How to split files using regular expressions? and Why does my shell script choke on whitespace or other special characters? When is double-quoting necessary?

Please be more detailed on where your problem really lies. Maybe check the guide on how to ask a good question. So far your understanding of the commands is correct, use man csplit to get a manual explaining how csplit works. — FelixJN, Sep 06 '21 at 08:59
It is probably best to put a well formatted description of the algorithm in the question's body, rather than just in the script - it will be easier to read than having to scroll horizontally through your code reading monospaced text... Also, provide an example input and example of your required output. — Greenonline, Sep 11 '21 at 04:26
I have edited my question as suggested by Green and included Felix's explanation. My remaining question is: Are the double quotes not really needed because "$f" appears on the right-hand side of an assignment? Are they only there because it is easier to double quote by default than to remember when they are not needed? — Christopher J Poor, Sep 11 '21 at 21:45
The answer to your question about double quotes is "yes". There's no harm in quoting an assignment, but you can encounter all sorts of unwanted issues if you forget to quote when you should have done so. It's simpler always to quote — Chris Davies, Sep 12 '21 at 00:05

FelixJN · Accepted Answer · 2021-09-06T12:04:33.890

Since I do not see any problems in your understanding, I will focus on the sed part.

Ranges

sed may work a command on a range, e.g. replacing (substituting) an A (i.e. the first in a line) with a B from line 11 to line 20 looks like:

sed '11,20s/A/B/'

A range may also be defined by pattern matches lie /start/,/end/ to apply the command between the strings start and end.

In your case we have /./,$.

A . means any character, empty lines do not have any characters, so it will only apply when a line is NOT empty. $ just refers to the last line, so we would do it for the whole document, but skip the empty lines at the beginning.

Now ! comes into play, which means NOT, i.e. select the opposite of the previous range. In this case all lines before the first line with a character.

d then deletes these lines.

One more comment on '{*}' in csplit . '{n}' is the number splits that are to be executed, the asterisk just means as many as possible. You also could only split 5 times.

Instead of %3d, I'd suggest using %03d for zero-padded three-digit numbers, it makes sorting easier.

Thank you, Felix. Your explanation for the of the sed commands. I guess I should have been more specific that this was the part I was entirely ignorant about. I also wanted to be sure that my understanding of the other parts of the script was correct before sharing it with the Zotero community. I use the script to split multiple notes dragged from Zotero into a text document. — Christopher J Poor, Sep 06 '21 at 23:40

Help me understand a script that uses csplit and sed

1 Answers1