24

I have the following data (a list of R packages parsed from a Rmarkdown file), that I want to turn into a list I can pass to R to install:

d3heatmap
data.table
ggplot2
htmltools
htmlwidgets
metricsgraphics
networkD3
plotly
reshape2
scales
stringr

I want to turn the list into a list of the form:

'd3heatmap', 'data.table', 'ggplot2', 'htmltools', 'htmlwidgets', 'metricsgraphics', 'networkD3', 'plotly', 'reshape2', 'scales', 'stringr'

I currently have a bash pipeline that goes from the raw file to the list above:

grep 'library(' Presentation.Rmd \
| grep -v '#' \
| cut -f2 -d\( \
| tr -d ')'  \
| sort | uniq

I want to add a step on to turn the new lines into the comma separated list. I've tried adding tr '\n' '","', which fails. I've also tried a number of the following Stack Overflow answers, which also fail:

This produces library(stringr)))phics) as the result.

This produces ,% as the result.

This answer (with the -i flag removed), produces output identical to the input.

fbt
  • 373
  • 1
  • 2
  • 6

8 Answers8

32

You can add quotes with sed and then merge lines with paste, like that:

sed 's/^\|$/"/g'|paste -sd, -

If you are running a GNU coreutils based system (i.e. Linux), you can omit the trailing '-'.

If you input data has DOS-style line endings (as @phk suggested), you can modify the command as follows:

sed 's/\r//;s/^\|$/"/g'|paste -sd, -
zeppelin
  • 3,822
  • 10
  • 21
  • 2
    On MacOS (and maybe others), you will need to include a dash to indicate that the input is from stdin rather than a file: sed 's/^\|$/"/g'|paste -sd, - – cherdt Jan 17 '17 at 19:08
  • True, "coreutils" version of paste will accept both forms, but "-" is more POSIX. Thx ! – zeppelin Jan 17 '17 at 19:21
  • 2
    Or just with sed alone: sed 's/.*/"&"/;:l;N;s/\n\(.*\)$/, "\1"/;tl' – Digital Trauma Jan 17 '17 at 20:09
  • When I run the command from your answer, I get output as seen in this gist. output.txt is the output after both sed and paste, and sed_output.txt is the output after sed. I am using gsed and gpaste installed from coreutils on MacOS. – fbt Jan 17 '17 at 20:17
  • @fbt would you please provide the full command line you have used , and the raw source data.

    I've tested this with the list of packages as provided in your question, and it does work just nice (see https://goo.gl/zlBIVS).

    – zeppelin Jan 17 '17 at 20:46
  • 1
    @fbt The note I now added at the end of my answer applies here as well. – phk Jan 17 '17 at 20:52
  • 1
    @DigitalTrauma - not really a good idea; that would be very slow (might even hang with huge files) - see the answers to the Q I linked in my comment on the Q here; the cool thing is to use paste alone ;) – don_crissti Jan 17 '17 at 21:07
  • As per the question, the output should be single quotes instead of double ones, so piping a simple character translation would do the trick: sed 's/\r//;s/^\|$/"/g'|paste -sd, - | tr '"' "'" – Zumo de Vidrio Jan 18 '17 at 08:31
  • Not all Linux have GNU coreutils. – ctrl-alt-delor Jan 19 '17 at 21:41
12

Using awk:

awk 'BEGIN { ORS="" } { print p"'"'"'"$0"'"'"'"; p=", " } END { print "\n" }' /path/to/list

Alternative with less shell escaping and therefore more readable:

awk 'BEGIN { ORS="" } { print p"\047"$0"\047"; p=", " } END { print "\n" }' /path/to/list

Output:

'd3heatmap', 'data.table', 'ggplot2', 'htmltools', 'htmlwidgets', 'metricsgraphics', 'networkD3', 'plotly', 'reshape2', 'scales', 'stringr'

Explanation:

The awk script itself without all the escaping is BEGIN { ORS="" } { print p"'"$0"'"; p=", " } END { print "\n" }. After printing the first entry the variable p is set (before that it's like an empty string). With this variable p every entry (or in awk-speak: record) is prefixed and additionally printed with single quotes around it. The awk output record separator variable ORS is not needed (since the prefix is doing it for you) so it is set to be empty at the BEGINing. Oh and we might our file to END with a newline (e.g. so it works with further text-processing tools); should this not be needed the part with END and everything after it (inside the single quotes) can be removed.

Note

If you have Windows/DOS-style line endings (\r\n), you have to convert them to UNIX style (\n) first. To do this you can put tr -d '\015' at the beginning of your pipeline:

tr -d '\015' < /path/to/input.list | awk […] > /path/to/output

(Assuming you don't have any use for \rs in your file. Very safe assumption here.)

Alternatively, simply run dos2unix /path/to/input.list once to convert the file in-place.

phk
  • 5,953
  • 7
  • 42
  • 71
  • When I run this command, I get ', 'stringr23aphics as the output. – fbt Jan 17 '17 at 20:21
  • @fbt See my latest note. – phk Jan 17 '17 at 20:31
  • 2
    print p"'"'"'"$0"'"'"'"; p=", "—holy quotes, Batman! – wchargin Jan 17 '17 at 21:23
  • I know, right‽ :) I thought about mentioning that in many shells print p"'\''"$0"'\''"; would have also worked (it's not POSIXy though), or alternatively using bash's C quoting strings ($'') even just print p"\'"$0"\'"; (might have required doubling other backslashes though) but there's already the other method using awk's character escapes. – phk Jan 17 '17 at 21:41
  • Wow, I can't believe you figured that out. Thank you. – fbt Jan 18 '17 at 17:44
  • the Alternative works for me! – BBK Oct 21 '22 at 16:11
8

As @don_crissti's linked answer shows, the paste option borders on incredibly fast -- the linux kernel's piping is more efficient than I would have believed if I hadn't just now tried it. Remarkably, if you can be happy with a single comma separating your list items rather than a comma+space, a paste pipeline

(paste -d\' /dev/null - /dev/null | paste -sd, -) <input

is faster than even a reasonable flex program(!)

%option 8bit main fast
%%
.*  { printf("'%s'",yytext); }
\n/(.|\n) { printf(", "); }

But if just decent performance is acceptable (and if you're not running a stress test, you won't be able to measure any constant-factor differences, they're all instant) and you want both flexibility with your separators and reasonable one-liner-y-ness,

sed "s/.*/'&'/;H;1h;"'$!d;x;s/\n/, /g'

is your ticket. Yes, it looks like line noise, but the H;1h;$!d;x idiom is the right way to slurp up everything, once you can recognize that the whole thing gets actually easy to read, it's s/.*/'&'/ followed by a slurp and a s/\n/, /g.


edit: bordering on the absurd, it's fairly easy to get flex to beat everything else hollow, just tell stdio you don't need the builtin multithread/signalhandler sync:

%option 8bit main fast
%%
.+  { putchar_unlocked('\'');
      fwrite_unlocked(yytext,yyleng,1,stdout);
      putchar_unlocked('\''); }
\n/(.|\n) { fwrite_unlocked(", ",2,1,stdout); }

and under stress that's 2-3x quicker than the paste pipelines, which are themselves at least 5x quicker than everything else.

jthill
  • 2,710
  • 1
    (paste -d\ \'\' /dev/null /dev/null - /dev/null | paste -sd, -) <infile | cut -c2- would do comma+space @ pretty much the same speed though as you noted, it's not really flexible if you need some fancy string as separator – don_crissti Jan 18 '17 at 11:33
  • That flex stuff is pretty damn cool man... this is the first time I see someone posting flex code on this site... big upvote ! Please post more of this stuff. – don_crissti Jan 24 '17 at 21:39
  • @don_crissti Thanks! I'll look for good opportunities, sed/awk/whatnot are usually better options just for the convenience value but there's often a pretty easy flex answer too. – jthill Jan 25 '17 at 22:19
4

Python

Python one-liner:

$ python -c "import sys; print(','.join([repr(l.strip()) for l in sys.stdin]))" < input.txt                               
'd3heatmap','data.table','ggplot2','htmltools','htmlwidgets','metricsgraphics','networkD3','plotly','reshape2','scales','stringr'

Works in simple way - we redirect input.txt into stdin using shell's < operator, read each line into a list with .strip() removing newlines and repr() creating a quoted representation of each line. The list is then joined into one big string via .join() function, with , as separator

Alternatively we could use + to concatenate quotes to each stripped line.

 python -c "import sys;sq='\'';print(','.join([sq+l.strip()+sq for l in sys.stdin]))" < input.txt

Perl

Essentially same idea as before: read all lines,strip trailing newline, enclose in single quotes,stuff everything into array @cvs , and print out array values joined with commas.

$ perl -ne 'chomp; $sq = "\047" ; push @cvs,"$sq$_$sq";END{ print join(",",@cvs)   }'  input.txt                        
 'd3heatmap','data.table','ggplot2','htmltools','htmlwidgets','metricsgraphics','networkD3','plotly','reshape2','scales','stringr'
  • IIRC, pythons's join should be able to take an iterator therefore there should be no need to materialize the stdin loop to a list – iruvar Jan 20 '17 at 06:37
  • @iruvar Yes, except look at OP's desired output - they want each word quoted, and we need to remove trailing newlines to ensure output is one line. You have an idea how to do that without a list comprehension ? – Sergiy Kolodyazhnyy Jan 20 '17 at 06:44
4

I think the following should do just fine, assuming you're data is in the file text

d3heatmap
data.table
ggplot2
htmltools
htmlwidgets
metricsgraphics
networkD3
plotly
reshape2
scales
stringr

Let's use arrays which have the substitution down cold:

#!/bin/bash
input=( $(cat text) ) 
output=( $(
for i in ${input[@]}
        do
        echo -ne "'$i',"
done
) )
output=${output:0:-1}
echo ${output//,/, }

The output of the script should be as follows:

'd3heatmap', 'data.table', 'ggplot2', 'htmltools', 'htmlwidgets', 'metricsgraphics', 'networkD3', 'plotly', 'reshape2', 'scales', 'stringr'

I believe this was what you were looking for?

phk
  • 5,953
  • 7
  • 42
  • 71
  • 2
    Nice solution. But while OP didn't explicitly ask for bash and while it is safe to assume that someone might use it (after all AFAIK it's the most used shell) it still shouldn't be taken for granted. Also, there are parts you could so a better job at quoting (putting in double quotes). For example, while the package names are unlikely to have spaces in them it still is good convention to quote variables rather than not, you might want to run https://www.shellcheck.net/ over it and see the notes and explanations there. – phk Jan 20 '17 at 06:36
2

I often have a very similar scenario: I copy a column from Excel and want to convert the content into a comma separated list (for later usage in a SQL query like ... WHERE col_name IN <comma-separated-list-here>).

This is what I have in my .bashrc:

function lbl {
    TMPFILE=$(mktemp)
    cat $1 > $TMPFILE
    dos2unix $TMPFILE
    (echo "("; cat $TMPFILE; echo ")") | tr '\n' ',' | sed -e 's/(,/(/' -e 's/,)/)/' -e 's/),/)/'
    rm $TMPFILE
}

I then run lbl ("line by line") on the cmd line which waits for input, paste the content from the clipboard, press <C-D> and the function returns the input surrounded with (). This looks like so:

$ lbl
1
2
3
dos2unix: converting file /tmp/tmp.OGM6UahLTE to Unix format ...
(1,2,3)

(I don't remember why I put the dos2unix in here, presumably because this often causes trouble in my company's setup.)

Rolf
  • 902
1

Some versions of sed act a little different, but on my mac, I can handle everything but the "uniq" in sed:

sed -n -e '
# Skip commented library lines
/#/b
# Handle library lines
/library(/{
    # Replace line with just quoted filename and comma
    # Extra quoting is due to command-line use of a quote
    s/library(\([^)]*\))/'\''\1'\'', /
    # Exchange with hold, append new entry, remove the new-line
    x; G; s/\n//
    ${
        # If last line, remove trailing comma, print, quit
        s/, $//; p; b
    }
    # Save into hold
    x
}
${
    # Last line not library
    # Exchange with hold, remove trailing comma, print
    x; s/, $//; p
}
'

Unfortunately to fix the unique part you have to do something like:

grep library Presentation.md | sort -u | sed -n -e '...'

--Paul

PaulC
  • 111
1

It is funny that to use a plain text list of R packages to install them in R, nobody proposed a solution using that list directly in R but fight with bash, perl, python, awk, sed or whatever to put quotes and commas in the list. This is not necessary at all and moreover does not solve how input and use the transformed list in R.

You can simply load the plain text file (said, packages.txt) as a dataframe with a single variable, that you can extract as a vector, directly usable by install.packages. So, convert it in a usable R object and install that list is just:

df <- read.delim("packages.txt", header=F, strip.white=T, stringsAsFactors=F)
install.packages(df$V1)

Or without an external file:

packages <-" 
d3heatmap
data.table
ggplot2
htmltools
htmlwidgets
metricsgraphics
networkD3
plotly
reshape2
scales
stringr
"
df <- read.delim(textConnection(packages), 
header=F, strip.white=T, stringsAsFactors=F)
install.packages(df$V1)
Fran
  • 1,811