0

I want to modify a file to remove all punctuation, numbers and uppercases and also change the file so that there is only 1 word per line exemple:
Hello, how are you!

hello
how
are
you

With some help I came up with this:

tr -d '[:punct:]' < file | tr -s '[:space:]' '\n' | tr -d '[0-9]' | tr '[A-Z]' '[a-z]' > cleanfile.txt

The issue however is when I have an adress in my file I end up with httpadresscom instead of

http  
adress  
com 

I also DON'T want words like "don't" or "readme.txt" to have this output

don  
t  
readme  
txt
gabriel
  • 11

2 Answers2

1

This should isolate all words, leaving only dots and quotes inside. And underscores, probably not wanted - then \w does not work.

]# grep -oE "(\w|\.\w|'\w)*" text
one
two
Three
four
linux
file
system
isn't
What
nothing
mailto
a.b
some.org
Molly's
cat
Wrote
a
readme.txt

one two. Three four linux' file-system isn't What? "nothing" mailto: a.b@some.org Molly's cat. Wrote a readme.txt.

Problem with tr is the minimal context you need. Here you would be stuck because some.org you want split, but readme.txt not. Now the "@" is gone.

0

For the first part: don't remove punctuation, instead convert it into a space.

For the 2nd (don't etc): you may need a dictionary of words, or to not remove 's.