1

I have a list of raw files that were scraped and it consists of both text and source codes. Below is the filetypes listed and I wanted to remove all files which are of type C Source, python script, HTML and empty files and only keep ASCII and unicode files in place.

file *
1dW6WJMN.txt:  Python script, ASCII text executable
9dJbZ3Vv.txt:  ASCII text, with CRLF line terminators
9dQsmVU4.txt:  Python script, UTF-8 Unicode text executable, with CRLF line terminators
A5hENB7D.txt:  C source, ASCII text, with CRLF line terminators
cidREdJG.txt:  UTF-8 Unicode text, with very long lines, with CRLF line terminators
exhjw1gK.txt:  UTF-8 Unicode text, with CRLF line terminators
iu7LPrqz.txt:  ASCII text, with very long lines, with CRLF line terminators
LsDHarjD.txt:  ASCII text
nLABt1a6.txt:  C source, ASCII text, with CRLF line terminators
nqMDtVuz.txt:  ASCII text, with CRLF line terminators
nqPuYb23.txt:  UTF-8 Unicode text, with CRLF line terminators
nQtzxhfQ.txt:  ASCII text, with CRLF line terminators
NQuLWwpt.txt:  ASCII text, with CRLF line terminators
nQXeJeED.txt:  ASCII text, with CRLF line terminators
nqXGv6ws.txt:  UTF-8 Unicode text, with CRLF line terminators
nQxr4Hwi.txt:  ASCII text, with CRLF line terminators
nQxr4Hwii.txt: empty
VQjrxevh.txt:  HTML document, UTF-8 Unicode text, with very long lines, with CRLF line terminators
yfDEfn4L.txt:  C source, ASCII text, with CRLF line terminators
yydAEDRn.txt:  HTML document, ASCII text, with very long lines, with CRLF line terminators

I tried using a simple grep with ASCII but all the source code files also contain the term ASCII. Is there any other way to filter out these source code files as sometimes there are also PHP, javascript files which I wanted to get rid of. I'm quite new to linux and any help would be appreciated. Thanks in advance

dmorgan
  • 11

1 Answers1

2

Try a longer pattern. You can use patterns that have spaces or tabs or many words. I would also recommend a progressive approach using pipelines:

$ file * | egrep -v 'ASCII text|Unicode text' | sed 's/: ..*$//'

If that doesn't get you the list of file names you want, hit up-arrow and edit the pattern(s) to match more or less or different parts of the out of file

The last step might be to send the output into a file full of commands:

$ file * | egrep -v 'ASCII text|Unicode text' | sed -e 's/: ..*$//' -e 's/^/rm / > commands

Review file commands contents for correctness, maybe eliminate that last troublesome case. Use the pipeline to get 95% to where you want to be, then hand edit. No shame in that. Then run the commands your pipeline wrote out:

$ sh ./commands
  • Not sure why but the first solution only returns the empty file 'nQxr4Hwii.txt: empty'. I would try the other one too – dmorgan Jul 01 '20 at 16:40
  • @dmorgan looks like my egrep pattern isn't correct. All the lines of file command output have either "ASCII text" or "Unicode text" in them. To get this to work, you'll have to try a different pattern. Since I don't have your list of files or there contents, I can't do it. Maybe something like: egrep 'C source|PHP|Python script' - if egrep misses some particular file name, you can add to it. –  Jul 02 '20 at 15:30