7

I am having a dilemma whether to edit a javascript file or not. When I open it with gedit, it shows the following warning:

The file you opened has some invalid characters. If you continue editing this file you could corrupt this document. You can also choose another character encoding and try again.

The current encoding is UTF-8. As the file has over 100,000 lines of code, is there a quick way to scan for the invalid characters?

  • A quick and dirty way would be to edit the file (probably adding and deleting something will be enough to make gedit think it was changed), and save as another name. Comparing the original and the changed file (diff should help here) should tell you what is going on. – vonbrand Apr 12 '13 at 10:26
  • @vonbrand, thanks for your suggestion. I tried saving the file with another name using gedit but screwed up and overwrote the actual file. Hope others will learn from my mistake by copying the file first instead of using gedit to save :p – Question Overflow Apr 13 '13 at 02:47

1 Answers1

14

As the file is UTF-8 you could run isutf8. An additional utils package. It gives you both line, char and offset for bad bytes.

Then use xxd, hexdump or the like to analyze.

Unfortunately it stops at first crash. But then again it depends on the file. Could be there is only one bad byte ;)

Have some C code that does a similar analysis but for entire file. It is on a disk somewhere long forgotten. Could try to find it if in need.

Else yes, the quick and not that dirty way would be to do a diff between a copy saved with gedit – as proposed by the good mr. @vonbrand.

Runium
  • 28,811
  • +1 for isutf8. For my large files, the output from diff was too voluminous to interpret, whereas isutf8 immediately gave me the line number and character position of the first non-UTF-8 character. On Ubuntu 16.04: sudo apt-get install moreutils – Steve Saporta Oct 14 '16 at 20:25