The key piece to this type of approach is having access to a good database of english words. There is this file on my system, /usr/share/dict/words
which has a lot of words, but other sources could be used instead.
Approach
My general approach would be to use grep
like so:
$ grep -vwf /usr/share/dict/words sample.txt
Where your example output is in sample.txt
.
In my limited testing the size of the words
dictionary seemed to bog grep
down. My version has 400k+ lines in it. So I started doing something like this to break it up a bit:
$ head -10000 /usr/share/dict/words > ~/10000words
Sample runs (10k)
The run your file through using the 1st 10k words from the "dictionary".
$ grep -vwf ~/10000words sample.txt
714
01:11:22,267 --> 01:11:27,731
Auch wenn noch viele Generationen auf einen Wechsel hoffen,
715
01:11:27,732 --> 01:11:31,920
werde ich mein Bestes geben
und hoffe, dass andere das gleiche tun.
I'm giving mine, I'm doing my best
hoping the other will do the same
716
01:11:31,921 --> 01:11:36,278
Wir haben eine harte Arbeit vor uns,
um den Lauf der Dinge zu ändern.
it's going to be hard work
for things to turn around.
717
01:11:36,879 --> 01:11:42,881
Wenn man die Zentren künstlicher Besamung,
die Zuchtlaboratorien und die modernen Kuhställe besichtigt,
When visiting artificial insemination centers,
the selection center, modern stables,
NOTE: This approach ran in ~1.5 seconds, on my i5 laptop.
It seems to be a viable approach. When I bumped it up to 100k lines it started to take a long time though, I aborted it before it finished, so you could break the words
dictionary up into several files.
NOTE: When I backed it off to 50k lines it took 32 seconds.
Diving deeper (50k lines)
When I started to expand the dictionary up to 50k I ran into the issue I was afraid of, overlap between the languages.
$ grep -vwf ~/50000words sample.txt
714
01:11:22,267 --> 01:11:27,731
715
01:11:27,732 --> 01:11:31,920
werde ich mein Bestes geben
und hoffe, dass andere das gleiche tun.
hoping the other will do the same
716
01:11:31,921 --> 01:11:36,278
Wir haben eine harte Arbeit vor uns,
um den Lauf der Dinge zu ändern.
717
01:11:36,879 --> 01:11:42,881
Wenn man die Zentren künstlicher Besamung,
die Zuchtlaboratorien und die modernen Kuhställe besichtigt,
the selection center, modern stables,
Analyzing the problem
One good thing with this approach is you can remove the -v
and see where the overlap is:
$ grep -wf ~/50000words sample.txt
Auch wenn noch viele Generationen auf einen Wechsel hoffen,
Even if it takes many generations hoping for a change,
I'm giving mine, I'm doing my best
it's going to be hard work
for things to turn around.
When visiting artificial insemination centers,
The word auf
is apparently in both languages...well at least it's in my words
file, so this might be a bit of a trial and error approach to refine the word list as needed.
NOTE: I knew it was the word auf
because grep
colored it red, that doesn't show up in the above output due to the limited nature of SE 8-).
$ grep auf ~/50000words
auf
aufait
aufgabe
aufklarung
auftakt
baufrey
Beaufert
beaufet
beaufin
Beauford
Beaufort
beaufort
bechauffeur