11

I have this text file:

714
01:11:22,267 --> 01:11:27,731
Auch wenn noch viele Generationen auf einen Wechsel hoffen,
Even if it takes many generations hoping for a change,

715
01:11:27,732 --> 01:11:31,920
werde ich mein Bestes geben
und hoffe, dass andere das gleiche tun.
I'm giving mine, I'm doing my best
hoping the other will do the same

716
01:11:31,921 --> 01:11:36,278
Wir haben eine harte Arbeit vor uns, 
um den Lauf der Dinge zu ändern. 
it's going to be hard work
for things to turn around.

717
01:11:36,879 --> 01:11:42,881
Wenn man die Zentren künstlicher Besamung, 
die Zuchtlaboratorien und die modernen Kuhställe besichtigt, 
When visiting artificial insemination centers,
the selection center, modern stables,
...

and would like to parse it so only the not-english lines stay

is this possible?

msw
  • 10,593
Deele Ma
  • 111
  • 3
    Can you safely assume that there will always be the same number of lines in each language? If there are two German lines will there always also be two English lines etc? – terdon Aug 31 '13 at 12:55

4 Answers4

13

There is a difficult way and a much easier way. The difficult way is to use natural language parsing to give a probability that a given line is in English and discard such lines.

The easier way is to take a list of English stop words and delete lines that contain elements from that list. If you wanted to decrease the chance of mis-categorizing a line, you could also look for the presence of German stop words in lines that you fail to reject to check that they are probably German.

Here's a very quick and dirty script to use the linked stop word list to do the filtering:

#!/usr/bin/python
english_stop = set()
with open('english-stop-words.txt') as estop:
    for line in estop:
        bar = line.find('|')
        if bar > -1:
            line = line[0:bar]
        line = line.strip()
        if line:
            english_stop.add(line)

with open('mixed-german.txt') as mixg:
    for line in mixg:
        for word in line.lower().split():
            if word in english_stop:
                break
        else:
            print line[:-1]

and the output:

714
01:11:22,267 --> 01:11:27,731
Auch wenn noch viele Generationen auf einen Wechsel hoffen,

715
01:11:27,732 --> 01:11:31,920
werde ich mein Bestes geben
und hoffe, dass andere das gleiche tun.

716
01:11:31,921 --> 01:11:36,278
Wir haben eine harte Arbeit vor uns, 
um den Lauf der Dinge zu ändern. 

717
01:11:36,879 --> 01:11:42,881
Wenn man die Zentren künstlicher Besamung, 
die Zuchtlaboratorien und die modernen Kuhställe besichtigt, 

A slightly more complete version should ignore various punctuation like ,. but not the English apostrophe ' when within a word. Even greater accuracy could be obtained by looking for codepoints that never occur in English (for example «ßü) but that's left as an exercise for the reader.

msw
  • 10,593
  • Very nice approach. Much better than my hack and slash approach 8-) – slm Aug 31 '13 at 14:12
  • Danke (using stop words as diagnostic of a language came from a part of my mind I didn't know was there ;) – msw Aug 31 '13 at 14:58
5

On your sample, this would work:

awk -v RS= -F '\n' -v OFS='\n' '{NF=NF/2+1;printf "%s", $0 RT}'

Details

  • RS=. Sets the record separator. An empty value is a special case that means a record is a paragraph (sequence of lines delimited by empty lines).
  • -F '\n': sets the field separator (fields in each record are lines).
  • OFS='\n': sets the output field separator.

For each record (paragraph):

  • NF=1+NF/2 (or NF=2 (the first 2 lines) + (NF-2)/2 (half of the remaining lines)): change the number of fields to exclude the english ones.
  • printf "%s", $0 RT: prints the record followed by the record terminator (to restore the same amount of spacing between paragraphs). To see what the above code is doing it's helpful if you add some print statements into the mix. Something like this:

That assumes Unix line endings. If the file is in MSDOS format as is common with subtitle files, you need to preprocess it with d2u or dos2unix.

  • This assumes that the english lines are alway in the 3rd or 4th position, right? – slm Aug 31 '13 at 13:46
  • 2
    @slm. No, that half the lines are english. – Stéphane Chazelas Aug 31 '13 at 13:50
  • Looking a bit more, this breaks the lines up into records. You then look within each record for the number of fields (NF). A NF is a line in this case, right? I still don't get what you're doing with the NF-=NF/2-1 bit. Are you calculating say NF=4 for the first record, 714. So you get the values NF=4 and NF/2-1=1, and then subtracting the 1 from the NF leaving you with 3? Then printing the first 3 "fields" of the record, hence dropping the 4th line? – slm Aug 31 '13 at 14:36
3

The key piece to this type of approach is having access to a good database of english words. There is this file on my system, /usr/share/dict/words which has a lot of words, but other sources could be used instead.

Approach

My general approach would be to use grep like so:

$ grep -vwf /usr/share/dict/words sample.txt

Where your example output is in sample.txt.

In my limited testing the size of the words dictionary seemed to bog grep down. My version has 400k+ lines in it. So I started doing something like this to break it up a bit:

$ head -10000 /usr/share/dict/words > ~/10000words

Sample runs (10k)

The run your file through using the 1st 10k words from the "dictionary".

$ grep -vwf ~/10000words sample.txt
714
01:11:22,267 --> 01:11:27,731
Auch wenn noch viele Generationen auf einen Wechsel hoffen,

715
01:11:27,732 --> 01:11:31,920
werde ich mein Bestes geben
und hoffe, dass andere das gleiche tun.
I'm giving mine, I'm doing my best
hoping the other will do the same

716
01:11:31,921 --> 01:11:36,278
Wir haben eine harte Arbeit vor uns, 
um den Lauf der Dinge zu ändern. 
it's going to be hard work
for things to turn around.

717
01:11:36,879 --> 01:11:42,881
Wenn man die Zentren künstlicher Besamung, 
die Zuchtlaboratorien und die modernen Kuhställe besichtigt, 
When visiting artificial insemination centers,
the selection center, modern stables,

NOTE: This approach ran in ~1.5 seconds, on my i5 laptop.

It seems to be a viable approach. When I bumped it up to 100k lines it started to take a long time though, I aborted it before it finished, so you could break the words dictionary up into several files.

NOTE: When I backed it off to 50k lines it took 32 seconds.

Diving deeper (50k lines)

When I started to expand the dictionary up to 50k I ran into the issue I was afraid of, overlap between the languages.

$ grep -vwf ~/50000words sample.txt
714
01:11:22,267 --> 01:11:27,731

715
01:11:27,732 --> 01:11:31,920
werde ich mein Bestes geben
und hoffe, dass andere das gleiche tun.
hoping the other will do the same

716
01:11:31,921 --> 01:11:36,278
Wir haben eine harte Arbeit vor uns, 
um den Lauf der Dinge zu ändern. 

717
01:11:36,879 --> 01:11:42,881
Wenn man die Zentren künstlicher Besamung, 
die Zuchtlaboratorien und die modernen Kuhställe besichtigt, 
the selection center, modern stables,

Analyzing the problem

One good thing with this approach is you can remove the -v and see where the overlap is:

$ grep -wf ~/50000words sample.txt
Auch wenn noch viele Generationen auf einen Wechsel hoffen,
Even if it takes many generations hoping for a change,
I'm giving mine, I'm doing my best
it's going to be hard work
for things to turn around.
When visiting artificial insemination centers,

The word auf is apparently in both languages...well at least it's in my words file, so this might be a bit of a trial and error approach to refine the word list as needed.

NOTE: I knew it was the word auf because grep colored it red, that doesn't show up in the above output due to the limited nature of SE 8-).

$ grep auf ~/50000words 
auf
aufait
aufgabe
aufklarung
auftakt
baufrey
Beaufert
beaufet
beaufin
Beauford
Beaufort
beaufort
bechauffeur
slm
  • 369,824
  • The word "auf" exists in the English language? That MUST be a bug in the word file. It definitely does not, at least not standalone (which should be the only way parsed for here) anyway – syntaxerror Sep 05 '13 at 13:50
  • @syntaxerror - as I said it's in the word list file that I was using. I am parsing standalone. That's what the grep -wf ... does. With a better supply of words this approach would be the more direct. The other solution (Stephane's) depends on the data being structured and doesn't look at it in any contextual way, msw's approach seem to have better legs though to me. – slm Sep 05 '13 at 13:55
  • I assumed you were parsing standalone. Whatever, I affirm that if the word "auf" is really part of an English-language word list, I want to see the dictionary reference where its existence is documented. Most likely, you won't find one...ever. But as you can see, one mere word can create total confusion in parsers of all sorts. – syntaxerror Sep 05 '13 at 17:40
  • @syntaxerror - sorry for the confusion, I wasn't disagreeing with you about "auf" being an actual word, just that it happen to be in the dictionary file that I was using. Incidentally I double checked the lineage of that file and it comes from a package on my Fedora 14 laptop called words. It sources this URL as the originator of the word lists that it's using: http://en.wikipedia.org/wiki/Moby_Project – slm Sep 05 '13 at 18:45
1

This looks like a .srt file. If it is, and if the number of English lines per subtitle is always the same as the number of German lines, then you can use:

awk 'BEGIN { RS="\r\n\r\n"; FS="\r\n"} {for (i=1;i<=(NF-2)/2+2; i++) print $i "\r"; print "\r"}' old.srt > new.srt

Where old.srt and new.srt are your chosen input and output files.