0

I am trying to write a shell script to generate all possible words in the English language less than 20 characters. I doubt there is any truly efficient way to do this other than to brute force some of it. Clearly this is going to generate a lot of gibberish but through the complete set, if the scope is even computable in a decent amount of time, I hope to explore aspects of the human language.

Also if anyone knows how to compute or tell me what the space is I'd love to know. I guess this is basic combinatorics or permutations but I don't know which is which. 26 letters. 20 or 25 length. I'm sure 25 provides enough complexity to come up with some good words but this is bound to increase computation dramatically. In the set no doubt would be aaaaaaadfsf and also bungology.

Joe
  • 1
  • This is a somewhat related Q that might be useful in solving this: http://unix.stackexchange.com/questions/88978/how-can-i-delete-all-english-lines-from-a-textfile/88993#88993 – slm Sep 21 '14 at 00:49

3 Answers3

1

Actually there is a file named /usr/share/words which contains all the English words.

I would probably use that file to find all the English words and to get the words upto particular length, you can do like,

awk 'length <=20' /usr/share/words | wc -l

I get 479396 words inside that file.

Ramesh
  • 39,297
  • I was looking at /usr/share/dict/words on OSX. I do not know how the file is derived. This is obviously not a complete set or the complete space. It is useful but this is likely not to be all the words. Not sure how they generated it. For instance bungology is a word but not included in the list. – Joe Sep 21 '14 at 00:07
1

If you want words with 20 characters then with 26 letters there is

26^20 = 19928148895209409152340197376

possibilities. Computers are fast nowadays, but are they fast enough? Good luck ;)

jimmij
  • 47,140
  • I'm not that smart. Is this combinatorics or permutations? Or both? Which is it? Is it as simple as taking 26 to the 20th power? I'm not a mathematician so I have no idea if that is accurate. I feel the space would be smaller for some reason. – Joe Sep 21 '14 at 00:10
  • If you're right, that's a gigantic space. I was hoping before I die to write a program that could write Shakespeare. I of course am kidding but these are the math spaces I have been considering as of late. – Joe Sep 21 '14 at 00:11
  • I'm not sure this is right as supposedly, which I also doubt the number of atoms in the universe is something of 10^80. I would think it would be far greater than this. – Joe Sep 21 '14 at 00:11
  • One doesn't have to be mathematician. Lets compute it at the primary school level: with only one character there is 26 letters so 26 possibilities, right? With two characters there is 26 possibilities of second letter for each of 26 possibilities of first letter, so 2626, which is equal to 26^2, right? For three characters there is 262626 = 26^3 and so on... For 20 characters there is 26...*26 = 26^20. – jimmij Sep 21 '14 at 00:18
  • 1
    This is the upper bound of all the combinations of 26 letters in the English alphabet taken 20 at a time. This doesn't take into consideration anything about the rules of the language and that no words can exist w/o using at least 1 vowel. So the actual number is much smaller than this! – slm Sep 21 '14 at 00:42
  • Yeah. I see what you're saying now. Though my intuition has started to analyze set theory very deeply, my algorithmic skill is terrible. Any shell or perl code to write this? I tried to write something in bash and kept getting memory errors. I would think that shell if it runs out of memory should request more memory in some kind of dynamic fashion. I'd have no idea how to write it to include words with vowels only. – Joe Sep 21 '14 at 01:01
  • Any idea how long a typical modern quad core would take to generate that many words? Or what the space would be considering only words with vowels? Somehow people are versed deeply in theory and know how fast computers run, what consists of an operation, and how to calculate the answer to that. Disturbing. Why are so many humans created which are not deeply proficient in mathematics? If we can't generate our own language space, there is a problem. – Joe Sep 21 '14 at 01:06
  • The idea of algorithm is very simple, e.g. for 2 characters try the following one liner: for l1 in {a-z}; do for l2 in {a-z}; do echo $l1$l2; done; done. But bash is not the right tool for this job. It should be compiled code to gain a speed (C, fortran, whatever). Now when I think of it I realized there is more then 26^20 possibilities because not all words have exactly 20 characters, so we have to include 27th character (space) at last position. – jimmij Sep 21 '14 at 01:22
1

Since you're looking for words that are less than 20 characters, this includes words that are 1, 2, 3 .. or 19 characters in length (not sure if there is a word in the English language with 19 characters). The total number of possibilities is then 2619 + 2618 + 2617.. + 261.

The brute force way to approach this problem is to create a list which includes all 26 alphabet of the English language. Then inside of a loop for i = 0; i < 20; i++, you create all possible words of length i using the 26 characters in the alphabet array. Recursion is your friend here. Once you have a word of length i, you can then pass it through whatever filtering rules to be used to define words in the English language, e.g. no words can exist without a vowel as mentioned by slm.

Note: Writing the so called filtering rules is not a trivial task. For example, it's fairly easy to check if the word contains any of aieou, but passing this check doesn't mean you've found a word ..there is still quite a long way to go from there.

How long will this brute force method take?

Jimmij posted that 26^20 = 19928148895209409152340197376 ~ 2e28. Now lets say your computer has a quad core 1.5 GHz processor and your program is able to exploit each core 100%. This gives you 1.5e9 x 4 = 6e9 cycles in a second. Each permutation itself will take multiple CPU cycles as it has to consider 26 characters for each permutation, etc. This number however is insignificant when compared to the # of permutations so lets just say that each permutation takes 6 instructions (and each instruction takes 1 CPU cycle) to make the math simple. Finally, you get (6 instructions/permutation x 2e28 permutations)/(6e9 instructions/second) = (2e19 seconds) ~ 6.35e11 years.

Harvinder
  • 395
  • This is 26^20. A big space. Any clue to calculate how fast it might take to run on a macbook pro quad core or how to write this without getting out of memory errors in shell? – Joe Sep 21 '14 at 18:58
  • Check the updated answer, hopefully it all makes sense. – Harvinder Sep 21 '14 at 19:36
  • Jesus. You have to be sh&tting me. How did humans ever create dictionaries? I mean the Oxford dictionary is like 26 volumes. This is clearly not the entire space. Yet it is a huge book set. If you consider to organize a book (I have no idea) there are sets of pages in what I call a sheaf, that is to say that page 1 and page 30 are conjoined on one sheet of paper that wraps around itself and at the inner core of that sheaf page like 10 and 11 are adjacent. So how do you ever possibly physically lay out a book PRE COMPUTER of a set of that possible size… Its not humanly possible. – Joe Sep 21 '14 at 19:58
  • Further if we can't even derive our simple language using our standard computers within a reasonable timeframe… I have no idea how to compute scientific notation but what you are telling me that this is 6 times 10 to the 11th power. How many years is that in regular notation? This is crazy. – Joe Sep 21 '14 at 19:59
  • That's 600000000000 years but the method I described is the caveman approach. You can reduce the number of permutations significantly using clever filtering rules. For example, no word starts with zzz or yyy so permutations that start with those such as zzzab or yyyab are useless and do not need to be considered. However, the algorithm you will need to accomplish this reduction in permutations will be ridiculously complicated. – Harvinder Sep 21 '14 at 20:40