7

I have a table URL_Experiment in my database (mySQL database). I have 2 million URL links in this table.

For each URL, I am checking if some particular text is present in the URL and updating the information back in my URL_Experiment table.

Now, I need to make the script run fast. The easy solution that comes to my mind is:

split the URL_Experiment into 10 tables (each with 200,000 rows) and run the script 10 times simultaneously. This way, I can ensure that the updates happen properly. But for this, I need to save 10 scripts, each accessing the correct table (I need to rename the tables from 1 to 10).

Though the above seems a pretty neat solution, I am looking for a more sophisticated solution. Something which will execute the same script simultaneously (in parallel) as 10 processes. My concern is the execution time. I do not want the resources to be wasted when they can be utilized.

Faheem Mitha
  • 35,108
Ramesh
  • 39,297
  • No sir. If it is a spamming system, I might not have revealed my original identity. It is related to my thesis research work. I had built the entire code using excel VBA. But, it was failing for larger datasets and so I am rewriting the entire code in bash script. If you see my SO questions, most of them would be related to data extraction in VBA only. – Ramesh Feb 13 '14 at 00:20
  • 1
    I am glad that am getting noticed for my questions :) I am really learning a lot these few days after starting to rewrite the code in script. I will mention all these links in my thesis defense :) – Ramesh Feb 13 '14 at 00:25
  • Are you populating the MySQL DB or is it the source of the URLs that you're being provided? I ask b/c I might be inclined to use sqlite instead of MySQL. BTW, what is your overall objective for your project? – slm Feb 13 '14 at 00:36
  • I generate the URLs based on the names that I get from another source. I know the website follows a particular pattern always and so I check if these URLs exist and if they exist, I extract the information from those URLs and store them in my MySQL database tables. – Ramesh Feb 13 '14 at 00:40
  • How was the performance in the creation of the MySQL DB? Often with this volume of rows the design of the architecture will bite you if you don't pay special attn. to it initially: http://stackoverflow.com/questions/1711631/how-do-i-improve-the-performance-of-sqlite – slm Feb 13 '14 at 00:42

1 Answers1

6

You can try GNU parallel. Many examples can be found here and here.

I am a member of the Swift team and we do exactly these kinds of things for large research projects.

Ole Tange
  • 35,514
Ketan
  • 9,226
  • Thanks a lot for the suggestion. If I need to use GNU parallel, do I have to change the script or something? Do I need to make any configuration changes? – Ramesh Feb 13 '14 at 00:18
  • Ketan offered to contact him directly if you need help setting up Swift, moving this to the comments, not really part of the answer: "Feel free to contact me if you would like to setup Swift for your problem.". – slm Feb 13 '14 at 00:19
  • 3
    parallel does seem like the best for the task, specifically I'd do something like mysql -e 'select ...' | parallel --pipe -j10 --round-robin ./scriptname.sh – phemmer Feb 13 '14 at 05:07