Writting a template for algorithms development - Basic concurrency

2024-12-25

Software engineering covers a wide range of different fields, being the most known one web development, but there is a lot more than that and databases and cloud computing. One topic that is the big brother who left 5 years ago and only comes back when you are knees deep in a problematic situation Also known as "a technical interview" is algorithms.

Algorithms are a fundamental part of software engineering, used in a wide range of applications, ranging from simple data structures to complex machine learning models. They are the backbone of any software system.

For me, a generalist who is making a PhD in computer science, the algorithm design is the base of my everyday work. After a few years of working with them I have gained some insights the loop that involves their development.

In case that you ever have to do something twice a month, as a Software Engineer I strongly believe that you should do it once and automate the process for the rest of the time. By following this idea, I have developed a template that I use for my projects, and that I share publicly in my GitHub for anyone interested. I hope it because algorithm development is usually done also by people with little knowledge about relevant software topics such as concurrency, file management, etc. And I hope this helps them in any way.

The template originally started as a reaction to the tediously repetitive process of running multiple files and saving the results in different files, later aggregating them. To avoid doing this manually, I first created a java class that automatically handled running multiple algorithms in different files automatically. This was the first improvement, and it was huge. Later this template has been improved in several ways, including more patterns, automations and ideas to simplify the whole process.

I have divided the explanation in several posts, which follow the same order as the improvements of the template, since, just like real life, as I improved things, other things caught my attention.

Starting from the beginning

I work in algorithm design and development. This is a process which involves, first researching information and new ideas, then designing the algorithm, and finally implementing it. This process is usually done in an iterative way, and it is repeated multiple times until your algorithm meets an objective, which might be to be faster, to use less memory, to be more accurate, or to just work.

Since it is done in an iterative way, the whole process is repetitive, requiring to do the same exact steps multiple times. Within that process, it usually starts by implementing the algorithm in your chosen programming language, then validating that it actually works as it should, later compiling the code, uploading it to a virtual machine, running it, waiting, and finally downloading the results. There is so much that can be automated in all these steps, since a mistake in one of the first steps requires to redo the whole process, being a total waste of time. Instead, I always bet for the computers, making our lives easier, by automating the boring stuff.

The first improvement

Within those steps, I started by automating the running of the code. Essentially, in this step you must run your code taking as input a certain amount of files. Now, the important part, each of these executions is an independent long running process. Each of these algorithms could run for 10 minutes, 1 hour, or even 10 hours, and it must run for all, 10, 100, or 1000 files, and sometimes you need to test different configurations of the algorithm, so you need to run it multiple times. The resulting needed time grows extremely fast, and they run sequentially. This is where the problem starts.

First solution: Bad bash script

The first solution I came up with was a bash script that would run 4 divided instances of the algorithm, each for a subset of the files required, meaning that there would be 4 processes running individually in the server concurrently, dividing the total time by 4. Great!

This had one huge inconvenient, those are 4 individual processes, meaning that there were 0 communication between them, therefore, if one of them failed, the others would continue running, and you would have to check which one failed and run it again. This is not a big problem if you need to do it once, but as you need to do it twice a day it becomes exhausting. Also, one of the processes might take the big chunky files and take 10 times more than the others, so you would have to wait for that one to finish, without any kind of load balancing among the different cores. Finally, the last nail in the coffin for this solution was that, I hate bash scripts longer than 3 lines.

Second solution: Good Java multithreading

First I want to thank Java for how easy it makes concurrency handling.

The second solution came as a response for the load balancing problem, I wanted the files to distribute automatically among the cores, and just around that time I was taking a concurrency class in the university, it came down like a gift.

Java allows to create an object ExecutorService, which is a pool of threads that you can use to run tasks concurrently. You just have to create a task, and then submit it to the ExecutorService, and it will run it in one of the threads. If you submit more tasks than threads, it will queue them and run them when a thread is free. It is easy to execute, it is easy to implement, and it is easy to understand, it is perfect.

With this implementation I started to actually give shape to the template, knowing that I would reuse this code in all my implementations. So I attempted to do it as generic as possible, so I could reuse it in any other project.

The implementation is divided in two parts. The first part is the algorithm interface, which only defines two methods, “run” and “runWithTimeLimit”, each algorithm I implement will have to implement this interface, which will allow me to run any mix of algorithm interchangeably. The second part is the algorithm runner class. ExecutorService automatically handles running your functions for you, but these functions must implement the Runnable interface, which forces you to implement the run method. The algorithm runner class implements this interface, and it is the one that will be submitted to the ExecutorService. Within this class, all the required steps for the algorithm will be executed. Therefore, the final process results as:

An object that implements Algorithm is created.
The object is wrapped within the Algorithm Runner class.
The Algorithm Runner object is submitted to the ExecutorService.

And that’s it, the ExecutorService will run the algorithm in one of the threads automatically.

Further improvements and later

Honestly, at this point I haven’t seen any point of improvement for this part of the template. I have been using it and I had not found any issue yet. Neither I have any idea on how to further expand it.

On the other hand, there are other parts which I have not mentioned and they will be covered in later posts.

#Concurrency #Java #Real-Case