Faster computer results without fear of errors | MIT News

Researchers have initiated a technique that can drastically accelerate certain types of computer programs automatically, while ensuring program results remain accurate.

Their system increases the speed of programs that run on the Unix shell, a ubiquitous programming environment created 50 years ago that is still widely used today. Their method parallels these programs, which means that it splits program components into pieces that can be run simultaneously by multiple computer processors.

This enables programs to perform tasks such as web indexing, natural language processing, or data analysis in a fraction of their original runtime.

“There are so many people who use these kinds of programs, such as data scientists, biologists, engineers and economists. Now they can automatically accelerate their programs without fear of getting the wrong results,” said Nikos Vasilakis, a researcher in the Computer Laboratory. and Artificial Intelligence (CSAIL) at MIT.

The system also makes it easier for developers to develop tools used by data scientists, biologists, engineers and others. They do not need to make any special adjustments to their program commands to enable this automatic, error-free alignment, adds Vasilakis, who chairs a commission of researchers from around the world who have been working on this system for nearly two years.

Vasilakis is a senior author of the group latest research paperwhich includes MIT co-author and CSAIL graduate student Tammam Mustafa and will be presented at the USENIX Symposium on Operating System Design and Implementation. Co-authors include lead author Konstantinos Kallas, a graduate student at the University of Pennsylvania; Jan Bielak, a student at Warsaw Staszic High School; Dimitris Karnikis, software engineer at Aarno Labs; Thurston HY Dang, a former MIT postdoctoral fellow who is now a software engineer at Google; and Michael Greenberg, an associate professor of computing at the Stevens Institute of Technology.

Decade problem

This new system, known as PaSh, focuses on a program, or scripts, that work in the Unix shell. A script is a sequence of commands that instruct a computer to perform a calculation. Accurate and automatic parallelization of shell scripts is a thorny issue that researchers have struggled with for decades.

The Unix-like shell remains popular, in part because it is the only programming environment that allows one script to be composed of functions written in several programming languages. Different programming languages ​​are more suitable for specific tasks or types of data; if a programmer uses the correct language, solving a problem can be much easier.

“People also enjoy developing in different programming languages, so composing all these components into a single program is something that happens very often,” Vasilakis adds.

While the Unix-like shell enables multilingual scripts, its flexible and dynamic structure makes it difficult to parallelize these scripts using traditional methods.

Paralleling a program is usually difficult because some parts of the program depend on others. This determines the order in which components must operate; the order failed and the program failed.

When a program is written in a single language, developers have explicit information about its features and the language that helps them determine which components can be parallelized. But those tools do not exist for scripts in the Unix shell. Users cannot easily see what is going on inside the components or extract information that would help in alignment.

Just the right solution

To overcome this problem, PaSh uses a preprocessing step that incorporates simple comments on program components that it thinks might be comparable. Then PaSh tries to parallelize those parts of the script while the program is running, at the exact moment it reaches each component.

This avoids another problem with shell programming – it is impossible to predict the behavior of a program in advance.

By aligning program components “just in time”, the system avoids this problem. It is able to effectively accelerate many more components than traditional methods that attempt to perform parallelization in advance.

Just-in-time alignment also ensures that the accelerated program still returns accurate results. If PaSh arrives at a program component that cannot be parallelized (perhaps it depends on a component that has not yet worked), it simply runs the original version and avoids causing an error.

“No matter the performance gains – if you promise to do something run in a second instead of a year – if there’s any chance of returning wrong results, no one will use your method,” Vasilakis says.

Users do not need to make any modifications to use PaSh; they can simply add the tool to their existing Unix shell and tell their scripts to use it.

Acceleration and accuracy

The researchers tested PaSh on hundreds of scripts, from classic to modern programs, and it didn’t break a single one. The system was able to run programs six times faster, on average, compared to unparalleled scripts, and reached a top speed of almost 34 times.

It also accelerated the speeds of scripts that other approaches could not match.

“Our system is the first to show this kind of completely right transformation, but it is also an indirect benefit. The way our system is designed allows other researchers and users in industry to build in addition to this work, ”Vasilakis says.

He is excited to receive further feedback from users and see how they are improving the system. The open source project joined the Linux Foundation last year, making it widely available to users in industry and academia.

Going forward, Vasilakis wants to use PaSh to deal with the distribution problem – sharing a program to run on multiple computers, rather than multiple processors within a single computer. He is also looking to improve the commentary scheme so that it is more usable and can better describe complex program components.

“Unix shell scripts play a key role in data analytics and software tasks. These scripts could work faster by making the various programs they call for use the multiple processing units available in modern CPUs. However, the dynamic nature of the shell makes it difficult
come up with parallel execution plans in advance, ”says Diomidis Spinellis, a professor of software engineering at Athens University of Economics and Business and a professor of software analysis at Delft Technical University, who was not involved with the research. “By timely analysis, PaSh-JIT manages to conquer the dynamic complexity of the shell and thus reduces the execution times of a script while maintaining the accuracy of the corresponding results.”

“As a replacement for an ordinary shell that directs steps but does not rearrange or separate them, PaSh provides a free way to improve the performance of big data processing jobs,” adds Douglas McIlroy, a lecturer in the Department of Computing at Dartmouth College, who previously led the Computing Techniques Research Department at Bell Laboratories (which was the birthplace of the Unix-like systems). “Manual optimization to exploit parallelism must be done at a level for which ordinary programming languages ​​(including shells) do not offer pure abstractions. The resulting code intermingles matters of logic and efficiency. It is difficult to read and difficult to maintain in the face of evolving demands. PaSh skillfully intervenes at this level, keeping the original logic on the surface while achieving efficiency when the program is run. “

This work has been supported, in part, by the Defense Advanced Research Projects Agency and the National Science Foundation.

Leave a Reply

Your email address will not be published.