Something fascinating happened in the world of scientific publishing last week: The prestigious journal Nature featured an overview of a 15-year-old programming library for the language Python. The widely popular library, called NumPy, gives Python the ability to perform scientific computing functions.
Asked on Twitter why a paper is coming out now, 15 years after NumPy’s creation, Stefan van der Walt of the University of California at Berkeley’s Institute for Data Science, one of the article’s authors, said that the publication of the article would give long-overdue formal recognition to some of NumPy’s contributors.
Our last paper was ~2010 not fully representative of the team. While we love that people use our software, many of our team members are in academia where citations count. We hope this will give them the credit needed to receive grant funding produce more high quality software
— Stefan van der Walt (@stefanvdwalt) September 16, 2020
The paper may be timely in another way. As accomplished as NumPy is in the Python programming world, there are clues in the paper that its future may be even more significant.
NumPy has the prospect of becoming an important piece of infrastructure for computing over and above just being a very valuable library.
As the article points out, NumPy has moved beyond its original scope of functions on multidimensional arrays. It has over time acquired aspects of infrastructure. The authors write, “It is no longer a small community project, but core scientific infrastructure.” That’s true in more ways than one. NumPy is not only a very valuable library of functions. It is becoming the center of a constellation of emerging libraries.
To understand why you must understand the modern utility of NumPy.
Array programming, the heart of NumPy, is especially important in artificial intelligence programming, including machine learning and deep learning. Those computing tasks depend on linear algebra, where manipulation of multi-dimensional arrays, known as tensors, is paramount.
Each of the AI frameworks such as TensorFlow and PyTorch have come up with different ways to do arrays, in part out of a response to the proliferation of specialized AI computer chips that operate on tensors in different ways. To stem the potential confusion from that, as the authors write, NumPy has “added the capability to act as a central coordination mechanism with a well-specified API.”
The same familiar NumPy code will send a given array function off to the very specific capabilities of the ever-expanding collection of technologies, things such as Dask, the library that can parallelize arrays to run on distributed systems of multiple computers. Examples of such mechanisms, known as protocols, include something called “NEP 18,” which allows arguments of a function in NumPy to invoke additional functionality outside the scope of what NumPy does.
ZDNet reached out to ask the corresponding authors whether NumPy will continue to evolve as a piece of infrastructure.
“Your question is a very good one, and one of the more important ones for where the whole ecosystem goes over the next years,” wrote Ralf Gommers, one of the authors, in an email to ZDNet. Gommers is the director for Quansight Labs, part of Quansight, the Austin, Texas startup that provides support for open source programs.
Gommers, who emphasized that he was speaking only for his personal view, and not for the NumPy community as a whole, told ZDNet, “I’d say, yes — NumPy is very likely to continue to evolve in this way.”
In fact, Gommers and others are attempting to bring some standardization across the Python landscape for how arrays are handled between these various technologies. They have formed something called the Consortium for Python Data API Standards. The initial blog post describes how there’s a risk of fragmentation as array functions get implemented in dozens of different libraries, from Dask to CuPy to Pandas to PyTorch to Koalas, etc.
As NumPy serves more and more as a coordination mechanism, the NumPy application programming interface may grow in importance and expand beyond its actual implementation. ZDNet asked the authors in the same exchange, “might NumPy become over time a piece of infrastructure that is separable from Python, as a resource that can be used regardless of the programming environment, to support distributed array operations and the like?”
“That is already the case I think,” Gommers told ZDNet, “not only with Xtensor or those other libraries I mentioned but also PyTorch and TensorFlow offering NumPy-like C++ APIs.”
Gommers added, “Really long-term I expect the NumPy ‘execution engine’ (i.e., the C and Python code that does the heavy lifting for fast array operations) to become less and less relevant, and the API to stay around or even grow in universality.”
Gommers makes an analogy to Basic Linear Algebra Subprograms, known by the acronym BLAS, a set of standard routines for operations on vectors that underly a lot of scientific computing. While “there are many conforming implementations,” Gommers pointed out, “almost no one uses the official Netlib BLAS anymore to run code.”
Again, Gommers’s comments are not an official view of the NumPy community. But his sense of NumPy’s trajectory thus far suggests that the library may over time become much more than a set of function calls. Perhaps it will become something of a universal framework for interaction of lots of capabilities in the realms of scientific computing, especially in AI.