Software Library Security is Scary

Michael Biancardi

Posted March 28, 2024; Last Updated March 28, 2024

Like a lot of developers, I never thought too hard about the security of libraries I use in my code. Surely large libraries like numpy, pandas, and jQuery are reasonably secure; they have the backing of large organizations, after all. Until recently, I hadn't considered the security of smaller libraries. I assumed that if they were available on a platform like pip or npm, then they must be at least somewhat secure. I assumed there was at least a basic vetting process. It turns out this assumption is wrong.

Recently, as part of my Master's Thesis work, I published an open-source Python package called OpenSTSMiner [Sidenote: My advisor has asked me to make the GitHub repo private and remove the library from PyPI until we publish a paper on this work; this doesn't detract from my findings and observations here, though]. This package implements a spatiotemporal data mining algorithm called Spatiotemporal Sequence Miner, or STS Miner. While the code is on my GitHub, I ideally wanted it available via a package manager like pip so others could easily use it. Plus, being on pip makes your library seem 'official'. I followed the Python packaging tutorial put out by the Python Packaging Authority (PyPA), the same group that makes pip, the most common package manager for Python which interacts with the Python Package Index (PyPI). The tutorial I followed ends with uploading your package to TestPyPI, a version of the Python Package Index used for testing and experimentation. Uploading to TestPyPI isn't too complicated. You just create an account, generate an API key, run a command to upload to the server via twine, and specify your API key. By specifying parameters to pip, you can even install your package from TestPyPI. That was pretty cool.

I was curious, though, what it would take to get my library on the official PyPI server. Using TestPyPI, users can't simply "pip install [package]"; the command instead looks like "pip install --index-url https://test.pypi.org/simple/ --no-deps [package]". Moreover, the TestPyPI server isn't designed to be permanent; occasionally, packages get deleted to save space. Luckily, the tutorial I was following contained some information on next steps. As it turns out, uploading to the official PyPI server isn't any more complicated or involved than uploading to TestPyPI. You create an account, generate an API key, upload via twine, and specify your API key. That's it. Your package is then available for installation by using pip like normal.

Honestly, I expected the process to be more involved. I wasn't expecting an in-depth security audit. But I (mistakenly) assumed someone would look at the library and make sure it's at least semi-legit. But no, there was no such process. My library now has the legitimacy of being on PyPI. To a lot of developers (including myself, up until recently) it would now appear that my library is "official", in some sense. That's scary.

In theory, a library on PyPI could mine for cryptocurrency in the background. It could read someone's files and upload them to a remote server. It could do all kinds of crazy stuff - and there's no check in place to prevent this. Of course, an OS or antivirus may have security controls that mitigate some of this - though an unknowing user could easily be tricked into thinking they're giving a legitimate library reasonable permissions. And sure, someone could audit the code, report it to PyPA, and get the library removed. But who's actually going to do that? How many developers take the time to read through source code before using a library? How much damage could a rogue library do before being caught?

This problem isn't unique to PyPI/pip, either. Anaconda, another Python package manager, also simply lets you upload a package; see this tutorial. It's similarly easy to upload a JavaScript package to npm; see this freeCodeCamp tutorial. The situation with nuget in .NET appears to be a bit better; according to Microsoft, packages under virus changes and some other validation steps before publication. Overall, this isn't great. So many seemingly "official" sources of libraries do nothing to validate library submissions.

These attacks aren't theoretical, either. Sonatype reported in 2023 that their researchers had found PyPI libraries that acted as Remote Access Trojans (RAT's), libraries that downloaded malware, libraries that stole private information, and more. That same year, researchers at ESET Threat Research found 116 malicious packages on PyPI. The reality is that malicious libraries are out there infecting unknowing developers.

To those working in the application security field, this is nothing new. But most developers don't realize the risks associated with installing libraries that haven't been validated. Many developers will blindly install a library that suits their needs - after all, don't reinvent the wheel, right? Until fairly recently, I fell in that same category. But now I know better. From now on, I'll be validating that the libraries I download are legit. For starters, I'm devising a few basic validation steps for myself:

Check who wrote the library. If the library comes from a large company or academic lab, it's probably safe. If it was written by some lone, random person, it's probably best to dig a little deeper and move on to the next steps.
Search the library online. If someone has identified the library as problematic, I want to know.
Scan the package (and dependencies) for legitimacy. There are open-source tools such as Safety CLI that scan against databases of vulnerabilities and known malicious packages. There are no guarantees here, but it's probably better than nothing.
Skim through the library's code. This probably isn't necessary all the time, but it doesn't hurt to glance through the code for a library. This doesn't have to be a full audit, but verify there's nothing obviously malicious. Plus, you'll probably get a better understanding of how the library works, which can help you as a developer.

If you're an actual company, you'll want a more in-depth validation process than this. And you'll probably want to formalize that process in more detail. But for individual developers looking to be more security-conscious, I think these steps provide a good starting point. General security practices also apply (e.g., don't store sensitive information in plaintext on your desktop). There's certainly more that can be done, such as running the library in a VM or auditing the code. As an individual developer, though, you need to balance security with convenience and productivity. Testing a random library in a VM on the off-chance that it's malicious may be more effort than it's worth. That's a judgment call you'll have to make.

So, in conclusion, most software libraries are not validated in any way. Just because they're on an "official" platform like PyPI or npm doesn't mean they're legit - there's usually no one validating before a library is published. Random libraries you install could very well be malware. If you're a developer, be careful about the libraries you use. Don't trust a library just because it's available on your favorite package manager. And listen to your infosec coworkers: There's a reason they don't want you downloading random software without some validation process.