I downloaded the metadata for every package on PyPi and this is what I found out

Chris Brookes
5 min readMay 3, 2020

I recently presented a lightning talk to some colleagues on how to port Python 2 code to Python 3. Writing code that supports Python 3 is now especially important after Python 2 reached its end of support on 01/01/2020. When writing the presentation I was looking for some statistics on how many Python packages now support Python 3, and how many of those have dropped support for Python 2. From scouring the internet I wasn’t able to find anything concrete, so I though I would answer the question for the rest of the world.

PyPi.org is the main Python Package repository and is maintained by the Python Packaging Authority. There are many other public mirrors and it’s common for large organisations to host their own, but PyPi tends to be the source of truth when it comes to Python libraries. This is likely the best place to get an answer to my question.

One common way to denote support of Python versions is via classifier strings. However after inspecting the source code of pip, I found that no attention is paid to these values, and they appear to be purely cosmetic. Other ways include the supported Python version in the filename of a .whl file (set by package metadata), and an under-utilised requires_python field that can be defined in the setup.py of a package. It became clear that there was no uniform way for a package to declare its support for Python versions.

The web interface for PyPi does not allow any advanced querying. I also found that it provides a very limited API. There are three types: simple, JSON & XMLRPC, none of which provide a flexible querying language beyond classifiers. Getting an answer to my question was going to require digging a bit deeper, so I decided to go with the JSON API and download the metadata for every package into an SQLite database that I can then query. After a coding session I arrived at this. Two hours, a large number of API requests and a 300MB database file later, the script had finished running and I had a local copy of all the package metadata. Apologies to anyone trying to install a package during that time.

At the time of this experiment (May 3rd 2020), there were 227,069 packages published to PyPi. As well as well maintained and actively developed packages, PyPi has its fair share of stale projects and other junk. For this exercise, any package that had not had a release in the last four years was considered stale, unmaintained and not included in the results below. This brings the total number of packages down to 179,171.

Package Versioning

As aforementioned, string classifiers are the most common way to denote versioning of a Python package so I started there. It’s worth noting that of the active packages on PyPi, 49,193 (27%) have no classifiers at all. Computing various queries checking for classifiers yielded the following results:

Beyond the 27% of packages that do not use classifiers, a further 21,319 packages have 0 classifiers that relate to Python version support, bringing the total to 70,512, or 39% of all active Python packages on PyPi. However, from the data above we can see that the industry has already made a heavy shift towards Python 3, with at least 56% of packages claiming they support 3. 17% have maintained support for Python 2, but 38% exclusively support Python 3.

The next step was to take the 70,512 packages that could not be classified by the method above and look for the python_version in the metadata of the package's latest release. This field is populated to indicate the Python version support for all package formats (e.g. .whl, .msi, .egg, .exe, .dmg files), except for source code distribution files ( sdist). This accounts for a further 29,160‬ packages.

This then leaves 41,352 packages that were not classified. 1,418 of these populated the requires_python version. However this is a free-text field and error prone when parsing. For such a small amount across a large data set I decided to discount this as a method to classify.

The final numbers:

Other Insights

Downloading the metadata into a format that can be queried presented an opportunity to find other insights into the Python world. Below details a few more insights drawn from the data obtained from PyPi.

Actively developed packages

Number of packages that have had a release in the last:

Licensing

As the license field in a setup.py is free-text the data showed a myriad of inconsistencies, spelling mistakes and formats that made it hard to quantify reliably. For a rudimentary view on the license distribution of packages I decided to go for a keyword-based approach.

Beyond the main licenses listed above I also came across a few humorous ones:

  • Do whatever you want, don't blame me
  • DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE Version 2, December 2004
  • <open file 'LICENSE', mode 'r' at 0x7f911d29f5d0>
  • Buy snare a beer

Top Authors

The author is a textual field and the table below does not take into account inconsistencies in spelling and format, but here are the top 15 people/organisations that author the most packages published to PyPi:

Conclusion

To answer the initial question, we were able to classify the supported Python version for a total of 137,819‬ active Python packages on PyPi (77%). Of those, 123,686‬ (69%) supported Python 3, and 6708 (21%) supported both Python versions. I expect Python 2’s recent end of support will trigger a larger shift to Python 3 and the results will look very different one year from now.

This proved to be a very long-winded project for a question that was initially thought to be trivial to answer. However it has shed some light on the secret world of Python Packages and will allow for the tracking of trends in the Python industry going forward. I have productionised the code (to some degree) for anyone else who wishes to run their own queries, you can find it here. I have also uploaded the data and queries used in this blog here.

Originally published at https://chris-brookes.blogspot.com on May 3, 2020.

--

--

Chris Brookes

‘Just a tech’. Python & C# Developer. Blog of random technical thoughts that hopefully somebody, somewhere will find useful or interesting one day.