Or why counting the number of projects using the EUPL is “Mission Impossible”?
How many projects are covered by the EUPL? More than one year ago, GitHub made an estimation: 15.000 projects in their own code base. But GitHub (that is based in Washington) is just one – probably the most important – of the many existing software forges. There are many others: Sourceforge, Bitbucket, RubyForge, Tigris.org, BountySource, Launchpad, BerliOS, JavaForge, GNU Savannah, GitLab, etc.,
There is no doubts that the EUPL is widely known and used today: Any google search highlights it. On multiple forums across the world, it is possible to verify that developers refers to the EUPL or discuss about the EUPL without needing to tell or explain to others what it is. It seems that at least in Europe a great number of software developers, if not almost everyone, knows. But can we measure this impact in projects, in numbers?
Even if the inputs were perfect and exhaustive, which they are not, counting licences (which in practice means writing software to count licences in many billion lines of code worldwide) is extremely difficult and requires making many normative choices. These choices need to be discussed and disclosed if we're to draw any accurate conclusions. The problems start with deciding what qualifies as a project to count. What is a project? What is just a licensed file or document? What if a project includes 200 files or much more (each of them licensed under the EUPL)? Do you care whether the code actually works and is mature, or whether it's had contributions from more than one person, or when it is just a beta version, a demo, a draft or sometimes simply the beginning of the formalisation of an idea? How do you handle duplication? Projects often change code hosting sites without removing their old home. If you are crawling multiple hosts, is your counting (code) smart enough to tell when two programs are the same? Does a forked or slightly modified version count as a separate program? What about versions of the same program for different operating systems?. Do you count them separately?
After you've determined which projects qualify, you have to parse or extract their licensing information. Licence notices are not yet predominantly written in structured, machine readable formats. At the contrary to patents, no central register exists and there is no formalism regarding the publication of copyright notices. They are written by and for humans, with typos and inconsistent formatting that confound automated parsers. This is especially true for the EUPL where copyright notices could be written in 23 languages. Even if English is number one without any doubts (the developers’ lingua franca), developers who are not English native may use multiple formulations for publishing their copyright notice. When licences are recognized in copyright notices, there may be several of them. A EUPL-covered project can contain files carrying other more permissive licence notices, because it is allowable -- and common -- to redistribute such files as part of a copyleft work. Because of the compatibility, EUPL-covered code can be merged in code covered by a compatible licence. Does that add one just to the EUPL column, or do you increment the other licence columns?
Once you've decided that a project qualifies, and have figured out how to represent its licence(s), you then have to decide how much weight to give it. Do you care about the size of the codebase? If you don't, then you will count a large package like CIRCA or AT4am as equal to a few lines of code in a small node.js library. If you do care, then you have to create categories to better compare apples to apples, and those criteria need to be published for readers to properly understand the results. Do you care about the size of the user base or community? If you don't, you will count any GitHub repository containing someone's personal configuration files, kindly shared under an open source licence but really intended only for their personal use, the same as a very large application, used by thousands of users or as the foundation for billions euro in economic value. If you do care, then you need to share how you determined the user base or community and how that was incorporated in the result.
Counting licences used across the entire universe of open source software is not an easy job!