Skip to main content

SWHID: Tracking past software for future humans

SWHID: Intrinsic identifier for software artefacts

Published on: 26/09/2023 Last update: 27/09/2023 News

While most open source users in public administrations are developing better strategies for the future, there are also initiatives to document and catalogue its past. The Software Heritage initiative recently released its unique identifier system, the “SoftWare Hash IDentifiers” (SWHID), with the objective of creating a generalised way of tracking software packages and versions for users and developers. Similar to ISBNs for books or ISMNs for music, the SWHID would allow for a greater certainty in the use of software packages, creating a reliable identity for all users. 

The SWHID has been in use for years in the Software Heritage Archive, covering billions of artefacts, and it has recently been assigned to a working group to manage the public release. The SWHID specification is openly available and maintained by a core team of experts. 

While identifiers for other cultural products are usually external, software—and particularly open source software—is in a different position due to its lack of dependence on any particular system for distribution. A song or a book is usually published with the hope that it will be distributed as-is, but developers of open source software usually encourage the creation of modified versions. The SWHID provides a persistent intrinsic identifier for any software source code artefacts. Users can freely use it as the identifier is intimately bound to the designated object, softwares’ artefacts do not need a register, they only need to follow a standard.

“The core of a SWHID identifier is made up by a prefix, swh (registered with IANA), a schema version, a tag to identify the type of the artefact it denotes (snapshot, release, revision, directory or content), and the cryptographic hash of the corresponding object, computed like for Merkle trees. Given a software artefact, everybody can compute the corresponding SWHID using a standard cryptographic algorithm, and can verify that it has not been modified.”

Description of categories of the SW hash identifier
Source: https://www.softwareheritage.org/2020/07/09/intrinsic-vs-extrinsic-identifiers/

SWHID is designed as a general purpose solution. Software Heritage initiative is using it to document the history of software, but other uses are expected to emerge such as documenting supply chains or tracking software versions to identify at what point a security vulnerability was introduced. With only a few decades of such use behind us, the challenges and importance of conservation and tracking have grown significantly. This is even more true now that the software industry is gradually entering an automation phase, where anyone can produce AI-generated code, opening the doors of software development to a larger number of people.