Master Data Management using Blockchain technology

Master Data Management for Life Sciences using the Blockchain

Erick Antezana
Rutger Vos, Naturalis Biodiversity Center, Leiden, the Netherlands. ORCiD: 0000-0001-9254-7318

Keywords

master data, ontologies, semantic web, linked data, blockchain, FAIR, data governance, data stewardship, life sciences, ownership

Background

Capital-intensive life science R&D, whether in the red, green, or white domains, has a strong need to trace biological materials such as seeds, samples, tissues, specimens, or plants not only for R&D purposes but also to comply with regulations while commercializing biotech products. Nowadays, there are too many databases, which are not typically integrated, storing traceable data around R&D data (e.g. to trace contaminants by driving down the time to find out its origin). Those databases keep different degrees of detail (e.g. data owner, type of data, status of the data); this fact complicates the querybility potential and further data analysis (e.g. QA, labeling). Moreover, keeping trace of biological material and their data from the very beginning - i.e. any form of capture, such as in the lab or in the field - until it makes it to the market is very relevant as the IDs typically change, the actions are not well traced, the users are not recorded, and so forth. Thus, having a technology that could decentralize the management of “transactions” is vital. Such a resource should not rely on a central body that manages the “trust” of information but on the community around it, who should be able to transparently access the data details including the owners of the data as well as its attributes (e.g. location, degree of completeness, degree of quality, specifications, data ownership, IP, privacy regulations).

What is blockchain?

It is a distributed technology allowing users to transfer assets (e.g. data) without intermediaries. The transaction is recorded in a digital ledger shared among the users taking part in that deal. Such a ledger keeps transparent/open views (e.g. data owner) on each transaction, which would define a concrete block within the entire chain. Each transaction is recorded in such a decentralized ledger, which is replicated across the chain participants. Once a transaction is recorded, using cryptographic techniques, it cannot be modified. Thus, traceability and provenance are often associated with this technology model. blockchain (source: BlockGeeks)

What is a Smart Contract?

Smart contracts are digital contracts, agreed upon by the chain participants, that provide controlled manipulation of the ledger. A consensus is said to be reached when the ledger contents are in sync in all participants chains. Participants are the ones approving transactions, which will make it into the entire chain. Different users might benefit from this technology indirectly and transparently.

What are cryptocurrencies?

Cryptocurrencies are just a convention where the participants agree that the transactions on a ledger represent money. The characteristics of the blockchain, which protect against fraud, make it so that the trust issues that normally exist around banking (requiring a lot of government regulations and backing) are addressed algorithmically. BitCoin is most well known cryptocurrency, but there are others.

Evaluation as of late 2017

Pros

Empowered users: data scientists have been typically asking for better controlling their assets: data, results, analysis, reports, … and they prefer to have such a control instead of trusting a third-party. Timely processing: no overhead or less overhead as no (or minimal) third-party acknowledgment is needed to confirm a transaction.
Transparent logs: all actions are recorded and no one can alter or delete them
Lower transaction costs due to fewer parties involved in the process
Blockchains make your data F.A.I.R.

Cons

Security aspects have been highlighted as a risk users wouldn't like to face (over 1 billion bitcoins were stolen/hacked in the last 5 years).
Performance of transactions could be an issue while dealing with large blockchains expected to support several actions per second.
Privacy of transactions: people might want to keep their actions (and data assets) not known by others.
No use cases showing real benefit are available
It is not a well understood technology: there is minimal effort to get this technology concepts digestible by newcomers. Thus, many people don’t trust it.

Application: bio-blockchain

Exchanging data

The simplest application is perhaps a series of data exchange transactions where the participants transform (enrich/analyze/filter) data in an agreed upon manner. This will involve the following steps:

create a decentralized network
query the ledger (source: hyperledger)
update the ledger (source: hyperledger)

Quantifying knowledge

The model proposed encompasses a value assigned to each piece of data asset. This is inspired by the Bitcoin model where currency gets a value according to economic markets. Within the Life Science, value is typically associated with the quality of resources, e.g. highly-curated databases are of great value as they are the result of a quality assurance process.

Awarding for knowledge

Participants will get rewarded for their contributions. In this way, a digital economic model will run where "currency" retributions, known as Shō ("award" in Japanese), are going to be given to participants according to their contributions (e.g. the higher they curate data, the more rewards they get).

SHŌ "With shō coin you will show off"

VIVO is an example of a decentralized system; institutions publish VIVO data just by adhering to a simple data structure in the form of an ontology. Similar to a distributed ledger, VIVO is a decentralized database that is used to maintain a continuously growing list of records. These records aim to include a comprehensive list of scholarly outputs. Although outputs are often described in a single narrative, the published reviewed paper, the research has generated many other outputs, which may or may not be recorded. The Research Object (RO) is a container for the purpose of recording all scholarly outputs associated with a particular research effort. The paper, datasets, software, notes, are bundled in an RO. The published paper is often a small fraction of the content of an RO. VIVO represents works beyond papers, datasets, software and reports. More is needed to represent all items that might appear in an RO. But another clear gap exists in identifying the collection of works, traditional and non-traditional, as research objects that can then represent relationships between the works. The richness in relations across ROs is also unrepresented, time-based relationships, and logical precursors, for example.

We are designing a system based on Blockchain technology to keep the ledger for ROs; from the conception through the life cycle of the RO. Wikipedia describes blockchain as follows:

“By design, blockchains are inherently resistant to modification of the data. Once recorded, the data in any given block cannot be altered retroactively without the alteration of all subsequent blocks and the collusion of the network. Functionally, a blockchain can serve as an open, distributed ledger that can record transactions between two parties efficiently and in a verifiable and permanent way.The ledger itself can also be programmed to trigger transactions automatically.”

We argue that all the value chain for ROs should be kept in a distributed ledger; in this way the RO is preserved and, modifications and transactions over the RO are kept in the ledger. Researchers are thus able to account for their products in a data-based ecosystem that makes it possible for third parties to develop specialized tools over the ROs, researchers and transactions. The Blockchain also preserves the metadata associated with transactions and modifications of ROs; the data is portable, once on the ledger the researcher and the ROs do not depend on a node for existing. The physical existence of the objects is left to specialized apps, e.g. github, figshare, dspace, dryad etc, using protocols, part of our design, to ensure provenance and traceability, thus asserting the existence of the object. Institutions are suppliers of metadata to the ledgers of their people. VIVO is enhanced as a portal, it becomes an integral tool for researchers to increase the value of the ledger, and provides tools for reaping value from the distributed collection of ledgers. In this way, the value of the commons is safely kept. Others may enhance the value of the commons by mining and contributing to the commons, by certifying elements of the commons, and by providing additional means for making use of the commons. The distributed ledger makes it possible to account for all the value that is currently unaccountable. In this context, the tragedy of the commons is not depletion of resources; it is to generate uncountable additional resources.

Architecture

blockchain_architecture

Open source blockchain code bases

Hyperledger

"Hyperledger is an open source collaborative effort created to advance cross-industry blockchain technologies. It is a global collaboration, hosted by The Linux Foundation, including leaders in finance, banking, Internet of Things, supply chains, manufacturing and Technology. That is what Hyperledger is about – communities of software developers building blockchain frameworks and platforms."

Ethereum

"Ethereum is a decentralized platform that runs smart contracts: applications that run exactly as programmed without any possibility of downtime, censorship, fraud or third party interference. These apps run on a custom built blockchain, an enormously powerful shared global infrastructure that can move value around and represent the ownership of property. This enables developers to create markets, store registries of debts or promises, move funds in accordance with instructions given long in the past (like a will or a futures contract) and many other things that have not been invented yet, all without a middle man or counterparty risk."

Storj

"We are passionate about decentralization, and we love free and open-source software. Our mission is to rethink cloud storage, to provide the security, privacy, and transparency it’s missing. That's why we are building an open-source cloud platform, that aim to fundamentally change the way people and devices own data."

Sia

"Sia splits apart, encrypts, and distributes your files across a decentralized network. Since you hold the keys, you own your data. No outside company can access or control your files, unlike traditional cloud storage providers. Using the Sia blockchain, Sia creates a decentralized storage marketplace in which hosts compete for your business – this leads to the lowest possible prices. Renters pay using Siacoin, which can also be mined and traded."

Tested during the BH17

https://hyperledger-fabric.readthedocs.io
Easy to install
Very good documentation (e.g. tutorials, videos, examples) -- non-Life Science example :-(
Examples simulating a blockchain network
Supported by Linux Foundation
Network is implemented using Go language, clients are (for example) in JavaScript/Node.js

Open questions and not yet tackled issues

What are the characteristics of a simple case where the blockchain is useful for life scientists?
How would a reward strategy work for data assets?
How to deal with data assets speculation?
Trends (e.g. new technologies, new discoveries) will influence the way this "knowledge currency" evolves over time. Which are the variables to take into account?
Should there be a global governing body (e.g. analogous to stock exchange) around this "currency"?

Conclusion

Blockchain technology is nowadays more than a buzzword that has been attracting communities who were looking for innovative ways to solve certain problems such as the traceability of data assets. Life sciences R&D is one of those communities that is also expecting to better manage data assets.

Literature

https://www.mendeley.com/community/blockchain4ls/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly