The [current format](Wikidata Json Format.md) has probably been patched and added to for several years. There is a lot of redundancy and reading it sequentially is quite hard.
This format suggestion tries both to remedy both the redundancy and the problems with the file being hard to read sequentially.
The basic premise of the file: One line per item is the same.
The first property in each an every entry is the
type
property which states whether the entry
represents an item
or property
.
This is redunant since the id
property always
starts with Q
for items and P
for properties.
So I suggest that the type
property is dropped.
Some easy space saving wins come from the labels, descriptions and aliases.
They make up a substantial amount of a lot of items and making these smaller of course makes it faster to create the dump, but will also make it a lot faster to read and process the dump, including using less memory.
A few, hand-optimized examples suggests a space saving on 30% only by optimizing these three sections.
The labels
section is an object with a property for each language for which a label exists:
Original format:
"labels": {
"en-gb": {
"language": "en-gb",
"value": "Northern Ireland"
},...
New format:
"labels": [
"en-gb": "Northern Ireland",
...
]
or:
"labels": [
"Q7979": "Northern Ireland",
...
]
Since there is only one value per language it makes no sense to have an object for each.
The original, compressed version:
"en-gb":{"language":"en-gb","value":"Northern Ireland"}
takes up 55 characters, whereas the new, compressed version:
"en-gb":"Northern Ireland"
takes up only 26 characters!
The same can be done with descriptions:
"descriptions": {
"en-gb": {
"language": "en-gb",
"value": "region in north-west Europe, part of the United Kingdom"
},...
"descriptions": [
"en-gb": "region in north-west Europe, part of the United Kingdom",
...
]
As with labels we could also use Q7979 here.
The compressed versions:
"en-gb":{"language":"en-gb","value":"region in north-west Europe, part of the United Kingdom"}
"en-gb":"region in north-west Europe, part of the United Kingdom"
Goes from a whopping 94 characters to a mere 65.
Aliases are a bit different, since there can be more than one alias per language.
Original version:
"aliases": {
"sco": [
{
"language": "sco",
"value": "Norlin Airlan"
},
{
"language": "sco",
"value": "Norlin Airlann"
}
],...
}
Still, quite a lot can be saved:
"sco": [
"Norlin Airlan",
"Norlin Airlann",
...
}
]
The corresponding compressed versions:
"sco":[{"language":"sco","value":"Norlin Airlan"},{"language":"sco","value":"Norlin Airlann"}]
"sco":["Norlin Airlan","Norlin Airlann"]
Goes from 94 to a meagre 40 characters per language. A substantial saving!
Claims can of course be simplified like crazy - here's a first draft idea:
"claims": {
"P31": ]
"id": string, //unique and looong id for the claim
// No snak object
"snaktype": number, // 0="value", 1="somevalue", 2="novalue"
"type": string, // Note: "type" before "value"
"value": object, // type and layout depending on "type"
"qualifiers": object[] // list of qualifier snaks, optional
"qorder": string[], // optional
"rank": number, // 0="deprecated", 1="normal", 2="preferred"
"refs": object[],// optional
]
}