« Home / Table Linker Utility Commands
This document describes the utility commands for the Table Linker (tl)
system.
Table of Contents:
build-elasticsearch-input
: builds a json lines file and a mapping file to be loaded into elasticsearch from a kgtk edge file.load-elasticsearch-index
: loads a json lines file to elasticsearch indexconvert-iswc-gt
: converts the ISWC Ground Truth file to TL Ground Truth filemetrics
: computes theprecision
,recall
andf1 score
for thetl
pipeline
Options:
-e, --examples
-- Print some examples and exit-h, --help
-- Print this help message and exit-v, --version
-- Print the version info and exit--url {url}
: URL of the ElasticSearch server.--index {name}
: Name of the elasticsearch index-U {user id}
: the user id for authenticating to the ElasticSearch index.-P {password}
: the password for authenticating to the ElasticSearch index.
utility commands for tl
builds a json lines file and a mapping file to be loaded into elasticsearch from a kgtk edge file.
This command takes as input an Edges file in KGTK format, which must be sorted by node1
so that this script can generate the JSON file for the index in a streaming fashion.
This command builds an index of labels and aliases. Extra information in the form of key#value
can be stored for retrieval purposes only. This command will also index the pagerank of node1
if available in the input KGTK edge file.
Options:
--input-file {path}
: input kgtk edge file, sorted by node1.--output-file {path}
: output json lines file, to be loaded into ES.--label-properties {a,b,...}
: the name of property which has labels for the node1.--mapping-file {path}
: path where a mapping file for the ES index will be output.--alias-properties {a,b,...}
: the name of property which has aliases for the node1. Optional, no default.--description-properties {a,b,...}
: the name of property which has descriptions for the node1. Optional, no default.--pagerank-properties {a,b,...}
: the name of property which has descriptions for the node1. Optional, no default.--blacklist-file {path}
: blacklist file path nodes from which will be ignored in the output. Optional.--extra-information {True|False}
: store extra information about node1 or not. Default False.--add-text {True|False}
: add a text field in the json which contains all text in label, alias and description. Default False
Example:
Consider the following three-stooges Edges file in KGTK format.
It uses two properties to define labels for nodes, preflabel
and label
,
and one property to define aliases, alias
.
node1 label node2
N1 isa Person
N1 preflabel “Moe”
N1 label “‘Moeh’@fr”
N2 alias “Lawrence|Lorenzo”
N2 isa Person
N2 preflabel “Larry”
N3 isa Person
N3 preflabel “Curly”
The following command will build a json line file and a mapping file using the properties preflabel
and label
to define the labels and alias
to
define the aliases of nodes.
$ tl build-elasticsearch-input --label-properties preflabel,label --alias-properties alias \
--mapping-file nodes_mapping.json --input-file nodes.tsv --output-file nodes.jl
This command will map nodes as follows:
- N1:
labels: “Moe”, “Moeh”
- N2:
labels: “Larry”, aliases:“Lawrence”, “Lorenzo”
- N3:
labels: “Curly”
The following command will build a json line file and a mapping file using the properties label
to define the labels and alias
to define
the aliases of nodes.
$ tl build-elasticsearch-input --label-properties label --alias-properties alias \
--mapping-file nodes_mapping.json --input-file nodes.json --output-file nodes.jl
This command will map nodes as follows:
- N1:
labels: “Moeh”
- N2:
aliases: “Lawrence”, “Lorenzo”
- N3:
Implementation
The algorithm uses the properties listed in the labels option to collect the set of strings to be indexed. The following cleaning operations are made on each value:
- When the value contains |-separated values, these will be split into multiple phrases, and each one will be indexed separately. For example, if the value is “’Curly’@en|’Moe’@sp”, it is split into the set containing “’Curly’@en” and “’Moe’@sp”
- If a value contains a language tag, e.g, “’Curly’@en”, the language tag is dropped and the value becomes “Curly”.
The set of all values for each of the label properties specified in the labels
option are collected into one set, and indexed as the labels
of the node.
Similar operation is done for all values specified in thealiases
option.
The command will follow these steps,
- The Elasticsearch document format is JSON, so convert the input KGTK file to JSON documents with the following fields,
id
: the identifier for the node. This will be computed from the columnnode1
in the input KGTK file.labels
: a list of values specified using the--labels
option.aliases
: a list of aliases specified using the--aliases
option.
- Build a mapping file as defined in the next section.
Elasticsearch Index Mapping
The mapping of the fields id
, labels
and aliases
stored in the Elasticsearch index is as follows,
id
: stored with default Elasticsearch analyzerid.keyword
: stored as is for exact matcheslabels
: default elasticsearch analyzerlabels.keyword
: stored as is for exact matcheslabels.keyword_lower
: stored lowercase for exact matchesaliases
: default with elasticsearch analyzeraliases.keyword
: stored as is for exact matchesaliases.keyword_lower
: stored lowercase for exact matches
The mapping file is a JSON document. A sample mapping file is here
loads a jsonlines file to Elasticsearch index.
Options:
--mapping {path}
: The mapping file path used to create custom mapping for the Elasticsearch index.
Examples:
# load the file docs.jl to the Elasticsearch index docs_1, create index first using specified docs_1_mapping.json
$ tl -U smith -P my_pwd --url http:/bah.com --index docs_1 load-elasticsearch-index \
--mapping docs_1_mapping.json docs.jl
# same as above, but don't create index using the mapping file
$ tl -U smith -P my_pwd --url http:/bah.com --index docs_1 load-elasticsearch-index docs.jl
Implementation
This command has the following steps,
- Check if the index to be created already exists.
- if the index exists, do nothing and move to the next step.
- if the index does not exist, create the index first with the mapping file, if specified, otherwise with default mapping. Then move to the next step.
- Batch load the documents into the Elasticsearch index.
converts the ISWC Ground Truth file to TL Ground Truth file. This is a one time operation, listed here for completion.
Options:
-d
: output directory where the files in TL GT will be created
This command uses the following constants,
dbpedia_sparql_url
: "http://dbpedia.org/sparql"elasticsearch_url
: "http://kg2018a.isi.edu:9200"elasticsearch_index
: "wikidata_dbpedia_joined_3"wikidata_sparql_url
: "http://dsbox02.isi.edu:8888/bigdata/namespace/wdq/sparql"
Examples
tl convert-iswc-gt -d my-output-path iswc_gt.csv
File Example:
# consider the ISWC GT File
$ cat iswc_gt.csv
v15_1 1 5 http://dbpedia.org/resource/Sachin_Tendulkar http://dbpedia.org/resource/Sachin_r_Tendulkar
v15_1 2 5 http://dbpedia.org/resource/Virat_Kohli
v15_1 3 5 http://dbpedia.org/resource/Rishabh_Pant
v15_1 4 5 http://dbpedia.org/resource/Ishant_Sharma
v15_3 0 1 http://dbpedia.org/resource/Royal_Challengers_Bangalore
v15_3 1 1 http://dbpedia.org/resource/Mumbai_Indians
$ tl convert-iswc-gt -d my-output-path ../o_path iswc_gt.csv
$ cat ../o_path/*csv
v15_1.csv
column row kg_id
1 5 Q9488
2 5 Q213854
3 5 Q21622311
4 5 Q3522062
v15_3.csv
column row kg_id
0 1 Q1156897
1 1 Q1195237
Implementation
The ISWC GT files have four columns with no column headers. The columns in order are:
file name
: name of the input file for which the current row has GT KG idcolumn
: zero based column index in the input filerow
: zero based row index in the input filedbpedia urls
: aspace
separated string of dbpedia urls as correct urls linking the input cell
This command has the following steps in order:
-
split the dbpedia urls in the ISWC ground truth file by space
-
for each of the dbpedia urls, do the following,
- run a term query to the elasticsearch index
wikidata_dbpedia_joined_3
and the fielddbpedia_urls.keyword
. If a match is found, use theQnode
from the returned document and record the dbpedia to Qnode mapping. If there is no match, move to the next step. - run a sparql query to dbpedia sparql endpoint to fetch the relevant Qnode. This query gets the
<http://www.w3.org/2002/07/owl#sameAs>
links for the dbpedia url, filtering in the Wikidata Qnodes. If a match is found, use theQnode
from the returned result and record the dbpedia to Qnode mapping. If there is no match, move to the next step. Example query,
select ?item ?qnode where { VALUES (?item) { (<http://dbpedia.org/resource/Virat_Kohli>) } ?item <http://www.w3.org/2002/07/owl#sameAs> ?qnode . FILTER (SUBSTR(str(?qnode),1, 24) = "http://www.wikidata.org/")
-convert the dbpedia url to wikipedia url by replacing
http://dbpedia.org/resource/
withhttps://en.wikipedia.org/wiki/
. Run a sparql query to wikidata sparql endpoint. If a match is found, use theQnode
from the returned result and record the dbpedia to Qnode mapping. If there is no match, move to the next step. Example query,SELECT ?item ?article WHERE { VALUES (?article) { (<https://en.wikipedia.org/wiki/Virat_Kohli>) } ?article schema:about ?item . }
- run a term query to the elasticsearch index
-
output a file with the name
file name
from the ISWC file in the output directory as specified by the option-d
. The output file has the following columns,column
: the column index from the ISWC GT file.row
: the row index from the ISWC GT filekg_id
: a|
separated string of Qnodes, corresponding to the dbpedia urls.
If the mapping from a dbpedia url to Qnode is not found, delete that row from the TL GT file.
computes the precision
, recall
and f1 score
for the tl
pipeline. Takes as input a Evaluation File
file and output a file in Metrics File format.
Options:
-c a
: column name with ranking scores- `-k {number}: default, k=1. recall is calculated considering candidates with rank upto k
- `--tag: a tag to use in the output file to identify the results of running the given pipeline
Examples:
$ tl metrics -c ranking_score < cities_evaluation.csv > cities_metrics.csv
# same as above but calculate recall at 5
$ tl metrics -c ranking_score -k 5 < cities_evaluation.csv > cities_metrics.csv
Implementation
Discard the rows with evaluation_label=0
. Sort all the candidates for an input cell by ranking score, breaking ties alphabetically.
If the top ranked candidate has evalution_label=1
, it is counted as true positive, otherwise false positive.
Compute precision
, recall
and f1 score