Skip to content

Latest commit

 

History

History
273 lines (206 loc) · 12.6 KB

tlu.md

File metadata and controls

273 lines (206 loc) · 12.6 KB

« Home / Table Linker Utility Commands

This document describes the utility commands for the Table Linker (tl) system.

Usage: tl [OPTIONS] COMMAND

Table of Contents:

Options:

  • -e, --examples -- Print some examples and exit
  • -h, --help -- Print this help message and exit
  • -v, --version -- Print the version info and exit
  • --url {url}: URL of the ElasticSearch server.
  • --index {name}: Name of the elasticsearch index
  • -U {user id}: the user id for authenticating to the ElasticSearch index.
  • -P {password}: the password for authenticating to the ElasticSearch index.

utility commands for tl

builds a json lines file and a mapping file to be loaded into elasticsearch from a kgtk edge file. This command takes as input an Edges file in KGTK format, which must be sorted by node1 so that this script can generate the JSON file for the index in a streaming fashion.

This command builds an index of labels and aliases. Extra information in the form of key#value can be stored for retrieval purposes only. This command will also index the pagerank of node1 if available in the input KGTK edge file.

Options:

  • --input-file {path}: input kgtk edge file, sorted by node1.
  • --output-file {path}: output json lines file, to be loaded into ES.
  • --label-properties {a,b,...}: the name of property which has labels for the node1.
  • --mapping-file {path}: path where a mapping file for the ES index will be output.
  • --alias-properties {a,b,...}: the name of property which has aliases for the node1. Optional, no default.
  • --description-properties {a,b,...}: the name of property which has descriptions for the node1. Optional, no default.
  • --pagerank-properties {a,b,...}: the name of property which has descriptions for the node1. Optional, no default.
  • --blacklist-file {path}: blacklist file path nodes from which will be ignored in the output. Optional.
  • --extra-information {True|False}: store extra information about node1 or not. Default False.
  • --add-text {True|False}: add a text field in the json which contains all text in label, alias and description. Default False

Example:

Consider the following three-stooges Edges file in KGTK format. It uses two properties to define labels for nodes, preflabel and label, and one property to define aliases, alias.

node1 label     node2
N1    isa       Person
N1    preflabel “Moe”
N1    label     “‘Moeh’@fr”
N2    alias     “Lawrence|Lorenzo”
N2    isa       Person
N2    preflabel “Larry”
N3    isa       Person
N3    preflabel “Curly”

The following command will build a json line file and a mapping file using the properties preflabel and label to define the labels and alias to define the aliases of nodes.

$ tl  build-elasticsearch-input --label-properties preflabel,label --alias-properties alias \ 
 --mapping-file nodes_mapping.json --input-file nodes.tsv --output-file nodes.jl

This command will map nodes as follows:

  • N1: labels: “Moe”, “Moeh”
  • N2: labels: “Larry”, aliases:“Lawrence”, “Lorenzo”
  • N3: labels: “Curly”

The following command will build a json line file and a mapping file using the properties label to define the labels and alias to define the aliases of nodes.

$ tl  build-elasticsearch-input --label-properties label --alias-properties alias \
--mapping-file nodes_mapping.json --input-file nodes.json --output-file nodes.jl

This command will map nodes as follows:

  • N1: labels: “Moeh”
  • N2: aliases: “Lawrence”, “Lorenzo”
  • N3:

Implementation

The algorithm uses the properties listed in the labels option to collect the set of strings to be indexed. The following cleaning operations are made on each value:

  • When the value contains |-separated values, these will be split into multiple phrases, and each one will be indexed separately. For example, if the value is “’Curly’@en|’Moe’@sp”, it is split into the set containing “’Curly’@en” and “’Moe’@sp”
  • If a value contains a language tag, e.g, “’Curly’@en”, the language tag is dropped and the value becomes “Curly”.

The set of all values for each of the label properties specified in the labels option are collected into one set, and indexed as the labels of the node. Similar operation is done for all values specified in thealiases option.

The command will follow these steps,

  • The Elasticsearch document format is JSON, so convert the input KGTK file to JSON documents with the following fields,
    • id: the identifier for the node. This will be computed from the column node1 in the input KGTK file.
    • labels: a list of values specified using the --labels option.
    • aliases: a list of aliases specified using the --aliases option.
  • Build a mapping file as defined in the next section.

Elasticsearch Index Mapping

The mapping of the fields id, labels and aliases stored in the Elasticsearch index is as follows,

  • id: stored with default Elasticsearch analyzer
  • id.keyword: stored as is for exact matches
  • labels: default elasticsearch analyzer
  • labels.keyword: stored as is for exact matches
  • labels.keyword_lower: stored lowercase for exact matches
  • aliases: default with elasticsearch analyzer
  • aliases.keyword: stored as is for exact matches
  • aliases.keyword_lower: stored lowercase for exact matches

The mapping file is a JSON document. A sample mapping file is here

loads a jsonlines file to Elasticsearch index.

Options:

  • --mapping {path}: The mapping file path used to create custom mapping for the Elasticsearch index.

Examples:

# load the file docs.jl to the Elasticsearch index docs_1, create index first using specified docs_1_mapping.json
$ tl -U smith -P my_pwd --url http:/bah.com --index docs_1 load-elasticsearch-index \
--mapping docs_1_mapping.json docs.jl

# same as above, but don't create index using the mapping file
$ tl -U smith -P my_pwd --url http:/bah.com --index docs_1 load-elasticsearch-index docs.jl 

Implementation

This command has the following steps,

  • Check if the index to be created already exists.
    • if the index exists, do nothing and move to the next step.
    • if the index does not exist, create the index first with the mapping file, if specified, otherwise with default mapping. Then move to the next step.
  • Batch load the documents into the Elasticsearch index.

converts the ISWC Ground Truth file to TL Ground Truth file. This is a one time operation, listed here for completion.

Options:

  • -d: output directory where the files in TL GT will be created

This command uses the following constants,

Examples

tl convert-iswc-gt -d my-output-path iswc_gt.csv 

File Example:

# consider the ISWC GT File
$ cat iswc_gt.csv

v15_1   1   5   http://dbpedia.org/resource/Sachin_Tendulkar http://dbpedia.org/resource/Sachin_r_Tendulkar 
v15_1   2   5   http://dbpedia.org/resource/Virat_Kohli
v15_1   3   5   http://dbpedia.org/resource/Rishabh_Pant
v15_1   4   5   http://dbpedia.org/resource/Ishant_Sharma
v15_3   0   1   http://dbpedia.org/resource/Royal_Challengers_Bangalore
v15_3   1   1   http://dbpedia.org/resource/Mumbai_Indians

$ tl convert-iswc-gt -d my-output-path ../o_path iswc_gt.csv 
$ cat ../o_path/*csv 

v15_1.csv
column  row     kg_id
1       5       Q9488
2       5       Q213854 
3       5       Q21622311
4       5       Q3522062

v15_3.csv
column  row     kg_id
0       1       Q1156897
1       1       Q1195237

Implementation

The ISWC GT files have four columns with no column headers. The columns in order are:

  • file name: name of the input file for which the current row has GT KG id
  • column: zero based column index in the input file
  • row: zero based row index in the input file
  • dbpedia urls: a space separated string of dbpedia urls as correct urls linking the input cell

This command has the following steps in order:

  • split the dbpedia urls in the ISWC ground truth file by space

  • for each of the dbpedia urls, do the following,

    • run a term query to the elasticsearch index wikidata_dbpedia_joined_3 and the field dbpedia_urls.keyword. If a match is found, use the Qnode from the returned document and record the dbpedia to Qnode mapping. If there is no match, move to the next step.
    • run a sparql query to dbpedia sparql endpoint to fetch the relevant Qnode. This query gets the <http://www.w3.org/2002/07/owl#sameAs> links for the dbpedia url, filtering in the Wikidata Qnodes. If a match is found, use the Qnode from the returned result and record the dbpedia to Qnode mapping. If there is no match, move to the next step. Example query,
    select ?item ?qnode where {
       VALUES (?item) { (<http://dbpedia.org/resource/Virat_Kohli>) } 
       ?item <http://www.w3.org/2002/07/owl#sameAs> ?qnode .
           FILTER (SUBSTR(str(?qnode),1, 24) = "http://www.wikidata.org/") 

    -convert the dbpedia url to wikipedia url by replacing http://dbpedia.org/resource/ with https://en.wikipedia.org/wiki/. Run a sparql query to wikidata sparql endpoint. If a match is found, use the Qnode from the returned result and record the dbpedia to Qnode mapping. If there is no match, move to the next step. Example query,

    SELECT ?item ?article WHERE {
        VALUES (?article) { (<https://en.wikipedia.org/wiki/Virat_Kohli>) } 
        ?article schema:about ?item .                 
        }            
  • output a file with the name file name from the ISWC file in the output directory as specified by the option -d. The output file has the following columns,

    • column: the column index from the ISWC GT file.
    • row: the row index from the ISWC GT file
    • kg_id: a | separated string of Qnodes, corresponding to the dbpedia urls.

If the mapping from a dbpedia url to Qnode is not found, delete that row from the TL GT file.

computes the precision, recall and f1 score for the tl pipeline. Takes as input a Evaluation File file and output a file in Metrics File format.

Options:

  • -c a: column name with ranking scores
  • `-k {number}: default, k=1. recall is calculated considering candidates with rank upto k
  • `--tag: a tag to use in the output file to identify the results of running the given pipeline

Examples:

$ tl metrics -c ranking_score <  cities_evaluation.csv > cities_metrics.csv

# same as above but calculate recall at 5
$ tl metrics -c ranking_score -k 5 <  cities_evaluation.csv > cities_metrics.csv

Implementation

Discard the rows with evaluation_label=0. Sort all the candidates for an input cell by ranking score, breaking ties alphabetically. If the top ranked candidate has evalution_label=1, it is counted as true positive, otherwise false positive.

Compute precision, recall and f1 score