Skip to content

Commit

Permalink
rustdoc-search: use set ops for ranking and filtering
Browse files Browse the repository at this point in the history
This commit adds ranking and quick filtering to type-based search,
improving performance and having it order results based on their
type signatures.

Motivation
----------

If I write a query like `str -> String`, a lot of functions come up.
That's to be expected, but `String::from_str` should come up on top, and
it doesn't right now. This is because the sorting algorithm is based
on the functions name, and doesn't consider the type signature at all.
`slice::join` even comes up above it!

To fix this, the sorting should take into account the function's
signature, and the closer match should come up on top.

Guide-level description
-----------------------

When searching by type signature, types with a "closer" match will
show up above types that match less precisely.

Reference-level explanation
---------------------------

Functions signature search works in three major phases:

* A compact "fingerprint," based on the [bloom filter] technique, is used to
  check for matches and to estimate the distance. It sometimes has false
  positive matches, but it also operates on 128 bit contiguous memory and
  requires no backtracking, so it performs a lot better than real
  unification.

  The fingerprint represents the set of items in the type signature, but it
  does not represent nesting, and it ignores when the same item appears more
  than once.

  The result is rejected if any query bits are absent in the function, or
  if the distance is higher than the current maximum and 200
  results have already been found.

* The second step performs unification. This is where nesting and true bag
  semantics are taken into account, and it has no false positives. It uses a
  recursive, backtracking algorithm.

  The result is rejected if any query elements are absent in the function.

[bloom filter]: https://en.wikipedia.org/wiki/Bloom_filter

Drawbacks
---------

This makes the code bigger.

More than that, this design is a subtle trade-off. It makes the cases I've
tested against measurably faster, but it's not clear how well this extends
to other crates with potentially more functions and fewer types.

The more complex things get, the more important it is to gather a good set
of data to test with (this is arguably more important than the actual
benchmarking ifrastructure right now).

Rationale and alternatives
--------------------------

Throwing a bloom filter in front makes it faster.

More than that, it tries to take a tactic where the system can not only check
for potential matches, but also gets an accurate distance function without
needing to do unification. That way it can skip unification even on items
that have the needed elems, as long as they have more items than the
currently found maximum.

If I didn't want to be able to cheaply do set operations on the fingerprint,
a [cuckoo filter] is supposed to have better performance.
But the nice bit-banging set intersection doesn't work AFAIK.

I also looked into [minhashing], but since it's actually an unbiased
estimate of the similarity coefficient, I'm not sure how it could be used
to skip unification (I wouldn't know if the estimate was too low or
too high).

This function actually uses the number of distinct items as its
"distance function."
This should give the same results that it would have gotten from a Jaccard
Distance $1-\frac{|F\cap{}Q|}{|F\cup{}Q|}$, while being cheaper to compute.
This is because:

* The function $F$ must be a superset of the query $Q$, so their union is
  just $F$ and the intersection is $Q$ and it can be reduced to
  $1-\frac{|Q|}{|F|}.

* There are no magic thresholds. These values are only being used to
  compare against each other while sorting (and, if 200 results are found,
  to compare with the maximum match). This means we only care if one value
  is bigger than the other, not what it's actual value is, and since $Q$ is
  the same for everything, it can be safely left out, reducing the formula
  to $1-\frac{1}{|F|} = \frac{|F|}{|F|}-\frac{1}{|F|} = |F|-1$. And, since
  the values are only being compared with each other, $|F|$ is fine.

Prior art
---------

This is significantly different from how Hoogle does it.
It doesn't account for order, and it has no special account for nesting,
though `Box<t>` is still two items, while `t` is only one.

This should give the same results that it would have gotten from a Jaccard
Distance $1-\frac{|A\cap{}B|}{|A\cup{}B|}$, while being cheaper to compute.

Unresolved questions
--------------------

`[]` and `()`, the slice/array and tuple/union operators, are ignored while
building the signature for the query. This is because they match more than
one thing, making them ambiguous. Unfortunately, this also makes them
a performance cliff. Is this likely to be a problem?

Right now, the system just stashes the type distance into the
same field that levenshtein distance normally goes in. This means exact
query matches show up on top (for example, if you have a function like
`fn nothing(a: Nothing, b: i32)`, then searching for `nothing` will show it
on top even if there's another function with `fn bar(x: Nothing)` that's
technically a closer match in type signature.

Future possibilities
--------------------

It should be possible to adopt more sorting criteria to act as a tie breaker,
which could be determined during unification.

[cuckoo filter]: https://en.wikipedia.org/wiki/Cuckoo_filter
[minhashing]: https://en.wikipedia.org/wiki/MinHash
  • Loading branch information
notriddle committed Nov 28, 2023
1 parent a5b2de4 commit c56f632
Show file tree
Hide file tree
Showing 9 changed files with 315 additions and 72 deletions.
3 changes: 2 additions & 1 deletion src/librustdoc/html/static/js/externs.js
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ function initSearch(searchIndex){}
* pathWithoutLast: Array<string>,
* pathLast: string,
* generics: Array<QueryElement>,
* bindings: Map<(string|integer), Array<QueryElement>>,
* bindings: Map<integer, Array<QueryElement>>,
* }}
*/
let QueryElement;
Expand Down Expand Up @@ -42,6 +42,7 @@ let ParserState;
* totalElems: number,
* literalSearch: boolean,
* corrections: Array<{from: string, to: integer}>,
* typeFingerprint: Uint32Array,
* }}
*/
let ParsedQuery;
Expand Down
254 changes: 197 additions & 57 deletions src/librustdoc/html/static/js/search.js
Original file line number Diff line number Diff line change
Expand Up @@ -238,6 +238,10 @@ function initSearch(rawSearchIndex) {
* @type {Array<Row>}
*/
let searchIndex;
/**
* @type {Uint32Array}
*/
let functionTypeFingerprint;
let currentResults;
/**
* Map from normalized type names to integers. Used to make type search
Expand Down Expand Up @@ -1049,6 +1053,8 @@ function initSearch(rawSearchIndex) {
correction: null,
proposeCorrectionFrom: null,
proposeCorrectionTo: null,
// bloom filter build from type ids
typeFingerprint: new Uint32Array(4),
};
}

Expand Down Expand Up @@ -1144,7 +1150,6 @@ function initSearch(rawSearchIndex) {
query.error = err;
return query;
}

if (!query.literalSearch) {
// If there is more than one element in the query, we switch to literalSearch in any
// case.
Expand Down Expand Up @@ -1952,8 +1957,7 @@ function initSearch(rawSearchIndex) {
* @param {integer} path_dist
*/
function addIntoResults(results, fullId, id, index, dist, path_dist, maxEditDistance) {
const inBounds = dist <= maxEditDistance || index !== -1;
if (dist === 0 || (!parsedQuery.literalSearch && inBounds)) {
if (dist <= maxEditDistance || index !== -1) {
if (results.has(fullId)) {
const result = results.get(fullId);
if (result.dontValidate || result.dist <= dist) {
Expand Down Expand Up @@ -2001,17 +2005,37 @@ function initSearch(rawSearchIndex) {
const fullId = row.id;
const searchWord = searchWords[pos];

const in_args = row.type && row.type.inputs
&& checkIfInList(row.type.inputs, elem, row.type.where_clause);
if (in_args) {
// path_dist is 0 because no parent path information is currently stored
// in the search index
addIntoResults(results_in_args, fullId, pos, -1, 0, 0, maxEditDistance);
}
const returned = row.type && row.type.output
&& checkIfInList(row.type.output, elem, row.type.where_clause);
if (returned) {
addIntoResults(results_returned, fullId, pos, -1, 0, 0, maxEditDistance);
// fpDist is a minimum possible type distance, where "type distance" is the number of
// atoms in the function not present in the query
const tfpDist = compareTypeFingerprints(
fullId,
parsedQuery.typeFingerprint
);
if (tfpDist !== false &&
!(results_in_args.size >= MAX_RESULTS && tfpDist > results_in_args.max_dist)
) {
const in_args = row.type && row.type.inputs
&& checkIfInList(row.type.inputs, elem, row.type.where_clause);
if (in_args) {
results_in_args.max_dist = Math.max(results_in_args.max_dist || 0, tfpDist);
const maxDist = results_in_args.size < MAX_RESULTS ?
(tfpDist + 1) :
results_in_args.max_dist;
addIntoResults(results_in_args, fullId, pos, -1, tfpDist, 0, maxDist);
}
}
if (tfpDist !== false &&
!(results_returned.size >= MAX_RESULTS && tfpDist > results_returned.max_dist)
) {
const returned = row.type && row.type.output
&& checkIfInList(row.type.output, elem, row.type.where_clause);
if (returned) {
results_returned.max_dist = Math.max(results_returned.max_dist || 0, tfpDist);
const maxDist = results_returned.size < MAX_RESULTS ?
(tfpDist + 1) :
results_returned.max_dist;
addIntoResults(results_returned, fullId, pos, -1, tfpDist, 0, maxDist);
}
}

if (!typePassesFilter(elem.typeFilter, row.ty)) {
Expand Down Expand Up @@ -2070,6 +2094,17 @@ function initSearch(rawSearchIndex) {
return;
}

const tfpDist = compareTypeFingerprints(
row.id,
parsedQuery.typeFingerprint
);
if (tfpDist === false) {
return;
}
if (results.size >= MAX_RESULTS && tfpDist > results.max_dist) {
return;
}

// If the result is too "bad", we return false and it ends this search.
if (!unifyFunctionTypes(
row.type.inputs,
Expand All @@ -2088,11 +2123,12 @@ function initSearch(rawSearchIndex) {
return;
}

addIntoResults(results, row.id, pos, 0, 0, 0, Number.MAX_VALUE);
results.max_dist = Math.max(results.max_dist || 0, tfpDist);
addIntoResults(results, row.id, pos, 0, tfpDist, 0, Number.MAX_VALUE);
}

function innerRunQuery() {
let elem, i, nSearchWords, in_returned, row;
let elem, i, nSearchWords;

let queryLen = 0;
for (const elem of parsedQuery.elems) {
Expand Down Expand Up @@ -2206,50 +2242,30 @@ function initSearch(rawSearchIndex) {
);
}

const fps = new Set();
for (const elem of parsedQuery.elems) {
convertNameToId(elem);
buildFunctionTypeFingerprint(elem, parsedQuery.typeFingerprint, fps);
}
for (const elem of parsedQuery.returned) {
convertNameToId(elem);
buildFunctionTypeFingerprint(elem, parsedQuery.typeFingerprint, fps);
}

if (parsedQuery.foundElems === 1) {
if (parsedQuery.elems.length === 1) {
elem = parsedQuery.elems[0];
for (i = 0, nSearchWords = searchWords.length; i < nSearchWords; ++i) {
// It means we want to check for this element everywhere (in names, args and
// returned).
handleSingleArg(
searchIndex[i],
i,
elem,
results_others,
results_in_args,
results_returned,
maxEditDistance
);
}
} else if (parsedQuery.returned.length === 1) {
// We received one returned argument to check, so looking into returned values.
elem = parsedQuery.returned[0];
for (i = 0, nSearchWords = searchWords.length; i < nSearchWords; ++i) {
row = searchIndex[i];
in_returned = row.type && unifyFunctionTypes(
row.type.output,
parsedQuery.returned,
row.type.where_clause
);
if (in_returned) {
addIntoResults(
results_others,
row.id,
i,
-1,
0,
Number.MAX_VALUE
);
}
}
if (parsedQuery.foundElems === 1 && parsedQuery.returned.length === 0) {
elem = parsedQuery.elems[0];
for (i = 0, nSearchWords = searchWords.length; i < nSearchWords; ++i) {
// It means we want to check for this element everywhere (in names, args and
// returned).
handleSingleArg(
searchIndex[i],
i,
elem,
results_others,
results_in_args,
results_returned,
maxEditDistance
);
}
} else if (parsedQuery.foundElems > 0) {
for (i = 0, nSearchWords = searchWords.length; i < nSearchWords; ++i) {
Expand Down Expand Up @@ -2796,6 +2812,95 @@ ${item.displayPath}<span class="${type}">${name}</span>\
};
}

/**
* Type fingerprints allow fast, approximate matching of types.
*
* This algo creates a compact representation of the type set using a Bloom filter.
* This fingerprint is used three ways:
*
* - It accelerates the matching algorithm by checking the function fingerprint against the
* query fingerprint. If any bits are set in the query but not in the function, it can't
* match.
*
* - The fourth section has the number of distinct items in the set.
* This is the distance function, used for filtering and for sorting.
*
* [^1]: Distance is the relatively naive metric of counting the number of distinct items in
* the function that are not present in the query.
*
* @param {FunctionType|QueryElement} type - a single type
* @param {Uint32Array} output - write the fingerprint to this data structure: uses 128 bits
* @param {Set<number>} fps - Set of distinct items
*/
function buildFunctionTypeFingerprint(type, output, fps) {

const input = type.id;
// `[]` doesn't go in the bloom filter because it's special-cased to match things
// other than itself.
if (input !== null && input !== typeNameIdOfArrayOrSlice) {
// https://docs.rs/rustc-hash/1.1.0/src/rustc_hash/lib.rs.html#60
// Rotate is skipped because we're only doing one cycle anyway.
const h0 = Math.imul(input, 0x9e3779b9);
const h1 = Math.imul(479001599 ^ input, 0x9e3779b9);
const h2 = Math.imul(433494437 ^ input, 0x9e3779b9);
output[0] |= 1 << (h0 % 32);
output[1] |= 1 << (h1 % 32);
output[2] |= 1 << (h2 % 32);
fps.add(input);
}
for (const g of type.generics) {
buildFunctionTypeFingerprint(g, output, fps);
}
const fb = {
id: null,
ty: 0,
generics: [],
bindings: new Map(),
};
for (const [k, v] of type.bindings.entries()) {
fb.id = k;
fb.generics = v;
buildFunctionTypeFingerprint(fb, output, fps);
}
output[3] = fps.size;
}

/**
* Compare the query fingerprint with the function fingerprint.
*
* @param {{number}} fullId - The function
* @param {{Uint32Array}} queryFingerprint - The query
* @returns {number|false} - False if non-match, number if distance
* This function might return 0!
*/
function compareTypeFingerprints(fullId, queryFingerprint) {

const fh0 = functionTypeFingerprint[fullId * 4];
const fh1 = functionTypeFingerprint[(fullId * 4) + 1];
const fh2 = functionTypeFingerprint[(fullId * 4) + 2];
const qh0 = queryFingerprint[0];
const qh1 = queryFingerprint[1];
const qh2 = queryFingerprint[2];
// Approximate set intersection with bloom filters.
// This can be larger than reality, not smaller, because hashes have
// the property that if they've got the same value, they hash to the
// same thing. False positives exist, but not false negatives.
const [in0, in1, in2, in3] = [fh0 & qh0, fh1 & qh1, fh2 & qh2];
// Approximate the set of items in the query but not the function.
// This might be smaller than reality, but cannot be bigger.
//
// | in_ | qh_ | XOR | Meaning |
// | --- | --- | --- | ------------------------------------------------ |
// | 0 | 0 | 0 | Not present |
// | 1 | 0 | 1 | IMPOSSIBLE because `in_` is `fh_ & qh_` |
// | 1 | 1 | 0 | If one or both is false positive, false negative |
// | 0 | 1 | 1 | Since in_ has no false negatives, must be real |
if ((in0 ^ qh0) || (in1 ^ qh1) || (in2 ^ qh2)) {
return false;
}
return functionTypeFingerprint[(fullId * 4) + 3];
}

function buildIndex(rawSearchIndex) {
searchIndex = [];
/**
Expand All @@ -2815,6 +2920,22 @@ ${item.displayPath}<span class="${type}">${name}</span>\
typeNameIdOfSlice = buildTypeMapIndex("slice");
typeNameIdOfArrayOrSlice = buildTypeMapIndex("[]");

// Function type fingerprints are 128-bit bloom filters that are used to
// estimate the distance between function and query.
// This loop counts the number of items to allocate a fingerprint for.
for (const crate in rawSearchIndex) {
if (!hasOwnPropertyRustdoc(rawSearchIndex, crate)) {
continue;
}
// Each item gets an entry in the fingerprint array, and the crate
// does, too
id += rawSearchIndex[crate].t.length + 1;
}
functionTypeFingerprint = new Uint32Array((id + 1) * 4);

// This loop actually generates the search item indexes, including
// normalized names, type signature objects and fingerprints, and aliases.
id = 0;
for (const crate in rawSearchIndex) {
if (!hasOwnPropertyRustdoc(rawSearchIndex, crate)) {
continue;
Expand Down Expand Up @@ -2964,17 +3085,36 @@ ${item.displayPath}<span class="${type}">${name}</span>\
}
searchWords.push(word);
const path = itemPaths.has(i) ? itemPaths.get(i) : lastPath;
let type = null;
if (itemFunctionSearchTypes[i] !== 0) {
type = buildFunctionSearchType(
itemFunctionSearchTypes[i],
lowercasePaths
);
if (type) {
const fp = functionTypeFingerprint.subarray(id * 4, (id + 1) * 4);
const fps = new Set();
for (const t of type.inputs) {
buildFunctionTypeFingerprint(t, fp, fps);
}
for (const t of type.output) {
buildFunctionTypeFingerprint(t, fp, fps);
}
for (const w of type.where_clause) {
for (const t of w) {
buildFunctionTypeFingerprint(t, fp, fps);
}
}
}
}
const row = {
crate: crate,
ty: itemTypes.charCodeAt(i) - charA,
name: itemNames[i],
path: path,
desc: itemDescs[i],
parent: itemParentIdxs[i] > 0 ? paths[itemParentIdxs[i] - 1] : undefined,
type: buildFunctionSearchType(
itemFunctionSearchTypes[i],
lowercasePaths
),
type,
id: id,
normalizedName: word.indexOf("_") === -1 ? word : word.replace(/_/g, ""),
deprecated: deprecatedItems.has(i),
Expand Down
4 changes: 2 additions & 2 deletions tests/rustdoc-js/assoc-type.js
Original file line number Diff line number Diff line change
Expand Up @@ -7,16 +7,16 @@ const EXPECTED = [
'query': 'iterator<something> -> u32',
'correction': null,
'others': [
{ 'path': 'assoc_type', 'name': 'my_fn' },
{ 'path': 'assoc_type::my', 'name': 'other_fn' },
{ 'path': 'assoc_type', 'name': 'my_fn' },
],
},
{
'query': 'iterator<something>',
'correction': null,
'in_args': [
{ 'path': 'assoc_type', 'name': 'my_fn' },
{ 'path': 'assoc_type::my', 'name': 'other_fn' },
{ 'path': 'assoc_type', 'name': 'my_fn' },
],
},
// if I write an explicit binding, only it shows up
Expand Down
Loading

0 comments on commit c56f632

Please sign in to comment.