Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the name from jaccard-strong to haussmann. NFC #49

Merged
merged 1 commit into from
Jan 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions doc/source/api/metrics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@ pairwise_distances

.. autofunction:: qbindiff.passes.metrics.pairwise_distances

jaccard_strong
--------------
haussmann
---------

.. autofunction:: qbindiff.passes.metrics.jaccard_strong
.. autofunction:: qbindiff.passes.metrics.haussmann

canberra_distances
------------------
Expand Down
4 changes: 2 additions & 2 deletions doc/source/basicex.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ BinExport is more flexible since it can be used with IDA, Ghidra and BinaryNinja

* `-n`, `--normalize` Normalize the Call Graph by removing some of the edges/nodes that should worsen the diffing result. **WARNING:** it can potentially lead to a worse matching. To know the details of the normalization step look at {ref}`normalization`

* `-d`, `--distance` Set the default distance that should be used by the features. The possible values are `canberra, correlation, cosine, euclidean, jaccard-strong`. The default one is `canberra`. To know the details of the jaccard-strong distance look here {ref}`jaccard-strong`
* `-d`, `--distance` Set the default distance that should be used by the features. The possible values are `canberra, correlation, cosine, euclidean, haussmann`. The default one is `canberra`. To know the details of the haussmann distance look here {ref}`haussmann`

* `-s`, `--sparsity-ratio` Set the density of the similarity matrix. This will loose some information (hence decrease accuracy) but it will also increase the performace. `0.999` means that the 99.9% of the matrix will be filled with zeros. The default value is `0.75`

Expand Down Expand Up @@ -65,7 +65,7 @@ Some examples are displayed below :
-f dat \
-f cst \
-f addr:0.01 \
-d jaccard-strong -s 0.999 -sr \
-d haussmann -s 0.999 -sr \
-ff bindiff -o ./result.BinDiff -vv

.. code-block:: bash
Expand Down
8 changes: 4 additions & 4 deletions doc/source/params.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,16 +25,16 @@ Most of the distance functions that QBinDiff uses come from `Scipy <https://docs
* seuclidean
* sqeuclidean

However, some distance are unique in QBinDiff, such as the jaccard-strong distance.
However, some distance are unique in QBinDiff, such as the haussmann distance.
This is a experimental new metric that combines the jaccard index and the canberra distance.

Jaccard-strong
~~~~~~~~~~~~~~
Haussmann
~~~~~~~~~

Formally it is defined as:

.. math::
d(u, v) = \sum_{i=0}^n\frac{f(u_i, v_i)}{ | \{ i | u_i \neq 0 \lor v_i \neq 0 \} | }
d(u, v) = \sum_{i=0}^n\frac{f(u_i, v_i)}{ \lvert \{ j | u_j \neq 0 \lor v_j \neq 0 \} \rvert }

.. math::
with\ u, v \in \mathbb{R}^n
Expand Down
2 changes: 1 addition & 1 deletion src/qbindiff/passes/fast_metrics.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@ def sparse_canberra(floating[::1] X_data, int[:] X_indices, int[:] X_indptr,

D[px, py] = d

def sparse_strong_jaccard(floating[::1] X_data, int[:] X_indices, int[:] X_indptr,
def sparse_haussmann(floating[::1] X_data, int[:] X_indices, int[:] X_indptr,
floating[::1] Y_data, int[:] Y_indices, int[:] Y_indptr,
double[:, ::1] D, double[:] w):
"""Pairwise canberra distances for CSR matrices"""
Expand Down
20 changes: 10 additions & 10 deletions src/qbindiff/passes/metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@
import sklearn.metrics
from scipy.spatial import distance
from scipy.sparse import issparse, csr_matrix
from qbindiff.passes.fast_metrics import sparse_canberra, sparse_strong_jaccard
from qbindiff.passes.fast_metrics import sparse_canberra, sparse_haussmann
from qbindiff.types import Distance


Expand Down Expand Up @@ -106,16 +106,16 @@ def canberra_distances(X, Y, w=None):
ValueError("Cannot assign weights with non-sparse matrices")


def jaccard_strong(X, Y, w=None):
def haussmann(X, Y, w=None):
r"""
Compute a variation of the jaccard distances between the vectors in X and Y using
the optional array of weights w.
Custom distance that takes inspiration from the jaccard index and the canberra distance.
If computes the distance between the vectors in X and Y using the optional array of weights w.

The distance function between two vector ``u`` and ``v`` is the following:

.. math::

\sum_{i}\frac{f(u_i, v_i)}{ | \{ i | u_i \neq 0 \lor v_i \neq 0 \} | }
\sum_{i}\frac{f(u_i, v_i)}{ | \{ j | u_j \neq 0 \lor v_j \neq 0 \} | }

where the function ``f`` is defined like this:

Expand All @@ -127,7 +127,7 @@ def jaccard_strong(X, Y, w=None):

.. math::

\sum_{i}\frac{w_i * f(u_i, v_i)}{ | \{ i | u_i \neq 0 \lor v_i \neq 0 \} | }
\sum_{i}\frac{w_i * f(u_i, v_i)}{ | \{ j | u_j \neq 0 \lor v_j \neq 0 \} | }

:param X: array-like of shape (n_samples_X, n_features)
An array where each row is a sample and each column is a feature.
Expand All @@ -139,7 +139,7 @@ def jaccard_strong(X, Y, w=None):
Default is None, which gives each value a weight of 1.0

:return D: ndarray of shape (n_samples_X, n_samples_Y)
D contains the pairwise strong jaccard distances.
D contains the pairwise haussmann distances.

When X and/or Y are CSR sparse matrices and they are not already
in canonical format, this function modifies them in-place to
Expand All @@ -159,19 +159,19 @@ def jaccard_strong(X, Y, w=None):
w = _validate_weights(w)
if w.size != X.shape[1]:
ValueError("Weights size mismatch")
sparse_strong_jaccard(X.data, X.indices, X.indptr, Y.data, Y.indices, Y.indptr, D, w)
sparse_haussmann(X.data, X.indices, X.indptr, Y.data, Y.indices, Y.indptr, D, w)
return D

if w is None:
return sparse_strong_jaccard(
return sparse_haussmann(
X.data, X.indices, X.indptr, Y.data, Y.indices, Y.indptr, D, None
)
ValueError("Cannot assign weights with non-sparse matrices")


CUSTOM_DISTANCES = {
Distance.canberra: canberra_distances,
Distance.jaccard_strong: jaccard_strong,
Distance.haussmann: haussmann,
}


Expand Down
2 changes: 1 addition & 1 deletion src/qbindiff/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -207,4 +207,4 @@ class Distance(IntEnum):
canberra = 0 # doc: canberra distance
euclidean = 1 # doc: euclidean distance
cosine = 2 # doc: cosine distance
jaccard_strong = 3 # doc: custom distance
haussmann = 3 # doc: haussmann distance