You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The release v1.1 introduced component-wise deduping, near duplicate hashing, and fuzzy deduping for names. These features are absolutely incredible! Truly, hats off to all the amazing work! However, no real documentation with code examples was released for these features, nor were tests created that show how these new features work and integrate with the rest of the library.
libpostal_duplicate_options_tdefaultOptions=libpostal_get_default_duplicate_options(); // or the with_languages variantlibpostal_is_street_duplicate("S Sumter Street", "South Sumter St", defaultOptions); // same for other ones. Just not sure how toponym one would work
Questions
It appears that these functions are meant to be used with the results of libpostal_parse_address. But it's not clear how. Some components have corresponding dedupe functions, such as unit, po box, and postal code. However, some do no have corresponding dedupe functions.
For example, some addresses parse into a HouseNumber and Road while others just parse into a House. Ex:
404 Maple Drive, Bldg A, Suite 100
HouseNumber: 404
Road: maple drive
Unit: bldg a suite 100
404 Maple Dr, Building A, Ste 100
House: 404 maple dr building a
Unit: ste 100
Ignoring the fact that "bldg a" gets put in the unit for one but not the other, its hard to do component wise matching when these have different components. My idea was to construct a street line and compare like so:
StringBuildersb=null;Dictionary<AddressLabel,string>parts1=_libPostalBinding.ParseAddress(address1);Dictionary<AddressLabel,string>parts2=_libPostalBinding.ParseAddress(address2);foreach(AddressLabellabelinparts1.Keys){if(labelisAddressLabel.HouseNumber or AddressLabel.House or AddressLabel.Road){if(sb==null){sb=new();sb.Append(parts1[label]);}else{sb.Append(' ');sb.Append(parts1[label]);}}}string?streetLine1=stringBuilder?.ToString();string?streetLine2=// repeat processbool streetsEqual =_libPostalBinding.IsStreetDuplicate(streetLine1,streetLine2);
This wouldn't work for the previous example because "building a" would be in the street line for one but not the other, but for most cases where you have House vs HouseNumber and Road, this would work, I think. Open to hearing how others do it.
These functions seem simple enough and I have tested them. My only question is on how to free the char* returned by this function. Expand and parse have destroy functions for their results but there is no libpostal_normalize_response_destroy. Am I just suppose to call free when I'm done with it? In that case, could a function be added that does that, so bindings in other programming languages can free it without having to create a dll just for free?
Here's what I think could be improved
Documentation should be added for each of these features. It should explain the following:
How the feature integrates with the rest of the library
What the inputs to each function should be
What the outputs of the function will be
Here's how we want to use libpostal
Our main use cases are clustering similar addresses, deduping exact/likely match addresses, and near-dupe hashing to make the dedupe process more efficient.
My country is US
The text was updated successfully, but these errors were encountered:
biegehydra
changed the title
Documentation needed for dedupe and fuzzy matching
Documentation needed for normalization, component-wise deduping, fuzzy matching and normalization
Mar 27, 2024
The release v1.1 introduced component-wise deduping, near duplicate hashing, and fuzzy deduping for names. These features are absolutely incredible! Truly, hats off to all the amazing work! However, no real documentation with code examples was released for these features, nor were tests created that show how these new features work and integrate with the rest of the library.
Here are my findings and questions
Findings
Component-Deduping
Header Functions
Presumed Usage
Questions
It appears that these functions are meant to be used with the results of
libpostal_parse_address
. But it's not clear how. Some components have corresponding dedupe functions, such as unit, po box, and postal code. However, some do no have corresponding dedupe functions.For example, some addresses parse into a
HouseNumber
andRoad
while others just parse into aHouse
. Ex:Ignoring the fact that "bldg a" gets put in the unit for one but not the other, its hard to do component wise matching when these have different components. My idea was to construct a street line and compare like so:
This wouldn't work for the previous example because "building a" would be in the street line for one but not the other, but for most cases where you have
House
vsHouseNumber
andRoad
, this would work, I think. Open to hearing how others do it.Near-Dupe Hashing
Headers
I believe you aren't meant to put the result from parse address into the near dupe hash functions.
Fuzzy duplicates
Headers
I have no idea how to use any of these functions. Where do the tokens come from? Where do the token scores come from?
Normalization
Headers
These functions seem simple enough and I have tested them. My only question is on how to free the
char*
returned by this function. Expand and parse have destroy functions for their results but there is nolibpostal_normalize_response_destroy
. Am I just suppose to callfree
when I'm done with it? In that case, could a function be added that does that, so bindings in other programming languages can free it without having to create a dll just forfree
?Here's what I think could be improved
Documentation should be added for each of these features. It should explain the following:
Here's how we want to use libpostal
Our main use cases are clustering similar addresses, deduping exact/likely match addresses, and near-dupe hashing to make the dedupe process more efficient.
My country is US
The text was updated successfully, but these errors were encountered: