-
Notifications
You must be signed in to change notification settings - Fork 15
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I only left comments on the last commit, and ignored changes that may not be relevant after my feedback on the previous PR.
const hostname = new URL(window.location.href).host; | ||
for (const [vendor, attributeInfo] of Object.entries(extractionData)) { | ||
if (hostname.includes(vendor)) { | ||
const url = window.location.href; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use a regex across the bare URL, a user could theoretically craft a URL with querystring/anchor bits that fool our regex into recognizing the site as a different domain than we intend, causing the add-on to behave in unexpected (and potentially dangerous?) ways. It's safer for us to avoid regexes, and operate only on the correctly-parsed host from the URL
object.
Perhaps, instead of regexes, we let each vendor object have an allowlist of subdomains, like ['amazon.com', 'www.amazon.com', 'smile.amazon.com']
, and we just check if the host is in the list?
for (const [productAttribute, extractor] of Object.entries(attributeInfo)) { | ||
const {selectors, extractUsing} = extractor; | ||
for (const selector of selectors) { | ||
for (const [productAttribute, tuples] of Object.entries(attributeInfo)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tuple
is a bit too generic of a word. extractor
was kind've nice since it was a bit more clear about what data the object/list contained, could we go back to that or something similar?
for (const selector of selectors) { | ||
for (const [productAttribute, tuples] of Object.entries(attributeInfo)) { | ||
for (const tuple of tuples) { | ||
const [selector, extractUsing] = tuple; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can perform this destructuring within the for
loop clause instead of on its own line.
if (data[productAttribute]) { | ||
break; | ||
} else { | ||
throw new Error(`Element found did not return a valid product ${productAttribute}.`); | ||
} | ||
} else if (selector === selectors[selectors.length - 1]) { | ||
} else if (tuple === tuples[tuples.length - 1]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is how the code was already, but this is a difficult-to-follow way to loop until something returns a value. Another way would be to write a function that loops over the list of selectors and attempts to extract values, and returns the first truthy value extracted. Splitting it into a function lets you use return
as an early-exit.
14eefe2
to
200bca9
Compare
@Osmose , ready for another look. Weird parts include (better ideas welcome):
Also thank you mythmon for your help on the function composition I had to perform in |
f848ddd
to
dc0ca41
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates!
src/utils.js
Outdated
default: | ||
throw new Error(`Unrecognized extraction property or attribute '${extractUsing}'.`); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These helpers aren't used anymore, we should remove them.
* that selector. | ||
*/ | ||
const fallbackExtractionData = { | ||
'amazon.com_www.amazon.com_smile.amazon.com': { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of storing these in a map, store them in a list and make the allowed domains an attribute:
const fallbackExtractionData = [
{
domains: ['amazon.com', 'www.amazon.com', 'smile.amazon.com'],
features: {
price: [
// etc.
],
},
},
];
That'd let you avoid trying to fit every domain into a single key.
src/extraction/utils.js
Outdated
/** | ||
* Separates a string into an array of substrings using '$' and '.' as separators | ||
*/ | ||
function splitString(str) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function name is too generic to describe what it is doing. We should remove it as a separate function, or rename it.
src/extraction/utils.js
Outdated
* @returns {Number} the price in subunits | ||
*/ | ||
export function getPriceInSubunits(priceEle) { | ||
const priceUnits = getPriceUnits(priceEle.childNodes); | ||
export function getPriceInSubunits(price) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally not a fan of methods that dispatch based on the type of their input. Typically the code can be rearranged so that the type-specific work happens in the spots where the types are different, instead of doing it in the commonly-shared method.
In our case, the parsing process for DOM elements is:
- Convert each child node to a string.
- Split each string by
$
and.
, and flatten into a single list. - Remove strings without digits.
- Remove special characters from strings.
- Convert strings to integers.
- Convert lists of 1 or 2 integers into prices, otherwise fail parsing.
The case of converting a single string is the same set of steps starting from step 2, with step 1 passing in a list of strings to step 2. In the string case, we can pass a list of length 1 with the price string in it. We get something like this:
export function parsePrice(tokens) {
const priceUnits = (
tokens
// Split tokens by $ and . to get the numbers between them
.flatMap(token => token.split(/[.$]/))
// Filter out any tokens that do not contain a digit
.filter(token => /\d/g.test(token))
// Remove any non-digit characters for each token in the list
.map(token => token.replace(/\D/g, ''))
// Convert price token strings to integers
.map(token => parseInt(token, 10))
);
// Convert units and subunits to a single integer value in subunits
switch (priceUnits.length) {
case 1:
return priceUnits[0] * 100;
case 2:
return ((priceUnits[0] * 100) + priceUnits[1]);
default:
return NaN;
}
}
// For fallback extraction
function inUnits(fn) {
const priceString = fn(element);
return element => parsePrice([priceString]);
}
// For Fathom extraction
const PRODUCT_FEATURES = {
price: {
...FEATURE_DEFAULTS,
getValueFromElement(element) {
const tokens = Array.from(element.childNodes).map(node => node.textContent);
return parsePrice(tokens);
},
},
};
parsePrice
in this situation is more loosely coupled to the two modules it is used, because it operates on the common format that both modules can produce themselves. The difficulty is figuring out the data each module has, how and where it is being transformed, and looking for the common data shape between them. Focusing on the types and transformations on those types helps make it a bit easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great point; thank you for the detailed reasoning and code samples.
if (value) { | ||
return value; | ||
} | ||
throw new Error('Element found did not return a valid value for the product feature.'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This throw breaks the cascade in case of a failure, and it's also not a case in which we want to halt execution and throw. Maybe remove this, maybe log a warning instead?
} | ||
// None of the selectors matched an element on the page | ||
throw new Error('No elements found with vendor data for the product feature.'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the extractors fail, do we want to throw and halt all execution? It's not unreasonable that our extractors get out of date again, so we should probably just be logging a warning and returning null
.
dc0ca41
to
5774b40
Compare
@Osmose Thanks for your feedback; this is ready for another look! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve fallback extraction by CSS selectors: * Update selectors for the top 5 sites. Add Home Depot and Best Buy. * Rename selectors JSON file to be more descriptive (was 'product_extraction_data.json', now 'fallback_extraction_selectors.json'). * Represent supported sites in 'fallback_extraction_selectors.json' as regular expression strings so that fallback extraction works for any subdomain of the site (e.g. 'smile.amazon.com'). * Represent CSS selectors by tuples in 'fallback_extraction_selectors.json', so that each selector can specify which attribute or property to read for that selector. * Clean price strings from fallback extraction using the same methods as used by Fathom extraction (PR #111); consolidate and move shared methods to 'utils.js'.
* Replaces ‘fallback_extraction_selectors.json’ with a JS version, which allows the tuples associating a CSS selector with how to extract information to point to a site-specific extraction method that returns the desired value when executed. * Renamed some variables in various functions in ‘fallback_extraction.js’ so that it more closely matches its sister functions in ‘fathom_extraction.js’ for improved readability. * Modified ‘getPriceInSubunits’ function in './src/extraction/utils.js' to take in an HTML element (from Fathom extraction) OR a string (from fallback extraction). * Refactored some of the supporting functions.
* Replace top level keys in CSS selector data (in ‘fallback_extraction_selectors.js’) with a string composed of supported domains/subdomains. * Update ‘getFeatureInfo’ to split the string into a list of domains/subdomains before checking for a hostname match. * Break out part of ‘extractProduct’ in ‘fallback_extraction.js` for better readability. * Rename some internal variables for more consistency/clarity.
* Log a warning instead of throwing an error and halting execution if fallback extraction for a supported site fails in various ways. * Restructure 'fallbackExtractionData' in 'fallback_extraction_selectors.js', so it is an array of objects with keys 'domains' and 'features'. * Format 'parsePrice' utility function's input argument into an array of strings so that it takes in the same type of argument regardless of where the extraction info is coming from (Fathom or fallback). * Remove unused utility methods from './src/utils.js' and './src/extraction/utils.js'.
5774b40
to
9e546b7
Compare
Improve fallback extraction by CSS selectors:
Follow up issue(s) to consider: