-
Notifications
You must be signed in to change notification settings - Fork 15
Fix #42: Improve price string parsing. #111
Conversation
@Osmose , ready for your review! In particular, I wasn't sure the best way to handle the main/subunits situation inside of |
Improve fallback extraction by CSS selectors: * Update selectors for the top 5 sites. Add Home Depot and Best Buy. * Rename selectors JSON file to be more descriptive (was 'product_extraction_data.json', now 'fallback_extraction_selectors.json'). * Represent supported sites in 'fallback_extraction_selectors.json' as regular expression strings so that fallback extraction works for any subdomain of the site (e.g. 'smile.amazon.com'). * Represent CSS selectors by tuples in 'fallback_extraction_selectors.json', so that each selector can specify which attribute or property to read for that selector. * Clean price strings from fallback extraction using the same methods as used by Fathom extraction (PR #111); consolidate and move shared methods to 'utils.js'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great first pass, nice work!
Thinking about the original issue, my first thought of how to frame the problem would be this:
Currently, we treat the price as a string and the .
as a singular separator between unit and subunit of currency. Some sites, however, split the subunit into a separate DOM element without a .
between.
So the goal should be to describe price parsing like this:
- Start at the first digit after a
$
symbol, or the first digit if there is no$
symbol. - Add digits to the unit number, ignoring
,
symbols, until a "separator" is found- A separator is defined as either a
.
symbol, or a break in DOM elements.
- A separator is defined as either a
- When a separator is found, convert the collected unit number to an integer, and store as a unit integer in a list.
- Repeat steps 2 and 3, appending unit integers to the list, until the last digit is found. That is the final unit integer.
- If the unit integer list has 0 or > 2 elements, parsing fails. Otherwise, multiply the first unit integer by 10, add the second unit integer to it (if it exists), and return that as the currency amount.
The hard part, of course, is treating both a .
and a DOM element break the same: as separators. You can do that in two steps: First split the matched element into child elements and take their innerText
values, then split each innerText
value and split it on .
symbols, and flatten to a single list.
Lemme know if that helps!
src/background/index.js
Outdated
// Do nothing if the extracted product's price is not of the form "$NX.XX". | ||
if (!isPriceFormattedCorrectly(extractedProduct.price)) { | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially, this adds an intermediate state for prices:
Price string from page -> Price string in "$NX.XX" format -> Integer amount in subunits (e.g. 1000
for $10)
There's not much point in having the intermediate format. Previously, we sent the price string from the page through until we created the price object, and then converted it to subunits.
There's two places it makes sense to perform the conversion from "price string from page" to "integer amount":
- At the point of extraction, i.e. the content script
- At the point of persistence, i.e.
priceFromExtracted
.
It looks like this PR is favoring (1) over (2), which is fine. That means the content script should probably be sending the extracted product with an integer amount instead of a string.
src/extraction/fathom_extraction.js
Outdated
extractedElements[feature] = fnodesList[0].element; | ||
const element = fnodesList[0].element; | ||
// Check for price units and subunits | ||
if (feature === 'price' && element.children.length > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An if
statement in a loop that only affects one of the loop iterations is typically a code smell; it's hard to keep track of mentally, and doesn't scale well as the special cases grow. In cases like this, the better solution is to defer to the things you're looping over for their behavior.
In other words, it seems like the specific special cases for each feature in PRODUCT_FEATURES
is spread amongst a few different places:
extractProduct
pulls a different attribute for the image feature only.runRuleset
is now performing some extra post-processing for price features.- Not right now, but a future version of
hasAllFeatures
could be doing some extra validation as well.
It'd be better if each feature was a thing (an object?) that knew all the special cases it needs. Maybe something like:
const FEATURE_DEFAULTS = {
getValueFromFnode(fnode) {
return fnode.element.innerText;
},
};
const PRODUCT_FEATURES = {
image: {
...FEATURE_DEFAULTS,
getValueFromFnode(fnode) {
return fnode.element.src;
},
}.
};
It's not a complete example but that's one possible shape. If you gather all the feature-specific needs in those objects, you can write generic loops and other things for the rest of the code that don't need to worry about special cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the returned extracted element sometimes be a special "price object" and testing that later is harder to maintain than being consistent. You may have to refactor more than just this method, but consistency in the values of maps or arrays is very valuable for writing maintainable code.
My first instinct is that runRuleset
being a standalone method from extractProduct
may not be the best split of functions.
src/extraction/fathom_extraction.js
Outdated
let isMainUnit = true; | ||
const priceElements = {}; | ||
// Loop through children: first element containing a digit is main unit, | ||
// second is subunit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way this is implemented, the last element containing a digit is the subunit, not the second.
Instead of returning an object with two entries and no way to distinguish weird edge cases (such as more than 2 elements with digits), what if this just returned a list of all the child elements containing digits? Then a method calling this could fail if more than two results are in the returned array, as that's not a case we are capable of handling right now.
src/extraction/fathom_extraction.js
Outdated
} else if (feature === 'price') { | ||
const priceStr = getPriceString(extractedElements[feature]); | ||
extractedProduct[feature] = priceStr; | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same point above about spreading customization per-feature across several methods applies here too.
src/extraction/fathom_extraction.js
Outdated
// Clean up price string and check for subunits | ||
} else if (feature === 'price') { | ||
const priceStr = getPriceString(extractedElements[feature]); | ||
extractedProduct[feature] = priceStr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The priceStr
variable isn't providing any clarity here, we could just assign the value directly.
src/extraction/fathom_extraction.js
Outdated
// Check for subunits e.g. dollars and cents. | ||
if ('mainUnit' in priceObj) { | ||
const mainUnitStr = priceObj.mainUnit.innerText; | ||
const subUnitStr = priceObj.subUnit.innerText; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the elements themselves are never used except for getting their innerText
, there's no point passing them around as elements until now, just pass around the innerText
value from the start.
33d7a01
to
41ddd0e
Compare
@Osmose Ready for a re-review; thank you for the feedback. Also I'm standing next to you. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! The new feature list is a great improvement. Nice work.
const cents = Number.parseInt(priceMatch[2], 10); | ||
return (dollars * 100) + cents; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yassssss
delete that code yassssssss
src/extraction/utils.js
Outdated
*/ | ||
export function getPriceIntegerInSubunits(priceEle) { | ||
// Get all child nodes including text nodes | ||
const childNodes = priceEle.childNodes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this variable, just pass it directly to getAllTokens
.
src/extraction/utils.js
Outdated
result.push(token.replace(/\D/g, '')); | ||
} | ||
return result; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function and getPriceTokens
are small enough that we can probably just fold them into getAllTokens
and use the docstrings as comments above them. Then getAllTokens
completely abstracts the token-processing pipeline to make getPriceIntegerInSubunits
simpler.
src/extraction/utils.js
Outdated
* @param {HTMLElement} priceEle | ||
* @returns {Number} An integer that represents the price in subunits | ||
*/ | ||
export function getPriceIntegerInSubunits(priceEle) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Integer
part of the method name is unnecessary, the InSubunits
bit is clear enough about the nature of the conversion.
src/extraction/utils.js
Outdated
} | ||
// Convert units and subunits to a single integer value in subunits | ||
const mainUnits = parseInt(cleanedPriceTokens[0], 10) * 100; | ||
const subUnits = cleanedPriceTokens.length === 2 ? parseInt(cleanedPriceTokens[1], 10) : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of an early exit and this ternary statement, it may be clearer to check for successful cases and then default to returning NaN
as a failure case at the end.
If the getAllTokens
method is updated to return "units" (i.e. have it run the parseInt over every unit it discovers, and call it extractUnits
) then this method can be simplified even more.
src/extraction/utils.js
Outdated
* @param {Array.string} tokens | ||
* @returns {Array.string} | ||
*/ | ||
function getAllTokens(tokens) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing being passed in isn't a list of tokens, it's a list of child nodes. Presumably tokens are what's being returned?
src/extraction/utils.js
Outdated
} | ||
|
||
/** | ||
* Separate token strings in a list into substrings using '$' and '.' as separators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're also using dom node splits as a separator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but not in this method.
src/extraction/utils.js
Outdated
let shouldPush = true; | ||
for (const node of tokens) { | ||
const text = node.textContent; | ||
for (const char of text) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use split
and flat
to simplify this:
const allTokens = (
tokens
.map(token => (
token.textContent.split(/[,$]/) // Split by , or $
))
.flat()
);
You don't need to worry about empty strings at the end since the "has a digit" test later on will strip them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait, @mythmon keeps telling me to use flatMap
and he's totally right:
const allTokens = tokens.flatMap(token => token.textContent.split(/[,$]/));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow.
* Update how we pull the price string from extracted Fathom price elements to provide main and subunits (e.g. dollars and cents) if available. * Add price string cleaning methods to remove extra characters (like commas) that were causing price parsing to fail. * Handle case when price string parsing still fails after cleaning by checking in the background script that the price string is formatted correctly before rendering the browserAction popup. * This will guarantee we never see the “blank panel” reported in #79 and #88. Price element innerText strings now supported as a result of these changes: * "$1327 /each" ([Home Depot example page](https://www.homedepot.com/p/KitchenAid-Classic-4-5-Qt-Tilt-Head-White-Stand-Mixer-K45SSWH/202546032)) * "$1,049.00" ([Amazon example page](https://www.amazon.com/Fujifilm-X-T2-Mirrorless-F2-8-4-0-Lens/dp/B01I3LNQ6M/ref=sr_1_2?ie=UTF8&qid=1535594119&sr=8-2&keywords=fuji+xt2+camera)) * "US $789.99" ([Ebay example page](https://www.ebay.com/itm/Dell-Inspiron-7570-15-6-Touch-Laptop-i7-8550U-1-8GHz-8GB-1TB-NVIDIA-940MX-W10/263827294291)) * "$4.99+" ([Etsy example page](https://www.etsy.com/listing/555504975/frankenstein-2-custom-stencil?ga_order=most_relevant&ga_search_type=all&ga_view_type=gallery&ga_search_query=&ref=sr_gallery-1-13)) Note: This does not handle the case where there is more than one price for the product page (e.g. if we see a range of prices such as "$19.92 - $38.00" or if the price changes based on size/color, etc.); that’s handled by Issue #86.
* Move price parsing functions to a new './src/extraction/utils.js' file so that they can be shared with fallback extraction * Price parsing is now simplified so that it: * Takes an element and breaks it into child nodes (including text nodes) * Further subdivides that list with '$' and '.' separators * Filters the list for strings containing digits, resulting in an array of length 2 (e.g. dollars and cents) or 1 (dollars) * Cleans the remaining price strings to remove any non-digit characters * Parse the strings to numbers and return an integer in units of the subunit currency (e.g. cents), which is used downstream * Remove 'priceStringToAmount' method from './src/utils.js' and update 'extractedProductShape' for prop type check.
* Handle the case during price parsing when we have fewer or more price tokens than expected (we would expect one or two). One means we just have a main unit of price (e.g. dollars) and two means we have both a main and subunit of price (e.g. dollars and cents). * Instead of having if statements that only affect a single element in a for..of loop, represent PRODUCT_FEATURES as an object that extends default behavior. This makes reading through the loop much easier.
* Rename 'getPriceIntegerInSubunits' to 'getPriceInSubunits' * Consolidate all helper functions into a single function 'getPriceUnits' with the help of 'Array.map', 'Array.filter' and most especially 'Array.flatmap'. * Replace 'if' statement conditions on the array returned by 'getPriceUnits' to a switch statement
2a1c33b
to
0c21ccb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve fallback extraction by CSS selectors: * Update selectors for the top 5 sites. Add Home Depot and Best Buy. * Rename selectors JSON file to be more descriptive (was 'product_extraction_data.json', now 'fallback_extraction_selectors.json'). * Represent supported sites in 'fallback_extraction_selectors.json' as regular expression strings so that fallback extraction works for any subdomain of the site (e.g. 'smile.amazon.com'). * Represent CSS selectors by tuples in 'fallback_extraction_selectors.json', so that each selector can specify which attribute or property to read for that selector. * Clean price strings from fallback extraction using the same methods as used by Fathom extraction (PR #111); consolidate and move shared methods to 'utils.js'.
Improve fallback extraction by CSS selectors: * Update selectors for the top 5 sites. Add Home Depot and Best Buy. * Rename selectors JSON file to be more descriptive (was 'product_extraction_data.json', now 'fallback_extraction_selectors.json'). * Represent supported sites in 'fallback_extraction_selectors.json' as regular expression strings so that fallback extraction works for any subdomain of the site (e.g. 'smile.amazon.com'). * Represent CSS selectors by tuples in 'fallback_extraction_selectors.json', so that each selector can specify which attribute or property to read for that selector. * Clean price strings from fallback extraction using the same methods as used by Fathom extraction (PR #111); consolidate and move shared methods to 'utils.js'.
Improve fallback extraction by CSS selectors: * Update selectors for the top 5 sites. Add Home Depot and Best Buy. * Rename selectors JSON file to be more descriptive (was 'product_extraction_data.json', now 'fallback_extraction_selectors.json'). * Represent supported sites in 'fallback_extraction_selectors.json' as regular expression strings so that fallback extraction works for any subdomain of the site (e.g. 'smile.amazon.com'). * Represent CSS selectors by tuples in 'fallback_extraction_selectors.json', so that each selector can specify which attribute or property to read for that selector. * Clean price strings from fallback extraction using the same methods as used by Fathom extraction (PR #111); consolidate and move shared methods to 'utils.js'.
Improve fallback extraction by CSS selectors: * Update selectors for the top 5 sites. Add Home Depot and Best Buy. * Rename selectors JSON file to be more descriptive (was 'product_extraction_data.json', now 'fallback_extraction_selectors.json'). * Represent supported sites in 'fallback_extraction_selectors.json' as regular expression strings so that fallback extraction works for any subdomain of the site (e.g. 'smile.amazon.com'). * Represent CSS selectors by tuples in 'fallback_extraction_selectors.json', so that each selector can specify which attribute or property to read for that selector. * Clean price strings from fallback extraction using the same methods as used by Fathom extraction (PR #111); consolidate and move shared methods to 'utils.js'.
null
fromextractProduct
infathom_extraction.js
(the same thing that happens for unsuccessful extraction).null
when price string parsing fails.Price element innerText strings now supported as a result of these changes:
Note: This does not handle the case where there is more than one price for the product page (e.g. if we see a range of prices such as "$19.92 - $38.00" or if the price changes based on size/color, etc.); that’s handled by Issue #86.