Skip to content
This repository has been archived by the owner on Dec 3, 2020. It is now read-only.

Fix #42: Improve price string parsing. #111

Merged
merged 4 commits into from
Sep 20, 2018
Merged

Conversation

biancadanforth
Copy link
Collaborator

@biancadanforth biancadanforth commented Sep 12, 2018

Price element innerText strings now supported as a result of these changes:

Note: This does not handle the case where there is more than one price for the product page (e.g. if we see a range of prices such as "$19.92 - $38.00" or if the price changes based on size/color, etc.); that’s handled by Issue #86.

@biancadanforth
Copy link
Collaborator Author

@Osmose , ready for your review!

In particular, I wasn't sure the best way to handle the main/subunits situation inside of fathom_extraction.js. Right now, if there are subunits, extractedElements.price maps to an object literal, else an HTML object. extractedProducts.price from extraction (what we send on to the background script) is always a string though.

biancadanforth added a commit that referenced this pull request Sep 14, 2018
Improve fallback extraction by CSS selectors:
* Update selectors for the top 5 sites. Add Home Depot and Best Buy.
* Rename selectors JSON file to be more descriptive (was 'product_extraction_data.json', now 'fallback_extraction_selectors.json').
* Represent supported sites in 'fallback_extraction_selectors.json' as regular expression strings so that fallback extraction works for any subdomain of the site (e.g. 'smile.amazon.com').
* Represent CSS selectors by tuples in 'fallback_extraction_selectors.json', so that each selector can specify which attribute or property to read for that selector.
* Clean price strings from fallback extraction using the same methods as used by Fathom extraction (PR #111); consolidate and move shared methods to 'utils.js'.
Copy link
Contributor

@Osmose Osmose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great first pass, nice work!

Thinking about the original issue, my first thought of how to frame the problem would be this:

Currently, we treat the price as a string and the . as a singular separator between unit and subunit of currency. Some sites, however, split the subunit into a separate DOM element without a . between.

So the goal should be to describe price parsing like this:

  1. Start at the first digit after a $ symbol, or the first digit if there is no $ symbol.
  2. Add digits to the unit number, ignoring , symbols, until a "separator" is found
    • A separator is defined as either a . symbol, or a break in DOM elements.
  3. When a separator is found, convert the collected unit number to an integer, and store as a unit integer in a list.
  4. Repeat steps 2 and 3, appending unit integers to the list, until the last digit is found. That is the final unit integer.
  5. If the unit integer list has 0 or > 2 elements, parsing fails. Otherwise, multiply the first unit integer by 10, add the second unit integer to it (if it exists), and return that as the currency amount.

The hard part, of course, is treating both a . and a DOM element break the same: as separators. You can do that in two steps: First split the matched element into child elements and take their innerText values, then split each innerText value and split it on . symbols, and flatten to a single list.

Lemme know if that helps!

// Do nothing if the extracted product's price is not of the form "$NX.XX".
if (!isPriceFormattedCorrectly(extractedProduct.price)) {
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially, this adds an intermediate state for prices:

Price string from page -> Price string in "$NX.XX" format -> Integer amount in subunits (e.g. 1000 for $10)

There's not much point in having the intermediate format. Previously, we sent the price string from the page through until we created the price object, and then converted it to subunits.

There's two places it makes sense to perform the conversion from "price string from page" to "integer amount":

  1. At the point of extraction, i.e. the content script
  2. At the point of persistence, i.e. priceFromExtracted.

It looks like this PR is favoring (1) over (2), which is fine. That means the content script should probably be sending the extracted product with an integer amount instead of a string.

extractedElements[feature] = fnodesList[0].element;
const element = fnodesList[0].element;
// Check for price units and subunits
if (feature === 'price' && element.children.length > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An if statement in a loop that only affects one of the loop iterations is typically a code smell; it's hard to keep track of mentally, and doesn't scale well as the special cases grow. In cases like this, the better solution is to defer to the things you're looping over for their behavior.

In other words, it seems like the specific special cases for each feature in PRODUCT_FEATURES is spread amongst a few different places:

  • extractProduct pulls a different attribute for the image feature only.
  • runRuleset is now performing some extra post-processing for price features.
  • Not right now, but a future version of hasAllFeatures could be doing some extra validation as well.

It'd be better if each feature was a thing (an object?) that knew all the special cases it needs. Maybe something like:

const FEATURE_DEFAULTS = {
  getValueFromFnode(fnode) {
    return fnode.element.innerText;
  },
};

const PRODUCT_FEATURES = {
  image: {
    ...FEATURE_DEFAULTS,
    getValueFromFnode(fnode) {
      return fnode.element.src;
    },
  }.
};

It's not a complete example but that's one possible shape. If you gather all the feature-specific needs in those objects, you can write generic loops and other things for the rest of the code that don't need to worry about special cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the returned extracted element sometimes be a special "price object" and testing that later is harder to maintain than being consistent. You may have to refactor more than just this method, but consistency in the values of maps or arrays is very valuable for writing maintainable code.

My first instinct is that runRuleset being a standalone method from extractProduct may not be the best split of functions.

let isMainUnit = true;
const priceElements = {};
// Loop through children: first element containing a digit is main unit,
// second is subunit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this is implemented, the last element containing a digit is the subunit, not the second.

Instead of returning an object with two entries and no way to distinguish weird edge cases (such as more than 2 elements with digits), what if this just returned a list of all the child elements containing digits? Then a method calling this could fail if more than two results are in the returned array, as that's not a case we are capable of handling right now.

} else if (feature === 'price') {
const priceStr = getPriceString(extractedElements[feature]);
extractedProduct[feature] = priceStr;
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same point above about spreading customization per-feature across several methods applies here too.

// Clean up price string and check for subunits
} else if (feature === 'price') {
const priceStr = getPriceString(extractedElements[feature]);
extractedProduct[feature] = priceStr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The priceStr variable isn't providing any clarity here, we could just assign the value directly.

// Check for subunits e.g. dollars and cents.
if ('mainUnit' in priceObj) {
const mainUnitStr = priceObj.mainUnit.innerText;
const subUnitStr = priceObj.subUnit.innerText;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the elements themselves are never used except for getting their innerText, there's no point passing them around as elements until now, just pass around the innerText value from the start.

@biancadanforth
Copy link
Collaborator Author

@Osmose Ready for a re-review; thank you for the feedback. Also I'm standing next to you. :)

Copy link
Contributor

@Osmose Osmose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! The new feature list is a great improvement. Nice work.

const cents = Number.parseInt(priceMatch[2], 10);
return (dollars * 100) + cents;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yassssss

delete that code yassssssss

*/
export function getPriceIntegerInSubunits(priceEle) {
// Get all child nodes including text nodes
const childNodes = priceEle.childNodes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this variable, just pass it directly to getAllTokens.

result.push(token.replace(/\D/g, ''));
}
return result;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function and getPriceTokens are small enough that we can probably just fold them into getAllTokens and use the docstrings as comments above them. Then getAllTokens completely abstracts the token-processing pipeline to make getPriceIntegerInSubunits simpler.

* @param {HTMLElement} priceEle
* @returns {Number} An integer that represents the price in subunits
*/
export function getPriceIntegerInSubunits(priceEle) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Integer part of the method name is unnecessary, the InSubunits bit is clear enough about the nature of the conversion.

}
// Convert units and subunits to a single integer value in subunits
const mainUnits = parseInt(cleanedPriceTokens[0], 10) * 100;
const subUnits = cleanedPriceTokens.length === 2 ? parseInt(cleanedPriceTokens[1], 10) : 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of an early exit and this ternary statement, it may be clearer to check for successful cases and then default to returning NaN as a failure case at the end.

If the getAllTokens method is updated to return "units" (i.e. have it run the parseInt over every unit it discovers, and call it extractUnits) then this method can be simplified even more.

* @param {Array.string} tokens
* @returns {Array.string}
*/
function getAllTokens(tokens) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing being passed in isn't a list of tokens, it's a list of child nodes. Presumably tokens are what's being returned?

}

/**
* Separate token strings in a list into substrings using '$' and '.' as separators
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're also using dom node splits as a separator.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but not in this method.

let shouldPush = true;
for (const node of tokens) {
const text = node.textContent;
for (const char of text) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use split and flat to simplify this:

const allTokens = (
  tokens
  .map(token => (
    token.textContent.split(/[,$]/) // Split by , or $
  ))
  .flat()
);

You don't need to worry about empty strings at the end since the "has a digit" test later on will strip them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, @mythmon keeps telling me to use flatMap and he's totally right:

const allTokens = tokens.flatMap(token => token.textContent.split(/[,$]/));

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow.

* Update how we pull the price string from extracted Fathom price elements to provide main and subunits (e.g. dollars and cents) if available.
* Add price string cleaning methods to remove extra characters (like commas) that were causing price parsing to fail.
* Handle case when price string parsing still fails after cleaning by checking in the background script that the price string is formatted correctly before rendering the browserAction popup.
  * This will guarantee we never see the “blank panel” reported in #79 and #88.

Price element innerText strings now supported as a result of these changes:
* "$1327 /each" ([Home Depot example page](https://www.homedepot.com/p/KitchenAid-Classic-4-5-Qt-Tilt-Head-White-Stand-Mixer-K45SSWH/202546032))
* "$1,049.00" ([Amazon example page](https://www.amazon.com/Fujifilm-X-T2-Mirrorless-F2-8-4-0-Lens/dp/B01I3LNQ6M/ref=sr_1_2?ie=UTF8&qid=1535594119&sr=8-2&keywords=fuji+xt2+camera))
* "US $789.99" ([Ebay example page](https://www.ebay.com/itm/Dell-Inspiron-7570-15-6-Touch-Laptop-i7-8550U-1-8GHz-8GB-1TB-NVIDIA-940MX-W10/263827294291))
* "$4.99+" ([Etsy example page](https://www.etsy.com/listing/555504975/frankenstein-2-custom-stencil?ga_order=most_relevant&ga_search_type=all&ga_view_type=gallery&ga_search_query=&ref=sr_gallery-1-13))

Note: This does not handle the case where there is more than one price for the product page (e.g. if we see a range of prices such as "$19.92 - $38.00" or if the price changes based on size/color, etc.); that’s handled by Issue #86.
* Move price parsing functions to a new './src/extraction/utils.js' file so that they can be shared with fallback extraction
* Price parsing is now simplified so that it:
  * Takes an element and breaks it into child nodes (including text nodes)
  * Further subdivides that list with '$' and '.' separators
  * Filters the list for strings containing digits, resulting in an array of length 2 (e.g. dollars and cents) or 1 (dollars)
  * Cleans the remaining price strings to remove any non-digit characters
  * Parse the strings to numbers and return an integer in units of the subunit currency (e.g. cents), which is used downstream
* Remove 'priceStringToAmount' method from './src/utils.js' and update 'extractedProductShape' for prop type check.
* Handle the case during price parsing when we have fewer or more price tokens than expected (we would expect one or two). One means we just have a main unit of price (e.g. dollars) and two means we have both a main and subunit of price (e.g. dollars and cents).
* Instead of having if statements that only affect a single element in a for..of loop, represent PRODUCT_FEATURES as an object that extends default behavior. This makes reading through the loop much easier.
* Rename 'getPriceIntegerInSubunits' to 'getPriceInSubunits'
* Consolidate all helper functions into a single function 'getPriceUnits' with the help of 'Array.map', 'Array.filter' and most especially 'Array.flatmap'.
* Replace 'if' statement conditions on the array returned by 'getPriceUnits' to a switch statement
@biancadanforth
Copy link
Collaborator Author

biancadanforth commented Sep 20, 2018

@Osmose Ready for another look. Greatly simplified and rebased.

I noticed here and on master, the live loading that should be working from #115 is not working anymore... I'll look into it tomorrow.

Copy link
Contributor

@Osmose Osmose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@biancadanforth biancadanforth merged commit 37b3aeb into master Sep 20, 2018
@biancadanforth biancadanforth deleted the 42-price-parsing branch September 20, 2018 17:28
biancadanforth added a commit that referenced this pull request Sep 21, 2018
Improve fallback extraction by CSS selectors:
* Update selectors for the top 5 sites. Add Home Depot and Best Buy.
* Rename selectors JSON file to be more descriptive (was 'product_extraction_data.json', now 'fallback_extraction_selectors.json').
* Represent supported sites in 'fallback_extraction_selectors.json' as regular expression strings so that fallback extraction works for any subdomain of the site (e.g. 'smile.amazon.com').
* Represent CSS selectors by tuples in 'fallback_extraction_selectors.json', so that each selector can specify which attribute or property to read for that selector.
* Clean price strings from fallback extraction using the same methods as used by Fathom extraction (PR #111); consolidate and move shared methods to 'utils.js'.
biancadanforth added a commit that referenced this pull request Sep 21, 2018
Improve fallback extraction by CSS selectors:
* Update selectors for the top 5 sites. Add Home Depot and Best Buy.
* Rename selectors JSON file to be more descriptive (was 'product_extraction_data.json', now 'fallback_extraction_selectors.json').
* Represent supported sites in 'fallback_extraction_selectors.json' as regular expression strings so that fallback extraction works for any subdomain of the site (e.g. 'smile.amazon.com').
* Represent CSS selectors by tuples in 'fallback_extraction_selectors.json', so that each selector can specify which attribute or property to read for that selector.
* Clean price strings from fallback extraction using the same methods as used by Fathom extraction (PR #111); consolidate and move shared methods to 'utils.js'.
biancadanforth added a commit that referenced this pull request Sep 28, 2018
Improve fallback extraction by CSS selectors:
* Update selectors for the top 5 sites. Add Home Depot and Best Buy.
* Rename selectors JSON file to be more descriptive (was 'product_extraction_data.json', now 'fallback_extraction_selectors.json').
* Represent supported sites in 'fallback_extraction_selectors.json' as regular expression strings so that fallback extraction works for any subdomain of the site (e.g. 'smile.amazon.com').
* Represent CSS selectors by tuples in 'fallback_extraction_selectors.json', so that each selector can specify which attribute or property to read for that selector.
* Clean price strings from fallback extraction using the same methods as used by Fathom extraction (PR #111); consolidate and move shared methods to 'utils.js'.
biancadanforth added a commit that referenced this pull request Sep 28, 2018
Improve fallback extraction by CSS selectors:
* Update selectors for the top 5 sites. Add Home Depot and Best Buy.
* Rename selectors JSON file to be more descriptive (was 'product_extraction_data.json', now 'fallback_extraction_selectors.json').
* Represent supported sites in 'fallback_extraction_selectors.json' as regular expression strings so that fallback extraction works for any subdomain of the site (e.g. 'smile.amazon.com').
* Represent CSS selectors by tuples in 'fallback_extraction_selectors.json', so that each selector can specify which attribute or property to read for that selector.
* Clean price strings from fallback extraction using the same methods as used by Fathom extraction (PR #111); consolidate and move shared methods to 'utils.js'.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants