Skip to content
This repository has been archived by the owner on Dec 3, 2020. It is now read-only.

Fix #84: Improve fallback extraction #114

Merged
merged 4 commits into from
Sep 28, 2018
Merged

Conversation

biancadanforth
Copy link
Collaborator

@biancadanforth biancadanforth commented Sep 14, 2018

Improve fallback extraction by CSS selectors:

  • Update selectors for the top 5 sites. Add Home Depot and Best Buy.
  • Represent CSS selectors in JSON file by tuples, so that each selector can specify exactly how the value should be extracted.
  • Replace selectors JSON file to a JS file, so site-specific extraction methods can be directly called to return the desired value for a given CSS selector (was 'product_extraction_data.json', now 'fallback_extraction_selectors.js').
  • Clean price strings from fallback extraction using the same methods used by Fathom extraction (PR Fix #42: Improve price string parsing. #111).

Follow up issue(s) to consider:

@biancadanforth biancadanforth changed the title 84 fallback extraction Fix #84: Improve fallback extraction Sep 14, 2018
@biancadanforth
Copy link
Collaborator Author

biancadanforth commented Sep 14, 2018

@Osmose Ready for your review! This PR builds off PR #111 . Only the last commit is new.

Copy link
Contributor

@Osmose Osmose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I only left comments on the last commit, and ignored changes that may not be relevant after my feedback on the previous PR.

const hostname = new URL(window.location.href).host;
for (const [vendor, attributeInfo] of Object.entries(extractionData)) {
if (hostname.includes(vendor)) {
const url = window.location.href;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use a regex across the bare URL, a user could theoretically craft a URL with querystring/anchor bits that fool our regex into recognizing the site as a different domain than we intend, causing the add-on to behave in unexpected (and potentially dangerous?) ways. It's safer for us to avoid regexes, and operate only on the correctly-parsed host from the URL object.

Perhaps, instead of regexes, we let each vendor object have an allowlist of subdomains, like ['amazon.com', 'www.amazon.com', 'smile.amazon.com'], and we just check if the host is in the list?

for (const [productAttribute, extractor] of Object.entries(attributeInfo)) {
const {selectors, extractUsing} = extractor;
for (const selector of selectors) {
for (const [productAttribute, tuples] of Object.entries(attributeInfo)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tuple is a bit too generic of a word. extractor was kind've nice since it was a bit more clear about what data the object/list contained, could we go back to that or something similar?

for (const selector of selectors) {
for (const [productAttribute, tuples] of Object.entries(attributeInfo)) {
for (const tuple of tuples) {
const [selector, extractUsing] = tuple;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can perform this destructuring within the for loop clause instead of on its own line.

if (data[productAttribute]) {
break;
} else {
throw new Error(`Element found did not return a valid product ${productAttribute}.`);
}
} else if (selector === selectors[selectors.length - 1]) {
} else if (tuple === tuples[tuples.length - 1]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is how the code was already, but this is a difficult-to-follow way to loop until something returns a value. Another way would be to write a function that loops over the list of selectors and attempts to extract values, and returns the first truthy value extracted. Splitting it into a function lets you use return as an early-exit.

src/utils.js Outdated Show resolved Hide resolved
@biancadanforth
Copy link
Collaborator Author

@Osmose , ready for another look.

Weird parts include (better ideas welcome):

  • The top level key in the CSS selectors object is a string that's broken out into a list of domains/subdomains.
  • getPriceInSubunits now accepts an HTML element (Fathom extraction) or a string (fallback extraction) as an argument.

Also thank you mythmon for your help on the function composition I had to perform in fallback_extraction_selectors.js.

Copy link
Contributor

@Osmose Osmose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates!

src/utils.js Outdated
default:
throw new Error(`Unrecognized extraction property or attribute '${extractUsing}'.`);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These helpers aren't used anymore, we should remove them.

* that selector.
*/
const fallbackExtractionData = {
'amazon.com_www.amazon.com_smile.amazon.com': {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of storing these in a map, store them in a list and make the allowed domains an attribute:

const fallbackExtractionData = [
  {
    domains: ['amazon.com', 'www.amazon.com', 'smile.amazon.com'],
    features: {
      price: [
        // etc.
      ],
    },
  },
];

That'd let you avoid trying to fit every domain into a single key.

/**
* Separates a string into an array of substrings using '$' and '.' as separators
*/
function splitString(str) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function name is too generic to describe what it is doing. We should remove it as a separate function, or rename it.

* @returns {Number} the price in subunits
*/
export function getPriceInSubunits(priceEle) {
const priceUnits = getPriceUnits(priceEle.childNodes);
export function getPriceInSubunits(price) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally not a fan of methods that dispatch based on the type of their input. Typically the code can be rearranged so that the type-specific work happens in the spots where the types are different, instead of doing it in the commonly-shared method.

In our case, the parsing process for DOM elements is:

  1. Convert each child node to a string.
  2. Split each string by $ and ., and flatten into a single list.
  3. Remove strings without digits.
  4. Remove special characters from strings.
  5. Convert strings to integers.
  6. Convert lists of 1 or 2 integers into prices, otherwise fail parsing.

The case of converting a single string is the same set of steps starting from step 2, with step 1 passing in a list of strings to step 2. In the string case, we can pass a list of length 1 with the price string in it. We get something like this:

export function parsePrice(tokens) {
  const priceUnits = (
    tokens

    // Split tokens by $ and . to get the numbers between them
    .flatMap(token => token.split(/[.$]/))

    // Filter out any tokens that do not contain a digit
    .filter(token => /\d/g.test(token))

    // Remove any non-digit characters for each token in the list
    .map(token => token.replace(/\D/g, ''))

    // Convert price token strings to integers
    .map(token => parseInt(token, 10))
  );
  
    
  // Convert units and subunits to a single integer value in subunits
  switch (priceUnits.length) {
    case 1:
      return priceUnits[0] * 100;
    case 2:
      return ((priceUnits[0] * 100) + priceUnits[1]);
    default:
      return NaN;
  }
}

// For fallback extraction
function inUnits(fn) {
  const priceString = fn(element);
  return element => parsePrice([priceString]);
}

// For Fathom extraction
const PRODUCT_FEATURES = {
  price: {
    ...FEATURE_DEFAULTS,
    getValueFromElement(element) {
      const tokens = Array.from(element.childNodes).map(node => node.textContent);
      return parsePrice(tokens);
    },
  },
};

parsePrice in this situation is more loosely coupled to the two modules it is used, because it operates on the common format that both modules can produce themselves. The difficulty is figuring out the data each module has, how and where it is being transformed, and looking for the common data shape between them. Focusing on the types and transformations on those types helps make it a bit easier.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great point; thank you for the detailed reasoning and code samples.

if (value) {
return value;
}
throw new Error('Element found did not return a valid value for the product feature.');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This throw breaks the cascade in case of a failure, and it's also not a case in which we want to halt execution and throw. Maybe remove this, maybe log a warning instead?

}
// None of the selectors matched an element on the page
throw new Error('No elements found with vendor data for the product feature.');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the extractors fail, do we want to throw and halt all execution? It's not unreasonable that our extractors get out of date again, so we should probably just be logging a warning and returning null.

@biancadanforth
Copy link
Collaborator Author

@Osmose Thanks for your feedback; this is ready for another look!

Copy link
Contributor

@Osmose Osmose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve fallback extraction by CSS selectors:
* Update selectors for the top 5 sites. Add Home Depot and Best Buy.
* Rename selectors JSON file to be more descriptive (was 'product_extraction_data.json', now 'fallback_extraction_selectors.json').
* Represent supported sites in 'fallback_extraction_selectors.json' as regular expression strings so that fallback extraction works for any subdomain of the site (e.g. 'smile.amazon.com').
* Represent CSS selectors by tuples in 'fallback_extraction_selectors.json', so that each selector can specify which attribute or property to read for that selector.
* Clean price strings from fallback extraction using the same methods as used by Fathom extraction (PR #111); consolidate and move shared methods to 'utils.js'.
* Replaces ‘fallback_extraction_selectors.json’ with a JS version, which allows the tuples associating a CSS selector with how to extract information to point to a site-specific extraction method that returns the desired value when executed.
* Renamed some variables in various functions in ‘fallback_extraction.js’ so that it more closely matches its sister functions in ‘fathom_extraction.js’ for improved readability.
* Modified ‘getPriceInSubunits’ function in './src/extraction/utils.js' to take in an HTML element (from Fathom extraction) OR a string (from fallback extraction).
  * Refactored some of the supporting functions.
* Replace top level keys in CSS selector data (in ‘fallback_extraction_selectors.js’) with a string composed of supported domains/subdomains.
  * Update ‘getFeatureInfo’ to split the string into a list of domains/subdomains before checking for a hostname match.
* Break out part of ‘extractProduct’ in ‘fallback_extraction.js` for better readability.
* Rename some internal variables for more consistency/clarity.
* Log a warning instead of throwing an error and halting execution if fallback extraction for a supported site fails in various ways.
* Restructure 'fallbackExtractionData' in 'fallback_extraction_selectors.js', so it is an array of objects with keys 'domains' and 'features'.
* Format 'parsePrice' utility function's input argument into an array of strings so that it takes in the same type of argument regardless of where the extraction info is coming from (Fathom or fallback).
* Remove unused utility methods from './src/utils.js' and './src/extraction/utils.js'.
@biancadanforth biancadanforth merged commit 4667bb1 into master Sep 28, 2018
@biancadanforth biancadanforth deleted the 84-fallback-extraction branch September 28, 2018 23:05
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants