Skip to content
This repository has been archived by the owner on Dec 3, 2020. It is now read-only.

#36: Integrate Fathom-based page extraction with a simple ruleset. #38

Merged
merged 5 commits into from
Aug 3, 2018

Conversation

biancadanforth
Copy link
Collaborator

@biancadanforth biancadanforth commented Aug 1, 2018

@Osmose , ready for your review.

Follow-up PR:
Add a ruleset (and coefficients) to extract product information as described in #36 with the target accuracy based on a training set.

Possible follow-up issue:
Currently, if Fathom finds no price elements and extractData is executed on a non-product page belonging to a domain we have CSS selectors for in product_extraction_data.json (e.g. amazon.com), it will try to find elements based on the selectors, find nothing, and will throw an error. I.e. in this situation, extractData will return nothing (not even null) to getProductInfo. Is that the behavior that we want? This smells like a potential follow-up issue.
One possible fix (and something we will likely end up doing anyway for performance considerations) is to create a ruleset for identifying a product page, so the extraction code only executes on those types of pages.

@biancadanforth biancadanforth requested a review from Osmose August 1, 2018 00:44
Copy link
Contributor

@Osmose Osmose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Along with the feedback you'll want to rebase, since the sidebar removal landed.

* License, v. 2.0. If a copy of the MPL was not distributed with this
* file, You can obtain one at http://mozilla.org/MPL/2.0/. */

/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double asterisk for the doc comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc comment is an excellent summary of the file!

*/
function formatPrice(priceString) {
const formattedPriceStr = priceString.replace('$', '').replace(/([\s]|[^0-9$.-])/g, '');
return parseFloat(formattedPriceStr.substr(formattedPriceStr.indexOf('$') + 1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Floats are unsuitable for representing price because they cannot accurately represent decimal numbers (that article is a bit wordy, you don't necessarily need to read all of it).

We're not actually doing anything with the prices yet anyway; for now just returning the string matched by Fathom is plenty.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just removed this function entirely for now.

from: 'content',
subject: 'ready',
data: {
price: runRuleset('price', tuningRoutines.price.coeffs) || extractData().price,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may behoove us to define an API for price extraction that's common between the two different methods, and use that here instead of hardcoding them. Something like:

const extractedProduct = fathomExtraction.extractProduct(window.document);
// extractedProduct == {price: '$4.95'}

That would make coding a fallback a bit easier if we want it, and also let us configure different extraction method later on without modifying the rest of the code.

let gotText;
if (feature === 'price') {
// strip whitespace, dollar sign, words, and trailing zeros when comparing price
gotText = tuningRoutine(...coeffs)(window.document).map(fnode => fnode.element)[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is really difficult to follow. You have to follow the execution through tuningRoutine returning a function that takes a document object and returns a list of fnodes (which are a fathom-specific thing), which we then extract dom nodes from, and then grab just the first one (can more than one even be returned?).

I'm not a fan of the design of this API. there's no reason for runRuleset to take the coefficients as an argument; in this WebExtension we're never going to call that function with different coefficients. Also the coefficients are being imported from the same file, why couldn't the tuning routine just use them directly?

I think this file should be loosely coupled with the details of extraction. This file is more concerned with triggering the extraction and then sending that data back to the background script.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the exported method from fathom_ruleset.js so that it simply returns the highest scoring element for the given feature in the given doc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and then grab just the first one (can more than one even be returned?

Yes. Fathom is returning all elements with the highest score. Because our ruleset (with it's one naive rule) is really bad right now, and on some pages, there is no div element with a class of "price", it is returning all div elements in that doc.

If the list has multiple elements, then I imagine we are not confident in Fathom's results, and should use our fallback method, where applicable? That's what I have now in the updated PR. What would you recommend @Osmose ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a way to get the scores of the dom elements returned. We could have a minimum threshold below which we consider the matches to not be worth returning; for our simple ruleset this can just be whatever the score of a matching element would be.

Depending on the rules, maybe we also want to rule out ancestors? I could see a parent having the same innerText as a child and getting a similar score depending on the ruleset.

I suspect if we have a threshold, return the maximum-ranking results, and still find more than one matching dom element, it's prudent to just choose the first matching one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay thanks for these thoughts.

I added a threshold based on the score for the element, and return the first element in the list if more than one exceed it.

Re: this comment:

Depending on the rules, maybe we also want to rule out ancestors? I could see a parent having the same innerText as a child and getting a similar score depending on the ruleset.

I agree that's a good consideration to include, but I am thinking that's best for the follow-up PR where I add a more detailed ruleset, since there is currently no innerText rule in this PR's simple ruleset.


import {dom, out, rule, ruleset, score, type} from 'fathom-web';

const tuningRoutines = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that we have a different ruleset per attribute; I thought we were going to use a single ruleset with multiple types?

import {dom, out, rule, ruleset, score, type} from 'fathom-web';

const tuningRoutines = {
price: {routine: tunedPriceFnodes, coeffs: [2]},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm weary about coefficients being stored as an ordered list instead of, say, a named mapping. The reason Swathi did this in her code is because by default the simulated annealer deals with lists of coefficients, but when we add in annealing scripts to the codebase we can use Maps to get around that while still storing coefficients as named values.

What do you think about loading the coefficients from a JSON file?

{
  "hasDivWithPriceClass": 1
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would Maps be better than just importing the JSON object and getting the corresponding coefficient such as coeff.hasDivWithPriceClass?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that we should just import the JSON class here. What I meant was that the annealing API deals with lists of coefficients. Because key ordering in JS objects isn't guaranteed, you have to do some work to make sure the coefficients you're optimizing and put in and taken out of the lists that the optimization works with in the same order.

Maps could theoretically help since their keys are ordered by insertion, but @erikrose mentioned that Maps might be a perf hit during the annealing process depending on how we use them.

In any case, this is a moot point until we get around to adding in an optimization script to the repo, so we can just ignore Maps for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I don't think you'll need to add any sort of optimization script to your repo; that's what the FathomFox Trainer is for. (The randomTransition() function of the optimizer will probably have to become pluggable or at least selectable, however, depending on what sort of math you're doing with your coefficients. Right now, it can reach integral values, even ones <0, which suits 0..1 bounded scores coming out of each scoring callback.)

* Ruleset for product prices
*/
function tunedPriceFnodes(coeffHasDivWithPriceClass = 1) {
function hasDivWithPriceClass(fnode) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defining these inside another function make them impossible to test. When we get into using regexes and other more complex rules I suspect unit tests of rules may have some value. Can we define these at a global level somehow?

return tuningRoutine;
}

export {tuningRoutines, formatPrice};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to just put an export keyword next to each of these instead of having an export line at the bottom.

Also, actually, I wasn't aware this syntax worked; I would have expected it to complain that we had an unnamed, non-default export. Interesting.

subject: 'ready',
data: {
price: runRuleset('price', tuningRoutines.price.coeffs) || extractData().price,
url: window.document.URL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, are URL objects serializable such that they can be sent in a postMessage? TIL

This patch will run Fathom against the page (not distinguishing a product from a non-product page) and log the extracted price value and page URL to the console via 'background.js'. Failing that, it will fall back to extraction via CSS selectors if any exist for the site in 'product_extraction_data.json', and failing that, it will try extraction via Open Graph meta tags.

This is heavily based on [Swathi Iyer](https://github.com/swathiiyer2/fathom-products) and [Victor Ng’s](https://github.com/mozilla/fathom-webextension) prior work. Currently, there is only one ruleset with one naive rule for one product feature, price. This initial commit is intended cover Fathom integration into the web extension. A later commit will add rules and take training data into account.

Note: The 'runRuleset' method in 'productInfo.js' returns 'NaN' if it doesn't find any elements for any of its rules.

Performance observations:
Originally, I had dumped Swathi's three rulesets (one each for product title, image and price) and tried to run them against any page, similar to Victor Ng's web extension. However, that was [freezing up the tab](#36 (comment)), and after profiling the content script Fathom was running in before and after replacing Swathi's rulesets with a single ruleset with only one rule for one attribute, I did not see any warnings from Firefox, nor detect any significant performance hits in the DevTools profiler due to Fathom. It would therefore appear the performance hit was related to the complex rulesets and not Fathom itself.

Webpack observations:
While [`jsdom`](https://www.npmjs.com/package/jsdom) is a `fathom-web` dependency, it is used only for running `fathom-web` in the Node context for testing. To avoid build errors associated with `jsdom` and its dependencies, I added a `’null-loader’` for that `require` call, which mocks the module as an empty object. This loader is also used in webpack.config.test.js, from PR #32.
Copy link
Collaborator Author

@biancadanforth biancadanforth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some replies and questions for you for my second commit.

import {dom, out, rule, ruleset, score, type} from 'fathom-web';

const tuningRoutines = {
price: {routine: tunedPriceFnodes, coeffs: [2]},
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would Maps be better than just importing the JSON object and getting the corresponding coefficient such as coeff.hasDivWithPriceClass?

*/
function formatPrice(priceString) {
const formattedPriceStr = priceString.replace('$', '').replace(/([\s]|[^0-9$.-])/g, '');
return parseFloat(formattedPriceStr.substr(formattedPriceStr.indexOf('$') + 1));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just removed this function entirely for now.

let gotText;
if (feature === 'price') {
// strip whitespace, dollar sign, words, and trailing zeros when comparing price
gotText = tuningRoutine(...coeffs)(window.document).map(fnode => fnode.element)[0];
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the exported method from fathom_ruleset.js so that it simply returns the highest scoring element for the given feature in the given doc.

let gotText;
if (feature === 'price') {
// strip whitespace, dollar sign, words, and trailing zeros when comparing price
gotText = tuningRoutine(...coeffs)(window.document).map(fnode => fnode.element)[0];
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and then grab just the first one (can more than one even be returned?

Yes. Fathom is returning all elements with the highest score. Because our ruleset (with it's one naive rule) is really bad right now, and on some pages, there is no div element with a class of "price", it is returning all div elements in that doc.

If the list has multiple elements, then I imagine we are not confident in Fathom's results, and should use our fallback method, where applicable? That's what I have now in the updated PR. What would you recommend @Osmose ?

@biancadanforth
Copy link
Collaborator Author

@Osmose , Ready for another look. The biggest open question is what to do when Fathom returns a multi-item list of elements, all having the highest score for a given product feature; see my Comment Review for more details.

Copy link
Contributor

@Osmose Osmose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this changes, this is getting pretty close. Nice work!

* Extracts the highest scoring element for a given feature contained
* in a page's HTML document.
*/
export default function runTuningRoutine(doc) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ruleset isn't really a "tuning routine". Maybe rename this runRuleset?

Copy link
Collaborator Author

@biancadanforth biancadanforth Aug 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm not really sure what "tuningRoutine" means in this context, but we certainly aren't tuning anything here. This was a relic of Swathi's/Victor's code.

);

/**
* Extracts the highest scoring element for a given feature contained
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "given feature" bit of this comment doesn't really apply anymore. There's no feature being passed in anymore, it just returns all the features we currently know how to extract.

for (const [vendor, attributeInfo] of Object.entries(extractionData)) {
if (hostname.includes(vendor)) {
return attributeInfo;
const fallbackExtraction = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define each of these extraction objects in separate files and import them? That'd help focus this module a bit more.

For the Fathom one, you can probably just define that in the existing fathom file and export it instead of the "return DOM nodes" function.

const metaEle = document.querySelector(`meta[property='${value}']`);
if (metaEle) {
data[key] = metaEle.getAttribute('content');
data.url = window.document.URL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extraction doesn't change per-extraction-method, so we can just add it in in getProductInfo instead of duplicating it.

Since it's possible for multiple elements to be returned by Fathom for a given ruleset applied to a page, I applied a threshold (currently just set as the score for the only rule we have, 'hasDivWithPriceClass') to help provide an additional measure of confidence.

If more than one element meet or exceed the threshold, the first element in the list is returned.
@biancadanforth
Copy link
Collaborator Author

Okay @Osmose , added the score threshold; ready for another pass from you.

@biancadanforth
Copy link
Collaborator Author

Sorry for the confusion @Osmose ; this should be ready to go now.

The last two commits should incorporate all of your feedback from comments last night and the second review this morning.

biancadanforth added a commit that referenced this pull request Aug 3, 2018
This commit builds off of PR #38, so that PR should merge before this.

Open questions:
* How to test interdependent rules, such as 'isNearProductImage' for the product title and product price candidate elements?
* The only feature that is pulled out accurately on my test page (an Amazon product page) is the image. What rules can I add/modify to get title and price correct?

TODO:
* Add rule to remove ancestor elements who have the same 'innerText' value.
* Consider adding image rule to see if an image element is the largest image on the page (above the fold).
* Add price rule to see if innerText starts with '$'.
Copy link
Contributor

@Osmose Osmose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work!

const SCORE_THRESHOLD = fathomCoeffs.hasDivWithPriceClass;

/**
* Checks to see if an element is a <div> with a class of "price".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description doesn't need to be quite so long. It's pretty clear that rules return coefficients in the context of Fathom, and this doesn't only score <div>s, it scores any DOM element. A one-liner is fine: Scores fnodes with a "price" class.

Probably also want to update the name of this to hasPriceClass or similar.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants