feat(browser.py): add BrowserReplayStrategy; support browser modes re…

…cord/replay (#872) * add BrowserReplayStrategy; support browser modes record/replay * minor refactor * black/flake8 * update README * improve README * add BrowserReplayStrategy to README * add strategies/visual_browser.py * fix Action.from_dict and test_action_from_dict to support <cmd>-t * calculate_tokens_and_cost; bugfix ActionEvent.fromdict; add ActionEvent.next_event; add TODOs; add visual_browser.py::SKIP_MOVE_BEFORE_CLICK * handle mousemove/scroll; add_screen_tlbr forwards and backwards; RAW_PRECISE/IMPRECISE_MOUSE_EVENTS; openai.MAX_IMAGES = 90; fix merge_consecutive_mouse_scroll_events and tests; filter_invalid_window_events; dump_state timeout; * add TODO * noqa
OpenAdaptAI · Oct 24, 2024 · d31fde0 · d31fde0
1 parent f4bfc90
commit d31fde0
Show file tree

Hide file tree

Showing 20 changed files with 1,923 additions and 274 deletions.
diff --git a/README.md b/README.md
@@ -35,9 +35,8 @@ with the power of Large Multimodal Modals (LMMs) by:
 - Recording screenshots and associated user input
 - Aggregating and visualizing user input and recordings for development
 - Converting screenshots and user input into tokenized format
-- Generating synthetic input via transformer model completions
-- Generating task trees by analyzing recordings (work-in-progress)
-- Replaying synthetic input to complete tasks (work-in-progress)
+- Generating and replaying synthetic input via transformer model completions
+- Generating process graphs by analyzing recording logs (work-in-progress)
 
 The goal is similar to that of
 [Robotic Process Automation](https://en.wikipedia.org/wiki/Robotic_process_automation),
@@ -165,37 +164,6 @@ pointing the cursor and left or right clicking, as described in this
 [open issue](https://github.com/OpenAdaptAI/OpenAdapt/issues/145)
 
 
-### Capturing Browser Events
-
-To capture (record) browser events in Chrome, follow these steps:
-
-1. Go to: [Chrome Extension Page](chrome://extensions/)
-
-2. Enable `Developer mode` (located at the top right):
-
-![image](https://github.com/OpenAdaptAI/OpenAdapt/assets/65433817/c97eb9fb-05d6-465d-85b3-332694556272)
-
-3. Click `Load unpacked` (located at the top left).
-
-![image](https://github.com/OpenAdaptAI/OpenAdapt/assets/65433817/00c8adf5-074a-4655-b132-fd87644007fc)
-
-4. Select the `chrome_extension` directory:
-
-![image](https://github.com/OpenAdaptAI/OpenAdapt/assets/65433817/71610ed3-f8d4-431a-9a22-d901127b7b0c)
-
-5. You should see the following confirmation, indicating that the extension is loaded:
-
-![image](https://github.com/OpenAdaptAI/OpenAdapt/assets/65433817/7ee19da9-37e0-448f-b9ab-08ef99110e85)
-
-6. Set the flag to `true` if it is currently `false`:
-
-![image](https://github.com/user-attachments/assets/8eba24a3-7c68-4deb-8fbe-9d03cece1482)
-
-7. Start recording. Once recording begins, navigate to the Chrome browser, browse some pages, and perform a few clicks. Then, stop the recording and let it complete successfully.
-
-8. After recording, check the `openadapt.db` table `browser_event`. It should contain all your browser activity logs. You can verify the data's correctness using the `sqlite3` CLI or an extension like `SQLite Viewer` in VS Code to open `data/openadapt.db`.
-
-
 ### Visualize
 
 Quickly visualize the latest recording you created by running the following command:
@@ -243,6 +211,7 @@ Other replay strategies include:
 - [`StatefulReplayStrategy`](https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/strategies/stateful.py): Early proof-of-concept which uses the OpenAI GPT-4 API with prompts constructed via OS-level window data.
 - (*)[`VisualReplayStrategy`](https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/strategies/visual.py): Uses [Fast Segment Anything Model (FastSAM)](https://github.com/CASIA-IVA-Lab/FastSAM) to segment active window.
 - (*)[`VanillaReplayStrategy`](https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/strategies/vanilla.py): Assumes the model is capable of directly reasoning on states and actions accurately. With future frontier models, we hope that this script will suddenly work a lot better.
+- (*)[`VisualBrowserReplayStrategy`](https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/strategies/visual_browser.py): Like VisualReplayStrategy but generates segments from the visible DOM read by the browser extension.
 
 
 The (*) prefix indicates strategies which accept an "instructions" parameter that is used to modify the recording, e.g.:
@@ -253,6 +222,22 @@ python -m openadapt.replay VanillaReplayStrategy --instructions "calculate 9-8"
 
 See https://github.com/OpenAdaptAI/OpenAdapt/tree/main/openadapt/strategies for a complete list. More ReplayStrategies coming soon! (see [Contributing](#Contributing)).
 
+### Browser integration
+
+To record browser events in Google Chrome (required by the `BrowserReplayStrategy`), follow these steps:
+
+1. Go to your Chrome extensions page by entering [chrome://extensions](chrome://extensions/) in your address bar.
+
+2. Enable `Developer mode` (located at the top right).
+
+3. Click `Load unpacked` (located at the top left).
+
+4. Select the `chrome_extension` directory in the OpenAdapt repo.
+
+5. Make sure the Chrome extension is enabled (the switch to the right of the OpenAdapt extension widget is turned on).
+
+6. Set the `RECORD_BROWSER_EVENTS` flag to `true` in `openadapt/data/config.json`.
+
 ## Features
 
 ### State-of-the-art GUI understanding via [Segment Anything in High Quality](https://github.com/SysCV/sam-hq):
@@ -306,13 +291,6 @@ We're looking forward to your contributions. Let's build the future 🚀
 
 ## Contributing
 
-### Notable Works-in-progress (incomplete, see https://github.com/OpenAdaptAI/OpenAdapt/pulls and https://github.com/OpenAdaptAI/OpenAdapt/issues/ for more)
-
-- [Video Recording Hardware Acceleration](https://github.com/OpenAdaptAI/OpenAdapt/issues/570) (help wanted)
-- [Audio Narration](https://github.com/OpenAdaptAI/OpenAdapt/pull/346) (help wanted)
-- [Chrome Extension](https://github.com/OpenAdaptAI/OpenAdapt/pull/364) (help wanted)
-- [Gemini Vision](https://github.com/OpenAdaptAI/OpenAdapt/issues/551) (help wanted)
-
 ### Replay Problem Statement
 
 Our goal is to automate the task described and demonstrated in a `Recording`.

diff --git a/chrome_extension/background.js b/chrome_extension/background.js
@@ -1,33 +1,28 @@
 /**
  * @file background.js
- * @description Creates a new background script that listens for messages from the content script
- * and sends them to a WebSocket server.
-*/
+ * @description Background script that maintains the current mode and communicates with content scripts.
+ */
 
 let socket;
+let currentMode = null; // Maintain the current mode here
 let timeOffset = 0; // Global variable to store the time offset
 
-/* 
- * TODO: 
-  * Ideally we read `WS_SERVER_PORT`, `WS_SERVER_ADDRESS` and 
-  * `RECONNECT_TIMEOUT_INTERVAL` from config.py, 
-  * or it gets passed in somehow. 
-*/
+/*
+ * Note: these need to match the corresponding values in config[.defaults].json
+ */
 let RECONNECT_TIMEOUT_INTERVAL = 1000; // ms
 let WS_SERVER_PORT = 8765;
 let WS_SERVER_ADDRESS = "localhost";
 let WS_SERVER_URL = "ws://" + WS_SERVER_ADDRESS + ":" + WS_SERVER_PORT;
 
-
 function socketSend(socket, message) {
   console.log({ message });
   socket.send(JSON.stringify(message));
 }
 
-
 /*
  * Function to connect to the WebSocket server.
-*/
+ */
 function connectWebSocket() {
   socket = new WebSocket(WS_SERVER_URL);
 
@@ -38,11 +33,34 @@ function connectWebSocket() {
   socket.onmessage = function(event) {
     console.log("Message from server:", event.data);
     const message = JSON.parse(event.data);
+
+    // Handle mode messages
+    if (message.type === 'SET_MODE') {
+      currentMode = message.mode; // Update the current mode
+      console.log(`Mode set to: ${currentMode}`);
+
+      // Send the mode to all active tabs
+      chrome.tabs.query(
+        {
+          active: true,
+        },
+        function(tabs) {
+          tabs.forEach(function(tab) {
+            chrome.tabs.sendMessage(tab.id, message, function(response) {
+              if (chrome.runtime.lastError) {
+                console.error("Error sending message to content script in tab " + tab.id, chrome.runtime.lastError.message);
+              } else {
+                console.log("Message sent to content script in tab " + tab.id, response);
+              }
+            });
+          });
+        }
+      );
+    }
   };
 
   socket.onclose = function(event) {
     console.log("WebSocket connection closed", event);
-    // Reconnect after 5 seconds if the connection is lost
     setTimeout(connectWebSocket, RECONNECT_TIMEOUT_INTERVAL);
   };
 
@@ -66,3 +84,32 @@ chrome.runtime.onMessage.addListener((message, sender, sendResponse) => {
     sendResponse({ status: "WebSocket connection not open" });
   }
 });
+
+/* Listen for tab activation */
+chrome.tabs.onActivated.addListener((activeInfo) => {
+  // Send current mode to the newly active tab if it's not null
+  if (currentMode) {
+    const message = { type: 'SET_MODE', mode: currentMode };
+    chrome.tabs.sendMessage(activeInfo.tabId, message, function(response) {
+      if (chrome.runtime.lastError) {
+        console.error("Error sending message to content script in tab " + activeInfo.tabId, chrome.runtime.lastError.message);
+      } else {
+        console.log("Message sent to content script in tab " + activeInfo.tabId, response);
+      }
+    });
+  }
+});
+
+/* Listen for tab updates to handle new pages or reloading */
+chrome.tabs.onUpdated.addListener((tabId, changeInfo, tab) => {
+  if (changeInfo.status === 'complete' && currentMode) {
+    const message = { type: 'SET_MODE', mode: currentMode };
+    chrome.tabs.sendMessage(tabId, message, function(response) {
+      if (chrome.runtime.lastError) {
+        console.error("Error sending message to content script in tab " + tabId, chrome.runtime.lastError.message);
+      } else {
+        console.log("Message sent to content script in tab " + tabId, response);
+      }
+    });
+  }
+});