Skip to content

Commit

Permalink
feat(browser.py): add BrowserReplayStrategy; support browser modes re…
Browse files Browse the repository at this point in the history
…cord/replay (#872)

* add BrowserReplayStrategy; support browser modes record/replay

* minor refactor

* black/flake8

* update README

* improve README

* add BrowserReplayStrategy to README

* add strategies/visual_browser.py

* fix Action.from_dict and test_action_from_dict to support <cmd>-t

* calculate_tokens_and_cost; bugfix ActionEvent.fromdict; add ActionEvent.next_event; add TODOs; add visual_browser.py::SKIP_MOVE_BEFORE_CLICK

* handle mousemove/scroll; add_screen_tlbr forwards and backwards; RAW_PRECISE/IMPRECISE_MOUSE_EVENTS; openai.MAX_IMAGES = 90; fix merge_consecutive_mouse_scroll_events and tests; filter_invalid_window_events; dump_state timeout;

* add TODO

* noqa
  • Loading branch information
abrichr authored Oct 24, 2024
1 parent f4bfc90 commit d31fde0
Show file tree
Hide file tree
Showing 20 changed files with 1,923 additions and 274 deletions.
60 changes: 19 additions & 41 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,8 @@ with the power of Large Multimodal Modals (LMMs) by:
- Recording screenshots and associated user input
- Aggregating and visualizing user input and recordings for development
- Converting screenshots and user input into tokenized format
- Generating synthetic input via transformer model completions
- Generating task trees by analyzing recordings (work-in-progress)
- Replaying synthetic input to complete tasks (work-in-progress)
- Generating and replaying synthetic input via transformer model completions
- Generating process graphs by analyzing recording logs (work-in-progress)

The goal is similar to that of
[Robotic Process Automation](https://en.wikipedia.org/wiki/Robotic_process_automation),
Expand Down Expand Up @@ -165,37 +164,6 @@ pointing the cursor and left or right clicking, as described in this
[open issue](https://github.com/OpenAdaptAI/OpenAdapt/issues/145)


### Capturing Browser Events

To capture (record) browser events in Chrome, follow these steps:

1. Go to: [Chrome Extension Page](chrome://extensions/)

2. Enable `Developer mode` (located at the top right):

![image](https://github.com/OpenAdaptAI/OpenAdapt/assets/65433817/c97eb9fb-05d6-465d-85b3-332694556272)

3. Click `Load unpacked` (located at the top left).

![image](https://github.com/OpenAdaptAI/OpenAdapt/assets/65433817/00c8adf5-074a-4655-b132-fd87644007fc)

4. Select the `chrome_extension` directory:

![image](https://github.com/OpenAdaptAI/OpenAdapt/assets/65433817/71610ed3-f8d4-431a-9a22-d901127b7b0c)

5. You should see the following confirmation, indicating that the extension is loaded:

![image](https://github.com/OpenAdaptAI/OpenAdapt/assets/65433817/7ee19da9-37e0-448f-b9ab-08ef99110e85)

6. Set the flag to `true` if it is currently `false`:

![image](https://github.com/user-attachments/assets/8eba24a3-7c68-4deb-8fbe-9d03cece1482)

7. Start recording. Once recording begins, navigate to the Chrome browser, browse some pages, and perform a few clicks. Then, stop the recording and let it complete successfully.

8. After recording, check the `openadapt.db` table `browser_event`. It should contain all your browser activity logs. You can verify the data's correctness using the `sqlite3` CLI or an extension like `SQLite Viewer` in VS Code to open `data/openadapt.db`.


### Visualize

Quickly visualize the latest recording you created by running the following command:
Expand Down Expand Up @@ -243,6 +211,7 @@ Other replay strategies include:
- [`StatefulReplayStrategy`](https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/strategies/stateful.py): Early proof-of-concept which uses the OpenAI GPT-4 API with prompts constructed via OS-level window data.
- (*)[`VisualReplayStrategy`](https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/strategies/visual.py): Uses [Fast Segment Anything Model (FastSAM)](https://github.com/CASIA-IVA-Lab/FastSAM) to segment active window.
- (*)[`VanillaReplayStrategy`](https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/strategies/vanilla.py): Assumes the model is capable of directly reasoning on states and actions accurately. With future frontier models, we hope that this script will suddenly work a lot better.
- (*)[`VisualBrowserReplayStrategy`](https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/strategies/visual_browser.py): Like VisualReplayStrategy but generates segments from the visible DOM read by the browser extension.


The (*) prefix indicates strategies which accept an "instructions" parameter that is used to modify the recording, e.g.:
Expand All @@ -253,6 +222,22 @@ python -m openadapt.replay VanillaReplayStrategy --instructions "calculate 9-8"

See https://github.com/OpenAdaptAI/OpenAdapt/tree/main/openadapt/strategies for a complete list. More ReplayStrategies coming soon! (see [Contributing](#Contributing)).

### Browser integration

To record browser events in Google Chrome (required by the `BrowserReplayStrategy`), follow these steps:

1. Go to your Chrome extensions page by entering [chrome://extensions](chrome://extensions/) in your address bar.

2. Enable `Developer mode` (located at the top right).

3. Click `Load unpacked` (located at the top left).

4. Select the `chrome_extension` directory in the OpenAdapt repo.

5. Make sure the Chrome extension is enabled (the switch to the right of the OpenAdapt extension widget is turned on).

6. Set the `RECORD_BROWSER_EVENTS` flag to `true` in `openadapt/data/config.json`.

## Features

### State-of-the-art GUI understanding via [Segment Anything in High Quality](https://github.com/SysCV/sam-hq):
Expand Down Expand Up @@ -306,13 +291,6 @@ We're looking forward to your contributions. Let's build the future 🚀

## Contributing

### Notable Works-in-progress (incomplete, see https://github.com/OpenAdaptAI/OpenAdapt/pulls and https://github.com/OpenAdaptAI/OpenAdapt/issues/ for more)

- [Video Recording Hardware Acceleration](https://github.com/OpenAdaptAI/OpenAdapt/issues/570) (help wanted)
- [Audio Narration](https://github.com/OpenAdaptAI/OpenAdapt/pull/346) (help wanted)
- [Chrome Extension](https://github.com/OpenAdaptAI/OpenAdapt/pull/364) (help wanted)
- [Gemini Vision](https://github.com/OpenAdaptAI/OpenAdapt/issues/551) (help wanted)

### Replay Problem Statement

Our goal is to automate the task described and demonstrated in a `Recording`.
Expand Down
73 changes: 60 additions & 13 deletions chrome_extension/background.js
Original file line number Diff line number Diff line change
@@ -1,33 +1,28 @@
/**
* @file background.js
* @description Creates a new background script that listens for messages from the content script
* and sends them to a WebSocket server.
*/
* @description Background script that maintains the current mode and communicates with content scripts.
*/

let socket;
let currentMode = null; // Maintain the current mode here
let timeOffset = 0; // Global variable to store the time offset

/*
* TODO:
* Ideally we read `WS_SERVER_PORT`, `WS_SERVER_ADDRESS` and
* `RECONNECT_TIMEOUT_INTERVAL` from config.py,
* or it gets passed in somehow.
*/
/*
* Note: these need to match the corresponding values in config[.defaults].json
*/
let RECONNECT_TIMEOUT_INTERVAL = 1000; // ms
let WS_SERVER_PORT = 8765;
let WS_SERVER_ADDRESS = "localhost";
let WS_SERVER_URL = "ws://" + WS_SERVER_ADDRESS + ":" + WS_SERVER_PORT;


function socketSend(socket, message) {
console.log({ message });
socket.send(JSON.stringify(message));
}


/*
* Function to connect to the WebSocket server.
*/
*/
function connectWebSocket() {
socket = new WebSocket(WS_SERVER_URL);

Expand All @@ -38,11 +33,34 @@ function connectWebSocket() {
socket.onmessage = function(event) {
console.log("Message from server:", event.data);
const message = JSON.parse(event.data);

// Handle mode messages
if (message.type === 'SET_MODE') {
currentMode = message.mode; // Update the current mode
console.log(`Mode set to: ${currentMode}`);

// Send the mode to all active tabs
chrome.tabs.query(
{
active: true,
},
function(tabs) {
tabs.forEach(function(tab) {
chrome.tabs.sendMessage(tab.id, message, function(response) {
if (chrome.runtime.lastError) {
console.error("Error sending message to content script in tab " + tab.id, chrome.runtime.lastError.message);
} else {
console.log("Message sent to content script in tab " + tab.id, response);
}
});
});
}
);
}
};

socket.onclose = function(event) {
console.log("WebSocket connection closed", event);
// Reconnect after 5 seconds if the connection is lost
setTimeout(connectWebSocket, RECONNECT_TIMEOUT_INTERVAL);
};

Expand All @@ -66,3 +84,32 @@ chrome.runtime.onMessage.addListener((message, sender, sendResponse) => {
sendResponse({ status: "WebSocket connection not open" });
}
});

/* Listen for tab activation */
chrome.tabs.onActivated.addListener((activeInfo) => {
// Send current mode to the newly active tab if it's not null
if (currentMode) {
const message = { type: 'SET_MODE', mode: currentMode };
chrome.tabs.sendMessage(activeInfo.tabId, message, function(response) {
if (chrome.runtime.lastError) {
console.error("Error sending message to content script in tab " + activeInfo.tabId, chrome.runtime.lastError.message);
} else {
console.log("Message sent to content script in tab " + activeInfo.tabId, response);
}
});
}
});

/* Listen for tab updates to handle new pages or reloading */
chrome.tabs.onUpdated.addListener((tabId, changeInfo, tab) => {
if (changeInfo.status === 'complete' && currentMode) {
const message = { type: 'SET_MODE', mode: currentMode };
chrome.tabs.sendMessage(tabId, message, function(response) {
if (chrome.runtime.lastError) {
console.error("Error sending message to content script in tab " + tabId, chrome.runtime.lastError.message);
} else {
console.log("Message sent to content script in tab " + tabId, response);
}
});
}
});
Loading

0 comments on commit d31fde0

Please sign in to comment.