Request for Comments: OpenAdapt Architecture #552

abrichr · 2023-12-19T00:56:17Z

abrichr
Dec 19, 2023
Maintainer

We are inviting the community to comment on our proposed approach to AI-First Process Automation:

https://github.com/OpenAdaptAI/OpenAdapt/wiki/OpenAdapt-Architecture-(draft)

Please feel free to point out limitations and/or suggest alternatives. Thank you for your contributions!

(Also please note this is a living document, and is undergoing ongoing modifications.)

atineoSE · 2023-12-19T09:34:15Z

atineoSE
Dec 19, 2023

This is really powerful, @abrichr, thanks for opening this to discussion.

As far as the actual architecture is concerned, I have no major comments to add. Congratulations on such an elaborate piece!

In terms of driving development progress and adoption, I would add a few comments, if that helps:

Adding actual examples of the kind of problems the approach solves, including screenshots, even videos, because it would more immediately motivate its value and potential.
Putting together an initial set of test scenarios would serve as initial benchmarks, and would contribute to more easily evaluate improvements over a baseline.
There are multiple state-of-the-art technologies being combined, each of which are arguably not yet fully understood. Putting them all together in a sequential fashion may make it really hard to figure out when something's not working as expected. This is the same problem that agentic scenarios are facing today. Coming up with ways to isolate such steps and work them out in isolation would greatly ease development.
It could also be hard to setup the project with multiple models possibly being deployed in different instances. For instance, local models are mentioned as well as EC2 deployments. It would be helpful to have all deployments consolidated for easier setup. In this regards, some configurations could be pre-defined, and expanded upon as requested by the community.
Each step may need different skill sets and may attract a different profile of contributor. It could be helpful to more clearly formulate each step, in order to make them more self-contained and more clearly state their needs. For instance, scrubbing PHI/PII needs familiarity with privacy practices and how to identify such info in images, whereas troubleshooting CoC promting requires familiarity with LLM prompting and coding in general. In this example, two very different skill sets could be applied.

1 reply

abrichr Jan 3, 2024
Maintainer Author

Thank you for the feedback @atineoSE !

Adding actual examples of the kind of problems the approach solves, including screenshots, even videos, because it would more immediately motivate its value and potential.

Agreed. I like your proposal of copying data from a spreadsheet into a webform:

Putting together an initial set of test scenarios would serve as initial benchmarks, and would contribute to more easily evaluate improvements over a baseline.

Agreed. I believe the aforementioned task can serve as the first test. I will create a recording and submit it in a PR.

There are multiple state-of-the-art technologies being combined, each of which are arguably not yet fully understood. Putting them all together in a sequential fashion may make it really hard to figure out when something's not working as expected. This is the same problem that agentic scenarios are facing today. Coming up with ways to isolate such steps and work them out in isolation would greatly ease development.

This is a good point, and part of the reason for the visualize functionality. This will likely need to be expanded to include prompt debugging.

It could also be hard to setup the project with multiple models possibly being deployed in different instances. For instance, local models are mentioned as well as EC2 deployments. It would be helpful to have all deployments consolidated for easier setup. In this regards, some configurations could be pre-defined, and expanded upon as requested by the community.

Another good point. The motivation behind multiple models was to support a fully offline approach, with improved performance via models hosted on larger machines. However, upon reflection I believe we should focus on hosted models only in order to use the state of the art models regardless of the capabilities of the desktop machine.

Each step may need different skill sets and may attract a different profile of contributor. It could be helpful to more clearly formulate each step, in order to make them more self-contained and more clearly state their needs. For instance, scrubbing PHI/PII needs familiarity with privacy practices and how to identify such info in images, whereas troubleshooting CoC promting requires familiarity with LLM prompting and coding in general. In this example, two very different skill sets could be applied.

Thank you for pointing this out. I added a description of the roles we are looking for in the README: https://github.com/OpenAdaptAI/OpenAdapt?tab=readme-ov-file#-open-contract-positions-at-openadaptai

LaPetiteSouris · 2023-12-22T14:40:58Z

LaPetiteSouris
Dec 22, 2023

Thanks a lot for putting all of this together. What an effort !

I will try to review this in multiple parts as it is so complex to do it in one pass.

Overall:

The decoupling of steps make sense totally. In a previous discussion long time ago, I mentioned that may be we need a "chain of actions" to be made for each steps, now, with "chain-of-code" logic, this makes sense.

I think, to reason in a "divide-and-conquer" mindset, the problem can be splited in to small step that:

At each moment, try to describe the current input (current Node in the Graph Action), then deduce the next steps (next Action Node).
Instead of directly taking the action, as you mention, may be it is good idea to show the action and ask for user feedback if needed. The LLM model should continuously give action until the user validates the action to move to next node

I think the problem above is the nucleus of all the chain. If we manage to solve this, then the whole action chain can be composed little by little.

This is also what you mean if I assume correctly with the graph and the description.

Now, focusing on this base case, we need to discuss in details about the input. The abstract formula is that: given a screen, represent it so that the model can understand. Then the model can "reason" in Chain of Code to produce the next action.

The key point to discuss here is:

What input to give
How to translate the inputs to a format that make senses to the model. I am not talking about the embedding as it is technical. But likely we need to have kind of "ontology" to represent "the world" the the models. IMHO, input like x, y coordinate and mouse buttons does not make much sense as it is difficult to capture the "meaning" of the current screen.

Failing to do 2 points above and we would have hallucination, as the LLM can give us anything.

So my thoughts is that, somehow we must find a way to "translate" the UI using our own ontology to feed as inputs into the Chain-of-Thoughts models. With this solved, we can start chaining things to together and think about complex case of fine-tuning for example (this is where the decision on the next action depends on not only the current view, but also the previous views and all the orders which represent the "intent" of the user )

2 replies

LaPetiteSouris Dec 22, 2023

To elaborate more on my idea about 'ontology'.

Here in https://github.com/SysCV/sam-hq, we want to segment screenshots to understand the semantic meaning of each step. That's a great point. But without a definition of what we expect, like: buttons, text field, text input...etc, the segmentation may fail to "translate" the screenshot into meaningful items which help the LLM to understand the intent of this current image.

May be in the first MVP, we may need to limit to a small scope, like: Given a screenshot, use Segment Anything to find a sequence of text and buttons. In this case, text and button is a part of our ontology which later help LLM to understand about its world. This dictionary of ontology can keep growing as the use case gets more and more complex.

My guts feeling say that this translation process would be the very key of helping LLM to understand our problem.

abrichr Jan 12, 2024
Maintainer Author

Thank you for the detailed feedback @LaPetiteSouris ! And thank you for your patience while I have been considering it.

Instead of directly taking the action, as you mention, may be it is good idea to show the action and ask for user feedback if needed. The LLM model should continuously give action until the user validates the action to move to next node

I agree that it would be useful to have this "explicit confirmation" mode. However to be clear, ultimately we want to the model to take actions autonomously as long as it has confidence that it knows what action to take.

I think the problem above is the nucleus of all the chain. If we manage to solve this, then the whole action chain can be composed little by little.
This is also what you mean if I assume correctly with the graph and the description.

I believe that as long as we can have the model 1) correctly define completion criteria for each step, and 2) reliably determine whether the completion criteria has been satisfied, then we can be confident in the autonomous capability.

Notably, we can verify whether both of these hold true by evaluating completion criteria on existing recordings.

What input to give

task description
composite events
composite event screenshots

How to translate the inputs to a format that make senses to the model. I am not talking about the embedding as it is technical. But likely we need to have kind of "ontology" to represent "the world" the the models. IMHO, input like x, y coordinate and mouse buttons does not make much sense as it is difficult to capture the "meaning" of the current screen.

This is an interesting point and one that we likely need to iterate on. Perhaps we could have the model generate its own ontology that is appropriate for the task at hand, or re-use the ontology used in https://github.com/THUDM/CogVLM .

What do you think?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenAdapt.AI

Request for Comments: OpenAdapt Architecture #552

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

OpenAdapt.AI

Request for Comments: OpenAdapt Architecture #552

abrichr Dec 19, 2023 Maintainer

Replies: 2 comments · 3 replies

atineoSE Dec 19, 2023

abrichr Jan 3, 2024 Maintainer Author

LaPetiteSouris Dec 22, 2023

LaPetiteSouris Dec 22, 2023

abrichr Jan 12, 2024 Maintainer Author

abrichr
Dec 19, 2023
Maintainer

Replies: 2 comments 3 replies

atineoSE
Dec 19, 2023

abrichr Jan 3, 2024
Maintainer Author

LaPetiteSouris
Dec 22, 2023

abrichr Jan 12, 2024
Maintainer Author