Paper | Model Instruction | Framework | Installation | Train | Benchmarks Acknowledgement
- [TODO]: Update data and code.
- [03.2024] xLAM model is released! Try it together with AgentLite benchmark or other benchmarks, which is comparable to GPT-4!
- [02.2024] Initial Release of AgentOhana and xLAM paper!
This repo is for research purposes only.
Autonomous agents powered by large language models (LLMs) have garnered significant research attention. However, fully harnessing the potential of LLMs for agent-based tasks presents inherent challenges due to the heterogeneous nature of diverse data sources featuring multi-turn trajectories.
This repo introduces xLAM that aggregates agent trajectories from distinct environments, spanning a wide array of scenarios. It standardizes and unifies these trajectories into a consistent format, streamlining the creation of a generic data loader optimized for agent training. Leveraging the data unification, our training pipeline maintains equilibrium across different data sources and preserves independent randomness across devices during dataset partitioning and model training.
If you already know Mixtral, xLAM-v0.1 is a significant upgrade and better at many things. For the same number of parameters, the model have been fine-tuned across a wide range of agent tasks and scenarios, all while preserving the capabilities of the original model.
xLAM-v0.1-r represents the version 0.1 of the Large Action Model series, with the "-r" indicating it's tagged for research. This model is compatible with VLLM and FastChat platforms.
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Salesforce/xLAM-v0.1-r")
model = AutoModelForCausalLM.from_pretrained("Salesforce/xLAM-v0.1-r", device_map="auto")
messages = [
{"role": "user", "content": "What is your favourite condiment?"},
{"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
{"role": "user", "content": "Do you have mayonnaise recipes?"}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Note: You may need to tune the Temperature setting for different applications. Typically, a lower Temperature is helpful for tasks that require deterministic outcomes. Additionally, for tasks demanding adherence to specific formats or function calls, explicitly including formatting instructions is advisable and important.
from fm_datasets import webshop_multi_turn_v2
from fm_utils.seed_random import init_device_seed
from fm_utils.interleave_datasets import interleave_data
sft_webshop_multi_turn = webshop_multi_turn_v2.SFTWebShopMultiTurnV2(tokenizer, script_args)
seed = init_device_seed(seed=42)
train_dataset, eval_dataset = \
interleave_data(
data_objects=[sft_webshop_multi_turn],
sample_probs=[1.0],
return_type="prompt_answer",
seq_length=4096,
seed=seed)
from fm_utils.derived_data_collator import DataCollatorForPromptAnswer
from fm_trainers.sft_foundation_trainer import SFTFoundationTrainer
collator = DataCollatorForPromptAnswer(
instruction_template=instruction_template_ids,
response_template=response_template_ids,
tokenizer=tokenizer,
mlm=False)
trainer = SFTFoundationTrainer(
model=base_model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
peft_config=peft_config,
packing=False,
max_seq_length=None,
tokenizer=tokenizer,
args=training_args,
data_collator=collator,
)
trainer.train()
You can use our configured docker environment gcr.io/salesforce-research-internal/xlam-2024-02-14
, and one example yaml file is shown at envs_config
.
Then, you can pip install -e . --no-dependencies
Or, you can directly pip install -e .
. There is a chance that your configured environment might have some error.
You can refer to the complete example scripts to learn more details
Or you can simply run this bash script to have a quick start for our example
nohup accelerate launch --config_file xLAM/train/scripts/multi_gpu.yaml xLAM/train/scripts/sft_mixtral8X7B_accelerator.py --model_name mistralai/Mixtral-8x7B-Instruct-v0.1 --seq_length 4096 --run_name sft_mixtral8X7B_v2_02072024 --output_dir {path} > sft_mixtral8X7B_v2_02072024.nohup 2>&1 &
LLM Name | ZS | ZST | ReaAct | PlanAct | PlanReAct | BOLAA |
---|---|---|---|---|---|---|
Llama-2-70B-chat | 0.0089 | 0.0102 | 0.4273 | 0.2809 | 0.3966 | 0.4986 |
Vicuna-33B | 0.1527 | 0.2122 | 0.1971 | 0.3766 | 0.4032 | 0.5618 |
Mixtral-8x7B-Instruct-v0.1 | 0.4634 | 0.4592 | 0.5638 | 0.4738 | 0.3339 | 0.5342 |
GPT-3.5-Turbo | 0.4851 | 0.5058 | 0.5047 | 0.4930 | 0.5436 | 0.6354 |
GPT-3.5-Turbo-Instruct | 0.3785 | 0.4195 | 0.4377 | 0.3604 | 0.4851 | 0.5811 |
GPT-4-0613 | 0.5002 | 0.4783 | 0.4616 | 0.7950 | 0.4635 | 0.6129 |
xLAM-v0.1-r | 0.5201 | 0.5268 | 0.6486 | 0.6573 | 0.6611 | 0.6556 |
LLM Name | ZS | ZST | ReaAct | PlanAct | PlanReAct |
---|---|---|---|---|---|
Mixtral-8x7B-Instruct-v0.1 | 0.3912 | 0.3971 | 0.3714 | 0.3195 | 0.3039 |
GPT-3.5-Turbo | 0.4196 | 0.3937 | 0.3868 | 0.4182 | 0.3960 |
GPT-4-0613 | 0.5801 | 0.5709 | 0.6129 | 0.5778 | 0.5716 |
xLAM-v0.1-r | 0.5492 | 0.4776 | 0.5020 | 0.5583 | 0.5030 |
Please note: All prompts provided by AgentLite are considered "unseen prompts" for xLAM-v0.1-r, meaning the model has not been trained with data related to these prompts.
LLM Name | Act | ReAct | BOLAA |
---|---|---|---|
GPT-3.5-Turbo-16k | 0.6158 | 0.6005 | 0.6652 |
GPT-4-0613 | 0.6989 | 0.6732 | 0.7154 |
xLAM-v0.1-r | 0.6563 | 0.6640 | 0.6854 |
Easy | Medium | Hard | ||||
---|---|---|---|---|---|---|
LLM Name | F1 Score | Accuracy | F1 Score | Accuracy | F1 Score | Accuracy |
GPT-3.5-Turbo-16k-0613 | 0.410 | 0.350 | 0.330 | 0.25 | 0.283 | 0.20 |
GPT-4-0613 | 0.611 | 0.47 | 0.610 | 0.480 | 0.527 | 0.38 |
xLAM-v0.1-r | 0.532 | 0.45 | 0.547 | 0.46 | 0.455 | 0.36 |
LLM Name | Unseen Insts & Same Set | Unseen Tools & Seen Cat | Unseen Tools & Unseen Cat |
---|---|---|---|
TooLlama V2 | 0.4385 | 0.4300 | 0.4350 |
GPT-3.5-Turbo-0125 | 0.5000 | 0.5150 | 0.4900 |
GPT-4-0125-preview | 0.5462 | 0.5450 | 0.5050 |
xLAM-v0.1-r | 0.5077 | 0.5650 | 0.5200 |
LLM Name | 1-step | 2-step | 3-step | 4-step | 5-step |
---|---|---|---|---|---|
GPT-4-0613 | - | - | - | - | 69.45 |
Claude-Instant-1 | 12.12 | 32.25 | 39.25 | 44.37 | 45.90 |
xLAM-v0.1-r | 4.10 | 28.50 | 36.01 | 42.66 | 43.96 |
Claude-2 | 26.45 | 35.49 | 36.01 | 39.76 | 39.93 |
Lemur-70b-Chat-v1 | 3.75 | 26.96 | 35.67 | 37.54 | 37.03 |
GPT-3.5-Turbo-0613 | 2.73 | 16.89 | 24.06 | 31.74 | 36.18 |
AgentLM-70b | 6.48 | 17.75 | 24.91 | 28.16 | 28.67 |
CodeLlama-34b | 0.17 | 16.21 | 23.04 | 25.94 | 28.16 |
Llama-2-70b-chat | 4.27 | 14.33 | 15.70 | 16.55 | 17.92 |
LLM Name | Success Rate | Progress Rate |
---|---|---|
xLAM-v0.1-r | 0.433 | 0.677 |
DeepSeek-67B | 0.400 | 0.714 |
GPT-3.5-Turbo-0613 | 0.367 | 0.627 |
GPT-3.5-Turbo-16k | 0.317 | 0.591 |
Lemur-70B | 0.283 | 0.720 |
CodeLlama-13B | 0.250 | 0.525 |
CodeLlama-34B | 0.133 | 0.600 |
Mistral-7B | 0.033 | 0.510 |
Vicuna-13B-16K | 0.033 | 0.343 |
Llama-2-70B | 0.000 | 0.483 |
We want to acknowledge the work which have made contributions to our paper and the agent research community! Besides, if you find our paper, code or model is useful, please cite
@article{zhang2024agentohana,
title={AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning},
author={Zhang, Jianguo and Lan, Tian and Murthy, Rithesh and Liu, Zhiwei and Yao, Weiran and Tan, Juntao and Hoang, Thai and Yang, Liangwei and Feng, Yihao and Liu, Zuxin and others},
journal={arXiv preprint arXiv:2402.15506},
year={2024}
}