Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a LanguageModel class that implements the Llama2 architecture via llama2.c and Emscripten #26

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

gohai
Copy link
Member

@gohai gohai commented Aug 1, 2023

This makes it possible to do inference on toy-models (15M, 42M, 110M parameters) using the Llama2 architecture, as implemented in llama2.c.

The included emscripten build artifacts came from this tree: https://github.com/gohai/llama2.c-emscripten

TODO:

  • CDN for models & tokenizer.bin
  • not firm about naming of things ("LanguageModel"?, "manual*"?, "tokens"?)
  • would be nice to come up with a Colab notebook to do training/fine-tuning of custom models

@gohai gohai requested review from shiffman and ziyuan-linn August 1, 2023 09:33
@gohai gohai force-pushed the model-llama2 branch 3 times, most recently from 53bd1f6 to e18b755 Compare August 2, 2023 09:31
Copy link
Member

@shiffman shiffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is amazing @gohai!

First, as we work on documenting this, a wonderful reference is Let Us Show You How GPT Works by @aatishb. Hi Aatish, tagging you mostly to say hi, but would of course welcome your input!

I added a few comments @gohai. In my experience with teaching the previous charRNN models with ml5.js, the main pain points were:

  • Generating character by character or token by token was very confusing for students and should probably be reserved for more advanced use cases (though great to include in ml5.js if we can?)
  • What students often want to do is train their own model, but I was never able to successfully pull this off with all the steps required to train in python then convert to JS. A clear and easy to use colab notebook could absolutely work though!

My other "worry" relates to treading into this territory and providing "out of the box" models. LLMs, as is well documented, have all sorts of bias and other issues. It would wonderful to collaborate with the team members who are working on educational materials and documentation to think about how we guide users to be aware of the limitations and possible dangers of certain models and datasets. The code of conduct can be a good reference as well for considering use cases and applications. This could be a great discussion for the full group! cc @sproutleaf

createCanvas(400, 400);
background(0);

lm = ml5.languageModel(onModelLoaded);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed it would be great to support preload().

Also, I wonder if we should require a "string" that references a specific model to load. Even though this is perhaps an extra, unecessary step it emphasizes to the user that this isn't magic, and requires an acknowledgement of the specific model they are using. Some options:

  lm = ml5.languageModel("TinyStories", onModelLoaded);
  lm = ml5.languageModel("TinyLlamas", onModelLoaded);

Referencing the model name is probably more important, but it's also important to emphasize the specific dataset that was used for training.

let options = {
temperature: 0.9
};
lm.generate(prompt, options, onToken);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the idea of an onToken() event, but for beginners it would be incredibly useful to have an API that mirrors the HuggingFace inference API. In this case, you pass a prompt and a desired number of tokens and you get the entire result back. Something along the lines of:

  let options = {
    temperature: 0.9,
    maxTokens: 100
  };
  lm.generate(prompt, options, gotText);

I might also consider just be the form prompt, maxTokens, callback and then a JS object with more properties if you wanted to set temperature and other more, maybe:

  lm.generate(prompt, 100, gotText);

and then:

  let options = {
    prompt: "How are you?",
    temperature: 0.9,
    maxTokens: 100
  };
  lm.generate(options, gotText);

Though maybe it's good to always have the prompt separatel

@gohai
Copy link
Member Author

gohai commented Aug 4, 2023

@aatishb I so love this article, particularly how it raises the transparency/blackbox-ism issues with OpenAI to a general audience! (Believe that's also at the core for why bother with toy models - not to suggest a similar type of performance is achievable in the browser, but to unpack this technology, and how it got built/trained, as widely as possibly.)

Thank you for taking the time to think through this, @shiffman!

  • The suggestion of requiring the user to spell out the model name makes so much sense! (implemented this in 4eb89d9) I feel similarly about what you wrote about the dangers of whatever the "out-of-the-box" experience is going to be. Looking forward to working with the larger group on this!
  • Following your suggestion, I changed the main callback to be on-completion rather than on-token and simplified the most basic example. (b544b24)
  • I'll look into training of custom models. From scratch on the TinyStories dataset seems to take a "couple of hours" on four A100. This is probably a step to far for Colab 😄, but perhaps there will be improvements in the implementation. There are people working on implementing fine-tuning using LoRA, this might be doable!
  • Preload now works (hackishly, d962efe)

lm.generate(prompt, options, gotText);
}

function gotText(out, lm) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how simple it is to have just a single string that comes back, but it might match how other models in ml5.js work if we instead of wrap everything inside an object, for example:

function gotText(results) {
  console.log(results.text);
  console.log(results.words);
}
{
  prompt: "How are you",
  text: "I am doing fine.",
  words: ["I", "am", "doing", "fine"]
}

I could also imagine returning a prompt property. Sometimes students want to include the prompt in what they display back and sometimes they don't. Keeping track of it in a global variable can be awkward (especially if there are multiple calls to generate() so having it available in the results can be helpful!

Copy link
Member Author

@gohai gohai Aug 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shiffman words and out are properties of the instance currently - would passing the instance as the result make sense, or would you feel a "result" object value is nicer, which the user is able to mutate as they like etc? (text over out for the literal output?) I'll add prompt, although currently its strictly the beginning of the response - more of a "prefix" really.
Thank you for your review! I'll try it out

@gohai gohai force-pushed the model-llama2 branch 4 times, most recently from 2e5365d to 347f5b4 Compare August 9, 2023 11:48
@shiffman shiffman mentioned this pull request Aug 19, 2023
gohai added 15 commits August 24, 2023 17:01
Besides providing a model name, the user can also pass an object containing the URL to a custom model. In both cases, they're explicit about the model they're exploring. As suggested by @shiffman
…-token

As suggested by @shiffman. This makes the most basic example easier to understand.
Previously this sometimes only passed the LanguageModel instance. Instead, always pass the best possible "value" as the first, and the instance as an (optional) second.
This drops all examples that use async/await.
Unsure if calling _{inc,de}crementPreload() manually is the best way to accomplish this. I first tried the p5PreloadHelper from the old repo, but this never made window._preloadCount go above zero for me. (Maybe @ziyuan-linn has some idea?)
…r top-p sampling (used by default)

From @karpathy's commit message: Quick note on sampling, the recommendation for good results is to use `-t 1.0 -p 0.9`, i.e. top-p sampling at 0.9 with temperature 1.0 (this is the default). To control the diversity of samples use either the temperature (i.e. vary `-t` between 0 and 1 and keep top-p off with `-p 0`) or the top-p value (i.e. vary `-p` between 0 and 1 and keep `-t 1`), but not both. Nice explainers on LLM sampling strategies include [this](https://peterchng.com/blog/2023/05/02/token-selection-strategies-top-k-top-p-and-temperature/), [this](https://docs.cohere.com/docs/controlling-generation-with-top-k-top-p) or [this](https://huggingface.co/blog/how-to-generate).
… topp

This automatically picks reasonable default value for the other parameter, if not explicity specified, and prints a message if both are.
This matches upstream llama2.c, and prevents a confusing message with the basic example, which specifies a temperature (thus disabling the default top-p sampling).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants