Removing batch dimension from default layout maps for Gemma and Llama #2035

martin-gorner · 2025-01-06T14:13:03Z

This is to align with Keras PR 20674 which fixes data sharding in the JAX trainer but does not support the model to be sharded on the "batch" dimension.

To be clear, before Keras PR 20674, using the "batch" dimension for sharding the model was not supported either (unless the dimension was 1). Keras PR 20674 fixes use cases where the "batch" dimension is not the first dimension in the device mesh and when model and data parallelism are used at the same time. However, the data sharding expressions it uses assume that only data is sharded on the "batch" dimension, not the model. That is why this PR removes model sharding on the "batch" dimension from default layout maps.

Sharding on the "batch" dimension was added in the Gemma default layout map by Keras-hub PR 1491. The reason why this was added is unclear.

SamanehSaadat · 2025-01-06T19:54:09Z

I don't think this change is needed. As I mentioned in keras-team/keras#20674 (comment), models should not be sharded on the batch dimension and this is not a bug that we don't shard the model on the batch dimension. Only data is sharded on the batch dimension. I believe here when we provide the batch dimension in the layout, we're setting up how the data is sharded so we should keep it as is. btw, this is a good resource about data distribution: https://jax.readthedocs.io/en/latest/distributed_data_loading.html

martin-gorner · 2025-01-07T10:50:14Z

Could you be more explicit when you say "we don't shard the model on the batch dimension". What do you mean exactly? The code of the Gemma default layout says:
layout_map["decoder_block.*attention.*(query|key|value).kernel"] = ('model', 'batch', None)
which will result in the attention weights being sharded on both the 'model' and 'batch' dims when the mesh is keras.distribution.DeviceMesh((len(devices)//2, 2), ["model", "batch"], devices), will it not?

SamanehSaadat · 2025-01-07T17:40:44Z

Could you be more explicit when you say "we don't shard the model on the batch dimension". What do you mean exactly? The code of the Gemma default layout says: layout_map["decoder_block.*attention.*(query|key|value).kernel"] = ('model', 'batch', None) which will result in the attention weights being sharded on both the 'model' and 'batch' dims when the mesh is keras.distribution.DeviceMesh((len(devices)//2, 2), ["model", "batch"], devices), will it not?

I don't think the attention weights will be sharded on the batch dimension if we set layout_map["decoder_block.*attention.*(query|key|value).kernel"] = ('model', 'batch', None). I think attention weights will be sharded on the model dimension only and the data input to the this layer has been sharded on the batch dimension so the attention layer doesn't get all the data but only a portion of the data.

martin-gorner · 2025-01-08T11:16:21Z

Would you have a pointer to the code where this behavior is implemented? I have checked the implementation of ditribute_variable, distribute_tensor as well as the documentation but I could find no reference to the special-casing of one dimension in the LayoutMap based on its name.

Quick test showing that there is no special casing for model weights and the "batch" dimension: https://www.kaggle.com/code/martingorner/keras-model-sharding-test

SamanehSaadat · 2025-01-09T02:23:18Z

num_model_replicas_total is equal to batch dimension. Because data is sharded on batch dimension, we need to replicate the model on this dimension so each shard of the model has the full replication of the model.

mesh_model_dim_size is the second dimension of the mesh. And dictates how the model is sharded.

In JAX, we shard the data and then computation follows the data. I recommend reading through the distribute_data_input to see how the layout is used.

PS: I'll look at your Kaggle notebook tomorrow.

martin-gorner · 2025-01-09T14:51:12Z

The notebook just shows that when a weights tensor has a layout map of ("batch", "model") on a mesh like ((2,4), ("batch", "model")), it does get split up into 8 pieces. So num_model_deplicas in that case is 1 (full sharding, no replication) but the Keras code (even with my fix) computes 2.

SamanehSaadat · 2025-01-10T01:10:31Z

The notebook just shows that when a weights tensor has a layout map of ("batch", "model") on a mesh like ((2,4), ("batch", "model")), it does get split up into 8 pieces. So num_model_deplicas in that case is 1 (full sharding, no replication) but the Keras code (even with my fix) computes 2.

The number of model replicas should be 2 in this case as the batch dim is 2. But there is an issue if your notebook is showing the model is sharded 8-way in this case. We need to debug to see where the disconnect happens. ((2,4), ("batch", "model")) should shard the data 2-way and the model 4-way.

martin-gorner · 2025-01-10T16:37:52Z

This is a matter of opinion and API design. I actually prefer the current implementation which is more direct and where a layout of ((2,4), ("a", "b")), applied to a specific weights tensor, shards that tensor 8 ways, no matter the names of the the mesh axes. We can add an error or warning if the users asks the model to be sharded on the batch_dim_name dimension, if we believe there is no use case for that.

martin-gorner · 2025-01-10T16:38:57Z

btw, you can hit me on chat to discuss more interactively

SamanehSaadat · 2025-01-10T17:22:46Z

This is a matter of opinion and API design. I actually prefer the current implementation which is more direct and where a layout of ((2,4), ("a", "b")), applied to a specific weights tensor, shards that tensor 8 ways, no matter the names of the the mesh axes. We can add an error or warning if the users asks the model to be sharded on the batch_dim_name dimension, if we believe there is no use case for that.

I see two points here:

If we want to shard the model 8-way, why do it in a 2-dimensional logical mesh? Assume we have a (2, 4) or (4, 2) or (2, 2, 2) physical mesh then we just input the ((1, 8), ("batch", "model")) and the model will be sharded 8-way on any of those physical meshes. This way, we abstract away the complexities of the physical mesh from the user.
I do believe allowing users to shard the data on the batch dimension should be supported and I think allowing the user to provide their desired data and model parallelism through ("batch", "model") is a nice API design (given that model doesn't need to be sharded on a 2-dimensional mesh).

martin-gorner · 2025-01-10T19:18:50Z

Of course sharding data on the batch dimension should be supported and it is.

For model weight sharding though, the current implementation maps exactly what low-level JAX APIs do which is a good thing IMHO. See my notebook:

A Keras layout spec of layout_map["dense/kernel"] = ("a", "b")
translates exactly into JAX as:
jax.device_put(x, jax.sharding.NamedSharding(mesh, jax.sharding.PartitionSpec("a", "b")

But maybe what you have in mind are sharding hints for outputs? This is implemented in Keras3 as layout_map["dense/output"] and yes, in this case it makes sense for the user to set it to something like ("batch", None)

martin-gorner · 2025-01-13T15:53:43Z

the Kaggle notebook: https://www.kaggle.com/code/martingorner/keras-model-sharding-test/ (the latest version was showing a runtime error, now fixed)

SamanehSaadat · 2025-01-13T23:37:27Z

If we want to shard the model 8-way, why do it in a 2-dimensional logical mesh? Assume we have a (2, 4) or (4, 2) or (2, 2, 2) physical mesh then we just input the ((1, 8), ("batch", "model")) and the model will be sharded 8-way on any of those physical meshes. This way, we abstract away the complexities of the physical mesh from the user.

How about this? Why do we need to specify the model sharding layout in a 2-dimensional way?

martin-gorner · 2025-01-14T15:05:33Z

Not sure what the layout map is for Gemma's attention weights in your proposal. Assuming it is ('model', 'batch'), then yes, when the "batch" dimension is 1, using 'batch' in the layout map does not matter. But if people want to do data and model parallelism by setting a logical mesh of ((2,4), ('batch', 'model')), then, with a layout of ('model', 'batch') the current implementation will shard the attention weights 8-way (which is what was specified according to JAX PartitionSpec and NamedSharding semantics, but obviously not a correct outcome).
Also, mesh=((1, 8), ("batch", "model")) and attention weights sharding spec ('model', 'batch') is one way of sharding a model 8-way. In the JAX API, it is not the only way. mesh=((2, 4), ('a', 'b')) and a sharding spec of ('a', 'b') is another way of expressing 8-way weights sharding. And since the Keras layout map implementation follows the JAX PartitionSpec and NamedSharding exacly, these settings should also make sense and mean the same thing in Keras.

Or are you suggesting diverging from JAX semantics here? What would the new semantics be?

SamanehSaadat · 2025-01-18T01:38:26Z

Sounds good! Thanks, Martin!

removing batch dimension from default layout maps for Gemma and Llama

32d79e7

github-actions bot added the Gemma Gemma model specific issues label Jan 6, 2025

martin-gorner mentioned this pull request Jan 6, 2025

Fixing batch_dim_name attribute keras-team/keras#20674

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing batch dimension from default layout maps for Gemma and Llama #2035

Removing batch dimension from default layout maps for Gemma and Llama #2035

martin-gorner commented Jan 6, 2025

SamanehSaadat commented Jan 6, 2025

martin-gorner commented Jan 7, 2025

SamanehSaadat commented Jan 7, 2025

martin-gorner commented Jan 8, 2025 •

edited

Loading

SamanehSaadat commented Jan 9, 2025

martin-gorner commented Jan 9, 2025

SamanehSaadat commented Jan 10, 2025

martin-gorner commented Jan 10, 2025

martin-gorner commented Jan 10, 2025

SamanehSaadat commented Jan 10, 2025

martin-gorner commented Jan 10, 2025

martin-gorner commented Jan 13, 2025

SamanehSaadat commented Jan 13, 2025

martin-gorner commented Jan 14, 2025

SamanehSaadat commented Jan 18, 2025

Removing batch dimension from default layout maps for Gemma and Llama #2035

Are you sure you want to change the base?

Removing batch dimension from default layout maps for Gemma and Llama #2035

Conversation

martin-gorner commented Jan 6, 2025

SamanehSaadat commented Jan 6, 2025

martin-gorner commented Jan 7, 2025

SamanehSaadat commented Jan 7, 2025

martin-gorner commented Jan 8, 2025 • edited Loading

SamanehSaadat commented Jan 9, 2025

martin-gorner commented Jan 9, 2025

SamanehSaadat commented Jan 10, 2025

martin-gorner commented Jan 10, 2025

martin-gorner commented Jan 10, 2025

SamanehSaadat commented Jan 10, 2025

martin-gorner commented Jan 10, 2025

martin-gorner commented Jan 13, 2025

SamanehSaadat commented Jan 13, 2025

martin-gorner commented Jan 14, 2025

SamanehSaadat commented Jan 18, 2025

martin-gorner commented Jan 8, 2025 •

edited

Loading