RFC-0018: torch.monitor - events and counters for PyTorch #30

edward-io · 2021-10-15T00:07:23Z

Co-authored-by: Tristan Rice [email protected]

Co-authored-by: Tristan Rice <[email protected]>

dzhulgakov · 2021-10-15T05:57:42Z

RFC-0018-torch-monitor.md

+
+### Events
+
+Events have a generic metadata dict to store arbitrary information. These events are intended to have a relatively low QPS compared to the existing `RECORD_FUNCTION` profiler interface and should have the same frequency as one might use stderr logging.


would something like "starting training loop iteration" be an event or is it too fine grained?

That's a good question. That seems like it would be fine from a performance standpoint though it may be fairly spammy to the logs so probably not a great default. Similar to how you normally log metrics during training, you probably would only want a constant number of events per epoch not per batch

Some of this needs to be experimented with more to get a good estimate on what is acceptable and under what loggers

ppwwyyxx · 2021-10-15T07:12:48Z

RFC-0018-torch-monitor.md

+# in background thread
+while True:
+   stats = collector.report()
+   collector.reset()
+   for k, v in stats.items():
+      tensorboard_writer.add_scalar(k, v)
+   time.sleep(60)


I would suggest against using time-based interval (i.e. aggregate after every 60 seconds) to log events. I think using count-based approach (e.g. aggregate after every 20 items) is more common among users and would be better as an official example.

When logging events such as average of losses, average of batch time, it is important for a user to know that they are aggregated from the same number of samples so that the logged numbers are comparable. Some concrete examples where aggregation every N seconds is problematic:

Considering the "STDDEV" aggregation, you would really want to compute stddev from the same number of samples, because otherwise stddev is not comparable at all.

Considering the "average of batch time": if we log the "average of batch time" in every N-seconds window, the values that are logged will be biased towards larger values, because larger batch-time will be averaged with fewer samples.

For a count based approach the author would specify it on the metric itself? I.e. the author would set that for per epoch data we would log every 1 but for distributed collectives we would log every say 1000. It does add an extra value the author has to configure but seems feasible though potentially susceptible to performance issues depending on the log aggregator

We likely need to support time based aggregation sinks such as Prometheous/Grafana anyways. When logging to time series based systems count-based doesn't match super well since we may have stale values if the events stop happening. If we default to zero if no items between the Prometheus polls we're likely to get spiky outputs

For the aggregation design, we can export the count as well to help the user handle that behavior

I think it's feasible to support both though it does add some more complexity to the aggregation layer. We can have an optional field as part of the stat definition with the # of samples required.

For conversion to time based loggers we'll likely want to support a "no data" indication so if there's not enough data points we just won't log anything and there will be a gap in the time series -- I believe this should be possible with Prometheus and fb303 though it might be fairly hacky with registering and unregistering series

jieru-hu · 2021-10-17T18:52:11Z

RFC-0018-torch-monitor.md

+            BATCH_LATENCY.add(time.time()-batch_start)
+        EPOCH_LATENCY.add(time.time()-epoch_start)


Thanks for the example!
One potential use case for torch.monitor would be logging real time energy and carbon consumption. For example, one might want to monitor real time CPU utilization and energy counters to ensure accurate accounting (something along the lines of this project: https://github.com/Breakend/experiment-impact-tracker). This sort of logging would ideally happen in a relatively short interval (eg, every few seconds instead of being tied to how the code is structured).
Ideally there would be a background thread that queries these metrics instead of having users manually add them to the application code. In fact, the best setup would require very little user intervention to monitor these metrics. This use case seems to be in line with the motivation of the RFC. But it is not clear to me from the examples that it is possible to achieve with the current design. Could you share more details on whether this sort of interaction-less/background monitoring is covered by torch.monitor?

Hi, yes this is definitely within the scope of torch.monitor. We have some tentative plans to provide out of the box system metrics monitor (i.e. call a method and get system metrics logged via a background thread) and something like energy accounting would be right in line with that

@d4l3k thanks for letting me know. Will that be included in this RFC then?

@jieru-hu the RFC focuses on APIs we're introducing rather than the specifics of which metrics we plan on counting. I think it would be better suited for a PR after torch.monitor is introduced. I do think it's a compelling use case and a good reason why we should add this functionality.

thanks @edward-io. sorry if my question was not clear. To clarify, I'm not asking about adding the energy metrics. I'm more curious about the mentioned tentative plan for providing out of the box system metrics monitor. The example here isn't clear on how metrics with no user interaction will be handled, since it looks like the one here focuses on monitoring epoch run time (which is a user-specified metric). It would be nice to have an example to show how system metrics with no user interaction will be handled (e.g., CPU utilization).

cbalioglu · 2021-10-18T14:21:49Z

RFC-0018-torch-monitor.md

+
+```py
+@dataclass
+class Event:


Should Event also contain a "severity" field (i.e. INFO, WARN, ERROR) similar to a conventional logger? I believe one can use the metadata attribute for that, but would it make sense to have first-class support for it?

We debated including this field, but decided against it. For some events, severity might not make sense. E.g. a DeprecationWarning event would all have the same level of severity (i.e. this API will break if you don't migrate off of it soon). Since severity can be added with helper functions, we decided to keep the Event interface simple as possible.

cbalioglu · 2021-10-18T14:28:10Z

RFC-0018-torch-monitor.md

+from torch.monitor import Event, log
+
+@dataclass
+class RdzvEvent(Event):


Who is responsible for setting the timestamp? Event itself or the log() function? In my opinion an event "materializes" by the time it gets logged, so log() makes more sense. In such case, not sure whether timestamp should be part of the Event struct or something that is internally recorded along with the Event struct.

It depends on the event. For something like a GPU when you log an event is not the same time as the event occurred and thus an external time tracker is necessary. Though it would be pretty easy to have Event auto-populate that field on creation and thus make it optional to provide that timestamp

cbalioglu · 2021-10-18T14:31:01Z

RFC-0018-torch-monitor.md

+    global event_handlers
+    event_handlers.append(handler)
+
+def log(event: Event) -> None:


Nit: How about a convenience function def log(type: str, message: Optional[str] = None)? The syntax would be cleaner for simple use cases compared to log(Event(type=...))

How would you envision filtering/using that data from the aggregator? For text logs is it better to just use the standard python logging.Logger?

cbalioglu · 2021-10-18T14:46:48Z

RFC-0018-torch-monitor.md

+**Interfaces**
+
+```py
+class StatType(Enum):


With this API my understanding is that we are performing aggregations on the fly within each time window and send it to the sink (e.g. a time-series database)? Would it make sense to send raw values to the sink and expect it to do the aggregations offline? For instance with this API how can you retrieve the mean of a metric for an arbitrary time frame? I see no way of recording individual items. The closest is VALUE but it only records the most recent item in the window.

The current design allows for implementing a handler to either do aggregations or output raw values. The aggregation is intended to be similar to https://github.com/facebook/fb303 and be able to do high performance logging from cpp.

The design does allow for a handler to log the raw values though from a performance standpoint there's a lot to consider. That would likely require some careful sampling or batching to be comparable performance to the aggregations.

For time frames, that's up to the handler/user. The out of the box implementation I'm planning to write is only going to support fixed time frames for now since that seems like it would be sufficient for most users. From my own experience, fixed 60s windows are good enough for production use cases

ananthsub · 2021-10-30T01:47:08Z

RFC-0018-torch-monitor.md

+class Event:
+    # type is a globally unique type name
+    # Ex: torch.elastic.RdzvEvent
+    type: str


assuming this is used by multiple torch libraries, does the event need to include a notion of namespace to reduce the chance of collisions around the type ? or would that be handled based on naming conventions around type ?

dbczumar · 2021-11-02T21:59:46Z

RFC-0018-torch-monitor.md

+
+Future work will add integrations into PyTorch core and distributed to log key events and metrics to the users. It'll also include out of the box integrations for commonly logged stats and logging destinations.
+
+### Events


The MLflow maintainers are very interested in this event abstraction for the purposes of automatically recording PyTorch model training sessions and their metadata (parameters, metrics, and files) as MLflow Runs. We currently have an autologging integration for PyTorch-Lightning (docs: https://mlflow.org/docs/latest/tracking.html#pytorch-experimental, examples: https://github.com/mlflow/mlflow/blob/master/examples/pytorch/MNIST/mnist_autolog_example.py). However, we've had difficulty implementing this functionality for general PyTorch models due to the lack of: 1. a clear interface for determining when training begins and ends, and 2. instrumentation providing a standardized, meaningful set of parameters / metrics for model training sessions.

In this vein, I have a couple questions:

When events are integrated into PyTorch, is the vision to record an event when model training starts and when it ends? If so, for the "training start" event, will model parameters be recorded? Similarly, for the "training end" event, can the event payload refer to the trained model? This would allow us to record model files to MLflow.

Is there a plan to define a standardized set of metrics for PyTorch models and emit them via the API, or is this up to the user?

Thank you in advance for your feedback, and please let us know if / how we can be involved in the design process here.

This proposal is primary intended to standardize performance/system metrics and events for now. How this is exposed to users for tracking modelling metrics hasn't been fully fleshed out and there's lots of existing solutions for that already (i.e. lightning, tensorboard, etc). That would be a good follow up RFC to this though may be much more involved since it'll likely play a bigger role in how people use PyTorch.

As you rightly pointed out PyTorch doesn't have a concept of training start or end and this RFC change that. There may be some ways to add heuristics to track that (i.e. when optimizer is setup on a model params) but that isn't covered by this RFC

There are plans to have standardized metrics for things like system util, memory, dataloaders etc. The full set for that isn't complete and the intention is that it'll be easy to add new ones as need arises. (PRs welcome)
Things like dataloader latency will likely be always be available but anything that requires background processing such as system util will likely require a user to set it up.

If you have design ideas for what a good model logging interface would look like for users, feel free to post them as an RFC here or you can setup a meeting with folks to discuss. Always happy to chat :)

guotuofeng · 2021-11-05T10:58:23Z

RFC-0018-torch-monitor.md

+        epoch_start = time.time()
+        for x, y in dataloader:
+            batch_start = time.time()
+            y_pred = model(x)


could we logging the model related metrics like loss, accuracy by torch.monitor package?

guotuofeng · 2021-11-05T11:01:51Z

RFC-0018-torch-monitor.md

+    key="training.batch.latency",
+    aggregations={StatType.MEAN, StatType.HISTOGRAM},
+)
+EPOCH_LATENCY = Stat(


do we support resource metrics like GPU utilization or CPU/memory usage?

nbcsm · 2021-11-09T05:55:46Z

RFC-0018-torch-monitor.md

+# in background thread
+while True:
+   stats = collector.report()
+   collector.reset()
+   for k, v in stats.items():
+      tensorboard_writer.add_scalar(k, v)


Should user write these codes in a background thread? Or this is torch.monitor internal implementation?

Summary: Pull Request resolved: pytorch#68074 This is the first step of many PRs towards implementing the `torch.monitor` RFC pytorch/rfcs#30 This defines the aggregation types, the `Stat` class and provides some simple collection of the stats. This doesn't match the RFC exactly as it incorporates some of the comments on the RFC as well as a few changes for performance. Changes: * added window_size to the stats. If specified it will always compute the stat using the `window_size` number of values. If there aren't enough values within that window it reports the previous stats. * This doesn't include the push metrics yet (will be coming). After more discussion it looks like the best way to handle this is to support a hybrid where the metric can set how frequently it'll be logged. For fixed window_size metrics it'll be logged each time it hits the window size. This will allow performant counters as well as lower frequency push counters (window_size=1). Performance considerations: * Updating the stats acquires a lock on that Stat object. This should be performant unless there's many-many threads writing to the same stat. Single thread will typically use futex so should be quite fast. * Adding/removing/fetching all stats sets a global lock on the stat list -- this shouldn't be an issue since these events happen infrequently. * Fetching stats accesses one stat at a time instead of a global lock. This means the exported values are linearizable but not serializable across multiple stats but I don't expect this to be an issue. Next steps: 1. Add StatCollector interface for push style metrics 1. Add pybind interfaces to expose to Python 1. Add default metric providers 1. Integrate into Kineto trace view Test Plan: buck test //caffe2/test/cpp/monitor:monitor CI Reviewed By: kiukchung Differential Revision: D32266032 fbshipit-source-id: bc84427554e5574d45bdecc37c6b9995e6c8e1e5

Summary: Pull Request resolved: #68074 This is the first step of many PRs towards implementing the `torch.monitor` RFC pytorch/rfcs#30 This defines the aggregation types, the `Stat` class and provides some simple collection of the stats. This doesn't match the RFC exactly as it incorporates some of the comments on the RFC as well as a few changes for performance. Changes: * added window_size to the stats. If specified it will always compute the stat using the `window_size` number of values. If there aren't enough values within that window it reports the previous stats. * This doesn't include the push metrics yet (will be coming). After more discussion it looks like the best way to handle this is to support a hybrid where the metric can set how frequently it'll be logged. For fixed window_size metrics it'll be logged each time it hits the window size. This will allow performant counters as well as lower frequency push counters (window_size=1). Performance considerations: * Updating the stats acquires a lock on that Stat object. This should be performant unless there's many-many threads writing to the same stat. Single thread will typically use futex so should be quite fast. * Adding/removing/fetching all stats sets a global lock on the stat list -- this shouldn't be an issue since these events happen infrequently. * Fetching stats accesses one stat at a time instead of a global lock. This means the exported values are linearizable but not serializable across multiple stats but I don't expect this to be an issue. Next steps: 1. Add StatCollector interface for push style metrics 1. Add pybind interfaces to expose to Python 1. Add default metric providers 1. Integrate into Kineto trace view Test Plan: buck test //caffe2/test/cpp/monitor:monitor CI Reviewed By: kiukchung Differential Revision: D32266032 fbshipit-source-id: dab8747b4712f5dba5644387817a3a0fda18b66a

d4l3k · 2021-12-03T21:50:28Z

FYI for anyone following this: I'm currently implementing this in pytorch and following this RFC and the comments as much as possible. Once I get a good working POC I'll update this RFC before landing it. There's a handful of differences to make implementation easier / cleaner

Here's the C++ events initial implementation: pytorch/pytorch#68783

kumpera · 2022-05-10T17:47:11Z

RFC-0018-torch-monitor.md

+    # high level
+    message: Optional[str]
+    # Timestamp is the unix wall clock time of the event in seconds
+    timestamp: Optional[float]


wallclock is an underspecified clock that is frail with issues such as drifts and rollbacks.

This should with a monotonic clock with millisecond resolution for it to be more broadly useful.

kumpera · 2022-05-10T18:04:42Z

RFC-0018-torch-monitor.md

+    timestamp: Optional[float]
+    # Metadata can contain any event specific fields and is intended
+    # for fields to later be aggregated or filtered on.
+    metadata: Dict[str, Union[str, float, int, bool]]


I think we should model events closely to what the system distributed tracing community have come up with.
The Open Telemetry Concepts page covers it pretty well.

In particular, we're missing the following:

Event ID, 8 bytes is more than enough.
An event ID is important if we want to do any forensics on event streams, specially across ranks.

Notion of trace/span. They makes it easy to understand the execution structure. In particular for events that cover a time range (an epoch, a forward pass, etc). For the purpose of this RFC, this information can be encoded with explicit start/stop flags and a set of parent events.

kumpera · 2022-05-10T18:07:39Z

RFC-0018-torch-monitor.md

+            message=message,
+        )
+
+event = RdzvEvent(run_id="1234", rank=0, message="rendezvous started")


Event logging should never happen unconditionally.
The API must offer an efficient way to test whether an event should be emitted and make it easy for users to filter/sample event emission.

kumpera · 2022-05-10T18:32:27Z

RFC-0018-torch-monitor.md

+    # Aggregations is how this stat should be aggregated.
+    aggregations: Set[StatType]
+
+    def add(self, v: float) -> None:


A model where counters only accumulate data and have the collectors sample them is much more efficient and predictable. This significantly reduces the impact of badly implemented collectors and lowers the collection cost.

Another thing that we should provide are passive counters that produce data on demand through a callback.
An example of that would be a counter for CUDA memory usage where we only want to measure it once per collector cycle.

kumpera · 2022-05-10T18:35:31Z

RFC-0018-torch-monitor.md

+            batch_start = time.time()
+            y_pred = model(x)
+            ...
+            BATCH_LATENCY.add(time.time()-batch_start)


This looks quite fragile and quite common.
Could we have a context manager that simplifies it?
Something like:

with BATCH_LATENCY.add_time(): y_pred = model(x)

kumpera · 2022-05-10T18:41:39Z

RFC-0018-torch-monitor.md

+
+@dataclass
+class Stat:
+    # key is the name of the stat.


Isn't counter a better name for this?

Shouldn't it include help and unit fields as well to make them more discoverable?

kumpera · 2022-05-10T18:46:36Z

RFC-0018-torch-monitor.md

+```py
+from torch.monitor import Stat, StatType
+
+BATCH_LATENCY = Stat(


Stats are indirectly registered with collectors when they produce data.
This model is problematic because it doesn't allow users to query and filter which counters they are interested in.

Another issue is that we can't record that fact that no data was collected for a specific Stat. An explicit count of zero is a lot easier to handle than a missing entry in the logs.

Another thing is that we should decouple collection from aggregation so that users can override them on a per-counter base.

Funny enough, this example is a great case for that. Means and histograms are not great summary for operation as they are not robust. Some users will rather have P95 and P99 collected.

facebook-github-bot · 2022-11-09T23:06:09Z

Hi @edward-io!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

RFC-0018: torch.monitor

82aa4c4

Co-authored-by: Tristan Rice <[email protected]>

edward-io changed the title ~~RFC-0018: torch.monitor~~ RFC-0018: torch.monitor - events and counters for PyTorch Oct 15, 2021

facebook-github-bot added the cla signed label Oct 15, 2021

dzhulgakov reviewed Oct 15, 2021

View reviewed changes

ppwwyyxx reviewed Oct 15, 2021

View reviewed changes

jieru-hu reviewed Oct 17, 2021

View reviewed changes

cbalioglu reviewed Oct 18, 2021

View reviewed changes

ananthsub reviewed Oct 30, 2021

View reviewed changes

dbczumar reviewed Nov 2, 2021

View reviewed changes

guotuofeng reviewed Nov 5, 2021

View reviewed changes

nbcsm reviewed Nov 9, 2021

View reviewed changes

d4l3k mentioned this pull request Nov 10, 2021

torch.monitor - Initial C++ Stats pytorch/pytorch#68074

Closed

kumpera reviewed May 10, 2022

View reviewed changes

edward-io mentioned this pull request Jul 30, 2022

RFC-0026-logging-system #43

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC-0018: torch.monitor - events and counters for PyTorch #30

RFC-0018: torch.monitor - events and counters for PyTorch #30

edward-io commented Oct 15, 2021

dzhulgakov Oct 15, 2021

d4l3k Oct 15, 2021

ppwwyyxx Oct 15, 2021

d4l3k Oct 15, 2021 •

edited

Loading

d4l3k Oct 19, 2021

jieru-hu Oct 17, 2021

d4l3k Oct 18, 2021

jieru-hu Oct 19, 2021

edward-io Oct 19, 2021

jieru-hu Oct 19, 2021 •

edited

Loading

cbalioglu Oct 18, 2021

edward-io Oct 19, 2021

cbalioglu Oct 18, 2021

d4l3k Oct 18, 2021

cbalioglu Oct 18, 2021

d4l3k Oct 18, 2021

cbalioglu Oct 18, 2021

d4l3k Oct 18, 2021

ananthsub Oct 30, 2021

dbczumar Nov 2, 2021

d4l3k Nov 3, 2021

guotuofeng Nov 5, 2021

guotuofeng Nov 5, 2021

nbcsm Nov 9, 2021

d4l3k commented Dec 3, 2021

kumpera May 10, 2022

kumpera May 10, 2022

kumpera May 10, 2022

kumpera May 10, 2022

kumpera May 10, 2022

kumpera May 10, 2022

kumpera May 10, 2022

facebook-github-bot commented Nov 9, 2022


		### Events

		Events have a generic metadata dict to store arbitrary information. These events are intended to have a relatively low QPS compared to the existing `RECORD_FUNCTION` profiler interface and should have the same frequency as one might use stderr logging.

		BATCH_LATENCY.add(time.time()-batch_start)
		EPOCH_LATENCY.add(time.time()-epoch_start)


		Future work will add integrations into PyTorch core and distributed to log key events and metrics to the users. It'll also include out of the box integrations for commonly logged stats and logging destinations.

		### Events

RFC-0018: torch.monitor - events and counters for PyTorch #30

Are you sure you want to change the base?

RFC-0018: torch.monitor - events and counters for PyTorch #30

Conversation

edward-io commented Oct 15, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d4l3k Oct 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jieru-hu Oct 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d4l3k commented Dec 3, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Nov 9, 2022

Process

d4l3k Oct 15, 2021 •

edited

Loading

jieru-hu Oct 19, 2021 •

edited

Loading