Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref(replays): Restrict data scrubbing to sentry payloads #1825

Merged
merged 6 commits into from
Feb 8, 2023

Conversation

jan-auer
Copy link
Member

@jan-auer jan-auer commented Feb 7, 2023

Unconditional data scrubbing can can break recording payloads because it scrubs strings that should be excluded from scrubbing. Eventually, the data scrubber will have to recurse through rrweb's DOM nodes and skip safe sub-trees or fields from scrubbing.

However, the client's default is to redact all strings from the DOM, which greatly reduces the chance of sensitive data ending up in the recording payloads. This leaves just Sentry's own payloads to be scrubbed, which are always top-level events marked with type: 5.

This PR updates the scrubber and introduces a visitor that streams through the top-level event list. When it encounters an item with type 5, it uses the existing transforming deserializer to scrub all strings in the event, otherwise skipping it. This solution allows us to continue scrubbing without a strict schema.

Alternatives

During implementation, there were two main alternatives considered:

  1. Implement a struct for the top-level events that contains the type, timestamp, and opaque data. Inspect the type, conditionally map it to another struct that wraps data in ScrubbedValue, and then serialize that struct.
  2. (implemented) Deserialize the top-level events into RawValue, parse it once with a helper struct to obtain the type. If the type matches, send the entire value through the transformer.

The second approach ultimately ended up in less code with equal to better performance, even though it requires to parse the raw value twice.

Safety

The raw_value feature in the serde_json deserializer is now enabled for all of Relay, since it is not possible to load the same crate another time with a separate set of features. There are no documented caveats on the raw_value feature in serde_json. The implementation uses a special token ("$serde_json::private::RawValue") in a way that makes it is safe for both serialize and deserialize:

  • Serialization: Goes through serialize_struct with the token as name. Triggering this behavior requires a manual implementation of Serialize using this token, which our code does not contain.
  • Deserialization: Goes through deserialize_newtype_struct with the token as name. Again, this requires a manual implementation of Deserialize to trigger accidentally.

Closes #1806

@jan-auer jan-auer self-assigned this Feb 7, 2023
relay-replays/src/recording.rs Outdated Show resolved Hide resolved
relay-replays/src/recording.rs Outdated Show resolved Hide resolved
@jan-auer jan-auer marked this pull request as ready for review February 8, 2023 08:29
@jan-auer jan-auer requested a review from a team February 8, 2023 08:29
Copy link
Contributor

@olksdr olksdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@jan-auer
Copy link
Member Author

jan-auer commented Feb 8, 2023

@jan-auer jan-auer merged commit f2944ff into master Feb 8, 2023
@jan-auer jan-auer deleted the ref/replays-skip-rrweb branch February 8, 2023 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PII scrubbing breaks a replay
2 participants