Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus] Re-implement disk attach/detach #1106

Merged
merged 32 commits into from
May 27, 2022
Merged

[nexus] Re-implement disk attach/detach #1106

merged 32 commits into from
May 27, 2022

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented May 23, 2022

Fixes #1073

Summary of previous version

When a request to attach a disk arrived at Nexus, the following would be performed:

  1. Check the number of disks attached to an instance
  2. Check the disk state (valid for attach?)
  3. Check the instance state (valid for disk attach?)
  4. (If the instance is running) Send a request to the Sled agent to update the state
  5. Update the disk state in the DB

Problems with previous version

  • None of the steps are guarded against concurrent access - it's possible to attach a disk after the "max disk attach count" check, but before the DB has been updated. Similarly, it's possible to modify the disk/instance state after the check, but before the update.
  • The previous version allows sending requests to the attach disks to running instances (by making requests to the sled agent) but this path is both (1) not implemented, and (2) has some subtle races that need a more holistic solution. For example: what if a concurrent request was made to attach a disk to two instances simultaneously? Although Nexus should prevent this case, if both bypass the checks before actually updating the DB, both operations would be issued.

New implementation

  • Add CTEs for attach, detach to atomically: (1) Check instance/disk state, (2) update instance/disk state, and (3) return any information (e.g., do the resources exist) within a single statement.

TODOs:

  • Q: Do we need the rcgen in instances? A: No! Removed.
  • Deal with the disk-attach-during-instance-destroy race condition. Done via "collection_detach_many"
  • Be more consistent about units for the "max attached resources" value
  • Would love to de-duplicate more trait bounds between the "collection_*" CTEs. However, I think I may be close to my limits here until more progress on 🔬 Tracking issue for RFC 2089: Extended Implied bounds rust-lang/rust#44491 is made.

/// // /* Make a decision on whether or not to apply ANY updates */
/// // do_update AS (
/// // SELECT IF(
/// // EXISTS(SELECT id FROM collection_info) AND
Copy link
Collaborator Author

@smklein smklein May 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this may be a simpler implementation of the "dueling administrators" solution proposed in RFD 192.

Basically, to perform a conditional UPDATE and check for existence of an object, that proposal relies on a division-by-zero to do an early exit before performing the UPDATE itself.

However, it seems possible to:

  • Avoid the early exit
  • Make the UPDATE conditional on whatever subqueries we want
  • Return the results of multiple rows, letting the caller figure out what happened (did the object exist? did we perform the update? etc.)

This version is slightly more verbose, but the lack of "division by zero" makes it seem simpler to me.

@smklein smklein marked this pull request as ready for review May 25, 2022 04:14
@smklein smklein requested review from jmpesp and plotnick May 25, 2022 04:14
@smklein
Copy link
Collaborator Author

smklein commented May 25, 2022

This change should also resolve the race condition mentioned in this comment by @plotnick : https://github.com/oxidecomputer/omicron/pull/1068/files#r873063471

@@ -0,0 +1,1480 @@
// This Source Code Form is subject to the terms of the Mozilla Public
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there's a lot here, I'm going to go ahead and try to give a "tour" of the contents of this CTE.

collection_detach.rs and collection_detach_many.rs are similar in layout.

use diesel::sql_types::{BigInt, Nullable, SingleValue};
use std::fmt::Debug;

/// The table representing the collection. The resource references
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right out the gate, I'm using a ton of type aliases. The DatastoreAtttachTarget<ResourceType> trait is implemented on a collection (after all, "collections" hold "resources") so nearly all of these act on "resource type, collection type".

Most of these are just short-hands for "get me the key/column/table/etc belonging to the resource/column".

/// type ResourceTimeDeletedColumn = disk::dsl::time_deleted;
/// }
/// ```
pub trait DatastoreAttachTarget<ResourceType>: Selectable<Pg> + Sized {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the trait I would expect clients to implement - just pointing to the types we actually need, by steering towards the right columns.

A real implementation exists in db/model/instance.rs.

type CollectionIdColumn: Column;

/// The time deleted column in the CollectionTable
type CollectionTimeDeletedColumn: Column<Table = <Self::CollectionIdColumn as Column>::Table>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By stating Column<Table = <Self::CollectionIdColumn as Column>::Table>, we're forcing this column to belong to the "Collection" table.

This prevents users of this API from accidentally pointing the "Collection" columns at the "Resource" table.

+ Default
+ ExpressionMethods;

/// Creates a statement for attaching a resource to the given collection.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wanna compare this CTE with the others, the signature of this function is something to keep an eye out for.

It's basically our statement of "how do we expect the API user to interact with this CTE"?

My goal was to ingest:

  1. UUIDs
  2. "Query statements"
  3. "Update statements"

Since we use them elsewhere in the datastore, so I figured that language would be easy to understand.
A fair bit of the CTE trait bounds are devoted to preventing misuse of these APIs.

Comment on lines +267 to +271
.filter(
Self::ResourceCollectionIdColumn::default()
.eq(collection_id),
)
.filter(Self::ResourceTimeDeletedColumn::default().is_null())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ensure the "count of attached resources" clause is always scoped to non-deleted resources that point back to the collection, based on the provided collection_id.

Comment on lines +286 to +287
.filter(collection_table().primary_key().eq(collection_id))
.filter(Self::CollectionTimeDeletedColumn::default().is_null()),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We force the user-supplied queries to look at the same collection_id, and still be non-deleted.

This means someone can't get accidentally call this to modify the wrong resource:

Instance::attach_resource(
  collection_id,
  resource_id,
  collection::table.into_boxed().filter(collection_id.eq(a_different_collection_id))
  ...
)

If they did, it would always result in "no update", since we'd filter by ID = 1 AND ID = 2

Comment on lines +291 to +293
.filter(resource_table().primary_key().eq(resource_id))
.filter(Self::ResourceTimeDeletedColumn::default().is_null())
.filter(Self::ResourceCollectionIdColumn::default().is_null()),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We enforce a similar constraint on the resource query. We also enforce that the "foreign key" field is actually NULL, as it should be before we call "attach".

Comment on lines +352 to +354
/// Result of [`AttachToCollectionStatement`] when executed asynchronously
pub type AsyncAttachToCollectionResult<ResourceType, C> =
Result<(C, ResourceType), AttachError<ResourceType, C, PoolError>>;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the result type I expect most users to interact with.

Either:

  1. "Here is the collection + resource you updated", or
  2. "We did not do the update, and here's an AttachError explaining why"

Comment on lines +489 to +498
/// 1. (collection_by_id, resource_by_id): Identify if the collection and
/// resource objects exist at all.
/// 2. (resource_count): Identify if the number of resources already attached to
/// the collection exceeds a threshold.
/// 3. (collection_info, resource_info): Checks for arbitrary user-provided
/// constraints on the collection and resource objects.
/// 4. (do_update): IFF all previous checks succeeded, make a decision to perfom
/// an update.
/// 5. (updated_resource): Apply user-provided updates on the resource -
/// presumably, setting the collection ID value.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hopefully describes the meat of the CTE.

The TL;DR: We want to check a few things, then do an update if all those checks pass, and we want it to be atomic, even though multiple tables are involved.

@smklein smklein requested a review from bnaecker May 25, 2022 16:04
Copy link
Contributor

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 with some questions and suggestions

nexus/src/app/instance.rs Outdated Show resolved Hide resolved
nexus/src/app/instance.rs Outdated Show resolved Hide resolved
let resource_exists_query = Box::new(
resource_table()
.into_boxed()
.filter(resource_table().primary_key().eq(resource_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we guarantee that primary_key is always only id? why not the more specific ResourceIdColumn?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, you're suggesting "can we guarantee this is the only call to filter which contains an ID?"

In this case, we're constructing the entire resource_exists_query from scratch - the only user input is the UUID - so I don't think this check is explicitly necessary. Re-phrasing: I don't such a check would prevent any user behavior.

Also, unfortunately, it's a little tricky to parse the whole list of filters applied through Diesel (it's easy to ask are there any filters, but hard to find a specific one, especially since this query is boxed).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, we do have the bound:

BoxedQuery<ResourceTable<ResourceType, Self>>: FilterBy<Eq<ResourcePrimaryKey<ResourceType, Self>, Self::Id>>

which states that the resource table should be filterable comparing the ID type with the ResourcePrimaryKey .

However, the Diesel table type:

https://docs.diesel.rs/master/diesel/prelude/trait.Table.html

has an associated type PrimaryKey, but I'm not 100% sure how to connect it back to the Column type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the linked PrimaryKey documentation, I'm concerned about the case of

If the table has a composite primary key, this will be a tuple.

where ResourceIdColumn is not a tuple. Though, looking at that bound, I can squint and convince myself that if primary_key() returns a tuple then compilation would fail :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the primary_key() and Self::Id types would need to match for this to work... since the CollectionIdColumn must implement Column, I think this CTE as currently implemented would not allow for composite PKs.

... but as you said, that would at least be a compile-time error, rather than a runtime error.

nexus/src/db/collection_attach.rs Show resolved Hide resolved
/// The unchanged resource and collection are returned as a part of this
/// error; it is the responsibility of the caller to determine which
/// condition was not met.
NoUpdate { attached_count: i64, resource: ResourceType, collection: C },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a couple of match branches on this variant that state something like "this error didn't tell us enough", leading me to think we should attach more context or information here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What more information would you want? Is there a match branch in particular you're looking at?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more context below - if you're talking about cases where we return 500 errors, those should be impossible to hit, so I'm not sure more context would help.

nexus/src/db/collection_detach.rs Show resolved Hide resolved
Comment on lines +1237 to +1242
// We can't attach, but the error hasn't
// helped us infer why.
return Err(Error::internal_error(
"cannot attach disk"
));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this is an example of the error not providing enough context, referenced in another review comment)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any case where I return "internal_error" implies a CTE bug to me - that's why we're returning internal_error for these cases, or a 500 error.

In this particular case:

  • We are trying to attach a disk to an instance
  • The disk and instance both exist, and "Time deleted" is NULL
  • We are not at the capacity of "number of attached disks"
  • Both instance and disk appeared in valid states
  • and yet the update did not occur

This should not happen - If I saw this case, I'd want to debug it, hence the 500 error - but I do not think any valid input should be able to trigger it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - I'd want to debug that too, but all I would see internally is "cannot attach disk". The error hasn't helped us infer why at all, like the comment says, and I don't see any other existing context in parse_result that can be attached to the NoUpdate errors which would point to additional investigation steps.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this, but I'm still not sure what other info to attach. I think if this case happened, we'd need to perform manual inspection of the DB to infer why it happened.

I view this as akin to "attempting, but failing, to deserialize a value from the database". I don't think it's appropriate to dump raw strings all the way back to the client, but it's still worth flagging.

@smklein smklein merged commit 5a0d1e0 into main May 27, 2022
@smklein smklein deleted the fix-attach branch May 27, 2022 19:14
leftwo pushed a commit that referenced this pull request Jan 28, 2024
Crucible changes
Remove a superfluous copy during write serialization (#1087)
Update to progenitor v0.5.0, pull in required Omicron updates (#1115)
Update usdt to v0.5.0 (#1116)
Do not panic on reinitialize of a downstairs client. (#1114)
Bump (tracing-)opentelemetry(-jaeger) (#1113)
Make the Guest -> Upstairs queue fully async (#1086)
Switch to per-block ownership (#1107)
Handle timeout in the client IO task (#1109)
Enforce buffer alignment (#1106)
Block size buffers (#1105)
New dtrace probes and a counter struct in the Upstairs. (#1104)
Implement read decryption offloading (#1089)
Remove Arc + Mutex from Buffer (#1094)
Comment cleanup and rename of DsState::Repair -> Reconcile (#1102)
do not panic the dynamometer for OOB writes (#1101)
Allow dsc to start the downstairs in read-only mode. (#1098)
Use the omicron-zone-package methods for topo sorting (#1099)
Package with topological sorting (#1097)
Fix clippy lints in dsc (#1095)

Propolis changes:
PHD: demote artifact store logs to DEBUG, enable DEBUG on CI (#626)
PHD: fix missing newlines in serial.log (#622)
PHD: fix run_shell_command with multiline commands (#621)
PHD: fix `--artifact-directory` not doing anything (#618)
Update h2 dependency
Update Crucible (and Omicron) dependencies
PHD: refactor guest serial console handling (#615)
phd: add basic "migration-from-base" tests + machinery (#609)
phd: Ensure min disk size fits read-only parents (#611)
phd: automatically fetch `crucible-downstairs` from Buildomat (#604)
Mitigate behavior from illumos#16183
PHD: add guest adapter for WS2022 (#607)
phd: include error cause chain in failure output (#606)
add QEMU pvpanic ISA device (#596)
Add crucible-mem backend
Make crucible opt parsing more terse in standalone
leftwo added a commit that referenced this pull request Jan 29, 2024
Crucible changes
Remove a superfluous copy during write serialization (#1087) Update to
progenitor v0.5.0, pull in required Omicron updates (#1115) Update usdt
to v0.5.0 (#1116)
Do not panic on reinitialize of a downstairs client. (#1114) Bump
(tracing-)opentelemetry(-jaeger) (#1113)
Make the Guest -> Upstairs queue fully async (#1086) Switch to per-block
ownership (#1107)
Handle timeout in the client IO task (#1109)
Enforce buffer alignment (#1106)
Block size buffers (#1105)
New dtrace probes and a counter struct in the Upstairs. (#1104)
Implement read decryption offloading (#1089)
Remove Arc + Mutex from Buffer (#1094)
Comment cleanup and rename of DsState::Repair -> Reconcile (#1102) do
not panic the dynamometer for OOB writes (#1101) Allow dsc to start the
downstairs in read-only mode. (#1098) Use the omicron-zone-package
methods for topo sorting (#1099) Package with topological sorting
(#1097)
Fix clippy lints in dsc (#1095)

Propolis changes:
PHD: demote artifact store logs to DEBUG, enable DEBUG on CI (#626) 
PHD: fix missing newlines in serial.log (#622)
PHD: fix run_shell_command with multiline commands (#621) 
PHD: fix `--artifact-directory` not doing anything (#618) Update h2
dependency
Update Crucible (and Omicron) dependencies
PHD: refactor guest serial console handling (#615) 
phd: add basic "migration-from-base" tests + machinery (#609) 
phd: Ensure min disk size fits read-only parents (#611) 
phd: automatically fetch `crucible-downstairs` from Buildomat (#604)
Mitigate behavior from illumos#16183
PHD: add guest adapter for WS2022 (#607)
phd: include error cause chain in failure output (#606) add QEMU pvpanic
ISA device (#596)
Add crucible-mem backend
Make crucible opt parsing more terse in standalone

Co-authored-by: Alan Hanson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Disk Attach has some race conditions
3 participants