Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging SCEs: Use union of features #247

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
201 changes: 174 additions & 27 deletions R/merge_sce_list.R
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ merge_sce_list <- function(
cell_id_column = "cell_id",
include_altexp = TRUE) {

# Check `sce_list`----------------------
## Checks --------------------------
if (is.null(names(sce_list))) {
warning(
glue::glue(
Expand All @@ -79,15 +79,13 @@ merge_sce_list <- function(
return(sce_list)
}

# Check `retain_coldata_cols` ----------------
# Check `retain_coldata_cols`
if (length(retain_coldata_cols) == 0) {
warning("All pre-existing colData will be removed from the the merged SCE.
Please check that `retain_coldata_cols` was correctly specified.")
}

# Subset SCEs to shared features and ensure appropriate naming ------------------

# First, obtain intersection among all SCE objects
# Check for shared features
shared_features <- sce_list |>
purrr::map(rownames) |>
purrr::reduce(intersect)
Expand All @@ -97,6 +95,64 @@ merge_sce_list <- function(
They cannot be merged.")
}

# Check that library id and sample id are present in main SCE metadata
id_checks <- sce_list |>
purrr::map(\(sce){
all(c("library_id", "sample_id") %in% names(metadata(sce)))
}) |>
unlist()
Comment on lines +100 to +103
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
purrr::map(\(sce){
all(c("library_id", "sample_id") %in% names(metadata(sce)))
}) |>
unlist()
purrr::map_lgl(\(sce){
all(c("library_id", "sample_id") %in% names(metadata(sce)))
})


if (!all(id_checks)) {
stop("The metadata for each SCE object must contain `library_id` and `sample_id`.")
}

# Check altExp compatibility, if we are including them
if (include_altexp) {

# Find all altExp names present in the SCE objects.
altexp_names <- sce_list |>
purrr::map(
\(sce) altExpNames(sce)
) |>
purrr::reduce(union)

# For each in altexp_names (if present), do they have the same features?
# If not, error out
for (altexp_name in altexp_names) {

# all altExps for this name
altexp_list <- sce_list |>
purrr::keep(\(sce) altexp_name %in% altExpNames(sce)) |>
purrr::map(altExp, altexp_name)

# find their union of features
altexp_name_features <- altexp_list |>
purrr::map(rownames) |>
purrr::reduce(union) |>
sort()
Comment on lines +128 to +132
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we are going to test that they are all equal, we don't need to get the union here: we can just test that all are equal to the first.


# create logical vector for presence of all features
features_present <- altexp_list |>
purrr::map_lgl(
\(alt_sce) identical(altexp_name_features, sort(rownames(alt_sce)))
)

if (!all(features_present)) {
stop(
glue::glue("The {altexp_name} alternative experiments do not share the same set of features.")
)
}
}
}

## Subset SCEs to shared features and ensure appropriate naming ------------------

# First, obtain the union of features for all (main) SCE objects
# this will also be the final order of features/rows in the final SCE
sce_full_features <- sce_list |>
purrr::map(rownames) |>
purrr::reduce(union)
Comment on lines +150 to +154
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going back to intersect here, because I think that is a better default. We could add this as an option to the function, but there may be unintended consequences to having NA values in the matrix for downstream steps, so I don't want to do this now.

Suggested change
# First, obtain the union of features for all (main) SCE objects
# this will also be the final order of features/rows in the final SCE
sce_full_features <- sce_list |>
purrr::map(rownames) |>
purrr::reduce(union)
# First, obtain the intersection of features for all (main) SCE objects
# this will also be the final order of features/rows in the final SCE
sce_shared_features <- sce_list |>
purrr::map(rownames) |>
purrr::reduce(intersect)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting this intersect code is already higher up in the script as a check to ensure there are intersecting features (I moved it there in this PR), so I'll just delete this and use what we've got above.


# Second, determine all the column names that are present in any SCE so it can
# be created in any missing SCEs with `NA` values
all_colnames <- sce_list |>
Expand All @@ -106,35 +162,22 @@ merge_sce_list <- function(
unlist() |>
unique()

# Check that the `retain_coldata_cols` are present in at least one SCE, and
# error if the column exists nowhere.
# Check that the `retain_coldata_cols` are present in at least one SCE
if (!(any(retain_coldata_cols %in% all_colnames))) {
warning("The provided `retain_coldata_cols` are not present in any SCEs.")
warning("The provided `retain_coldata_cols` are not present in any SCE colData.")
}

# check that library id and sample id are present in metadata
id_checks <- sce_list |>
purrr::map(\(sce){
all(c("library_id", "sample_id") %in% names(metadata(sce)))
}) |>
unlist()

if (!all(id_checks)) {
stop("The metadata for each SCE object must contain `library_id` and `sample_id`.")
}

# Prepare main experiment of SCEs for merging --------------------
## Prepare main experiment of SCEs for merging --------------------
sce_list <- sce_list |>
purrr::imap(
prepare_sce_for_merge,
batch_column = batch_column,
cell_id_column = cell_id_column,
shared_features = shared_features,
all_features = sce_full_features,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just call this merge_features? Nothing says we need it to be "all" or "shared", etc.

Also updating the variable to match above changes

Suggested change
all_features = sce_full_features,
merge_features = sce_shared_features,

retain_coldata_cols = retain_coldata_cols,
preserve_rowdata_cols = preserve_rowdata_cols
)


## Handle metadata ---------------------------------------------
# get a list of metadata from the list of sce objects
# each library becomes an element within the metadata components
Expand Down Expand Up @@ -240,8 +283,8 @@ merge_sce_list <- function(
#' colData slot
#' @param cell_id_column The name of the cell_id column which will be added to the
#' colData slot
#' @param shared_features A vector of features (genes) that all SCEs to be merged
#' have in common
#' @param all_features A vector of features that all SCEs are expected to have.
#' If any are missing, they will be added and filled with `NA` values.
Comment on lines +286 to +287
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming update

Suggested change
#' @param all_features A vector of features that all SCEs are expected to have.
#' If any are missing, they will be added and filled with `NA` values.
#' @param merge_features A vector of features that will be included in the output SCE for merging
#' If any are missing, they will be added and filled with `NA` values.

#' @param retain_coldata_cols A vector of columns to retain in the colData slot.
#' If columns are missing from the data, they will be filled with `NA` values.
#' The exceptions to this are `barcode_column` and `batch_column` which will be
Expand All @@ -255,12 +298,20 @@ prepare_sce_for_merge <- function(
sce_name,
batch_column,
cell_id_column,
shared_features,
all_features,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
all_features,
merge_features,

retain_coldata_cols,
preserve_rowdata_cols) {

# Subset to shared features
sce <- sce[shared_features, ]

#### assays #####
# If all features are present, order them to match `all_features` order
# If any features are missing, create a new SCE with those features and pop in
# existing assays and rowData slot
sce <- create_sce_with_all_features(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought that maybe this gets updated to be more general?

I also don't love "create" here if it isn't always creating a new SCE, but it is fine?

Suggested change
sce <- create_sce_with_all_features(
sce <- create_sce_with_features(

sce,
all_features
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
all_features
merge_features

)


##### rowData #####
# Add `sce_name` ID to rowData column names except for those
Expand Down Expand Up @@ -333,6 +384,102 @@ prepare_sce_for_merge <- function(
}


#' Create a new SCE that contains all provided features, or reorder a provided
#' SCE into the given feature order
#'
#' @param sce SCE from which a new SCE will be created, or which will be reordered
#' @param all_features A vector of all features that need to be in the final SCE,
#' in this order
Comment on lines +391 to +392
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a bit of simplification

Suggested change
#' @param all_features A vector of all features that need to be in the final SCE,
#' in this order
#' @param features A vector of the feature names for the final SCE, in output order

#'
#' @return SCE object with all features in the given order
create_sce_with_all_features <- function(
sce,
all_features) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
all_features) {
features) {


present_features <- rownames(sce)
feature_diff <- setdiff(all_features, present_features)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worth renaming here, since the diff is directional: if there are extra features in present_features, we don't care, and they will be discarded.

Suggested change
feature_diff <- setdiff(all_features, present_features)
missing_features <- setdiff(features, present_features)


if (length(feature_diff) == 0) {

# Simply reorder and return
return(sce[all_features,])
Comment on lines +402 to +405
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (length(feature_diff) == 0) {
# Simply reorder and return
return(sce[all_features,])
if (length(missing_features) == 0) {
# Simply reorder and/or filter and return
return(sce[features,])


} else {

# new matrix with all NAs
new_matrix <- matrix(
data = NA,
nrow = length(all_features),
ncol = ncol(sce),
dimnames = list(
all_features,
colnames(sce)
)
)
Comment on lines +410 to +418
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name updates, but also we can save some effort later if we make this a Matrix::Matrix to match the SCE object and then we can make it sparse, as we expect it to be. Note that I also use NA_real_ to get the right type info in...

Suggested change
new_matrix <- matrix(
data = NA,
nrow = length(all_features),
ncol = ncol(sce),
dimnames = list(
all_features,
colnames(sce)
)
)
new_matrix <- Matrix::Matrix(
data = NA_real_,
nrow = length(all_features),
ncol = ncol(sce),
dimnames = list(
all_features,
colnames(sce)
)
sparse = TRUE
)


# Create new matrix for each present assay
sce_assays <- assayNames(sce)
new_assays <- sce_assays |>
purrr::map(
\(assay_name) {


# fill in existing matrix values
new_matrix[present_features, colnames(sce)] <- as.matrix(
assay(sce, assay_name)[present_features, colnames(sce)]
)
Comment on lines +428 to +430
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use Matrix::Matrix above, you should not need to do the conversion:

Suggested change
new_matrix[present_features, colnames(sce)] <- as.matrix(
assay(sce, assay_name)[present_features, colnames(sce)]
)
new_matrix[present_features, colnames(sce)] <- (
assay(sce, assay_name)[present_features, colnames(sce)]
)

I think I would move the new matrix creation into this map function too. I'm not sure how I feel about reusing the existing one in the case of multiple assays. It might be fine, but if it is it means we are making a copy anyway, so there is not going to be a cost to creating it here.


return(new_matrix)
}
) |>
purrr::set_names(sce_assays)

# Create a new SCE
new_sce <- SingleCellExperiment(assays = new_assays)

# Establish new rowData, filling in NAs for missing features
rowdata_colnames <- colnames(rowData(sce))
new_rowData <- matrix(
Comment on lines +441 to +442
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't be a matrix, because data in rowData is of mixed types. It is probably fine to start with the matrix here to get the dimensions right, but we want the new rowData to be a data frame of some kind.

data = NA,
nrow = length(all_features),
ncol = ncol(rowData(sce)),
dimnames = list(
all_features,
rowdata_colnames
)
)

# Slot in existing rowData, ensuring correct order (as.matrix is needed)
new_rowData[present_features, rowdata_colnames] <- as.matrix(
rowData(sce)[present_features, rowdata_colnames]
)
Comment on lines +453 to +455
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to do any matrix conversion here, because that will not preserve types.

Suggested change
new_rowData[present_features, rowdata_colnames] <- as.matrix(
rowData(sce)[present_features, rowdata_colnames]
)
new_rowData[present_features, rowdata_colnames] <- (
rowData(sce)[present_features, rowdata_colnames]
)


# Add new rowData into new_sce
# TODO: We still need to handle columns like `gene_ids` which are now NAs
rowData(new_sce) <- new_rowData

# Add existing colData & metadata to the new SCE
colData(new_sce) <- colData(sce)
metadata(new_sce) <- metadata(sce)

# Return this new_sce
return(new_sce)
}
}















#' Prepare altExps for merge and create a list of merged altExps for each altExp name
#'
Expand Down
22 changes: 22 additions & 0 deletions man/create_sce_with_all_features.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 3 additions & 3 deletions man/prepare_sce_for_merge.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading