Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standard for expected v1.0 #217

Closed
nvictus opened this issue Feb 23, 2021 · 9 comments
Closed

Standard for expected v1.0 #217

nvictus opened this issue Feb 23, 2021 · 9 comments
Assignees

Comments

@nvictus
Copy link
Member

nvictus commented Feb 23, 2021

The goal will be to make all tools conformant to these formats and conventions.

Formats

The following formats are tab-separated when stored as text and must include a header line for column names. The diag columns are treated as "dense", meaning all submatrix diagonals, starting from 0, should be included.

Intra-chromosomal regional

  • indexable columns (required): region, diag
  • summary value columns: n_valid, count.sum, balanced.sum, etc.

Example:

region	diag	n_valid	count.sum	balanced.sum	count.avg	balanced.avg
chr1p	0	2339	nan	nan	nan	nan
chr1p	1	2323	nan	nan	nan	nan
chr1p	2	2316	3271349.0	50.62124450976966	1412.4995682210708	0.021857186748605206
chr1p	3	2310	2413365.0	37.436586233175476	1044.7467532467533	0.01620631438665605
chr1p	4	2305	1944357.0	30.24594826305476	843.5388286334056	0.013121886448179939
chr1p	5	2300	1629057.0	25.40835095474898	708.285652173913	0.011047109110760426
chr1p	6	2295	1389263.0	21.743878766308743	605.3433551198257	0.009474456978783767
chr1p	7	2290	1202681.0	18.92056641946033	525.1882096069869	0.008262256078366956
...

Intra-chromosomal bi-regional

  • indexable columns (required): region1, region2, diag
  • summary value columns: n_valid, count.sum, balanced.sum, etc.

Example:

region1	region2	diag	n_valid	count.sum	balanced.sum	count.avg	balanced.avg
chr1p	chr1q	1	0	0.0	0.0	nan	nan
chr1p	chr1q	2	0	0.0	0.0	nan	nan
chr1p	chr1q	3	0	0.0	0.0	nan	nan
...
chr1p	chr1q	512	1	79.0	0.00016265066917626247	79.0	0.00016265066917626247
chr1p	chr1q	513	1	77.0	0.00018764562827233117	77.0	0.00018764562827233117
chr1p	chr1q	514	1	80.0	0.00025875086217912906	80.0	0.00025875086217912906
chr1p	chr1q	515	2	99.0	0.00034619239710538346	49.5	0.00017309619855269173
chr1p	chr1q	516	3	111.0	0.0005523207078083529	37.0	0.00018410690260278428
chr1p	chr1q	517	4	138.0	0.0011315473845712298	34.5	0.00028288684614280745
...

Inter-chromosomal bi-regional

  • indexable columns (required): region1, region2
  • summary value columns: n_valid, count.sum, balanced.sum, etc.

Example:

region1	region2	n_valid	count.sum	balanced.sum	count.avg	balanced.avg
chr1	chr2	19598120	8242429.0	129.8439506278517	0.4205724324578072	6.625326849098368e-06
chr1	chr3	16469440	7091424.0	112.10478081136287	0.43058076048730254	6.806836225843919e-06
chr1	chr4	15681920	6306526.0	103.98342811014709	0.4021526700812145	6.630784247729047e-06
chr1	chr5	14646160	5772584.0	93.46547681049861	0.3941363470015349	6.381568739553481e-06
chr1	chr6	13589000	5924095.0	94.7099193491221	0.43594782544705274	6.969601835979256e-06
chr1	chr7	12514720	5511574.0	89.59489226713862	0.4404072963677973	7.159160753667571e-06
chr1	chr8	11611640	6019429.0	83.05476606656579	0.5183961094212359	7.152716245643663e-06
chr1	chr9	9150640	4204874.0	66.85595583095115	0.4595169299633687	7.306150808134857e-06
chr1	chr10	10494560	5263222.0	75.59315171810027	0.5015190727386379	7.20307966394973e-06
...

Interpreting region columns

In general, region names are given as mnemonic names. Their actual coordinates are specified in a separate BED-like region table with schema [chrom, start, end, name]. As BED files, they generally do not have header lines when saved as text.

Example:

chr1	0	123479591	chr1p
chr1	123479591	248956422	chr1q
chr2	0	93139351	chr2p
chr2	93139351	242193529	chr2q
chr3	0	92214016	chr3p
chr3	92214016	198295559	chr3q
chr4	0	50728006	chr4p
chr4	50728006	190214555	chr4q
chr5	0	48272853	chr5p
chr5	48272853	181538259	chr5q
...

In the simple case when using whole chromosomes as regions, the region table may be omitted. Application code should assume this is the case when a region table is not provided.

Amendment (2021-03-02) (amended by Ilya, diag -> sep)

Intra-chromosomal regional will be dropped in favor of the bi-regional format. Symmetric intra-chromosomal zones will use the same name for region1 and region2.
diag will be renamed to sep to accommodate potential more exotic definitions of separation.

  • indexable columns (required): region1, region2, sep
  • summary value columns: n_valid, count.sum, balanced.sum, etc.
  • for symmetric zones, the region name must be repeated

Example:

region1	region2	sep	n_valid	count.sum	balanced.sum	count.avg	balanced.avg
chr1p	chr1q	1	0	0.0	0.0	nan	nan
chr1p	chr1q	2	0	0.0	0.0	nan	nan
chr1p	chr1q	3	0	0.0	0.0	nan	nan
...
chr1p	chr1q	512	1	79.0	0.00016265066917626247	79.0	0.00016265066917626247
chr1p	chr1q	513	1	77.0	0.00018764562827233117	77.0	0.00018764562827233117
chr1p	chr1q	514	1	80.0	0.00025875086217912906	80.0	0.00025875086217912906
chr1p	chr1q	515	2	99.0	0.00034619239710538346	49.5	0.00017309619855269173
chr1p	chr1q	516	3	111.0	0.0005523207078083529	37.0	0.00018410690260278428
chr1p	chr1q	517	4	138.0	0.0011315473845712298	34.5	0.00028288684614280745
...
@sergpolly
Copy link
Member

sergpolly commented Mar 1, 2021

we could "safely" register expected from file using pd.read_table(...usecols=expected_schema...) inside a try catch, then if region_table(table with the definition of used regions) is not provided , we could try to generate it using parse_regions (works with UCSC and full chrom-names - fails for region nicknames like chr1p) - thus by the end of such "sanitation" procedure we'd have region_table and expected that are compatible to each other

*maybe check compatibility with cooler ?

here is a more detailed example - with several schemas for expected:
https://gist.github.com/sergpolly/c82b77ae12a82453b6a7aa98292f82cd

PS
given how easy it is to generate a region_table from UCSC/chroms/tuples/whatever isn't it reasonable to require it on the Python API level ? otherwise the same functionality (parsing/guessing of region_table) needs to be replicated inside of the API functions (pileups, saddles, etc) - i.e. lots and lots of bioframe.parse_regions throughout the code ?
what do you think ?

@Phlya
Copy link
Member

Phlya commented Mar 2, 2021

Nice, some good ideas there @sergpolly!

I think providing region_table should be encouraged as good practice, but if it's not provided in case of whole chromosomes (or ucsc-style region names, but I like that less...) it can be generated easily every time - we can make a function to read expected based on your code there which would just do it then, perhaps?

@gfudenberg gfudenberg mentioned this issue Sep 4, 2021
41 tasks
@gfudenberg
Copy link
Member

seems there a few other properties in the above that weren't formalized that would be useful for a check.is_expected function (similar to how we now have bedframes/viewframes formalized in bioframe):

  • the combination (region,diag) should be unique
  • indexable (region, diag) columns should not have NAs
  • dtypes (region=str, diag=int, count=int, balanced=float, ...) as per sergey's suggestion

@Phlya
Copy link
Member

Phlya commented Sep 4, 2021

We renamed diag to sep in the standard btw! In case someone wants to use some more fancy expected.

@gfudenberg
Copy link
Member

ah yeah, I'll add that update to the 0.5.0 roadmap

@sergpolly
Copy link
Member

diag -> sep -> dist ...

@gfudenberg
Copy link
Member

@sergpolly does recent commit close this issue? or should we convert to discussion? or add some todo for adding these as schemas (cc @nvictus )

@sergpolly
Copy link
Member

oh - a couple todos:

  • the combination (region,diag) should be unique

  • rename diag to dist everywhere

everyhting else was addressed by the recent PR

again will do as a PR with small fixes for expected

@sergpolly
Copy link
Member

done in #296

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants