Access Data Release

`s3://lovelywater2/` : Serratus data-warehouse

Current Version: v230110

Versioned and structured data releases are freely hosted on AWS S3 in our data-warehouse: "lovelywater2".

Unstructured data and intermediate files are in the Working Data Directories.

Structured Data Types

Search sequence references
SRA Run Info Queries
Summary-level data
Alignment-level data (.bam or .pro, see notes below)
Assembly-level data
RdRP barcode sequences (PALMdb)

Folder organization

## Folder organization
                                                                               NEW/UPDATED
s3://lovelywater2/     # A Read-Only Archive of Serratus Data Releases
⦿ Common files
├── assembly/         # Viral assembly and annotation data                     
│   └─── cov/         # .fasta  : Assembled/filtered coronaviruses             
│   └─── contigs/     # CoronaSPAdes output, contigs, graphs, stats...         
│   └─── annotation/  # CoV annotation and taxonomic assignments
├── seq/              # Reference sequences used in data-releases      
│   └─── cov3ma/      # Nucleotide viral pangenome
│   └─── protref5/    # Protein viral panproteome
│   └─── rdrp1/       # viral RNA dependent RNA polymerase collection 1
│   └─── rdrp5/       # dark  RNA dependent RNA polymerase collection 5        ***
├── sra/              # sraRunInfo.csv files and queries for data (per query)
│   └─── README.md    # see github.com/ababaian/serratus/wiki/SRA-queries      ***
│   └─── *query*      # (see below)                                            ***
⦿ Nucleotide search files
├── bam/              # .bam    : Aligned files
├── summary/          # .summary: Original alignment summaries (deprecated)  
├── summary2/         # .summary: Alignment summaries
⦿ Translated-nucleotide (protein) search files
├── pro/              # .pro.gz : Translated-nucleotide alignments (diamond)
├── psummary/         # .psummary: Protein
⦿ RdRP 1 translated-nucleotide search files
├── rpro/              # .pro.gz : Aligned files                              ***
├── rsummary/          # .psummary: Alignment summaries for rdrp-search       ***
⦿ Dark RdRP 5 translated-nucleotide search files
├── dpro/              # .pro.gz : Aligned files                              ***
├── dsummary/          # .psummary: Alignment summaries for rdrp-search       ***
⦿ Index Files
├ index.tsv           # Index file of completed SRA accessions
├ pindex.tsv          # Index file of completed protein SRA accessions
├ rindex.tsv          # Index file of completed rdrp SRA accessions           ***
├ dindex.tsv          # Index file of completed dark rdrp SRA accessions      ***
├ LICENSE.md          #
└ README.md           # This README.md                                        **

s3://lovelywater2/sra/
* QUERY SETS *
├ v201210/               # Query sets from major version v210225 and prior
├ v220113/               # Query sets from major version v210225
└ v230116_SraRunInfo.csv # master query CSV for v230116                          ***

Naming Convention

All folders are flat, with files named {sra_accession}.{ext}

For example, the SRA library SRA123456 processed in the 'viro' query will have the files:

s3://lovelywater2/bam/SRA123456.bam
s3://lovelywater2/summary/SRA123456.summary
s3://lovelywater2/assembly/contigs/SRA123456.coronaSPAdes.gene_clusters.fa

Accessing Data

The S3 bucket has public read-only permissions. All files can be downloaded via aws cli or wget/curl.

aws-cli : aws s3 cp s3://lovelywater2/<file_path>.
wget/curl : wget https://lovelywater2.s3.amazonaws.com/<file_path>

To find or access a sub-set of data use the index file:

`aws s3 cp s3://lovelywater2/index.tsv ./`

`grep "SRR1234" index.tsv > matches`

`aws s3 cp --recursive -exclude "*" -include "SRR1234*" s3://lovelywater2/summary/ ./SRR1234/`

Access Alignment Data in IGV

As of version 20200821, all .bam files are sorted and have an associated .bai index file in the ~/bam/ directory. These alignment files can be visualized directly in a genome browswer such as igv using the cov3ma as reference genome.

IGV Stream Alignment: File --> Load from URL --> https://lovelywater2.s3.amazonaws.com/bam/ERR2756788.bam

You can then navigate to a relevant accession such as "EU769558.1" and directly vizualize read alignments.

IGV screenshot

`.pro` files

Translated-nucleotide alignment data are saved as (.pro), the output of diamond -f 6 with the following ordered-fields.

qseqid  qstart qend qlen qstrand sseqid  sstart send slen pident evalue cigar qseq_translated full_qseq full_qseq_mate

(See also: Diamond Wiki)

`.mfc` compressed files

FASTA assemblies are compressed using MFCompress.

# Quick install (linux 64bit)
wget http://sweet.ua.pt/ap/software/mfcompress/MFCompress-linux64-1.01.tgz
tar -xvf MFCompress-linux64-1.01.tgz
cp MFC*/MFC* ./; rm -rf MFCompress-linux64-1.01

# Decompress
MFCompressD SRR01234.fa.mfc

LICENSE

All data released in s3://lovelywater2/ is done so under the cc0 v1.0 license as defined in s3://lovelywater2/LICENSE.md.

Genomes and Contigs

RdRP barcode sequence database

PALMdb is a database of viral polymerase palmprint (barcode) sequences classified by (1) taxonomy and (2) species-like operational taxonomic units (OTUs) obtained by clustering at 90% sequence identity. PALMdb was created using the palmscan algorithm to mine public sequence databases and Serratus contigs. The 2021-03-14 update includes 250,799 novel Serratus palmprint sequences, representing 132,992 new OTUs.

Overview

Architecture and Pipeline

Raw Data

Serratus Explorer (serratus.io)

Usage

Running Serratus
- Serratus-Lite, local
Finding Novel Viruses (tutorials)
Papers using Serratus
Containers
Summarizer usage
Cloud Budgeting
Serratus SQL Database Management
Data Policy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Access Data Release

`s3://lovelywater2/` : Serratus data-warehouse

Structured Data Types

Folder organization

Naming Convention

Accessing Data

Access Alignment Data in IGV

`.pro` files

`.mfc` compressed files

LICENSE

Genomes and Contigs

RdRP barcode sequence database

Overview

Raw Data

Serratus Explorer (serratus.io)

Usage

Contributing

Work in Progress

Clone this wiki locally

Access Data Release

s3://lovelywater2/ : Serratus data-warehouse

Structured Data Types

Folder organization

Naming Convention

Accessing Data

Access Alignment Data in IGV

.pro files

.mfc compressed files

LICENSE

Genomes and Contigs

RdRP barcode sequence database

Overview

Raw Data

Serratus Explorer (serratus.io)

Usage

Contributing

Work in Progress

Clone this wiki locally

`s3://lovelywater2/` : Serratus data-warehouse

`.pro` files

`.mfc` compressed files