Organism Filter

The pipeline uses FastqScreen to classify and filter non human reads.

The QC pipeline runs fastq screen on each single cell fastq pair. Fastq screen takes fastq inputs and outputs fastqs with tags added to read names. Each read in a pair is classified independently. We run our classification against human, mouse and salmon genomes. The bam files generated by the pipeline will be tagged with the fastqscreen tag to specify the species that they belong to.

| Fastq Screen Flag| Explanation| |—-|—-| |0|Read does not map| |1|Read maps uniquely| |2|Read multi maps|

Fastq format

Flag Format: The Flag information is appended to the read id in the fastq file. The very first read will have the following format:

The Flag information is appended to the read id in the fastq file. The very first read will have the following format:

@<Read-id>#FQST:grch37:mm10:salmon:100

In this example, the read uniquely maps to the human genome and doesn’t align to Mouse or Salmon genome at all.

All subsequent reads will have the following format:

@<Read-id>#FQST:100

Bam format

Each read in the bam file will contain the following tag:

FS:Z:mm10_0,salmon_0,grch37_1

Pipeline features:

Metrics:

Detailed Metrics:

The pipeline generates a csv file with detailed counts for every flag option. The counts are also split by the Read direction. The table columns depend on the references that we’re checking against. For instance, the table will have following columns for a run against Human, Mouse and Salmon genomes:

  • cell_id: id of the cell
  • read_end: end 1 or 2 of read pairs
  • Human: The column will have values {0,1,2}. Please see the table in fastq screen for details
  • Mouse: The column will have values {0,1,2}. Please see the table in fastq screen for details
  • Salmon: The column will have values {0,1,2}. Please see the table in fastq screen for details
  • count: number of reads

Summary Metrics:

The pipeline will also add some summary metrics to the main alignment metrics table. The column names depend on the references. For instance, the table will have following columns for a run against Human, Mouse and Salmon genomes

  • human: count of reads that align to human genome (uniquely or multi-map)
  • human_multihit: count of reads that align to human genome (uniquely or multi-map) and also align to another genome at the same time (uniquely or multi-map)
  • mouse: count of reads that align to mouse genome (uniquely or multi-map)
  • mouse_multihit: count of reads that align to mouse genome (uniquely or multi-map) and also align to another genome at the same time (uniquely or multi-map)
  • salmon: count of reads that align to salmon genome (uniquely or multi-map)
  • salmon_multihit: count of reads that align to salmon genome (uniquely or multi-map) and also align to another genome at the same time (uniquely or multi-map)
  • nohit: count of reads that do not align to any genome

Options

Default functionality:

do not filter the files at all. The output bam files will have the information in their read tags.

Filter options:

  • filter_contaminated_reads flag in config file. keep the following read pairs:
  • Both R1 and R2 match human only (remove reads that match multiple references)
  • one of the mates matches human only, other one doesnt match anything.