Practical Microbial Bioinformatics

Yonas
January 30, 2023

This is a compilation of protocols and methods we use in analyzing the sequencing data of microbial genomes, microbiomes, and amplicons. It covers Linux basics and various bioinformatics pipelines. All protocols are tested in a local Bio-Linux server.

Share This Post

Remote Connection

Tmux usage

$ tmux ls # list

$ tmux a -t # attach to a session

$ tmux kill-session -t session-name # kill a session

Remote copy

$ scp zeng@login.genome.au.dk:/faststorage/home/zeng/201811GI-Greenland/megahit-assembly-trimmed/intermediate_contigs/k21.contigs.fa .

$ rsync –info=progress2 -auvz /DATA_3/than/result ~/data3/

Reorganize files

$ for i in *; do cd $i/clean_data && for file in *1.fq.gz; do mv $file $i_R1_001.fastq.gz; done && for file in *2.fq.gz; do mv $file $i_R2_001.fastq.gz; done && cd ~/data3/result; done

Upload files to server

$ rsync -e ssh -avz ~/mydata3/filename zeng@login.genome.au.dk:faststorage/file
#To copy a folder from your computer to the cluster:
$ rsync -e ssh -avz /path/to/data user@login.genome.au.dk:data
OR
$ scp . yonzen@computerome.cbs.dtu.dk:mydata/data/VRSice

Virtual Environment

Conda usage

$ conda create –name env-name pysam

$ source activate amazing-project

$ source deactivate amazing-project

$ conda env list

Reads

Read the first few lines of a gz file

$ gzip -cd CONN.20111109.0057.gz | head

Reads QC

$ fastqc *.gz -o stats-by-fastqc

Quality trimming

$ bash bbduk.sh in=reads.fq out=clean.fq qtrim=rl trimq=20

# rl, for both right and left; until Q20

$ bbduk.sh in=reads.fq out=clean.fq maq=20

#maq, average quality score

#final command

$ bash /home/zeng/bbmap/bbduk.sh in=reads.fq out=clean.fq qtrim=rl trimq=20 maq=20 minlen=50

$ ~/mydata3/bbmap/bbduk.sh in=VRS.ice_1.fq in2=VRS.ice_2.fq out=VRS.ice.trimmed.left10_1.fq out2=VRS.ice.trimmed.left10_2.fq ftl=10

# remove 10 most left bases

Compress reads

$ tar -zvcf VRS.ice.trimmed.fq.gz trimmed10/

Divide a big file into small ones

#install 7z
$ tar xjf file.tar.bz2 #uncompress
$ chmod uog+x 7za #executable
$ ./7za a -v1000m -mx0 filename
#divide into files of 10ooMb each without compression
$ ./7za x filename.001 # merge small files

Assembly

Megahit

$ megahit/megahit -1 SRR341725_1.fastq.gz -2 SRR341725_2.fastq.gz -o SRR341725.megahit_asm

Spades

$ spades.py –meta –only-assembler –12 SS10-joined-trimmed.fq -m 120 –threads 6 -o SS10-assembly-spades

Unicycler

$ for i in v*; do echo $i && cd $i/clean_data && ls && tmux new -d “unicycler -1 *1.fq.gz -2 *2.fq.gz -o ~/data3/assemblies/$i –min_fasta_length 500” && cd ~/data3/result && sleep 40m; done

# automatically assemble all genomes remotely

$ unicycler -1 *1.fq.gz -2 *2.fq.gz -o ~/data3/assemblies/vice304 –min_fasta_length 500 –no_pilon # cancel polish, only a few to tens of base correction.

$ for i in *; do echo $i && cd $i && ls && tmux new -d “unicycler -1 *1.fq.gz -2 *2.fq.gz -o ~/mydata3/assemblies/$i –min_fasta_length 500 –no_pilon –threads 16” && cd ~/mydata3/Clean && sleep 180m; done

# an improved pipeline

Combine Illumina and PacBio reads for assembly

# check reads quality
$ fastqc -t 64 -o fastqc *.fastq
# remove 20 bases at both ends for 150 base-long reads
$ ~/bbmap/bbduk.sh in=TET16_illumina_R1.fastq in2=TET16_illumina_R2.fastq out=TET16_illumina_R1.L15R40.fastq out2=TET16_illumina_R2.L15R40.fastq forcetrimleft=20 forcetrimright=130
# trim PacBio reads
# ~/bbmap/bbduk.sh in=TET16.pacbio.fastq out=TET16.pacbio.L10R25k.gt1k.fastq forcetrimleft=10 forcetrimright=25000 minlen=1000
# OR use the specific tool for treating PacBio reads
$ ./SequelTools.sh
# assembly Illumina and PacBio reads
$ unicycler -1 TET16_illumina_R1.L20R10.fastq -2 TET16_illumina_R2.L20R10.fastq -l ../pacbio.reads/TET16.pacbio.fastq -o ../hybrid.re.assembly.L20R10 -t 32
# re-try using different settings
# include all PacBio reads to assure the success, not be misguided by the uneven distribution of bases, which is caused by read length inconsistency. There are only a few very long reads.
# Try L15R40 illumina + pacbio gt500
$ ~/bbmap/bbduk.sh in=TET16.pacbio.fastq out=TET16.pacbio.gt500.fastq minlen=500
$ unicycler -1 TET16_illumina_R1.L15R40.fastq -2 TET16_illumina_R2.L15R40.fastq -l ../pacbio.reads/TET16.pacbio.gt500.fastq -o ../hybrid.re.assembly.il.L15R40.pb.gt500 -t 32
# The best strategy is Left removing 10 bases，Right removing adaptor，then Quality trimming to Q20, min length 50 bases

Minimum length of a assembly

$ bbtools/stats.sh in=contigs.fasta -m

Assembly stats

$ quast.py [options] <contig_file(s)> -m 200

$ seqkit stats -a -T filename -o stats-filename

# -a, all; -T, tab format

Remove short contigs

$ awk ‘BEGIN{RS=”>”;ORS=””}length($0)>300{print “>”$0}’ input > output

# remove contigs shorter than 300 bp; count everything between two “>”

Add filename into the contig's name

$ for i in *.gt300; do sed -i — ‘s/>/>$i/g’

$ for i in *; do echo $i && sed -i “s/>/>$i./g” $i; done

Bulk rename assemblies

$ for i in *; do echo $i && cd ~/data3/assemblies/$i && if [ -e assembly.fasta ]; then cp assembly.fasta ~/data3/finalcontigs/$i.fasta; else echo “assembly file not exist”; fi; done

Bulk remove extensions in filenames

$ for i in *; do rename ‘s/.fasta//g’ $i; done

Bulk add extensions to assemblies' filenames

$ for i in *; do rename “s/$i/$i.gt500.fa/g” $i; done

Clean & organize assembly files

$ for i in *; do sed -i “s/>/>$i./g” $i/assembly.fasta; done

$ for i in *; do sed -i “s/ /./g” $i/assembly.fasta; done

$ for i in *; do cp $i/assembly.fasta ../finalcontigs/$i.fasta; done

Estimate of genome completeness and contamination

$ checkm lineage_wf <bin folder> <output folder>

Compress all fastq to save storage space

for file in $(find ./mydata/ -name *.fastq -type f); do echo $file; gzip $file; donemafft-qinsi all.fasta > all.mafft.qinsi.fasta

Annotation

RAST server

$ svr_submit_RAST_job -user RAST-username -passwd RAST-password -fasta assembly.fasta -domain Bacteria -bioname “own_name” –genetic_code 11 -gene_caller rast

# a docker version of submitter

$ for i in *.fasta; do echo $i; svr_submit_RAST_job -user ******* -passwd ****** -fasta $i -domain Bacteria -bioname “***$i” –genetic_code 11 -gene_caller rast && rm $i -f; done

# automate the upload of the files in a folder

Remove special characters

$ sed -e “s/\^M//” filename > newfilename

#remove CTRL+M that appears in RAST annotation results

#Note: To enter ^M, type CTRL-V + M. That is, hold down the CTRL key then press V and M in succession.

Retrieve data in columns

$ while IFS=’ ‘ read -r col1 col2; do echo “$col2”; name=$(print $col2 | sed -e “s/\^M//”); echo $name; svr_retrieve_RAST_job oceanbug maomaochong $name genbank > $col1.gbk; done < jobid.txt

# Tab as seperator; to type Tab in, press CTRL+V, then press tab

GBK to FAA

$ python genbank_to_fasta.py -i ../../genome/gbk/*.gbk

Prokka

$ prokka –outdir prokka –prefix VRS.soil1.nanoAssembly –addgenes -kingdom Archaea –metagenome *.fasta

16S rapid identification

$ barrnap –kingdom bac -o output.fasta -t 8 input.fasta

Genome classification

$ pip install gtdbtk
$ wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release89/89.0/gtdbtk_r89_data.tar.gz
$ tar xvzf gtdbtk_r89_data.tar.gz
$ export GTDBTK_DATA_PATH=/DATA_3/databases/GTDB/release89
$ ~/.local/bin/gtdbtk classify_wf –cpus 12 –genome_dir ./ –extension fa –out_dir gtdbtk_output

Biuld a HMM model

$ conda install -c bioconda hmmer
$ hmmerbuild –amino -n rhodopsin rhodopsin.hmm rhodopsin.alighment
$ onda install -c bioconda prokka
# use customized HMM to annotate a genome
$ prokka –outdir prokka.fungenes.ice –prefix vrs.ice –addgenes –metagenome –cpus 196 –force –centre X –compliant ../vrs.ice.contigs.fa

NCBI PGAP annotation pipeline

# install docker
$ sudo snap install docker # version 19.03.11, or
$ sudo apt install docker.io
# install PGAP
$ wget https://github.com/ncbi/pgap/raw/prod/scripts/pgap.py
$ chmod +x pgap.py
$ ./pgap.py –update
# test installation environment, -n no report
$ time ./pgap.py -n -c 64 -o mg37_results.docker test_genomes/MG37/input.yaml
# test singularity
$ time ./pgap.py -n -D singularity -c 64 –no-internet –ignore-all-errors -o mg37_results.singularity test_genomes/MG37/input.yaml
# prepare yaml file and submol.yaml
# input.yaml\
fasta:\
class: File\
location: assembly.file.name.fasta\
submol:\
class: File\
location: submol.yaml\
# submol.yaml\ # a simplified version, genus name that can be searched in NCBI is a must, species name can be any
organism:
genus_species: ‘Mycoplasma genitalium’
strain: ‘strain-name’
# start annotation
$ sudo ./pgap.py -r -o output-folder-filename –no-internet input.yaml

Batch annotation using PGAP

# copy all fasta files into a subfolder
$ for f in */*.fasta; do cp “$f” “$(echo “$f” | sed s/’\/assembly’//)”; done
# remove non-standard characters in fasta filenames
$ for f in *.fasta; do cp “$f” “$(echo “$f” | sed s/.*-1a-//)”; mv “$f” backup/; done
# automatically create output folders and yaml files needed for PGAP annotation
$ for f in *.fasta; do echo “$f”; mkdir “$(echo “$f” | sed s/.fasta//)”; cp $f “$(echo “$f” | sed s/.fasta//)”/; sed s/assembly.fasta/”$f”/ input.yaml > “$(echo “$f” | sed s/.fasta//)”/input.yaml; sed s/strain.name/”$(echo “$f” | sed s/.fasta//)”/ submol.yaml > “$(echo “$f” | sed s/.fasta//)”/submol.yaml; done
# test environment
$ time ./pgap.py -r -D singularity -o mg37_results.singularity test_genomes/MG37/input.yaml
$ time ./pgap.py -r -o ~/mydata3/genomes/final/AK113 –no-internet ~/mydata3/genomes/final/AK113/input.yaml
# start annotation
$ for f in /home/zeng/mydata/PGAP/assembly/*.fasta; do echo $f; ./pgap.py -n –no-internet –cpus 64 -o $(echo “$f” | sed s/.fasta//) $(echo “$f” | sed s/.fasta//)/input.yaml; done
$ for f in /home/zeng/mydata/PGAP/assembly/*.fasta; do echo $f; echo $(echo “$f” | sed s/.fasta//); echo $(echo “$f” | sed s/.fasta//)/input.yaml; done
$ ./pgap.py -n –no-internet -c 64 -o /home/zeng/mydata/PGAP/assembly/AK230 /home/zeng/mydata/PGAP/yongqin/assembly/AK230/input.yaml
# data cleanup
$ for f in ../*/*/*/*/assembly.fasta; do echo $f; cp “$f” “$(echo “$f” | sed s/’\/’/./g | sed s/…// | sed s/.*-1a-// | sed s/.*result.//)”; done
$ for f in *result*.fasta; do cp “$f” “$(echo “$f” | sed s/.*result.//)”; mv “$f” ./backup; done
$ for f in *-1a-*.fasta; do cp “$f” “$(echo “$f” | sed s/.*-1a-//)”; mv “$f” ./backup; done
$ for f in *.fasta; do cp “$f” “$(echo “$f” | sed s/.assembly//)”; mv “$f” ./backup; done

Nanopore

Basecalling

$ /home/tkn/Programs/Guppy_basecaller/ont-guppy-cpu/bin/guppy_basecaller –flowcell FLO-MIN106 –kit SQK-RAD004 -t 10 -i fast5/ -r -s basecalled/

Adaptor removal

$ porechop -i all.3rd.fastq -o all.3rd.porechopped.fastq

Quality trimming

$ bbduk.sh in=reads.fq out=clean.fq qtrim=rl trimq=10 maq=10 minlen=2000
# recommend 7 (>80% accuracy)
$ ~/bbmap/bbduk.sh in= out= qtrim=rl trimq=7 maq=7 minlen=2000

Stats

$ NanoStat [-h] [-v] [-o OUTDIR] [-p PREFIX] [-t THREADS] (–fastq FASTQ | –summary SUMMARY | –bam BAM)

Assembly

$ flye –nano-raw nanopore_VRSmetagenome.fastq –out-dir /home/zeng/VRSsoil-nanopore-assembly-by-flye-on-raw –meta -g 10g –threads 4

Annotation

$ seqtk seq -a input.fastq > output.fasta
$ prokka –outdir prokka –prefix VRS.soil1.nanoAssembly –addgenes -kingdom Archaea –metagenome *.fasta

Search for certain species

$ samtools faidx test.fa name

Computational Genomics

ANI calculation

$ ./fastANI –ql [QUERY_LIST] –rl [REFERENCE_LIST] -o [OUTPUT_FILE] ../fastani-Linux64-v1.1/fastANI –ql list –rl list -o all.Tardiphaga.ANI –matrix -t 8

# 1 thread by default

AAI calculation

$ conda install -c bioconda comparem
$ comparem aai_wf –protein –cpus 16 –file_ext fasta ./aa ./aai_output

Compare two genomes with visualization

$ ../fastani-Linux64-v1.1/fastANI -q genome1.fasta -r genome2.fasta -o 1.vs.2.fastANI -t 8 –visualize

$ Rscript scripts/visualize.R genome1.fasta genome2.fasta 1.vs.2.fastANI.visual

Compare multiple genome organizations with EasyFig

$ sudo apt-get update
$ sudo apt install inkscape
$ inkscape -z -e out.png -w 5000 -h 2000 *.svg
$ inkscape t.svg –export-pdf=t.pdf
# install EasyFig
# EasyFig settings: 5k blast， 0.0001 cutoff, with outline, fileter small blast hits, 1200000 pixels, 50 pixels in genome lines, 150 pixles height, 3000 blast hit height

Pangenome with Roary

# install bioperl
$ cpanm Bio::DB::EUtilities
$ cpanm Bio::DB::GenBank
$ cpanm Bio::DB::GFF
$ bp_genbank2gff3.pl *.gbk –outdir gff
# convert gbk to gff3
$ roary *.gff -f all.genome.roary -e -n -i 80 -cd 100 -r -p 24
#-p, threads; -f, output folder;-e, create core gene alignment; -n, fast alignment with MAFFT;-i, identity from blastp;-cd 100, define core gene as presence in 100% species
# install ggplot2
# run R
# install.packages(“ggplot2”)

MetaWrap for metagenome binning

# create a conda environment
$ module load tools
$ module load anaconda2/4.4.0
$ conda create –name python=2.7 -y
$ source activate
() $ pip install pip –upgrade
() $ pip install <your_required_python_package(s)> –upgrade
$ conda env list
$ conda config –set safety_checks disabled
$ conda create -y -n metawrap-env python=2.7
$ source activate metawrap-env
# Note: ordering is important
$ conda config –add channels defaults
$ conda config –add channels conda-forge
$ conda config –add channels bioconda
$ conda config –add channels ursky
$ conda install -y -c ursky metawrap-mg
# reads classification
$ metawrap kraken -o kreken.result -t 64 -s 1000000 reads.fq
# consolidate binning results by using multiple programs
$ metawrap bin_refinement -o BIN_REFINEMENT -t 96 -A INITIAL_BINNING/metabat2_bins/ -B INITIAL_BINNING/maxbin2_bins/ -C INITIAL_BINNING/concoct_bins/ -c 50 -x 10
# refine a single bin
$ metawrap bin_refinement -o final.bins.metabat2.refined -t 64 -A final.bins.metabat2/ -c 50 -x 10
# count the number of high quality bins
$ cat BIN_REFINEMENT/metawrap_bins.stats | awk ‘$2>50 && $3<10’ | wc -l

Mark mobilome density in a genome

# define mobilome as an annotation with a keyword transposase OR recombinase OR integrase OR phage
# identify loci using keyword search in annotaion file
# use python package Matplotlib to plot
# overlay the plot onto a genome structure figure
# python

import matplotlib.pyplot as plt

points1 = [ (100,1), (200,1), (3000,1), (666,1) ] # (a1,b1), (a2,b2), … # do not use “-” in a variable’s name
points2 = [ (1000,1), (2000,1), (5000,1), (8888,1) ]

plt.xlim(0,10000) # set up the plot limits

for pt in points1:
# plot (x,y) pairs.
# vertical line: 2 x,y pairs: (a,0) and (a,b)
ax.plot( [pt[0],pt[0]], [0,pt[1]], color=’blue’, linewidth=0.5 ) # set line color as b – blue

for pt in points2:
# plot (x,y) pairs.
# vertical line: 2 x,y pairs: (a,0) and (a,b)
ax.plot( [pt[0],pt[0]], [0,pt[1]], color=’black’, linewidth=0.5 ) # set line color as k – black

plt.show()

Core genome phylogeny

$ conda activate env.name
$ conda install roary
$ conda install prokka
# update blast, find the location of blastp
$ whereis blastp
# prepare data, annotate using prokka
$ prokka –outdir prokka_bin283 –genus Tardiphaga –locustag bin283 –cpus 128 –force –centre X –compliant –prefix bin283 VRS.ice.bin.283.fasta
# collect all gff files
$ cp */*.gff ./gff/.
# generate core gene alignment using roary
$ roary -p 128 -o tardiphaga -f ../roary.results -e -n -v *.gff
# use various program to generate a phylogenetic tree from the core gene alignment file

Count gene occurence in a metagenome

# use “ln -s” to link a certain file somewhere else to a filename in the current folder
# indexing contigs
$ bowtie2-build vrs.ice.contigs.fa vrs.ice –threads 192 –large-index && bowtie2-build vrs.soil.contigs.fa vrs.soil –thread 64 –large-index
# mapping reads back to contigs
$ bowtie2 -x vrs.ice -1 vrs.ice.reads.1.fq -2 vrs.ice.reads.2.fq -S vrs.ice.map.sam –no-unal –threads 196 && bowtie2 -x vrs.soil -1 vrs.soil.reads.1.fq -2 vrs.soil.reads.2.fq -S vrs.soil.map.sam –no-unal –threads 128
# converted to BAM format and sort either by read name or by leftmost alignment coordinate
# using -m to increase temp file size, to avoid failure due to too many temp files. do not exceed 1020 files
$ samtools sort -o vrs.ice.map.sorted.bam -O BAM –threads 196 -m 2G vrs.ice.map.sam
# remove duplicates
# download picard.jar
$ java -jar picard.jar MarkDuplicates INPUT=vrs.ice.map.sorted.bam OUTPUT=vrs.ice.map.markdup.bam \
METRICS_FILE=vrs.ice.map.markdup.metrics AS=TRUE VALIDATION_STRINGENCY=LENIENT \
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 REMOVE_DUPLICATES=TRUE
# define the regions to be calculated on coverage
# using blast results, only columns on gene id, hit start and hit end
# SAF format, add a cloumn before and after
$ conda install -c bioconda csvtk
$ csvtk del-header vrs.ice.gyrB.blast | csvtk mutate2 -t -H -e ‘$3 > $2 ? $2 : $3 ‘ -L 0 | csvtk mutate2 -t -H -e ‘$3 > $2 ? $3 : $2 ‘ -L 0 | csvtk mutate2 -t -H -e ‘$3 > $2 ? “+” : “-” ‘ | csvtk mutate2 -t -H -e ‘”gyrB”‘ | csvtk cut -f7,1,4,5,6 -t -H > vrs.ice.gyrB.saf
# count the number of reads mapped to each gene
# download the program at https://sourceforge.net/projects/subread/files/subread-2.0.0/subread-2.0.0-Linux-x86_64.tar.gz/download
$ ./subread-2.0.0-Linux-x86_64/bin/featureCounts -F SAF -T 64 -a vrs.ice.recA.saf -o vrs.ice.recA.counts ../vrs.ice.map.markdup.bam
# count total reads in a bam file
# get the total number of reads of a BAM file (may include unmapped and duplicated multi-aligned reads)
$ samtools view -c ../vrs.soil.map.markdup.bam
# counting only mapped (primary aligned) reads
$ samtools view -c -F 260 ../vrs.soil.map.markdup.bam

Calculate a MAG's abundance in a metagenome

$ conda create -y -n zeng_metawrap-env python=2.7
$ conda activate metawrap-env

# Note: ordering is important
$ conda config –add channels defaults
$ conda config –add channels conda-forge
$ conda config –add channels bioconda
$ conda config –add channels ursky
#vNO DATABASE needs to be installed for this function

# shorten path
ln -s /data/zeng/medusa.backup.2.5/all.raw.reads/11BGI-Greenland-metagenome/F18FTSEUHT1164_SOIxgtM/Reads/Clean/VRS-SOIL/vrs.soil.raw.reads.1.fq vrs.soil.reads_1.fastq;
\
ln -s C D
$ metawrap quant_bins -b ./VRS.HQ.bins/ice -o quantify.results.ice -a ../vrs.ice.contigs.fa -t 196 vrs.ice.reads_*.fastq

# Note that the abundances are expressed as “genome copies per million reads”, and are calculated with Salmon in a simmilar way like TPM (transcripts per million) is calculated in RNAseq analysis.
# make sure contig names inside a bin is consistent with those in the assembly
# rename the names inside a bin
# retrieve original names
$ grep “>” ../vrs.soil.contigs.fa > original.name
$ ln -s /data/zeng/computerome/computerome/data/VRSice/VRS-ice.contigs.fa vrs.ice.contigs.fa
$ sed -i ‘s/>VRS.ice.bin.*.fa.//g’ *.fa
$ for file in *.fa; do echo $file >> name.change.log; for i in $(grep ‘k141_’ $file); do date >> name.change.log; echo “$i” >> name.change.log; replace=$(grep -m 1 $i”_” original.name); echo “$replace” >> name.change.log; sed -i ‘s/'”$i”‘/'”$replace”‘/’ $file; done; done
$ grep -m 1 # only find the first match

Generate circos

$ wget http://circos.ca/distribution/circos-0.69-9.tgz
$ tar xvfz ../circos-0.69-9.tgz
$ cd circos/bin
$ ./circos -v or -h or -man
$ circa 100 usd, skin for using data on circos
# use Easyfig to generate DNA comparison results, take coordinate columns as input into circa.(DNA level)
# or do local blast to generate tblastn results, convert into circa compatible file. (protein level) \
chromosome file\
chromosome,size\
1.easyfig.fa,4716552\
2.easyfig.fa,5179092\
\
chromosome1 chromosome2 identity length mismatch start1 end1 start2 end2 e-value \
1.easyfig.fa 2.easyfig.fa 73.491 24056 5661 259 1940218 1964036 1636159 1612583 0 14110\
1.easyfig.fa 2.easyfig.fa 77.883 19876 3968 121 2427168 2446844 1096555 1076909 0 15866\
# make sure to remove duplicate regions in EasyFig results by sorting according to Start loci

Genome species tree

$ conda install -c bioconda phylophlan
# or install locally, need to install trimal as well, add into path PATH=$PATH:~/opt/bin

# download refs
$ phylophlan_get_reference -g p__Gemmatimonadetes -o Gemma.input –verbose

# Download reference genomes from close phyla
$ for i in Fibrobacteres Poribacteria Latescibacteria Cloacimonetes Zixibacteria Marinimicrobia Ignavibacteria Caldithrix Bacteroidetes Chlorobi; do phylophlan_get_reference -g p__${i} -n 1 -o Gemma.input –verbose; done

# Generate config file
$ phylophlan_write_config_file -d a -o gemma_config.cfg –db_aa diamond –map_dna diamond –map_aa diamond –msa mafft –trim trimal –tree1 fasttree –tree2 raxml –verbose

# download all NCBI Gemma genomes
$ wget ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt
for f in $(grep Gemmatimonadetes prokaryotes.txt | csvcut -t -c 21); do wget “$f/*_genomic.fna.gz” –retry-connrefused –waitretry=1 –read-timeout=10 –timeout=5 -t 0; done
$ for f in $(grep Gemmatimonadetes prokaryotes.txt | cut -f21); do wget –retry-connrefused –waitretry=1 –read-timeout=10 –timeout=5 -t 0 “$f/*_protein.faa.gz”; done

# Build the phylogenetic tree
$ phylophlan -i Gemma.input -d phylophlan –diversity medium –accurate -f gemma_config.cfg -o output_gemma –nproc 128 -t a –verbose

# bootstrapping
$ raxmlHPC -m PROTGAMMAAUTO -p 12345 -b 12345 -# 100 -s alignment.file -n output.name

# add bootstrap values onto the tree
$ raxmlHPC -m PROTGAMMAAUTO -p 12345 -f b -t best.tree.file -z bootstrap.output.file -n output.name

# use iTOL to display the tree
# rename taxa names by removing ASM*** in the name. the logic is to create line first, delete, then combine lines again.
$ sed ‘s/_ASM/\n_ASM/g’ final.tree.RAxML.best.renamed.tre | sed ‘s/_ASM.*v1:/:/g’ | sed ‘s/_ASM.*v2:/:/g’ | sed ‘:a;N;$!ba;s/\n//g’ > final.phylogenomic.tree

Public Databases

Download genomes from ENA

# install enaBrowserTools
# download search results
$ python2.7 ../enaBrowserTools-1.5.4/python/enaGroupGet.py -g assembly -f fasta -t -e -w 2 # 2 represents bacterial taxon id
$ python2.7 ../enaBrowserTools-1.5.4/python/enaDataGet.py -f fasta -t -e -w 2 # download both type
# convert downloaded txt to csv
$ sed -i ‘s/\t/,/g’ all.bac.arc.ena.txt
# find the column number
$ csvcut -n all.bac.arc.ena.txt
# count the number of genomes to be downloaded
$ csvcut -c fasta_file all.bac.arc.ena.txt | wc -l
# remove useless digits in filenames
$ while read line; do echo $line | tail -c 7 >> all.bac.list.clean; done < all.bac.list
# download all bacterial and archaeal genomes
$ while read line; do echo $line; A=$(echo $line | tr -d ‘\r’); echo $A; curl -O $A && ls -lh | wc -l; done < ../archaea.list
$ while read line; do A=$(echo $line | tr -d ‘\r’); echo $A; curl -O $A && ls -lh | wc -l; done < all.bac.arc.list
# unzip one by one
$ for i in *.fasta.gz; do echo $i; echo “total unzipped files: ‘ls *.fasta -lh | wc -l'”; gunzip -k $i; done
# identify pathogens and move into a separate folder
$ declare -i B=0; for i in *.fasta.gz; do echo $i; A=$(gunzip -c $i | head -n1); if [[ “$A” == *Salmonella* ]] || [[ “$A” == *Brucella* ]] || [[ “$A” == *Chlamydia* ]] || [[ “$A” == *Clostridioides* ]] || [[ “$A” == *Clostridium* ]] || [[ “$A” == *Corynebacterium* ]] || [[ “$A” == *Enterococcus* ]] || [[ “$A” == *Erysipelothrix* ]] || [[ “$A” == *Escherichia* ]] || [[ “$A” == *Haemophilus* ]] || [[ “$A” == *Helicobacter* ]] || [[ “$A” == *Listeria* ]] || [[ “$A” == *Mycobacterium* ]] || [[ “$A” == *Neisseria* ]] || [[ “$A” == *Shigella* ]] || [[ “$A” == *Staphylococcus* ]] || [[ “$A” == *Streptococcus* ]] || [[ “$A” == *Yersinia* ]]; then B=$B+1 && echo $B “file(s) moved to the folder of pathogens” && mv $i ./pathogens; fi; done
# add additional pathogens
$ declare -i B=0; for i in *.fasta.gz; do echo $i; A=$(gunzip -c $i | head -n1); if [[ “$A” == *Klebsiella* ]] || [[ “$A” == *”Acinetobacter baumannii”* ]] || [[ “$A” == *”Pseudomonas aeruginosa”* ]] || [[ “$A” == *”Mycobacteroides abscessus”* ]] || [[ “$A” == *Acinetobacter_baumannii* ]] || [[ “$A” == *Pseudomonas_aeruginosa* ]] || [[ “$A” == *Mycobacteroides_abscessus* ]]; then B=$B+1 && echo $B “file(s) moved to the folder of pathogens” && mv $i ./pathogens; fi; done
# check progress
$ ls -lh | wc -l && ls ../ -lh | wc -l && echo `ls -lh | tail -n1`

Download genomes from NCBI

$ csvcut -n ../prokaryotes.csv
$ csvcut -c 15 ../prokaryotes.csv > ftp.list
$ while read line; do A=$(echo $line | tr -d ‘\r’); echo $A; ncftpget ${A}/*.fna.gz && ls -lh | wc -l; done < ftp.list
# display the total size of all files
$ du -ach
# check all downloads
$ find . -name *.gz | wc -l
# remove duplicate genome with fastx toolkit
$ fastx_collapser < some.fasta > some.unique.fasta
OR
$ usearch -fastx_uniques input.fasta -fastaout uniques.fasta -sizeout -relabel Uniq
# only works with single fasta file
# remove fasta containing pathogens by looking for name in the first line
$ declare -i B=0; for i in *.fna; do A=$(head -n1 -q $i); if [[ “$A” == *cyanobacterium* ]]; then B=$B+1 && echo $B “file(s) removed” && mv $i ./cyanobacterium; fi; done
# define B as an integral, count how many files have been treated
$ bash ~/bbmap/dedupe.sh

Download all FunGen HMMs

# link to http://fungene.cme.msu.edu/hmm_download.spr?hmm_id=331
$ for i in $(seq 1 1000); do curl -O http://fungene.cme.msu.edu/hmm_download.spr?hmm_id=$i && ls -lh | wc -l; done
# replace filename with gene name, which is the 2nd field of the 2nd line of each file
$ for i in *; do echo $i; name=”$(awk ‘FNR == 2 {print $2}’ $i)”; echo $name; mv ./$i ./$name; done
# find a certain line with certain characters
$ grep -hnr “bin.280” gtdbtk.bac120.markers_summary.tsv

Identify a gene in public genomes

# create a local database of all downloaded genomes
$ makeblastdb -in all.ena*.fasta -title all.ena.archaea -out all.ena.archaea -max_file_sz 2GB -dbtype nucl
# look for certain genes using rhodopsin gene as the example
$ tblastn -db ../all.no.pathogen/fna/all.ncbi -query xr -num_threads 20 -out xr.blast.results -max_target_seqs 10000 -outfmt “6 qseqid qgi qacc sseqid sallseqid sgi sallgi sacc sallacc qstart qend sstart send qseq sseq evalue bitscore score length pident nident mismatch positive gapopen gaps ppos sscinames scomnames qcovs qcovhsp qcovus”
# process results and keep needed columns only
$ csvcut pufm.blast.results -t -H -c 4,20,29 > pufm.sorted # t, tab as separator；H，no head
# remove certain lines
$ sed -i <start>,<end>d <filename>
# or keep original files
$ sed ‘1,2395d’ pufm.sorted.clean.identity.sorted > tmp; mv tmp pufm.identity.removed
# select column C, remove duplicates, count lines
$ csvcut -t -c a pufm.final | sort -u | wc -l
# remove blast results with low coverage or low identity
$ csvsort -c t xr.sorted | sed ‘1,1470d’ | csvsort -c 3 | sed ‘1,88d’ > xr.final
# use the first six letter to identify the same sequences
$ sort pufm.final.deRep.shortname xr.final.deRep.shortname | awk ‘dup[$0]++ == 1’
# find corresponding species names
$ while read line; do echo $line; grep “$line” ena.all.id >> dual.taxon; done < dual.list.final
# download taxonomy data from ENA, convert into csv
$ ssconvert taxonomy.xml taxonomy.csv
# process, sort, remove certain lines, resort
$ csvsort -c t pr.sorted | sed ‘1,711d’ | csvsort -c 3 | sed ‘1,55d’ > pr.sorted.gt30
# generate ID, remove duplicates
$ csvcut -c 1 pr.sorted.gt30 | sed ‘s/|/,/g’ | sed ‘s/01/0100/g’ | sed ‘s/00/,/g’ | csvcut -c 2 | sort -u > pr.final.list

Retrieve taxonomic info of a NCBI genome

# for WGS
$ while read line; do echo $line; grep ‘”‘$line'”‘ prokaryotes.csv >> pufM.list.final; done < pufM.list.name.txt
# for CP series
$ while read line; do echo $line; grep $line prokaryotes.csv >> pufM.list.final; done < pufM.list.name2.txt

Cloud

Transfer S3 file to Dropbox

Install the AWS CLI and configure it with your AWS credentials.
Install the Dropbox-Uploader script and configure it with your Dropbox credentials.
git clone https://github.com/andreafabrizi/Dropbox-Uploader.git
$chmod +x dropbox_uploader.sh
$./dropbox_uploader.sh
Use the AWS CLI to sync the S3 folder to a local folder on your computer:
aws s3 sync s3://your-bucket/your-folder /path/to/local/folder
Use the Dropbox-Uploader script to upload the local folder to Dropbox:
./dropbox_uploader.sh upload /path/to/local/folder /Dropbox/folder
Note: You may need to modify the above commands to fit your specific use case, such as specifying the correct bucket and folder names.

Upload files to Dropbox from Ubuntu Server

Create an APP with Dropbox and generate an access token.

ACCESS_TOKEN=”YOUR_ACCESS_TOKEN”
LOCAL_FILE_PATH=”/path/to/your/local/file.txt”
DROPBOX_PATH=”/path/to/dropbox/folder/file.txt”

curl -X POST https://content.dropboxapi.com/2/files/upload \
–header “Authorization: Bearer $ACCESS_TOKEN” \
–header “Dropbox-API-Arg: {\”path\”: \”$DROPBOX_PATH\”,\”mode\”: \”add\”,\”autorename\”: true,\”mute\”: false,\”strict_conflict\”: false}” \
–header “Content-Type: application/octet-stream” \
–data-binary @$LOCAL_FILE_PATH

Or, in short

curl -X POST https://content.dropboxapi.com/2/files/upload –header “Authorization: Bearer YOURACCESSTOKEN” –header “Dropbox-API-Arg: {\”path\”: \”/fastqc.zip\”,\”mode\”: \”add\”,\”autorename\”: true,\”mute\”: false,\”strict_conflict\”: false}” –header “Content-Type: application/octet-stream” –data-binary @fastqc.zip

Basic Linux Commands

Check a folder's size

$ du -sh file_path

Extend the command line into the next line

use \

Copy a certain folder only

$ rsync -av –progress sourcefolder /destinationfolder –exclude thefoldertoexclude

Count files

$ ls -l | wc -l

Compress a folder

$ tar -zcvf name.tar.gz /home/folder_to_be_compressed

Sort files by date and time

$ ls -t

Find a file

$ find / -name *.gz

Find a folder

$ find . -name * -type d

Display the content of a compressed file

$ tar -tf filename.tar.gz

Compress folders into indivisual files

$ for i in */; do zip -r “${i%/}.zip” “$i”; done

Uncompress a file while keeping the original

$ gunzip -k *.gz

Find a file within all folders

$ ls */*/*/filename

Create a link to a path

$ ln -s /DATA_3/zeng/zeng/ mydata3

Remove all but certain files

$ rm */* -r — !(*gz)

Display the whole path of a file

$ readlink -f file.txt

Replace a character within a file with the file's path

$ for i in *; do sed -i “s/>/>$i./g” $i; done

Delay a command

$ sleep 10m

$ sleep 10 # 10 sec

$ sleep 10h

Rename in a batch

$ for i in *.fasta; do rename ‘s/fromA/toB/g’ $i; done

$rename ‘s/^mywork_myfile_/mywork_/’ *.fasta

$ for i in */*/*/contigs.fasta; do rename ‘s/\//_/g’ $i; done

Transfer files to a folder in a batch

$ for i in */;do echo $i; for s in $i*gz; do mkdir new/$i; cp $s new/$i; done; done

Remove all empty folders

$ rmdir -p

Run a command in background

$ nohup command-with-options &

OR screen OR tmux

Alternative command for copy

$ gcp -rv ~/Music/* /data/music/

$ rsync –info=progress2 -auvz ~/Music/ /data/music/

Copy files to another folder and change the filename

$ for i in *; do echo $i && cd ~/data3/$i && if [ -e assembly.fasta ]; then cp assembly.fasta ~/data3/$i.fasta; else echo “assembly file not exist”; fi; done

# judge the existence of the filename first.

Report server status every six hrs to an email

$ for i in $(seq 1 100); do echo $i; squeue -u zeng | mail -s “server status” email@email.dk; sleep 6h; done

Combine the last lines of files into a new file

$ tail */*log | cat – test.out > temp && mv temp test.out

Kill all processes of a user

$ pkill -u zeng

Disk usage

$ df

Folder info & stats

$ du

Display the number of files

$ echo “Number of files in this directory: `ls | wc -l`” # not ´, be aware of the direction

List folders only

$ ls -d */

Make all subfolders readable and writable

$ chmod 777 ./ -R

Display results in the next screen

$ command_x | more

Show current jobs

$ jobs

Check system's current state

$ top OR $ htop OR $ glances

Basic editor

nano OR vim OR emacs

Compress a file and a folder into a output file

$ zip -r output_file.zip file1 folder1

Escape certain files

$ command_x !(“filename1″|”filename2”)

Add into path

$ PATH=$PATH:~/opt/bin

Create a shortcut to a long command

$ alias [-p] [name[=value] … ]

Keep the first 15 characters of a filename

for i in *.gz; do echo $i; mv $i “$(echo $i | head -c15)”.gz; done

Show previously executed commands

history

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Case Studies

Quarterly Campaign Featuring Sequencing Library Pooling

Our budget sequencing services are highly popular with our clients. We provide quarterly sequencing campaigns at unmatched discounts, significantly reducing costs through pooling various sequencing libraries.

Admin

Holidays 2024

This is a summary of official holidays of 2024 in Denmark, the UK, and China. We use this calendar to count the number of workdays spent on a project.

Technical documents

Microbial Genome Sequencing

Sequencing a microbial genome has never been as easy and affordable as now. We provide a complete sequencing package with the basic bioinformatics analysis included. Our professional services allow you to be hustle-free and focus on your critical tasks.

Technical documents

Microbial Community Profiling

Microbial life dominates our planet Earth in terms of quantity and biodiversity. The very first step to understanding them is to know who they are. This can be addressed via high-throughput amplicon sequencing of a few conservative biomarker genes.

Practical Microbial Bioinformatics

Share This Post

Table of Contents

Remote Connection

Virtual Environment

Reads

Assembly

Annotation

Nanopore

Computational Genomics

Public Databases

Cloud

Basic Linux Commands

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Quarterly Campaign Featuring Sequencing Library Pooling

Holidays 2024

Microbial Genome Sequencing

Microbial Community Profiling

live a life as light as microbes

Do You Want To share your project or recieve a quote?

mBioWorks ApS