Practical Microbial Bioinformatics

This is a compilation of protocols and methods we use in analyzing the sequencing data of microbial genomes, microbiomes, and amplicons. It covers Linux basics and various bioinformatics pipelines. All protocols are tested in a local Bio-Linux server.

Share This Post

Table of Contents

Remote Connection

$ tmux ls # list

$ tmux a -t # attach to a session

$ tmux kill-session -t session-name # kill a session

$ scp zeng@login.genome.au.dk:/faststorage/home/zeng/201811GI-Greenland/megahit-assembly-trimmed/intermediate_contigs/k21.contigs.fa .


$ rsync –info=progress2 -auvz /DATA_3/than/result ~/data3/

$ for i in *; do cd $i/clean_data && for file in *1.fq.gz; do mv $file $i_R1_001.fastq.gz; done && for file in *2.fq.gz; do mv $file $i_R2_001.fastq.gz; done && cd ~/data3/result; done

$ rsync -e ssh -avz ~/mydata3/filename zeng@login.genome.au.dk:faststorage/file
#To copy a folder from your computer to the cluster:
$ rsync -e ssh -avz /path/to/data user@login.genome.au.dk:data
$ scp . yonzen@computerome.cbs.dtu.dk:mydata/data/VRSice

Virtual Environment

$ conda create –name env-name pysam

$ source activate amazing-project

$ source deactivate amazing-project

$ conda env list


$ gzip -cd CONN.20111109.0057.gz | head

$ fastqc *.gz -o stats-by-fastqc

$ bash bbduk.sh in=reads.fq out=clean.fq qtrim=rl trimq=20 

# rl, for both right and left; until Q20

$ bbduk.sh in=reads.fq out=clean.fq maq=20

#maq, average quality score

#final command

$ bash /home/zeng/bbmap/bbduk.sh in=reads.fq out=clean.fq qtrim=rl trimq=20 maq=20 minlen=50

$ ~/mydata3/bbmap/bbduk.sh in=VRS.ice_1.fq in2=VRS.ice_2.fq out=VRS.ice.trimmed.left10_1.fq out2=VRS.ice.trimmed.left10_2.fq ftl=10

# remove 10 most left bases

$ tar -zvcf VRS.ice.trimmed.fq.gz trimmed10/

#install 7z
$ tar xjf file.tar.bz2 #uncompress
$ chmod uog+x 7za #executable
$ ./7za a -v1000m -mx0 filename
#divide into files of 10ooMb each without compression
$ ./7za x filename.001 # merge small files


$ megahit/megahit -1 SRR341725_1.fastq.gz -2 SRR341725_2.fastq.gz -o SRR341725.megahit_asm

$ spades.py –meta –only-assembler –12 SS10-joined-trimmed.fq -m 120 –threads 6 -o SS10-assembly-spades

$ for i in v*; do echo $i && cd $i/clean_data && ls && tmux new -d “unicycler -1 *1.fq.gz -2 *2.fq.gz -o ~/data3/assemblies/$i –min_fasta_length 500” && cd ~/data3/result && sleep 40m; done

# automatically assemble all genomes remotely

$ unicycler -1 *1.fq.gz -2 *2.fq.gz -o ~/data3/assemblies/vice304 –min_fasta_length 500 –no_pilon # cancel polish, only a few to tens of base correction.

$ for i in *; do echo $i && cd $i && ls && tmux new -d “unicycler -1 *1.fq.gz -2 *2.fq.gz -o ~/mydata3/assemblies/$i –min_fasta_length 500 –no_pilon –threads 16” && cd ~/mydata3/Clean && sleep 180m; done

# an improved pipeline

# check reads quality
$ fastqc -t 64 -o fastqc *.fastq
# remove 20 bases at both ends for 150 base-long reads
$ ~/bbmap/bbduk.sh in=TET16_illumina_R1.fastq in2=TET16_illumina_R2.fastq out=TET16_illumina_R1.L15R40.fastq out2=TET16_illumina_R2.L15R40.fastq forcetrimleft=20 forcetrimright=130
# trim PacBio reads
# ~/bbmap/bbduk.sh in=TET16.pacbio.fastq out=TET16.pacbio.L10R25k.gt1k.fastq forcetrimleft=10 forcetrimright=25000 minlen=1000
# OR use the specific tool for treating PacBio reads
$ ./SequelTools.sh
# assembly Illumina and PacBio reads
$ unicycler -1 TET16_illumina_R1.L20R10.fastq -2 TET16_illumina_R2.L20R10.fastq -l ../pacbio.reads/TET16.pacbio.fastq -o ../hybrid.re.assembly.L20R10 -t 32
# re-try using different settings
# include all PacBio reads to assure the success, not be misguided by the uneven distribution of bases, which is caused by read length inconsistency. There are only a few very long reads.
# Try L15R40 illumina + pacbio gt500
$ ~/bbmap/bbduk.sh in=TET16.pacbio.fastq out=TET16.pacbio.gt500.fastq minlen=500
$ unicycler -1 TET16_illumina_R1.L15R40.fastq -2 TET16_illumina_R2.L15R40.fastq -l ../pacbio.reads/TET16.pacbio.gt500.fastq -o ../hybrid.re.assembly.il.L15R40.pb.gt500 -t 32
# The best strategy is Left removing 10 bases,Right removing adaptor,then Quality trimming to Q20, min length 50 bases

$ bbtools/stats.sh in=contigs.fasta -m 

$ quast.py [options] <contig_file(s)> -m 200


$ seqkit stats -a -T filename -o stats-filename

# -a, all; -T, tab format

$ awk ‘BEGIN{RS=”>”;ORS=””}length($0)>300{print “>”$0}’ input > output

# remove contigs shorter than 300 bp; count everything between two “>”

$ for i in *.gt300; do sed -i — ‘s/>/>$i/g’


$ for i in *; do echo $i &&  sed -i “s/>/>$i./g” $i; done

$ for i in *; do echo $i && cd ~/data3/assemblies/$i && if [ -e assembly.fasta ]; then cp assembly.fasta ~/data3/finalcontigs/$i.fasta; else echo “assembly file not exist”; fi; done

$ for i in *; do rename ‘s/.fasta//g’ $i; done

$ for i in *; do rename “s/$i/$i.gt500.fa/g” $i; done

$ for i in *; do sed -i “s/>/>$i./g” $i/assembly.fasta; done

$ for i in *; do sed -i “s/ /./g” $i/assembly.fasta; done

$ for i in *; do cp $i/assembly.fasta ../finalcontigs/$i.fasta; done

$ checkm lineage_wf <bin folder> <output folder>

for file in $(find ./mydata/ -name *.fastq -type f); do echo $file; gzip $file; donemafft-qinsi all.fasta > all.mafft.qinsi.fasta


$ svr_submit_RAST_job -user RAST-username -passwd RAST-password -fasta assembly.fasta -domain Bacteria -bioname “own_name” –genetic_code 11 -gene_caller rast

# a docker version of submitter

$ for i in *.fasta; do echo $i; svr_submit_RAST_job -user ******* -passwd ****** -fasta $i -domain Bacteria -bioname “***$i” –genetic_code 11 -gene_caller rast && rm $i -f; done

# automate the upload of the files in a folder

$ sed -e “s/\^M//” filename > newfilename

#remove CTRL+M that appears in RAST annotation results

#Note: To enter ^M, type CTRL-V + M. That is, hold down the CTRL key then press V and M in succession.

$ while IFS=’        ‘ read -r col1 col2; do echo “$col2”; name=$(print $col2 | sed -e “s/\^M//”); echo $name; svr_retrieve_RAST_job oceanbug maomaochong $name genbank > $col1.gbk; done < jobid.txt

# Tab as seperator; to type Tab in, press CTRL+V, then press tab

$ python genbank_to_fasta.py -i ../../genome/gbk/*.gbk

$ prokka –outdir prokka –prefix VRS.soil1.nanoAssembly –addgenes -kingdom Archaea –metagenome  *.fasta

$ barrnap –kingdom bac -o output.fasta -t 8 input.fasta

$ pip install gtdbtk  
$ wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release89/89.0/gtdbtk_r89_data.tar.gz
$ tar xvzf gtdbtk_r89_data.tar.gz
$ export GTDBTK_DATA_PATH=/DATA_3/databases/GTDB/release89
$ ~/.local/bin/gtdbtk classify_wf –cpus 12 –genome_dir ./ –extension fa –out_dir gtdbtk_output

$ conda install -c bioconda hmmer
$ hmmerbuild –amino -n rhodopsin rhodopsin.hmm rhodopsin.alighment
$ onda install -c bioconda prokka
# use customized HMM to annotate a genome
$ prokka –outdir prokka.fungenes.ice –prefix vrs.ice –addgenes –metagenome –cpus 196 –force –centre X –compliant ../vrs.ice.contigs.fa

# install docker
$ sudo snap install docker # version 19.03.11, or
$ sudo apt install docker.io
# install PGAP
$ wget https://github.com/ncbi/pgap/raw/prod/scripts/pgap.py
$ chmod +x pgap.py
$ ./pgap.py –update
# test installation environment, -n no report
$ time ./pgap.py -n -c 64 -o mg37_results.docker test_genomes/MG37/input.yaml
# test singularity
$ time ./pgap.py -n -D singularity -c 64 –no-internet –ignore-all-errors -o mg37_results.singularity test_genomes/MG37/input.yaml
# prepare yaml file and submol.yaml
# input.yaml\
class: File\
location: assembly.file.name.fasta\
class: File\
location: submol.yaml\
# submol.yaml\ # a simplified version, genus name that can be searched in NCBI is a must, species name can be any
genus_species: ‘Mycoplasma genitalium’
strain: ‘strain-name’
# start annotation
$ sudo ./pgap.py -r -o output-folder-filename –no-internet input.yaml

# copy all fasta files into a subfolder
$ for f in */*.fasta; do cp “$f” “$(echo “$f” | sed s/’\/assembly’//)”; done
# remove non-standard characters in fasta filenames
$ for f in *.fasta; do cp “$f” “$(echo “$f” | sed s/.*-1a-//)”; mv “$f” backup/; done
# automatically create output folders and yaml files needed for PGAP annotation
$ for f in *.fasta; do echo “$f”; mkdir “$(echo “$f” | sed s/.fasta//)”; cp $f “$(echo “$f” | sed s/.fasta//)”/; sed s/assembly.fasta/”$f”/ input.yaml > “$(echo “$f” | sed s/.fasta//)”/input.yaml; sed s/strain.name/”$(echo “$f” | sed s/.fasta//)”/ submol.yaml > “$(echo “$f” | sed s/.fasta//)”/submol.yaml; done
# test environment
$ time ./pgap.py -r -D singularity -o mg37_results.singularity test_genomes/MG37/input.yaml
$ time ./pgap.py -r -o ~/mydata3/genomes/final/AK113 –no-internet ~/mydata3/genomes/final/AK113/input.yaml
# start annotation
$ for f in /home/zeng/mydata/PGAP/assembly/*.fasta; do echo $f; ./pgap.py -n –no-internet –cpus 64 -o $(echo “$f” | sed s/.fasta//) $(echo “$f” | sed s/.fasta//)/input.yaml; done
$ for f in /home/zeng/mydata/PGAP/assembly/*.fasta; do echo $f; echo $(echo “$f” | sed s/.fasta//); echo $(echo “$f” | sed s/.fasta//)/input.yaml; done
$ ./pgap.py -n –no-internet -c 64 -o /home/zeng/mydata/PGAP/assembly/AK230 /home/zeng/mydata/PGAP/yongqin/assembly/AK230/input.yaml
# data cleanup
$ for f in ../*/*/*/*/assembly.fasta; do echo $f; cp “$f” “$(echo “$f” | sed s/’\/’/./g | sed s/…// | sed s/.*-1a-// | sed s/.*result.//)”; done
$ for f in *result*.fasta; do cp “$f” “$(echo “$f” | sed s/.*result.//)”; mv “$f” ./backup; done
$ for f in *-1a-*.fasta; do cp “$f” “$(echo “$f” | sed s/.*-1a-//)”; mv “$f” ./backup; done
$ for f in *.fasta; do cp “$f” “$(echo “$f” | sed s/.assembly//)”; mv “$f” ./backup; done


$ /home/tkn/Programs/Guppy_basecaller/ont-guppy-cpu/bin/guppy_basecaller –flowcell FLO-MIN106 –kit SQK-RAD004 -t 10 -i fast5/ -r -s basecalled/

$ porechop -i all.3rd.fastq -o all.3rd.porechopped.fastq

$ bbduk.sh in=reads.fq out=clean.fq qtrim=rl trimq=10 maq=10 minlen=2000
# recommend 7 (>80% accuracy)
$ ~/bbmap/bbduk.sh in= out= qtrim=rl trimq=7 maq=7 minlen=2000

$ NanoStat [-h] [-v] [-o OUTDIR] [-p PREFIX] [-t THREADS] (–fastq FASTQ | –summary SUMMARY | –bam BAM)

$ flye –nano-raw nanopore_VRSmetagenome.fastq –out-dir /home/zeng/VRSsoil-nanopore-assembly-by-flye-on-raw –meta -g 10g –threads 4

$ seqtk seq -a input.fastq > output.fasta
$ prokka –outdir prokka –prefix VRS.soil1.nanoAssembly –addgenes -kingdom Archaea –metagenome  *.fasta

$ samtools faidx test.fa name

Computational Genomics

$ ./fastANI –ql [QUERY_LIST] –rl [REFERENCE_LIST] -o [OUTPUT_FILE] ../fastani-Linux64-v1.1/fastANI –ql list –rl list -o all.Tardiphaga.ANI –matrix -t 8

# 1 thread by default

$ conda install -c bioconda comparem
$ comparem aai_wf –protein –cpus 16 –file_ext fasta ./aa ./aai_output

$ ../fastani-Linux64-v1.1/fastANI -q genome1.fasta -r genome2.fasta -o 1.vs.2.fastANI -t 8 –visualize

$ Rscript scripts/visualize.R genome1.fasta genome2.fasta 1.vs.2.fastANI.visual

$ sudo apt-get update
$ sudo apt install inkscape
$ inkscape -z -e out.png -w 5000 -h 2000 *.svg
$ inkscape t.svg –export-pdf=t.pdf
# install EasyFig
# EasyFig settings: 5k blast, 0.0001 cutoff, with outline, fileter small blast hits, 1200000 pixels, 50 pixels in genome lines, 150 pixles height, 3000 blast hit height

# install bioperl
$ cpanm Bio::DB::EUtilities
$ cpanm Bio::DB::GenBank
$ cpanm Bio::DB::GFF
$ bp_genbank2gff3.pl *.gbk –outdir gff
# convert gbk to gff3
$ roary *.gff -f all.genome.roary -e -n -i 80 -cd 100 -r -p 24
#-p, threads; -f, output folder;-e, create core gene alignment; -n, fast alignment with MAFFT;-i, identity from blastp;-cd 100, define core gene as presence in 100% species
# install ggplot2
# run R
# install.packages(“ggplot2”)

# create a conda environment
$ module load tools
$ module load anaconda2/4.4.0
$ conda create –name python=2.7 -y
$ source activate
() $ pip install pip –upgrade
() $ pip install <your_required_python_package(s)> –upgrade
$ conda env list
$ conda config –set safety_checks disabled
$ conda create -y -n metawrap-env python=2.7
$ source activate metawrap-env
# Note: ordering is important
$ conda config –add channels defaults
$ conda config –add channels conda-forge
$ conda config –add channels bioconda
$ conda config –add channels ursky
$ conda install -y -c ursky metawrap-mg
# reads classification
$ metawrap kraken -o kreken.result -t 64 -s 1000000 reads.fq
# consolidate binning results by using multiple programs
$ metawrap bin_refinement -o BIN_REFINEMENT -t 96 -A INITIAL_BINNING/metabat2_bins/ -B INITIAL_BINNING/maxbin2_bins/ -C INITIAL_BINNING/concoct_bins/ -c 50 -x 10
# refine a single bin
$ metawrap bin_refinement -o final.bins.metabat2.refined -t 64 -A final.bins.metabat2/ -c 50 -x 10
# count the number of high quality bins
$ cat BIN_REFINEMENT/metawrap_bins.stats | awk ‘$2>50 && $3<10’ | wc -l

# define mobilome as an annotation with a keyword transposase OR recombinase OR integrase OR phage
# identify loci using keyword search in annotaion file
# use python package Matplotlib to plot
# overlay the plot onto a genome structure figure
# python

import matplotlib.pyplot as plt

points1 = [ (100,1), (200,1), (3000,1), (666,1) ] # (a1,b1), (a2,b2), … # do not use “-” in a variable’s name
points2 = [ (1000,1), (2000,1), (5000,1), (8888,1) ]

plt.xlim(0,10000) # set up the plot limits

for pt in points1:
# plot (x,y) pairs.
# vertical line: 2 x,y pairs: (a,0) and (a,b)
ax.plot( [pt[0],pt[0]], [0,pt[1]], color=’blue’, linewidth=0.5 ) # set line color as b – blue

for pt in points2:
# plot (x,y) pairs.
# vertical line: 2 x,y pairs: (a,0) and (a,b)
ax.plot( [pt[0],pt[0]], [0,pt[1]], color=’black’, linewidth=0.5 ) # set line color as k – black



$ conda activate env.name
$ conda install roary
$ conda install prokka
# update blast, find the location of blastp
$ whereis blastp
# prepare data, annotate using prokka
$ prokka –outdir prokka_bin283 –genus Tardiphaga –locustag bin283 –cpus 128 –force –centre X –compliant –prefix bin283 VRS.ice.bin.283.fasta
# collect all gff files
$ cp */*.gff ./gff/.
# generate core gene alignment using roary
$ roary -p 128 -o tardiphaga -f ../roary.results -e -n -v *.gff
# use various program to generate a phylogenetic tree from the core gene alignment file

# use “ln -s” to link a certain file somewhere else to a filename in the current folder
# indexing contigs
$ bowtie2-build vrs.ice.contigs.fa vrs.ice –threads 192 –large-index && bowtie2-build vrs.soil.contigs.fa vrs.soil –thread 64 –large-index
# mapping reads back to contigs
$ bowtie2 -x vrs.ice -1 vrs.ice.reads.1.fq -2 vrs.ice.reads.2.fq -S vrs.ice.map.sam –no-unal –threads 196 && bowtie2 -x vrs.soil -1 vrs.soil.reads.1.fq -2 vrs.soil.reads.2.fq -S vrs.soil.map.sam –no-unal –threads 128
# converted to BAM format and sort either by read name or by leftmost alignment coordinate
# using -m to increase temp file size, to avoid failure due to too many temp files. do not exceed 1020 files
$ samtools sort -o vrs.ice.map.sorted.bam -O BAM –threads 196 -m 2G vrs.ice.map.sam
# remove duplicates
# download picard.jar
$ java -jar picard.jar MarkDuplicates INPUT=vrs.ice.map.sorted.bam OUTPUT=vrs.ice.map.markdup.bam \
# define the regions to be calculated on coverage
# using blast results, only columns on gene id, hit start and hit end
# SAF format, add a cloumn before and after
$ conda install -c bioconda csvtk
$ csvtk del-header vrs.ice.gyrB.blast | csvtk mutate2 -t -H -e ‘$3 > $2 ? $2 : $3 ‘ -L 0 | csvtk mutate2 -t -H -e ‘$3 > $2 ? $3 : $2 ‘ -L 0 | csvtk mutate2 -t -H -e ‘$3 > $2 ? “+” : “-” ‘ | csvtk mutate2 -t -H -e ‘”gyrB”‘ | csvtk cut -f7,1,4,5,6 -t -H > vrs.ice.gyrB.saf
# count the number of reads mapped to each gene
# download the program at https://sourceforge.net/projects/subread/files/subread-2.0.0/subread-2.0.0-Linux-x86_64.tar.gz/download
$ ./subread-2.0.0-Linux-x86_64/bin/featureCounts -F SAF -T 64 -a vrs.ice.recA.saf -o vrs.ice.recA.counts ../vrs.ice.map.markdup.bam
# count total reads in a bam file
# get the total number of reads of a BAM file (may include unmapped and duplicated multi-aligned reads)
$ samtools view -c ../vrs.soil.map.markdup.bam
# counting only mapped (primary aligned) reads
$ samtools view -c -F 260 ../vrs.soil.map.markdup.bam

$ conda create -y -n zeng_metawrap-env python=2.7
$ conda activate metawrap-env

# Note: ordering is important
$ conda config –add channels defaults
$ conda config –add channels conda-forge
$ conda config –add channels bioconda
$ conda config –add channels ursky
#vNO DATABASE needs to be installed for this function

# shorten path
ln -s /data/zeng/medusa.backup.2.5/all.raw.reads/11BGI-Greenland-metagenome/F18FTSEUHT1164_SOIxgtM/Reads/Clean/VRS-SOIL/vrs.soil.raw.reads.1.fq vrs.soil.reads_1.fastq;
ln -s C D
$ metawrap quant_bins -b ./VRS.HQ.bins/ice -o quantify.results.ice -a ../vrs.ice.contigs.fa -t 196 vrs.ice.reads_*.fastq

# Note that the abundances are expressed as “genome copies per million reads”, and are calculated with Salmon in a simmilar way like TPM (transcripts per million) is calculated in RNAseq analysis.
# make sure contig names inside a bin is consistent with those in the assembly
# rename the names inside a bin
# retrieve original names
$ grep “>” ../vrs.soil.contigs.fa > original.name
$ ln -s /data/zeng/computerome/computerome/data/VRSice/VRS-ice.contigs.fa vrs.ice.contigs.fa
$ sed -i ‘s/>VRS.ice.bin.*.fa.//g’ *.fa
$ for file in *.fa; do echo $file >> name.change.log; for i in $(grep ‘k141_’ $file); do date >> name.change.log; echo “$i” >> name.change.log; replace=$(grep -m 1 $i”_” original.name); echo “$replace” >> name.change.log; sed -i ‘s/'”$i”‘/'”$replace”‘/’ $file; done; done
$ grep -m 1 # only find the first match

$ wget http://circos.ca/distribution/circos-0.69-9.tgz
$ tar xvfz ../circos-0.69-9.tgz
$ cd circos/bin
$ ./circos -v or -h or -man
$ circa 100 usd, skin for using data on circos
# use Easyfig to generate DNA comparison results, take coordinate columns as input into circa.(DNA level)
# or do local blast to generate tblastn results, convert into circa compatible file. (protein level) \
chromosome file\
chromosome1 chromosome2 identity length mismatch start1 end1 start2 end2 e-value \
1.easyfig.fa 2.easyfig.fa 73.491 24056 5661 259 1940218 1964036 1636159 1612583 0 14110\
1.easyfig.fa 2.easyfig.fa 77.883 19876 3968 121 2427168 2446844 1096555 1076909 0 15866\
# make sure to remove duplicate regions in EasyFig results by sorting according to Start loci

$ conda install -c bioconda phylophlan
# or install locally, need to install trimal as well, add into path PATH=$PATH:~/opt/bin

# download refs
$ phylophlan_get_reference -g p__Gemmatimonadetes -o Gemma.input –verbose

# Download reference genomes from close phyla
$ for i in Fibrobacteres Poribacteria Latescibacteria Cloacimonetes Zixibacteria Marinimicrobia Ignavibacteria Caldithrix Bacteroidetes Chlorobi; do phylophlan_get_reference -g p__${i} -n 1 -o Gemma.input –verbose; done

# Generate config file
$ phylophlan_write_config_file -d a -o gemma_config.cfg –db_aa diamond –map_dna diamond –map_aa diamond –msa mafft –trim trimal –tree1 fasttree –tree2 raxml –verbose

# download all NCBI Gemma genomes
$ wget ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt
for f in $(grep Gemmatimonadetes prokaryotes.txt | csvcut -t -c 21); do wget “$f/*_genomic.fna.gz” –retry-connrefused –waitretry=1 –read-timeout=10 –timeout=5 -t 0; done
$ for f in $(grep Gemmatimonadetes prokaryotes.txt | cut -f21); do wget –retry-connrefused –waitretry=1 –read-timeout=10 –timeout=5 -t 0 “$f/*_protein.faa.gz”; done

# Build the phylogenetic tree
$ phylophlan -i Gemma.input -d phylophlan –diversity medium –accurate -f gemma_config.cfg -o output_gemma –nproc 128 -t a –verbose

# bootstrapping
$ raxmlHPC -m PROTGAMMAAUTO -p 12345 -b 12345 -# 100 -s alignment.file -n output.name

# add bootstrap values onto the tree
$ raxmlHPC -m PROTGAMMAAUTO -p 12345 -f b -t best.tree.file -z bootstrap.output.file -n output.name

# use iTOL to display the tree
# rename taxa names by removing ASM*** in the name. the logic is to create line first, delete, then combine lines again.
$ sed ‘s/_ASM/\n_ASM/g’ final.tree.RAxML.best.renamed.tre | sed ‘s/_ASM.*v1:/:/g’ | sed ‘s/_ASM.*v2:/:/g’ | sed ‘:a;N;$!ba;s/\n//g’ > final.phylogenomic.tree

Public Databases

# install enaBrowserTools
# download search results
$ python2.7 ../enaBrowserTools-1.5.4/python/enaGroupGet.py -g assembly -f fasta -t -e -w 2 # 2 represents bacterial taxon id
$ python2.7 ../enaBrowserTools-1.5.4/python/enaDataGet.py -f fasta -t -e -w 2 # download both type
# convert downloaded txt to csv
$ sed -i ‘s/\t/,/g’ all.bac.arc.ena.txt
# find the column number
$ csvcut -n all.bac.arc.ena.txt
# count the number of genomes to be downloaded
$ csvcut -c fasta_file  all.bac.arc.ena.txt | wc -l
# remove useless digits in filenames
$ while read line; do echo $line | tail -c 7 >> all.bac.list.clean; done < all.bac.list
# download all bacterial and archaeal genomes
$ while read line; do echo $line; A=$(echo $line | tr -d ‘\r’); echo $A; curl -O $A && ls -lh | wc -l; done < ../archaea.list
$ while read line; do A=$(echo $line | tr -d ‘\r’); echo $A; curl -O $A && ls -lh | wc -l; done < all.bac.arc.list
# unzip one by one
$ for i in *.fasta.gz; do echo $i; echo “total unzipped files: ‘ls *.fasta -lh | wc -l'”; gunzip -k $i; done
# identify pathogens and move into a separate folder
$ declare -i B=0; for i in *.fasta.gz; do echo $i; A=$(gunzip -c $i | head -n1); if [[ “$A” == *Salmonella* ]] || [[ “$A” == *Brucella* ]] || [[ “$A” == *Chlamydia* ]] || [[ “$A” == *Clostridioides* ]] || [[ “$A” == *Clostridium* ]] || [[ “$A” == *Corynebacterium* ]] || [[ “$A” == *Enterococcus* ]] || [[ “$A” == *Erysipelothrix* ]] || [[ “$A” == *Escherichia* ]] || [[ “$A” == *Haemophilus* ]] || [[ “$A” == *Helicobacter* ]] || [[ “$A” == *Listeria* ]] || [[ “$A” == *Mycobacterium* ]] || [[ “$A” == *Neisseria* ]] || [[ “$A” == *Shigella* ]] || [[ “$A” == *Staphylococcus* ]] || [[ “$A” == *Streptococcus* ]] || [[ “$A” == *Yersinia* ]]; then B=$B+1 && echo $B “file(s) moved to the folder of pathogens” && mv $i ./pathogens; fi; done
# add additional pathogens
$ declare -i B=0; for i in *.fasta.gz; do echo $i; A=$(gunzip -c $i | head -n1); if [[ “$A” == *Klebsiella* ]] || [[ “$A” == *”Acinetobacter baumannii”* ]] || [[ “$A” == *”Pseudomonas aeruginosa”* ]] || [[ “$A” == *”Mycobacteroides abscessus”* ]] || [[ “$A” == *Acinetobacter_baumannii* ]] || [[ “$A” == *Pseudomonas_aeruginosa* ]] || [[ “$A” == *Mycobacteroides_abscessus* ]]; then B=$B+1 && echo $B “file(s) moved to the folder of pathogens” && mv $i ./pathogens; fi; done
# check progress
$ ls -lh | wc -l && ls ../ -lh | wc -l && echo `ls -lh | tail -n1`

$ csvcut -n ../prokaryotes.csv
$ csvcut -c 15 ../prokaryotes.csv > ftp.list
$ while read line; do A=$(echo $line | tr -d ‘\r’); echo $A; ncftpget ${A}/*.fna.gz && ls -lh | wc -l; done < ftp.list
# display the total size of all files
$ du -ach
# check all downloads
$ find . -name *.gz | wc -l
# remove duplicate genome with fastx toolkit
$ fastx_collapser < some.fasta > some.unique.fasta
$ usearch -fastx_uniques input.fasta -fastaout uniques.fasta -sizeout -relabel Uniq
# only works with single fasta file
# remove fasta containing pathogens by looking for name in the first line
$ declare -i B=0; for i in *.fna; do A=$(head -n1 -q $i); if [[ “$A” == *cyanobacterium* ]]; then B=$B+1 && echo $B “file(s) removed” && mv $i ./cyanobacterium; fi; done
# define B as an integral, count how many files have been treated
$ bash ~/bbmap/dedupe.sh

# link to http://fungene.cme.msu.edu/hmm_download.spr?hmm_id=331
$ for i in $(seq 1 1000); do curl -O http://fungene.cme.msu.edu/hmm_download.spr?hmm_id=$i && ls -lh | wc -l; done
# replace filename with gene name, which is the 2nd field of the 2nd line of each file
$ for i in *; do echo $i; name=”$(awk ‘FNR == 2 {print $2}’ $i)”; echo $name; mv ./$i ./$name; done
# find a certain line with certain characters
$ grep -hnr “bin.280” gtdbtk.bac120.markers_summary.tsv

# create a local database of all downloaded genomes
$ makeblastdb -in all.ena*.fasta -title all.ena.archaea -out all.ena.archaea -max_file_sz 2GB -dbtype nucl
# look for certain genes using rhodopsin gene as the example
$ tblastn -db ../all.no.pathogen/fna/all.ncbi -query xr -num_threads 20 -out xr.blast.results -max_target_seqs 10000 -outfmt “6 qseqid qgi qacc sseqid sallseqid sgi sallgi sacc sallacc qstart qend sstart send qseq sseq evalue bitscore score length pident nident mismatch positive gapopen gaps ppos sscinames scomnames qcovs qcovhsp qcovus”
# process results and keep needed columns only
$ csvcut pufm.blast.results -t -H -c 4,20,29 > pufm.sorted # t, tab as separator;H,no head
# remove certain lines
$ sed -i <start>,<end>d <filename>
# or keep original files
$ sed ‘1,2395d’ pufm.sorted.clean.identity.sorted > tmp; mv tmp pufm.identity.removed
# select column C, remove duplicates, count lines
$ csvcut -t -c a pufm.final | sort -u | wc -l
# remove blast results with low coverage or low identity
$ csvsort -c t xr.sorted | sed ‘1,1470d’ | csvsort -c 3 | sed ‘1,88d’ > xr.final
# use the first six letter to identify the same sequences
$ sort pufm.final.deRep.shortname xr.final.deRep.shortname | awk ‘dup[$0]++ == 1’
# find corresponding species names
$ while read line; do echo $line; grep “$line” ena.all.id >> dual.taxon; done < dual.list.final
# download taxonomy data from ENA, convert into csv
$ ssconvert taxonomy.xml taxonomy.csv
# process, sort, remove certain lines, resort
$ csvsort -c t pr.sorted | sed ‘1,711d’ | csvsort -c 3 | sed ‘1,55d’ > pr.sorted.gt30
# generate ID, remove duplicates
$ csvcut -c 1 pr.sorted.gt30 | sed ‘s/|/,/g’ | sed ‘s/01/0100/g’ | sed ‘s/00/,/g’ | csvcut -c 2 | sort -u > pr.final.list

# for WGS
$ while read line; do echo $line; grep ‘”‘$line'”‘ prokaryotes.csv >> pufM.list.final; done < pufM.list.name.txt
# for CP series
$ while read line; do echo $line; grep $line prokaryotes.csv >> pufM.list.final; done < pufM.list.name2.txt

  • Install the AWS CLI and configure it with your AWS credentials.
  • Install the Dropbox-Uploader script and configure it with your Dropbox credentials.
  • git clone https://github.com/andreafabrizi/Dropbox-Uploader.git
    $chmod +x dropbox_uploader.sh
  • Use the AWS CLI to sync the S3 folder to a local folder on your computer:
  • aws s3 sync s3://your-bucket/your-folder /path/to/local/folder
  • Use the Dropbox-Uploader script to upload the local folder to Dropbox:
  • ./dropbox_uploader.sh upload /path/to/local/folder /Dropbox/folder
  • Note: You may need to modify the above commands to fit your specific use case, such as specifying the correct bucket and folder names.

Create an APP with Dropbox and generate an access token.


curl -X POST https://content.dropboxapi.com/2/files/upload \
–header “Authorization: Bearer $ACCESS_TOKEN” \
–header “Dropbox-API-Arg: {\”path\”: \”$DROPBOX_PATH\”,\”mode\”: \”add\”,\”autorename\”: true,\”mute\”: false,\”strict_conflict\”: false}” \
–header “Content-Type: application/octet-stream” \
–data-binary @$LOCAL_FILE_PATH

Or, in short

curl -X POST https://content.dropboxapi.com/2/files/upload –header “Authorization: Bearer YOURACCESSTOKEN” –header “Dropbox-API-Arg: {\”path\”: \”/fastqc.zip\”,\”mode\”: \”add\”,\”autorename\”: true,\”mute\”: false,\”strict_conflict\”: false}” –header “Content-Type: application/octet-stream” –data-binary @fastqc.zip

Basic Linux Commands

$ du -sh file_path

$ rsync -av –progress sourcefolder /destinationfolder –exclude thefoldertoexclude

$ ls -l | wc -l

$ tar -zcvf name.tar.gz /home/folder_to_be_compressed

$ find / -name *.gz

$ find . -name * -type d

$ tar -tf filename.tar.gz

$ for i in */; do zip -r “${i%/}.zip” “$i”; done

$ ls */*/*/filename

$ ln -s /DATA_3/zeng/zeng/ mydata3

$ rm */* -r — !(*gz)

$ readlink -f file.txt

$ for i in *; do sed -i “s/>/>$i./g” $i; done

$ sleep 10m

$ sleep 10 # 10 sec

$ sleep 10h

$ for i in *.fasta; do rename ‘s/fromA/toB/g’ $i; done


$rename ‘s/^mywork_myfile_/mywork_/’ *.fasta


$ for i in */*/*/contigs.fasta; do rename ‘s/\//_/g’ $i; done

$ for i in */;do echo $i; for s in $i*gz; do mkdir new/$i; cp $s new/$i; done; done

$ nohup command-with-options &

OR screen OR tmux

$ gcp -rv ~/Music/* /data/music/


$ rsync –info=progress2 -auvz ~/Music/ /data/music/

$ for i in *; do echo $i && cd ~/data3/$i && if [ -e assembly.fasta ]; then cp assembly.fasta ~/data3/$i.fasta; else echo “assembly file not exist”; fi; done


# judge the existence of the filename first.

$ for i in $(seq 1 100); do echo $i; squeue -u zeng | mail -s “server status” email@email.dk; sleep 6h; done

$ tail */*log | cat – test.out > temp && mv temp test.out

$ df

$ echo “Number of files in this directory: `ls | wc -l`”  # not ´, be aware of the direction

$ ls -d */ 

$ command_x | more 

$ top OR $ htop OR $ glances

nano OR vim OR emacs

$ zip -r output_file.zip file1 folder1 

$ command_x !(“filename1″|”filename2”) 

$ PATH=$PATH:~/opt/bin 

$ alias [-p] [name[=value] … ]

for i in *.gz; do echo $i; mv $i “$(echo $i | head -c15)”.gz; done

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore


Holidays 2024

This is a summary of official holidays of 2024 in Denmark, the UK, and China. We use this calendar to count the number of workdays spent on a project.

Read More »
Technical documents

Microbial Genome Sequencing

Sequencing a microbial genome has never been as easy and affordable as now. We provide a complete sequencing package with the basic bioinformatics analysis included. Our professional services allow you to be hustle-free and focus on your critical tasks.

Read More »
Technical documents

Microbial Community Profiling

Microbial life dominates our planet Earth in terms of quantity and biodiversity. The very first step to understanding them is to know who they are. This can be addressed via high-throughput amplicon sequencing of a few conservative biomarker genes.

Read More »

live a life as light as microbes

Do You Want To share your project or recieve a quote?