Selected TOOLS for data mining.

NINJA-OPS is a recursive acronym that stands for "NINJA Is Not Just Another - OTU Picking Solution".


Making use of BWT-enabled aligners, NINJA quickly generates a QIIME-compatible closed-reference OTU table from a given fasta file of sequence reads. NINJA also allows for convenient quality control on your data, such as fast reverse complementing, base pair trimming, and a specialized denoising transformation. Moreover, NINJA is entirely free and open source.


NINJA is 5-10 times faster than USEARCH, and produces more accurate OTU assignments. NINJA can convert an entire MiSeq run of human microbiome amplicon data into a taxonomy-annotated OTU table in around 10 minutes on your laptop.


The latest NINJA version can always be found on Github, or at

BugBase is an analysis tool for 16S datasets. BugBase estimates high-level community-wide microbiome phenotypes including:


  • Gram Staining

  • Biofilm Formation

  • Pathogenicity

  • Mobile Element

  • Oxygen Tolerance


The latest BugBase version can always be found on Github.


An interactive web-based version is available at Upload your closed-reference Greengenes 13.8 QIIME-compatible OTU table in BIOM format and get back a table of the estimated prevalence of these traits in each sample.

QIIME2 (canonically pronounced chime) stands for Quantitative Insights Into Microbial Ecology. Over 3,000 citations!


QIIME2 is an open-source bioinformatics pipeline for performing microbiome analysis from raw DNA sequencing data. QIIME2 is designed to take users from raw sequencing data generated on the Illumina or other platforms through publication quality graphics and statistics. This includes demultiplexing and quality filtering, OTU picking, taxonomic assignment, and phylogenetic reconstruction, and diversity analyses and visualizations. QIIME2 has been applied to studies based on billions of sequences from tens of thousands of samples.


Find download links at more at


Citation: Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Peña AG, Goodrich JK, Gordon JI, Huttley GA, Kelley ST, Knights D, Koenig JE, Ley RE, Lozupone CA, McDonald D, Muegge BD, Pirrung M, Reeder J, Sevinsky JR, Turnbaugh PJ, Walters WA, Widmann J, Yatsunenko T, Zaneveld J, Knight R. (2010). "QIIME allows analysis of high-throughput community sequencing data". Nature Methods 7(5):335-6.

SourceTracker uses Bayesian statistics to predict the source of microbial communities in a set of input samples (i.e., the sink samples). See some uses of SourceTracker for inspiration here.

Download the latest version at Github. Read a tutorial and more at


Citation: Knights D, Kuczynski J, Charlson E, Zaneveld J, Collman RG, Bushman FD, Knight R, Kelley ST. (2011). Bayesian community-wide microbial source tracking. Nature Methods. 2011 Jul 17.

Overview of tools by Gabe

"Big Guns"

  • aKronyMer: De novo phylogeny and database-free metagenomic sample comparison and diversity calculation

  • BURST: Optimal short-read alignment for metagenomic shotgun and amplicon data

  • NINJA-OPS: Heuristic short-read alignment for amplicon data

  • SHI7: Self-learning quality control for short-read (metagenomic) fastq sequence data

  • UTree: Heuristic short-read assignment for metagnomic shotgun and amplicon data


  • Kafan: Compression for biological sequence data faster and more efficient than gzip

  • EMBALMLETS: Miscellaneous tools for metagenomic sort-read processing: 

    • LLsim: short-read simulator with specific error control from multi-sequence fasta inputs

    • embalmulate: converter for burst/.b6 output into QIIME1 OTU table

    • bcov: coverage analyzer for burst/.b6 output

    • lingenome: prepares genomic database out of a folder of NCBI assemblies

    • t2gg/a2gg: taxonomy annotation toolkit for NCBI genome/gene assemblies

  • WRANGLr: microbial coalition creation using Nei-Saitou-guided tree of pairwise feature recombination

  • Gabe's DBs: pipeline to automate the creation of new amplicon, WGS, and gene databases from NCBI RefSeq representative sequences