三代测序数据或长contigs的纠错和基因组组装工具的安装方法和详细使用方法
介绍:
Canu是一种用于长读长contigs的纠错和基因组组装工具。它最初是为了处理PacBio等第三代测序技术产生的长读长DNA测序数据而设计的。更近期,Canu也开始支持Oxford Nanopore等其他长读长测序技术。
Canu的目标是通过利用长读长测序数据,提供高质量的基因组组装结果。它的设计思路是以自我校正(self-correction)为基础的组装方法。Canu首先通过将长读长测序数据拆分为较短的overlaps,然后进行纠错和重叠扩展(overlapping extension)来构建contigs。接下来,Canu使用错误校正和重叠扩展迭代的过程来提高contig质量,并且通过建立read的互补关系来组装contigs。
Canu的使用场景取决于待解决问题的需求。当您需要进行高质量的基因组组装,特别是在处理长读长测序数据时,Canu就是一个合适的选择。它适用于各种生物学研究领域,如微生物学、植物学和动物学等。同时,Canu也适合处理大型基因组,特别是那些无法通过短读长测序数据进行准确组装的基因组。使用Canu可以提供更长的contigs和更好的基因组覆盖率,从而有助于识别基因和其他遗传元件。
老规矩,先看文章:
Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation
De novo assembly of haplotype-resolved genomes with trio binning | Nature Biotechnology
再看github: marbl/canu: A single molecule sequence assembler for genomes large and small. (github.com)
纠错和基因组组装是基因组学领域中的重要任务,可以帮助研究人员快速的获得高质量的基因组序列。下面是一些常用的三代测序数据或长contigs的纠错和基因组组装工具的安装和使用方法的介绍:
安装方法
通过源代码编译安装
克隆Canu项目源码库:
注意官方不建议直接下载zip文件,所以直接clone
git clone https://github.com/marbl/canu.git
cd canu/src
安装依赖(如果尚未安装)
Canu依赖于一些第三方软件和库,例如zlib、bzip2、perl、c++编译器等。确保这些依赖已经正确安装在系统中。
安装调试这里就不说了
编译Canu
make -j
设置环境变量, 这个按自己喜好操作,不做这一步的直接使用绝对路径运行即可
export PATH="/your-path-to-canu/canu/bin:$PATH"
使用包管理工具安装(例如conda)
推荐使用mamba ,速度快。
mamba create -n canu
mamba activate canu
mamba install -c conda-forge -c bioconda -c defaults canu
conda环境配置参考:轻快小miniconda3在linux下的安装配置-centos9stream-Miniconda3 Linux 64-bit-CSDN博客
Canu的组装用法及具体步骤
假设你有一个名为nanopore_reads.fastq.gz
的Oxford Nanopore原始数据文件,想要进行基因组组装,以下是一个基本的Canu命令行实例:
canu -p project_name \
-d output_directory \
genomeSize=genome_size_in_bp \
useGrid=false \
-nanopore-raw nanopore_reads.fastq.gz \
-maxMemory memory_limit \
-threads num_threads
#官方参考样例:
canu [-haplotype|-correct|-trim] \
[-s <assembly-specifications-file>] \
-p <assembly-prefix> \
-d <assembly-directory> \
genomeSize=<number>[g|m|k] \
[other-options] \
[-trimmed|-untrimmed|-raw|-corrected] \
[-pacbio|-nanopore|-pacbio-hifi] *fastq
参数解释:
-p project_name
: 指定输出结果前缀。-d output_directory
: 设置输出目录路径。genomeSize
: 预估目标基因组大小,单位为碱基对。useGrid=false
: 如果不在网格计算环境中运行,则设置为false。-nanopore-raw
: 输入原始长读测序数据文件路径。-maxMemory
: 设定程序最大内存使用量。-threads
: 指定使用的线程数量。
这里注意参数,如果系统中配置了超算slurm等环境,默认会启用超算,所以如果不使用超算环境则加上useGrid=false,这样会启用单节点进行计算。
这里直接使用二代测序的组装contigs作为输入开始运行,建议使用nohup后台运行。
全参数帮助信息:
canu --help
usage: canu [-version] [-citation] \
[-haplotype | -correct | -trim | -assemble | -trim-assemble] \
[-s <assembly-specifications-file>] \
-p <assembly-prefix> \
-d <assembly-directory> \
genomeSize=<number>[g|m|k] \
[other-options] \
[-haplotype{NAME} illumina.fastq.gz] \
[-corrected] \
[-trimmed] \
[-pacbio |
-nanopore |
-pacbio-hifi] file1 file2 ...
example: canu -d run1 -p godzilla genomeSize=1g -nanopore-raw reads/*.fasta.gz
To restrict canu to only a specific stage, use:
-haplotype - generate haplotype-specific reads
-correct - generate corrected reads
-trim - generate trimmed reads
-assemble - generate an assembly
-trim-assemble - generate trimmed reads and then assemble them
The assembly is computed in the -d <assembly-directory>, with output files named
using the -p <assembly-prefix>. This directory is created if needed. It is not
possible to run multiple assemblies in the same directory.
The genome size should be your best guess of the haploid genome size of what is being
assembled. It is used primarily to estimate coverage in reads, NOT as the desired
assembly size. Fractional values are allowed: '4.7m' equals '4700k' equals '4700000'
Some common options:
useGrid=string
- Run under grid control (true), locally (false), or set up for grid control
but don't submit any jobs (remote)
rawErrorRate=fraction-error
- The allowed difference in an overlap between two raw uncorrected reads. For lower
quality reads, use a higher number. The defaults are 0.300 for PacBio reads and
0.500 for Nanopore reads.
correctedErrorRate=fraction-error
- The allowed difference in an overlap between two corrected reads. Assemblies of
low coverage or data with biological differences will benefit from a slight increase
in this. Defaults are 0.045 for PacBio reads and 0.144 for Nanopore reads.
gridOptions=string
- Pass string to the command used to submit jobs to the grid. Can be used to set
maximum run time limits. Should NOT be used to set memory limits; Canu will do
that for you.
minReadLength=number
- Ignore reads shorter than 'number' bases long. Default: 1000.
minOverlapLength=number
- Ignore read-to-read overlaps shorter than 'number' bases long. Default: 500.
A full list of options can be printed with '-options'. All options can be supplied in
an optional sepc file with the -s option.
For TrioCanu, haplotypes are specified with the -haplotype{NAME} option, with any
number of haplotype-specific Illumina read files after. The {NAME} of each haplotype
is free text (but only letters and numbers, please). For example:
-haplotypeNANNY nanny/*gz
-haplotypeBILLY billy1.fasta.gz billy2.fasta.gz
Reads can be either FASTA or FASTQ format, uncompressed, or compressed with gz, bz2 or xz.
Reads are specified by the technology they were generated with, and any processing performed.
[processing]
-corrected
-trimmed
[technology]
-pacbio <files>
-nanopore <files>
-pacbio-hifi <files>
其他分析工具和流程推荐:
EasyMetagenome易宏基因组——简单易用的宏基因组分析流程-来自刘永鑫团队的秘密武器_刘永鑫宏基因组文件-CSDN博客 宏基因组学Metagenome-磷循环Pcycle功能基因分析-从分析过程到代码及结果演示-超详细保姆级流程_pcycdb-CSDN博客