KofamScan-KEGG官方推荐的使用系同源和隐马尔可夫模型进行KO注释

简介

KofamScan 是一款基于 KEGG 直系同源和隐马尔可夫模型(HMM)的基因功能注释工具,可通过同源搜索和预先计算的自适应分数阈值,将 KEGG 同源物(KOs)分配给蛋白质序列的隐马尔可夫模型(KOfam)数据库。在线版本可在 https://www.genome.jp/tools/kofamkoala/ 上获取。KofamKOALA 比现有的 KO 分配工具更快,其准确性可与性能最好的工具相媲美。KofamKOALA 的功能注释有助于将基因与 KEGG 通路图等 KEGG 资源联系起来,并促进分子网络的重建。

安装

# 软件
(base) [yut@io02 ~]$ mamba install  -c bioconda kofamscan
# 数据库
# filezila打开ftp.genome.jp看一下最新版数据库更新时间
mkdir -p KoFamDB20231025
cd KoFamDB20231025
#数据库:
wget ftp://ftp.genome.jp/pub/db/kofam/ko_list.gz
wget ftp://ftp.genome.jp/pub/db/kofam/profiles.tar.gz
gunzip ko_list.gz
tar xvzf profiles.tar.gz

使用

输入蛋白序列

查询文件是包含一个或多个氨基酸序列的 FASTA 文件。不能使用核苷酸序列。每个序列必须有一个唯一的名称。序列名称是介于标题符号(“>”)和第一个空白字符(空格、制表符、换行符等)之间的字符串。请勿在">"之后使用空格。

(base) [yut@node01 TestData]$ grep -c ">" GCA_002343985.1_ASM234398v1_genomic.faa
2486 # 蛋白数量

输出detail-tsv格式

time exec_annotation -p ~/Database/KoFamDB20231025/profiles -k ~/Database/KoFamDB20231025/ko_list --cpu 10 --tmp-dir . -E 1e-5 -f detail-tsv -o GCA_002343985.1_ASM234398v1_genomic.faa_kofamscan_detail-tsv.tsv GCA_002343985.1_ASM234398v1_genomic.faa &>GCA_002343985.1_ASM234398v1_genomic.faa_kofamscan_detail.log

# -p与-k为上述数据库配置文件
# -E 为e-value
# -f为输出格式,包括:
-f, --format <format>      Format of the output [detail]
      detail:          Detail for each hits (including hits below threshold)
      detail-tsv:      Tab separeted values for detail format
      mapper:          KEGG Mapper compatible format
      mapper-one-line: Similar to mapper, but all hit KOs are listed in one line
# 3m18.871s

输出detail格式

(base) [yut@node01 TestData]$  time exec_annotation -p ~/Database/KoFamDB20231025/profiles -k ~/Database/KoFamDB20231025/ko_list --cpu 10 --tmp-dir . -E 1e-5 -f detail -o GCA_002343985.1_ASM234398v1_genomic.faa_kofamscan_detail_format.tsv  GCA_002343985.1_ASM234398v1_genomic.faa &>GCA_002343985.1_ASM234398v1_genomic.faa_kofamscan_detail.log&
# real    3m0.691s

输出mapper格式

(base) [yut@node01 TestData]$  time exec_annotation -p ~/Database/KoFamDB20231025/profiles -k ~/Database/KoFamDB20231025/ko_list --cpu 10 --tmp-dir . -E 1e-5 -f mapper -o GCA_002343985.1_ASM234398v1_genomic.faa_kofamscan_mapper_format.tsv  GCA_002343985.1_ASM234398v1_genomic.faa &>GCA_002343985.1_ASM234398v1_genomic.faa_kofamscan_mapper.log&
# real    3m13.254s

输出结果

detail和detail-tsv格式

(base) [yut@node01 TestData]$ head GCA_002343985.1_ASM234398v1_genomic.faa_kofamscan_detail-tsv.tsv
#       gene name       KO      thrshld score   E-value "KO definition"
#       ---------       ------  ------- ------  ---------       -------------
*       GLGNCOCG_00001  K09946  28.63   105.8   3.2e-31 "uncharacterized protein"
*       GLGNCOCG_00002  K22441  90.53   175.7   2.4e-52 "diamine N-acetyltransferase [EC:2.3.1.57]"
        GLGNCOCG_00002  K03789  105.10  66.4    4.4e-19 "[ribosomal protein S18]-alanine N-acetyltransferase [EC:2.3.1.266]"
        GLGNCOCG_00002  K03828  117.57  66.1    4.1e-19 "putative acetyltransferase [EC:2.3.1.-]"
        GLGNCOCG_00002  K03825  113.37  57.6    1.7e-16 "L-phenylalanine/L-methionine N-acetyltransferase [EC:2.3.1.53 2.3.1.-]"
        GLGNCOCG_00002  K24217  175.13  55.7    6e-16   "L-amino acid N-acyltransferase [EC:2.3.1.-]"
        GLGNCOCG_00002  K09994  117.73  50.8    2.6e-14 "(aminoalkyl)phosphonate N-acetyltransferase [EC:2.3.1.280]"
        GLGNCOCG_00002  K15520  93.30   50.6    3.5e-14 "mycothiol synthase [EC:2.3.1.189]"
(base) [yut@node01 TestData]$ head GCA_002343985.1_ASM234398v1_genomic.faa_kofamscan_detail_format.tsv
# gene name           KO     thrshld  score   E-value KO definition
#-------------------- ------ ------- ------ --------- ---------------------
* GLGNCOCG_00001      K09946   28.63  105.8   3.2e-31 uncharacterized protein
* GLGNCOCG_00002      K22441   90.53  175.7   2.4e-52 diamine N-acetyltransferase [EC:2.3.1.57]
  GLGNCOCG_00002      K03789  105.10   66.4   4.4e-19 [ribosomal protein S18]-alanine N-acetyltransferase [EC:2.3.1.266]
  GLGNCOCG_00002      K03828  117.57   66.1   4.1e-19 putative acetyltransferase [EC:2.3.1.-]
  GLGNCOCG_00002      K03825  113.37   57.6   1.7e-16 L-phenylalanine/L-methionine N-acetyltransferase [EC:2.3.1.53 2.3.1.-]
  GLGNCOCG_00002      K24217  175.13   55.7     6e-16 L-amino acid N-acyltransferase [EC:2.3.1.-]
  GLGNCOCG_00002      K09994  117.73   50.8   2.6e-14 (aminoalkyl)phosphonate N-acetyltransferase [EC:2.3.1.280]
  GLGNCOCG_00002      K15520   93.30   50.6   3.5e-14 mycothiol synthase [EC:2.3.1.189]

(base) [yut@node01 TestData]$ sort -t $'\t' -k4 -n GCA_002343985.1_ASM234398v1_genomic.faa_kofamscan_detail-tsv.tsv|head
#       ---------       ------  ------- ------  ---------       -------------
#       gene name       KO      thrshld score   E-value "KO definition"
        GLGNCOCG_01738  K19152          0.5     1.9e+02 "small toxic protein IbsD"
        GLGNCOCG_02049  K21664          1.7     4.3     "KSHV latency-associated nuclear antigen"
        GLGNCOCG_02331  K21664          2.9     1.8     "KSHV latency-associated nuclear antigen"
        GLGNCOCG_00788  K21664          4.7     0.53    "KSHV latency-associated nuclear antigen"
        GLGNCOCG_01757  K19313          5.6     1.5     "aminocyclitol acetyltransferase [EC:2.3.-.-]"
        GLGNCOCG_01062  K16402          6.8     0.023   "soraphen polyketide synthase B"
        GLGNCOCG_00067  K16007          7.1     0.033   "8,8a-deoxyoleandolide synthase 1"
        GLGNCOCG_01945  K26748          7.2     0.27    "MFS transporter, AAHS family, 2,4-dichlorophenoxyacetate transporter"
 
# 有明确KO号的基因数量
(base) [yut@node01 TestData]$ awk '$1~/*/' GCA_002343985.1_ASM234398v1_genomic.faa_kofamscan_detail-tsv.tsv|wc -l
1216

(base) [yut@node01 TestData]$ echo|awk '{print 1216/2486*100}'
48.9139

可以看到detailj及detail-tsv格式是用制表符将所有比对的结果都输出,即除了满足阈值< 1e-5,同时也将大于该阈值的结果输出。总共7列,第一列是标记该基因是否分配到Knumber,带星号的表示有Knumber;第二列表示输入基因名称;第三列是KO号;第四列是阈值;第五列是得分,得分一般大于前面阈值列的就会有KO;第六列是E-value;第七列是KO definition。该基因组最终有48.91%的蛋白序列有KO注释。

mapper格式

(base) [yut@node01 TestData]$ head GCA_002343985.1_ASM234398v1_genomic.faa_kofamscan_mapper_format.tsv
GLGNCOCG_00001  K09946
GLGNCOCG_00002  K22441
GLGNCOCG_00003
GLGNCOCG_00004  K03286
GLGNCOCG_00005
GLGNCOCG_00006
GLGNCOCG_00007
GLGNCOCG_00008
GLGNCOCG_00009
GLGNCOCG_00010

# 一个基因注释到多个KO
(base) [yut@node01 TestData]$ grep GLGNCOCG_00297  GCA_002343985.1_ASM234398v1_genomic.faa_kofamscan_mapper_format.tsv
GLGNCOCG_00297  K13990
GLGNCOCG_00297  K00603
GLGNCOCG_00297  K01500
(base) [yut@node01 TestData]$ grep GLGNCOCG_00297  GCA_002343985.1_ASM234398v1_genomic.faa_kofamscan_detail_format.tsv
* GLGNCOCG_00297      K13990  507.67  749.9  4.5e-226 glutamate formiminotransferase / formiminotetrahydrofolate cyclodeaminase [EC:2.1.2.5 4.3.1.4]
* GLGNCOCG_00297      K00603  405.93  411.8  4.9e-124 glutamate formiminotransferase / 5-formyltetrahydrofolate cyclo-ligase [EC:2.1.2.5 6.3.3.2]
* GLGNCOCG_00297      K01500  201.63  215.6   7.4e-65 methenyltetrahydrofolate cyclohydrolase [EC:3.5.4.9]
  GLGNCOCG_00297      K01746  429.53   65.3   8.3e-19 formiminotetrahydrofolate cyclodeaminase [EC:4.3.1.4]
  GLGNCOCG_00297      K10740   25.77   10.7     0.042 replication factor A3

# 有KO的
(base) [yut@node01 TestData]$ awk -F"\t" '$2!=""' GCA_002343985.1_ASM234398v1_genomic.faa_kofamscan_mapper_format.tsv|wc -l
1216

mapper格式总共两列,第一列是基因ID,第二列是Knumber,没有注释到的为空,注意一个基因可能同时注释到多个KO。

  • mapper-one-line格式:类似上面的mapper格式,不一样的是一个基因如果有多个KO则以制表符分割放在同一行。
(base) [yut@node01 TestData]$ less GCA_002343985.1_ASM234398v1_genomic.faa_kofamscan_mapper-one-line_format.tsv
GLGNCOCG_00294  K07304
GLGNCOCG_00295  K01924
GLGNCOCG_00296  K01468
GLGNCOCG_00297  K13990  K00603  K01500
GLGNCOCG_00298
GLGNCOCG_00299
GLGNCOCG_00300
GLGNCOCG_00301
GLGNCOCG_00302
(base) [yut@node01 TestData]$ awk 'NF>2' GCA_002343985.1_ASM234398v1_genomic.faa_kofamscan_mapper-one-line_format.tsv|cat -A
GLGNCOCG_00069^IK13693^IK01079$
GLGNCOCG_00108^IK01417^IK20276$
GLGNCOCG_00297^IK13990^IK00603^IK01500$
GLGNCOCG_00486^IK03668^IK06079$
GLGNCOCG_00489^IK06079^IK03668$
GLGNCOCG_00776^IK00027^IK00029$
GLGNCOCG_00998^IK25153^IK25151$

常用命令

参数:evalue:1e-5(the ORFs were searched against the Kyoto Encyclopedia of Genes and Genomes (KEGG) database [48] using KofamScan version 1.2.0 (E-value < 10−5, score>predefined thresholds by KofamScan) [49]. 参考https://www.nature.com/articles/s41396-023-01546-2,https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-023-01621-y)
每行一个基因,一个基因多个KO则多行输出,不输出注释不到的基因,

(base) [yut@node01 TestData]$  time exec_annotation -p ~/Database/KoFamDB20231025/profiles -k ~/Database/KoFamDB20231025/ko_list --cpu 10  -E 1e-5 -f mapper --no-report-unannotated  -o GCA_002343985.1_ASM234398v1_genomic.faa_kofamscan_mapper_annot_format_1e-                 5.tsv  GCA_002343985.1_ASM234398v1_genomic.faa &

tmp目录

  • 不写 --tmp-dir则默认在当前目录创建tmp目录,其下生成tabular目录,里面是所有KO的HMM信息。
  • 注意:如果使用parallel同时并行多个文件时,一定要为每个指定一个不同名称的目录,否则则会互相影响,输出不完整。
(base) [yut@node01 TestData]$ ls tmp
tabular
(base) [yut@node01 TestData]$ ll tmp/tabular/|head
总用量 51M
-rw-r--r-- 1 yut lzdx 2.4K 1116 11:17 K00001
-rw-r--r-- 1 yut lzdx 1.2K 1116 11:17 K00002
-rw-r--r-- 1 yut lzdx 2.3K 1116 11:17 K00003
-rw-r--r-- 1 yut lzdx 2.9K 1116 11:17 K00004
-rw-r--r-- 1 yut lzdx 1.4K 1116 11:17 K00005
-rw-r--r-- 1 yut lzdx 1.2K 1116 11:17 K00006
-rw-r--r-- 1 yut lzdx  997 1116 11:17 K00007
-rw-r--r-- 1 yut lzdx 2.8K 1116 11:17 K00008
-rw-r--r-- 1 yut lzdx 1.2K 1116 11:17 K00009
(base) [yut@node01 TestData]$ head tmp/tabular/K00001
#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
GLGNCOCG_00164       -          K00001               -            1.3e-52  177.0   0.4   1.5e-52  176.8   0.4   1.0   1   0   0   1   1   1   1 NADP-dependent alcohol dehydrogenase C
GLGNCOCG_00550       -          K00001               -            8.8e-48  161.1   1.0   9.8e-48  160.9   1.0   1.0   1   0   0   1   1   1   1 Alcohol dehydrogenase
GLGNCOCG_02093       -          K00001               -            1.1e-05   22.2   3.7   1.3e-05   21.9   3.7   1.1   1   0   0   1   1   1   1 NAD(P) transhydrogenase subunit alpha part 1
GLGNCOCG_00799       -          K00001               -            0.00063   16.4   1.5      0.16    8.4   0.3   2.2   2   0   0   2   2   2   2 Glutamate synthase [NADPH] small chain
GLGNCOCG_01761       -          K00001               -             0.0011   15.6   0.9    0.0019   14.8   0.3   1.6   2   0   0   2   2   2   1 Alanine dehydrogenase 2
GLGNCOCG_01061       -          K00001               -             0.0035   13.9   0.0    0.0057   13.2   0.0   1.4   1   0   0   1   1   1   1 UDP-N-acetyl-D-glucosamine 6-dehydrogenase
GLGNCOCG_01051       -          K00001               -             0.0053   13.3   0.3    0.0077   12.8   0.3   1.2   1   0   0   1   1   1   1 Glutamate dehydrogenase

与emapper结果比较

## emapper-2.1.9,evalue:default, 0.001;eggnog_5.02_database/eggnog_proteins.dmnd 
## /apps/anaconda3/envs/eggnog-mapper-2.1.9/bin/emapper.py -i GCA_002343985.1_ASM234398v1_genomic.faa --cpu 20 -o emapper_out --override --temp_dir .
(base) [yut@node01 TestData]$ awk -F"\t"  'NR>1{print $1,$12}' emapper_out.emapper.annotations|grep ko|grep -v "#"|wc -l
1232

## kofamscan: evalue 1e-3
(base) [yut@node01 TestData]$  time exec_annotation -p ~/Database/KoFamDB20231025/profiles -k ~/Database/KoFamDB20231025/ko_list --cpu 10  -E 1e-3 -f mapper --no-report-unannotated  -o GCA_002343985.1_ASM234398v1_genomic.faa_kofamscan_mapper_annot_format_1e-3.tsv  GCA_002343985.1_ASM234398v1_genomic.faa &

(base) [yut@node01 TestData]$ wc -l GCA_002343985.1_ASM234398v1_genomic.faa_kofamscan_mapper_annot_format_1e-3.tsv
1218 GCA_002343985.1_ASM234398v1_genomic.faa_kofamscan_mapper_annot_format_1e-3.tsv

emapper在1e-3的阈值下要比kofamscan在1e-5多16个基因的KO注释,然而当降低kofamscan的evalue到1e-3也仅多了两个基因的KO。
在这里插入图片描述

  • 对另外多个细菌基因组注释结果统计
    在这里插入图片描述
    在这里插入图片描述

  • 85个基因组两种注释的基因数量(包含KO)差异

  • 85个基因组两种注释的KO数量差异

其他参数

Usage: exec_annotation [options] <query>
  <query>                    FASTA formatted query sequence file
  -o <file>                  File to output the result  [stdout]
  -p, --profile <path>       Profile HMM database
  -k, --ko-list <file>       KO information file
  --cpu <num>                Number of CPU to use  [1]
  -c, --config <file>        Config file
  --tmp-dir <dir>            Temporary directory  [./tmp] # 不写的话默认是当前目录下创建tmp
  -E, --e-value <e_value>    Largest E-value required of the hits
  -T, --threshold-scale <scale>
                             The score thresholds will be multiplied by this value
  -f, --format <format>      Format of the output [detail]
      detail:          Detail for each hits (including hits below threshold) # 按照hit展示,一个基因可能有多个hits,默认格式。显示基因名称、分配的 K 编号、KO 的阈值、hmmsearch 分数和 E 值以及 KO 的定义。此外,如果得分高于阈值,则在行首添加星号 "*"。
      detail-tsv:      Tab separeted values for detail format # 制表符分隔与上面一样
      mapper:          KEGG Mapper compatible format # 两列,基因与KO
      mapper-one-line: Similar to mapper, but all hit KOs are listed in one line # 一个基因的多个hits在一行
  --[no-]report-unannotated  Sequence name will be shown even if no KOs are assigned
                             Default is true when format=mapper or mapper-all,
                             false when format=detail
  --create-alignment         Create domain annotation files for each sequence
                             They will be located in the tmp directory
                             Incompatible with -r
  -r, --reannotate           Skip hmmsearch
                             Incompatible with --create-alignment
  --keep-tabular             Neither create tabular.txt nor delete K number files
                             By default, all K number files will be combined into
                             a tabular.txt and delete them
  --keep-output              Neither create output.txt nor delete K number files
                             By default, all K number files will be combined into
                             a output.txt and delete them
                             Must be with --create-alignment
  -h, --help                 Show this message and exit

参考

Kofamscan
KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold

    IF: 5.8 Q1 B3