Kagome v2

Kagome is an open source Japanese morphological analyzer written in pure Go. It can tokenize Japanese text into words and analyze parts of speech, with dictionaries embedded in the binary for easy deployment.

Note

Key features (Improvements from v1):

Self-contained binaries with embedded dictionaries (MeCab-IPADIC, UniDic)
Multiple segmentation modes for different use cases
RESTful API server mode for production use
WebAssembly support for browser environments

Index

Basic Usage

Command line

% kagome -h
Japanese Morphological Analyzer -- github.com/ikawaha/kagome/v2
usage: kagome <command>
The commands are:
   [tokenize] - command line tokenize (*default)
   server - run tokenize server
   lattice - lattice viewer
   sentence - tiny sentence splitter
   version - show version

tokenize [-file input_file] [-dict dic_file] [-userdict user_dic_file] [-sysdict (ipa|uni)] [-simple false] [-mode (normal|search|extended)] [-split] [-json]
  -dict string
    	dict
  -file string
    	input file
  -json
    	outputs in JSON format
  -mode string
    	tokenize mode (normal|search|extended) (default "normal")
  -simple
    	display abbreviated dictionary contents
  -split
    	use tiny sentence splitter
  -sysdict string
    	system dict type (ipa|uni) (default "ipa")
  -udict string
    	user dict

% # piped standard input
% echo "すもももももももものうち" | kagome
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS

For more details, see the Commands section.

As a Go library

# Install Kagome module
go get github.com/ikawaha/kagome/v2

package main

import (
  "fmt"
  "strings"

  "github.com/ikawaha/kagome-dict/ipa"
  "github.com/ikawaha/kagome/v2/tokenizer"
)

func main() {
  t, err := tokenizer.New(ipa.Dict(), tokenizer.OmitBosEos())
  if err != nil {
    panic(err)
  }
  // wakati (simple word splitting/segmentation)
  fmt.Println("---wakati---")
  seg := t.Wakati("すもももももももものうち")
  fmt.Println(seg)

  // tokenize w/ morphological analysis
  fmt.Println("---tokenize---")
  tokens := t.Tokenize("すもももももももものうち")
  for _, token := range tokens {
    features := strings.Join(token.Features(), ",")
    fmt.Printf("%s\t%v\n", token.Surface, features)
  }
}

output:

---wakati---
[すもも も もも も もも の うち]
---tokenize---
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ

For more examples, see:
- examples directory
- GoDoc

Install

To get the kagome command line tool, choose your preferred installation method below:

Go (recommended)

go install github.com/ikawaha/kagome/v2@latest

Homebrew

# macOS and Linux (for both AMD64 and Arm64)
brew install ikawaha/kagome/kagome

Manual Install
- For manual installation, download and extract the appropriate archived file for your OS and architecture from the releases page.
- Note that the extracted binary must be placed in an accessible directory with execution permission.
Docker/Docker Compose
- See the Docker section below

Commands

Major sub-commands of kagome command line tool.

Tokenize command

% # interactive/REPL mode
% kagome
すもももももももものうち
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS

% # piped standard input
echo "すもももももももものうち" | kagome
すもも  名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も      助詞,係助詞,*,*,*,*,も,モ,モ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
も      助詞,係助詞,*,*,*,*,も,モ,モ
もも    名詞,一般,*,*,*,*,もも,モモ,モモ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
うち    名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS

% # JSON output
% # (For jq command see https://jqlang.org/)
% echo "猫" | kagome -json | jq .
[
  {
    "id": 286994,
    "start": 0,
    "end": 1,
    "surface": "猫",
    "class": "KNOWN",
    "pos": [
      "名詞",
      "一般",
      "*",
      "*"
    ],
    "base_form": "猫",
    "reading": "ネコ",
    "pronunciation": "ネコ",
    "features": [
      "名詞",
      "一般",
      "*",
      "*",
      "*",
      "*",
      "猫",
      "ネコ",
      "ネコ"
    ]
  }
]

% # word splitting/segmentation only (equivalent to "wakati" functionality)
% echo "すもももももももものうち" | kagome -json | jq -r '[.[].surface] | join("/")'
すもも/も/もも/も/もも/の/うち

% # Extract only pronunciations using jq (for Text-to-Speech purposes, etc.)
% echo "私ははにわよわわわんわん" | kagome -json | jq -r '.[].pronunciation'
ワタシ
ワ
ハニワ
ヨ
ワ
ワ
ワンワン

Server command

For continuous usage, kagome provides a server mode to decouple the startup time of the tokenizer.

RESTful API

Start a server and try to access the "/tokenize" endpoint.

% kagome server &
% curl -XPUT localhost:6060/tokenize -d'{"sentence":"すもももももももものうち", "mode":"normal"}' | jq .

Web App

Start a server and access http://localhost:6060 in your browser.

% kagome server &

Important

The demo web application uses graphviz to draw a lattice. You need graphviz to be installed on your system.

[!TIP] Kagome can be compiled to WebAssembly (wasm) and run locally in a web browser as well. For details, see the WebAssembly section.

Wasm Demo: https://ikawaha.github.io/kagome/

Lattice command

A debug tool of tokenize process outputs a lattice in graphviz dot format.

% kagome lattice 私は鰻 | dot -Tpng -o lattice.png

Sentence command

Split long text into sentences:

% echo "吾輩は猫である。名前はまだ無い。" | kagome sentence
吾輩は猫である。
名前はまだ無い。

This command is useful if a single line of data is too lengthy, and you want to avoid errors such as bufio.Scanner: token too long.

% echo "吾輩は猫である。名前はまだ無い。" | kagome -json | jq -r '[.[].surface] | join("/")'
吾輩/は/猫/で/ある/。/名前/は/まだ/無い/。

% echo "吾輩は猫である。名前はまだ無い。" | kagome sentence | kagome -json | jq -r '[.[].surface] | join("/")'
吾輩/は/猫/で/ある/。
名前/は/まだ/無い/。

This command is equivalent to the -split option of the tokenize command.

% echo "吾輩は猫である。名前はまだ無い。" | kagome -split -json | jq -r '[.[].surface] | join("/")'
吾輩/は/猫/で/ある/。
名前/は/まだ/無い/。

Dictionaries

Currently supported dictionaries by default.

dict source package

MeCab IPADIC mecab-ipadic-2.7.0-20070801 github.com/ikawaha/kagome-dict/ipa

UniDIC unidic-mecab-2.1.2_src github.com/ikawaha/kagome-dict/uni
Experimental Features

dict source package

mecab-ipadic-NEologd mecab-ipadic-neologd github.com/ikawaha/kagome-ipa-neologd

Korean MeCab mecab-ko-dic-2.1.1-20180720 github.com/ikawaha/kagome-dict-ko

Note

For more details and differences between the dictionaries, see the wiki.

Segmentation modes

Similar to Kuromoji, Kagome also supports various segmentation modes (splitting strategies) to tokenize the input text.

Normal: Regular segmentation
Search: Use a heuristic to perform additional segmentation that is useful for search purposes
Extended: Similar to search mode, but also unknown words with uni-grams

Untokenized	Normal	Search	Extended
関西国際空港	関西国際空港	関西　国際　空港	関西　国際　空港
日本経済新聞	日本経済新聞	日本　経済　新聞	日本　経済　新聞
シニアソフトウェアエンジニア	シニアソフトウェアエンジニア	シニア　ソフトウェア　エンジニア	シニア　ソフトウェア　エンジニア
デジカメを買った	デジカメ　を　買っ　た	デジカメ　を　買っ　た	デ　ジ　カ　メ　を　買っ　た

Note

If your purpose is for search, try changing the mode before switching to another dictionary.

Docker

We provide scratch-based Docker images that simply run the kagome command line tool on various architectures: AMD64, Arm64, Arm32 (Arm v5, v6 and v7)

Pull the image

docker pull ikawaha/kagome:latest

# Alternatively, you can pull from GitHub Container Registry
docker pull ghcr.io/ikawaha/kagome:latest

Run the command via Docker

# Interactive/REPL mode
docker run --rm -it ikawaha/kagome:latest

# If pulling from GitHub Container Registry
docker run --rm -it ghcr.io/ikawaha/kagome:latest

Run the server via Docker

# Server mode (http://localhost:6060)
docker run --rm -p 6060:6060 ikawaha/kagome:latest server

# If pulling from GitHub Container Registry
docker run --rm -p 6060:6060 ghcr.io/ikawaha/kagome:latest server

docker-compose.yml example

services:
  kagome:
    image: ikawaha/kagome:latest
    ports: ["6060:6060"]
    command: server
    restart: unless-stopped

Note: Base image doesn't include Graphviz. For lattice visualization, see examples.

WebAssembly

Kagome compiles to WebAssembly for browser use.

Live demo: https://ikawaha.github.io/kagome/
Source code: ./_examples/wasm

Reference

Detailed Reference Manual in Japanese:
Community Wiki in English:
- https://github.com/ikawaha/kagome/wiki

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 837 Commits
.github		.github
_examples		_examples
cmd		cmd
docs		docs
filter		filter
testdata		testdata
tokenizer		tokenizer
.deepsource.toml		.deepsource.toml
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.goreleaser.yml		.goreleaser.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
kagome.go		kagome.go
kagome_test.go		kagome_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

Kagome v2

Index

Basic Usage

Command line

As a Go library

Install

Commands

Tokenize command

Server command

RESTful API

Web App

Lattice command

Sentence command

Dictionaries

Segmentation modes

Docker

WebAssembly

Reference

License

About

Uh oh!

Releases 81

Sponsor this project

Uh oh!

Packages

Uh oh!

Uh oh!

Contributors 16

Uh oh!

Languages

dict	source	package
MeCab IPADIC	mecab-ipadic-2.7.0-20070801	github.com/ikawaha/kagome-dict/ipa
UniDIC	unidic-mecab-2.1.2_src	github.com/ikawaha/kagome-dict/uni

dict	source	package
mecab-ipadic-NEologd	mecab-ipadic-neologd	github.com/ikawaha/kagome-ipa-neologd
Korean MeCab	mecab-ko-dic-2.1.1-20180720	github.com/ikawaha/kagome-dict-ko

Uh oh!

License

ikawaha/kagome

Folders and files

Latest commit

History

Repository files navigation

Kagome v2

Index

Basic Usage

Command line

As a Go library

Install

Commands

Tokenize command

Server command

RESTful API

Web App

Lattice command

Sentence command

Dictionaries

Segmentation modes

Docker

WebAssembly

Reference

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 81

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Uh oh!

Contributors 16

Uh oh!

Languages

Packages