janome package

Submodules

janome.dic module

class janome.dic.CompiledUserDictionary(dic_dir, connections)[source]

Bases: RAMDictionary

User dictionary class (compiled)

__init__(dic_dir, connections)[source]
classmethod load_dict(dic_dir)[source]
lookup(s)[source]
class janome.dic.Dictionary[source]

Bases: ABC

Base dictionary class

abstract get_trans_cost(id1, id2)[source]
abstract lookup(s, matcher)[source]
abstract lookup_extra(num)[source]
exception janome.dic.LoadingDictionaryError[source]

Bases: Exception

__init__()[source]
class janome.dic.MMapDictionary(entries_compact, entries_extra, open_files, connections)[source]

Bases: Dictionary

MMap dictionary class

__init__(entries_compact, entries_extra, open_files, connections)[source]
get_trans_cost(id1, id2)[source]
lookup(s, matcher)[source]
lookup_extra(idx)[source]
class janome.dic.RAMDictionary(entries, connections)[source]

Bases: Dictionary

RAM dictionary class

__init__(entries, connections)[source]
get_trans_cost(id1, id2)[source]
lookup(s, matcher)[source]
lookup_extra(num)[source]
class janome.dic.UnknownsDictionary(chardefs, unknowns)[source]

Bases: object

Dictionary class for handling unknown words

__init__(chardefs, unknowns)[source]
get_char_categories(c)[source]
unknown_grouping(cate)[source]
unknown_invoked_always(cate)[source]
unknown_length(cate)[source]
class janome.dic.UserDictionary(user_dict, enc, type, connections, progress_handler=None)[source]

Bases: RAMDictionary

User dictionary class (on-the-fly)

__init__(user_dict, enc, type, connections, progress_handler=None)[source]

Initialize user defined dictionary object.

Parameters:
  • user_dict – user dictionary file (CSV format)

  • enc – character encoding

  • type – user dictionary type. supported types are ‘ipadic’ and ‘simpledic’

  • connections – connection cost matrix. expected value is SYS_DIC.connections

  • progress_handler – handler mainly to indicate progress, implementation of ProgressHandler

classmethod build_dic(user_dict, enc, dict_type, progress_handler)[source]
classmethod line_to_entry_ipadic(line)[source]

Convert IPADIC formatted string to an user dictionary entry

classmethod line_to_entry_simpledic(line)[source]

Convert simpledict formatted string to an user dictionary entry

lookup(s)[source]
save(to_dir, compressionlevel=9)[source]

Save compressed compiled dictionary data.

Parameters:

to_dir – directory to save dictionary data

Compressionlevel:

(Optional) gzip compression level. default is 9

janome.dic.end_save_entries(dir, bucket_idx)[source]
janome.dic.save_chardefs(chardefs, dir='.')[source]
janome.dic.save_connections(connections, dir='.')[source]
janome.dic.save_entry(dir, bucket_idx, morph_id, entry)[source]
janome.dic.save_entry_buckets(dir, buckets)[source]
janome.dic.save_fstdata(data, dir, part=0)[source]
janome.dic.save_unknowns(unknowns, dir='.')[source]
janome.dic.start_save_entries(dir, bucket_idx, morph_offset)[source]

janome.system_dic module

class janome.system_dic.MMapSystemDictionary(mmap_entries, connections, chardefs, unknowns)[source]

Bases: MMapDictionary, UnknownsDictionary

MMap System dictionary class

__init__(mmap_entries, connections, chardefs, unknowns)[source]
classmethod instance()[source]
class janome.system_dic.SystemDictionary(entries, connections, chardefs, unknowns)[source]

Bases: RAMDictionary, UnknownsDictionary

System dictionary class

__init__(entries, connections, chardefs, unknowns)[source]
classmethod instance()[source]

janome.tokenizer module

The tokenizer module supplies Token and Tokenizer classes.

Usage:

>>> from janome.tokenizer import Tokenizer
>>> t = Tokenizer()
>>> for token in t.tokenize('すもももももももものうち'):
...   print(token)
...
すもも     名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も       助詞,係助詞,*,*,*,*,も,モ,モ
もも      名詞,一般,*,*,*,*,もも,モモ,モモ
も       助詞,係助詞,*,*,*,*,も,モ,モ
もも      名詞,一般,*,*,*,*,もも,モモ,モモ
の       助詞,連体化,*,*,*,*,の,ノ,ノ
うち      名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ

with wakati (‘分かち書き’) mode:

>>> from janome.tokenizer import Tokenizer
>>> t = Tokenizer()
>>> for token in t.tokenize('すもももももももものうち', wakati=True):
...   print(token)
...
すもも

もも

もも

うち

with user dictionary (IPAdic format):

$ cat examples/user_ipadic.csv
東京スカイツリー,1288,1288,4569,名詞,固有名詞,一般,*,*,*,東京スカイツリー,トウキョウスカイツリー,トウキョウスカイツリー
東武スカイツリーライン,1288,1288,4700,名詞,固有名詞,一般,*,*,*,東武スカイツリーライン,トウブスカイツリーライン,トウブスカイツリーライン
とうきょうスカイツリー駅,1288,1288,4143,名詞,固有名詞,一般,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,トウキョウスカイツリーエキ
>>> t = Tokenizer("user_ipadic.csv", udic_enc="utf8")
>>> for token in t.tokenize('東京スカイツリーへのお越しは、東武スカイツリーライン「とうきょうスカイツリー駅」が便利です。'):
...  print(token)...
...
東京スカイツリー        名詞,固有名詞,一般,*,*,*,東京スカイツリー,トウキョウスカイツリー,トウキョウスカイツリー
へ       助詞,格助詞,一般,*,*,*,へ,ヘ,エ
の       助詞,連体化,*,*,*,*,の,ノ,ノ
お越し     名詞,一般,*,*,*,*,お越し,オコシ,オコシ
は       助詞,係助詞,*,*,*,*,は,ハ,ワ
、       記号,読点,*,*,*,*,、,、,、
東武スカイツリーライン     名詞,固有名詞,一般,*,*,*,東武スカイツリーライン,トウブスカイツリーライン,トウブスカイツリーライン
「       記号,括弧開,*,*,*,*,「,「,「
とうきょうスカイツリー駅    名詞,固有名詞,一般,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,トウキョウスカイツリーエキ
」       記号,括弧閉,*,*,*,*,」,」,」
が       助詞,格助詞,一般,*,*,*,が,ガ,ガ
便利      名詞,形容動詞語幹,*,*,*,*,便利,ベンリ,ベンリ
です      助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
。       記号,句点,*,*,*,*,。,。,。

with user dictionary (simplified format):

$ cat examples/user_simpledic.csv
東京スカイツリー,カスタム名詞,トウキョウスカイツリー
東武スカイツリーライン,カスタム名詞,トウブスカイツリーライン
とうきょうスカイツリー駅,カスタム名詞,トウキョウスカイツリーエキ
>>> t = Tokenizer("user_simpledic.csv", udic_type="simpledic", udic_enc="utf8")
>>> for token in t.tokenize('東京スカイツリーへのお越しは、東武スカイツリーライン「とうきょうスカイツリー駅」が便利です。'):
...   print(token)
class janome.tokenizer.Token(node: Node, extra: Tuple | None = None)[source]

Bases: object

A Token object contains all information for a token.

__init__(node: Node, extra: Tuple | None = None)[source]
class janome.tokenizer.Tokenizer(udic: str = '', *, udic_enc: str = 'utf8', udic_type: str = 'ipadic', max_unknown_length: int = 1024, wakati: bool = False, mmap: bool = True, dotfile: str = '')[source]

Bases: object

A Tokenizer tokenizes Japanese texts with system and optional user defined dictionary.

CHUNK_SIZE = 500
MAX_CHUNK_SIZE = 1024
__init__(udic: str = '', *, udic_enc: str = 'utf8', udic_type: str = 'ipadic', max_unknown_length: int = 1024, wakati: bool = False, mmap: bool = True, dotfile: str = '')[source]

Initialize Tokenizer object with optional arguments.

Parameters:
  • udic – (Optional) user dictionary file (CSV format) or directory path to compiled dictionary data

  • udic_enc – (Optional) character encoding for user dictionary. default is ‘utf-8’

  • udic_type – (Optional) user dictionray type. supported types are ‘ipadic’ and ‘simpledic’. default is ‘ipadic’

  • max_unknows_length – (Optional) max unknown word length. default is 1024.

  • wakati – (Optional) if given True load minimum sysdic data for ‘wakati’ mode.

  • mmap – (Optional) if given False, memory-mapped file mode is disabled. Set this option to False on any environments that do not support mmap. Default is True on 64bit architecture; otherwise False.

tokenize(text: str, *, wakati: bool = False, baseform_unk: bool = True, dotfile: str = '') Iterator[Token | str][source]

Tokenize the input text.

Parameters:
  • text – unicode string to be tokenized

  • wakati – (Optinal) if given True returns surface forms only. default is False.

  • baseform_unk – (Optional) if given True sets base_form attribute for unknown tokens. default is True.

  • dotfile – (Optional) if specified, graphviz dot file is output to the path for later visualizing of the lattice graph. This option is ignored when the input length is larger than MAX_CHUNK_SIZE.

Returns:

generator yielding tokens (wakati=False) or generator yielding string (wakati=True)

exception janome.tokenizer.WakatiModeOnlyException[source]

Bases: Exception

janome.analyzer module

The analyzer module supplies Analyzer framework for pre-processing and post-processing for morphological analysis.

Added in version 0.3.4

NOTE This is experimental. The class/method interfaces can be modified in the future releases.

Usage:

>>> from janome.tokenizer import Tokenizer
>>> from janome.analyzer import Analyzer
>>> from janome.charfilter import *
>>> from janome.tokenfilter import *
>>> text = '蛇の目はPure Pythonな形態素解析器です。'
>>> char_filters = [UnicodeNormalizeCharFilter(), RegexReplaceCharFilter('蛇の目', 'janome')]
>>> tokenizer = Tokenizer()
>>> token_filters = [CompoundNounFilter(), POSStopFilter(['記号','助詞']), LowerCaseFilter()]
>>> a = Analyzer(char_filters=char_filters, tokenizer=tokenizer, token_filters=token_filters)
>>> for token in a.analyze(text):
...     print(token)
...
janome  名詞,固有名詞,組織,*,*,*,*,*,*
pure    名詞,固有名詞,組織,*,*,*,*,*,*
python  名詞,一般,*,*,*,*,*,*,*
な       助動詞,*,*,*,特殊・ダ,体言接続,だ,ナ,ナ
形態素解析器  名詞,複合,*,*,*,*,形態素解析器,ケイタイソカイセキキ,ケイタイソカイセキキ
です      助動詞,*,*,*,特殊・デス,基本形,です,デス,デス

Usage (word count with TokenCountFilter):

>>> from janome.tokenizer import Tokenizer
>>> from janome.analyzer import Analyzer
>>> from janome.tokenfilter import *
>>> text = 'すもももももももものうち'
>>> token_filters = [POSKeepFilter(['名詞']), TokenCountFilter()]
>>> a = Analyzer(token_filters=token_filters)
>>> for k, v in a.analyze(text):
...   print('%s: %d' % (k, v))
...
もも: 2
すもも: 1
うち: 1
class janome.analyzer.Analyzer(*, char_filters: List[CharFilter] = [], tokenizer: Tokenizer | None = None, token_filters: List[TokenFilter] = [])[source]

Bases: object

An Analyzer analyzes Japanese texts with customized CharFilter chain, Tokenizer and TokenFilter chain.

Added in version 0.3.4

__init__(*, char_filters: List[CharFilter] = [], tokenizer: Tokenizer | None = None, token_filters: List[TokenFilter] = [])[source]

Initialize Analyzer object with CharFilters, a Tokenizer and TokenFilters.

Parameters:
  • char_filters – (Optional) CharFilters list. CharFilters are applied to the input text in the list order. default is the empty list.

  • tokenizer – (Optional) A Tokenizer object. Tokenizer tokenizes the text modified by char_filters. default is Tokenizer initialized with no extra options. WARNING: A Tokenizer initialized with wakati=True option is not accepted.

  • token_filters – (Optional) TokenFilters list. TokenFilters are applied to the Tokenizer’s output in the list order. default is the empty list.

analyze(text: str) Iterator[Any][source]

Analyze the input text with custom CharFilters, Tokenizer and TokenFilters.

Parameters:

text – unicode string to be tokenized

Returns:

token generator. emitted element type depends on the output of the last TokenFilter. (e.g., ExtractAttributeFilter emits strings.)

janome.charfilter module

class janome.charfilter.CharFilter[source]

Bases: ABC

Base CharFilter class.

A CharFilter modifies or transforms the input text according to the rule described in apply() method. Subclasses must implement apply() method.

Added in version 0.3.4

abstract apply(text: str) str[source]
class janome.charfilter.RegexReplaceCharFilter(pat: str, repl: str)[source]

Bases: CharFilter

RegexReplaceCharFilter replaces string matched with a regular expression pattern to replacement string.

Added in version 0.3.4

__init__(pat: str, repl: str)[source]

Initialize RegexReplaceCharFilter with a regular expression pattern string and replacement.

Parameters:
  • pattern – regular expression string.

  • repl – replacement string.

apply(text: str) str[source]
class janome.charfilter.UnicodeNormalizeCharFilter(form: str = 'NFKC')[source]

Bases: CharFilter

UnicodeNormalizeCharFilter normalizes Unicode string.

Added in version 0.3.4

__init__(form: str = 'NFKC')[source]

Initialize UnicodeNormalizeCharFilter with normalization form.

See also unicodedata.normalize

Parameters:

form – (Optional) normalization form. valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’. default is ‘NFKC’

apply(text: str) str[source]

janome.tokenfilter module

class janome.tokenfilter.CompoundNounFilter[source]

Bases: TokenFilter

A CompoundNounFilter generates compound nouns.

This Filter joins contiguous nouns. For example, ‘形態素解析器’ is splitted three noun tokens ‘形態素/解析/器’ by Tokenizer and then re-joined by this filter. Generated tokens are associated with the special part-of-speech tag ‘名詞,複合,*,*’

Added in version 0.3.4

apply(tokens: Iterator[Token]) Iterator[Token][source]
class janome.tokenfilter.ExtractAttributeFilter(att: str)[source]

Bases: TokenFilter

An ExtractAttributeFilter extracts a specified attribute of Token.

NOTES This filter must placed the last of token filter chain because return values are not tokens but strings.

Added in version 0.3.4

__init__(att: str)[source]

Initialize ExtractAttributeFilter object.

Parameters:

att – attribute name should be extraced from a token. valid values for att are ‘surface’, ‘part_of_speech’, ‘infl_type’, ‘infl_form’, ‘base_form’, ‘reading’ and ‘phonetic’.

apply(tokens: Iterator[Token]) Iterator[str][source]
class janome.tokenfilter.LowerCaseFilter[source]

Bases: TokenFilter

A LowerCaseFilter converts the surface and base_form of tokens to lowercase.

Added in version 0.3.4

apply(tokens: Iterator[Token]) Iterator[Token][source]
class janome.tokenfilter.POSKeepFilter(pos_list: List[str])[source]

Bases: TokenFilter

A POSKeepFilter keeps tokens associated with part-of-speech tags listed in the keep tags list and removes other tokens.

Tag matching rule is prefix-matching. e.g., if ‘動詞’ is given as a keep tag, ‘動詞,自立,*,*’ and ‘動詞,非自立,*,*’ (or so) are kept.

Added in version 0.3.4

__init__(pos_list: List[str])[source]

Initialize POSKeepFilter object.

Parameters:

pos_list – keep part-of-speech tags list.

apply(tokens: Iterator[Token]) Iterator[Token][source]
class janome.tokenfilter.POSStopFilter(pos_list: List[str])[source]

Bases: TokenFilter

A POSStopFilter removes tokens associated with part-of-speech tags listed in the stop tags list and keeps other tokens.

Tag matching rule is prefix-matching. e.g., if ‘動詞’ is given as a stop tag, ‘動詞,自立,*,*’ and ‘動詞,非自立,*,*’ (or so) are removed.

Added in version 0.3.4

__init__(pos_list: List[str])[source]

Initialize POSStopFilter object.

Parameters:

pos_list – stop part-of-speech tags list.

apply(tokens: Iterator[Token]) Iterator[Token][source]
class janome.tokenfilter.TokenCountFilter(att: str = 'surface', sorted: bool = False)[source]

Bases: TokenFilter

An TokenCountFilter counts word frequencies in the input text. Here, ‘word’ means an attribute of Token.

This filter generates word-frequency pairs. When sorted option is set to True, pairs are sorted in descending order of frequency.

NOTES This filter must placed the last of token filter chain because return values are not tokens but string-integer tuples.

Added in version 0.3.5

__init__(att: str = 'surface', sorted: bool = False)[source]

Initialize TokenCountFilter object.

Parameters:
  • att – attribute name should be extraced from a token. valid values for att are ‘surface’, ‘part_of_speech’, ‘infl_type’, ‘infl_form’, ‘base_form’, ‘reading’ and ‘phonetic’.

  • sorted – sort items by term frequency

apply(tokens: Iterator[Token]) Iterator[Tuple[str, int]][source]
class janome.tokenfilter.TokenFilter[source]

Bases: ABC

Base TokenFilter class.

A TokenFilter modifies or transforms the input token sequence according to the rule described in apply() method. Subclasses must implement apply() method.

Added in version 0.3.4

abstract apply(tokens: Iterator[Token]) Iterator[Any][source]
class janome.tokenfilter.UpperCaseFilter[source]

Bases: TokenFilter

An UpperCaseFilter converts the surface and base_form of tokens to uppercase.

Added in version 0.3.4

apply(tokens: Iterator[Token]) Iterator[Token][source]
class janome.tokenfilter.WordKeepFilter(keep_words: List[str])[source]

Bases: TokenFilter

A WordKeepFilter keeps tokens whose surface form is listed in the keep words list.

Added in version 0.5.0

__init__(keep_words: List[str]) None[source]

Initialize WordKeepFilter object.

Parameters:

keep_words – keep words list.

apply(tokens: Iterator[Token]) Iterator[Token][source]
class janome.tokenfilter.WordStopFilter(stop_words: List[str])[source]

Bases: TokenFilter

A WordStopFilter removes tokens whose surface form is listed in the stop words list.

Added in version 0.5.0

__init__(stop_words: List[str])[source]

Initialize WordStopFilter object.

Parameters:

stop_words – stop words list.

apply(tokens: Iterator[Token]) Iterator[Token][source]