janome package

Submodules

janome.dic module

class janome.dic.CompiledUserDictionary(dic_dir, connections)[source]

Bases: janome.dic.Dictionary

User dictionary class (compiled)

__init__(dic_dir, connections)[source]
load_dict(dic_dir)[source]
class janome.dic.Dictionary(compiledFST, entries, connections)[source]

Bases: object

Base dictionary class

__init__(compiledFST, entries, connections)[source]
get_trans_cost(id1, id2)[source]
lookup(s)[source]
lookup_extra(num)[source]
exception janome.dic.LoadingDictionaryError[source]

Bases: Exception

__init__()[source]
class janome.dic.MMapDictionary(compiledFST, entries_compact, entries_extra, open_files, connections)[source]

Bases: object

Base MMap dictionar class

__init__(compiledFST, entries_compact, entries_extra, open_files, connections)[source]
get_trans_cost(id1, id2)[source]
lookup(s)[source]
lookup_extra(idx)[source]
class janome.dic.MMapSystemDictionary(mmap_entries, connections, chardefs, unknowns)[source]

Bases: janome.dic.MMapDictionary, janome.dic.UnknownsDictionary

MMap System dictionary class

__init__(mmap_entries, connections, chardefs, unknowns)[source]
class janome.dic.SystemDictionary(entries, connections, chardefs, unknowns)[source]

Bases: janome.dic.Dictionary, janome.dic.UnknownsDictionary

System dictionary class

__init__(entries, connections, chardefs, unknowns)[source]
class janome.dic.UnknownsDictionary(chardefs, unknowns)[source]

Bases: object

__init__(chardefs, unknowns)[source]
get_char_categories[source]
unknown_grouping(cate)[source]
unknown_invoked_always(cate)[source]
unknown_length(cate)[source]
class janome.dic.UserDictionary(user_dict, enc, type, connections)[source]

Bases: janome.dic.Dictionary

User dictionary class (uncompiled)

__init__(user_dict, enc, type, connections)[source]

Initialize user defined dictionary object.

Parameters:
  • user_dict – user dictionary file (CSV format)
  • enc – character encoding
  • type – user dictionary type. supported types are ‘ipadic’ and ‘simpledic’
  • connections – connection cost matrix. expected value is SYS_DIC.connections

See also

See http://mocobeta.github.io/janome/en/#use-with-user-defined-dictionary for details for user dictionary.

buildipadic(user_dict, enc)[source]
buildsimpledic(user_dict, enc)[source]
save(to_dir, compressionlevel=9)[source]

Save compressed compiled dictionary data.

Parameters:to_dir – directory to save dictionary data
Compressionlevel:
 (Optional) gzip compression level. default is 9
janome.dic.end_save_entries(dir, bucket_num)[source]
janome.dic.load_all_fstdata()[source]
janome.dic.save_chardefs(chardefs, dir='.')[source]
janome.dic.save_connections(connections, dir='.')[source]
janome.dic.save_entry(dir, bucket_idx, morph_id, entry)[source]
janome.dic.save_entry_buckets(dir, buckets)[source]
janome.dic.save_fstdata(data, dir, suffix='')[source]
janome.dic.save_unknowns(unknowns, dir='.')[source]
janome.dic.start_save_entries(dir, bucket_num)[source]

janome.tokenizer module

The tokenizer module supplies Token and Tokenizer classes.

Usage:

>>> from janome.tokenizer import Tokenizer
>>> t = Tokenizer()
>>> for token in t.tokenize(u'すもももももももものうち'):
...   print(token)
... 
すもも     名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も       助詞,係助詞,*,*,*,*,も,モ,モ
もも      名詞,一般,*,*,*,*,もも,モモ,モモ
も       助詞,係助詞,*,*,*,*,も,モ,モ
もも      名詞,一般,*,*,*,*,もも,モモ,モモ
の       助詞,連体化,*,*,*,*,の,ノ,ノ
うち      名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ

with wakati (‘分かち書き’) mode:

>>> from janome.tokenizer import Tokenizer
>>> t = Tokenizer()
>>> for token in t.tokenize(u'すもももももももものうち', wakati=True):
...   print(token)
...
すもも

もも

もも

うち

with user dictionary (IPAdic format):

$ cat examples/user_ipadic.csv 
東京スカイツリー,1288,1288,4569,名詞,固有名詞,一般,*,*,*,東京スカイツリー,トウキョウスカイツリー,トウキョウスカイツリー
東武スカイツリーライン,1288,1288,4700,名詞,固有名詞,一般,*,*,*,東武スカイツリーライン,トウブスカイツリーライン,トウブスカイツリーライン
とうきょうスカイツリー駅,1288,1288,4143,名詞,固有名詞,一般,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,トウキョウスカイツリーエキ
>>> t = Tokenizer("user_ipadic.csv", udic_enc="utf8")
>>> for token in t.tokenize(u'東京スカイツリーへのお越しは、東武スカイツリーライン「とうきょうスカイツリー駅」が便利です。'):
...  print(token)... 
... 
東京スカイツリー        名詞,固有名詞,一般,*,*,*,東京スカイツリー,トウキョウスカイツリー,トウキョウスカイツリー
へ       助詞,格助詞,一般,*,*,*,へ,ヘ,エ
の       助詞,連体化,*,*,*,*,の,ノ,ノ
お越し     名詞,一般,*,*,*,*,お越し,オコシ,オコシ
は       助詞,係助詞,*,*,*,*,は,ハ,ワ
、       記号,読点,*,*,*,*,、,、,、
東武スカイツリーライン     名詞,固有名詞,一般,*,*,*,東武スカイツリーライン,トウブスカイツリーライン,トウブスカイツリーライン
「       記号,括弧開,*,*,*,*,「,「,「
とうきょうスカイツリー駅    名詞,固有名詞,一般,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,トウキョウスカイツリーエキ
」       記号,括弧閉,*,*,*,*,」,」,」
が       助詞,格助詞,一般,*,*,*,が,ガ,ガ
便利      名詞,形容動詞語幹,*,*,*,*,便利,ベンリ,ベンリ
です      助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
。       記号,句点,*,*,*,*,。,。,。

with user dictionary (simplified format):

$ cat examples/user_simpledic.csv 
東京スカイツリー,カスタム名詞,トウキョウスカイツリー
東武スカイツリーライン,カスタム名詞,トウブスカイツリーライン
とうきょうスカイツリー駅,カスタム名詞,トウキョウスカイツリーエキ
>>> t = Tokenizer("user_simpledic.csv", udic_type="simpledic", udic_enc="utf8")
>>> for token in t.tokenize(u'東京スカイツリーへのお越しは、東武スカイツリーライン「とうきょうスカイツリー駅」が便利です。'):
...   print(token)
class janome.tokenizer.Token(node, extra=None)[source]

Bases: object

A Token object contains all information for a token.

__init__(node, extra=None)[source]
base_form = None

base form (基本形)

infl_form = None

stem form (活用形)

infl_type = None

terminal form (活用型)

part_of_speech = None

part of speech (品詞)

phonetic = None

pronounce (発音)

reading = None

“reading (読み)

surface = None

surface form (表層形)

class janome.tokenizer.Tokenizer(udic='', udic_enc='utf8', udic_type='ipadic', max_unknown_length=1024, wakati=False, mmap=False)[source]

Bases: object

A Tokenizer tokenizes Japanese texts with system and optional user defined dictionary. It is strongly recommended to re-use a Tokenizer object because object initialization cost is high.

CHUNK_SIZE = 500
MAX_CHUNK_SIZE = 1000
__init__(udic='', udic_enc='utf8', udic_type='ipadic', max_unknown_length=1024, wakati=False, mmap=False)[source]

Initialize Tokenizer object with optional arguments.

Parameters:
  • udic – (Optional) user dictionary file (CSV format) or directory path to compiled dictionary data
  • udic_enc – (Optional) character encoding for user dictionary. default is ‘utf-8’
  • udic_type – (Optional) user dictionray type. supported types are ‘ipadic’ and ‘simpledic’. default is ‘ipadic’
  • max_unknows_length – (Optional) max unknown word length. default is 1024.
  • wakati – (Optional) if given True load minimum sysdic data for ‘wakati’ mode.
  • mmap – (Optional) if given True use memory-mapped file for dictionary data.

See also

See http://mocobeta.github.io/janome/en/#use-with-user-defined-dictionary for details for user dictionary.

tokenize(text, stream=False, wakati=False, baseform_unk=True)[source]

Tokenize the input text.

Parameters:
  • text – unicode string to be tokenized
  • stream – (Optional) if given True use stream mode. default is False.
  • wakati – (Optinal) if given True returns surface forms only. default is False.
  • baseform_unk – (Optional) if given True sets base_form attribute for unknown tokens. default is True.
Returns:

list of tokens (stream=False, wakati=False) or token generator (stream=True, wakati=False) or list of string (stream=False, wakati=True) or string generator (stream=True, wakati=True)

exception janome.tokenizer.WakatiModeOnlyException[source]

Bases: Exception

janome.analyzer module

The analyzer module supplies Analyzer framework for pre-processing and post-processing for morphological analysis.

Added in version 0.3.4

NOTE This is experimental. The class/method interfaces can be modified in the future releases.

Usage:

>>> from janome.tokenizer import Tokenizer
>>> from janome.analyzer import Analyzer
>>> from janome.charfilter import *
>>> from janome.tokenfilter import *
>>> text = u'蛇の目はPure Pythonな形態素解析器です。'
>>> char_filters = [UnicodeNormalizeCharFilter(), RegexReplaceCharFilter(u'蛇の目', u'janome')]
>>> tokenizer = Tokenizer()
>>> token_filters = [CompoundNounFilter(), POSStopFilter(['記号','助詞']), LowerCaseFilter()]
>>> a = Analyzer(char_filters, tokenizer, token_filters)
>>> for token in a.analyze(text):
...     print(token)
... 
janome  名詞,固有名詞,組織,*,*,*,*,*,*
pure    名詞,固有名詞,組織,*,*,*,*,*,*
python  名詞,一般,*,*,*,*,*,*,*
な       助動詞,*,*,*,特殊・ダ,体言接続,だ,ナ,ナ
形態素解析器  名詞,複合,*,*,*,*,形態素解析器,ケイタイソカイセキキ,ケイタイソカイセキキ
です      助動詞,*,*,*,特殊・デス,基本形,です,デス,デス

Usage (word count with TokenCountFilter):

>>> from janome.tokenizer import Tokenizer
>>> from janome.analyzer import Analyzer
>>> from janome.tokenfilter import *
>>> text = u'すもももももももものうち'
>>> token_filters = [POSKeepFilter('名詞'), TokenCountFilter()]
>>> a = Analyzer(token_filters=token_filters)
>>> for k, v in a.analyze(text):
...   print('%s: %d' % (k, v))
...
もも: 2
すもも: 1
うち: 1
class janome.analyzer.Analyzer(char_filters=[], tokenizer=None, token_filters=[])[source]

Bases: object

An Analyzer analyzes Japanese texts with customized CharFilter chain, Tokenizer and TokenFilter chain.

Added in version 0.3.4

__init__(char_filters=[], tokenizer=None, token_filters=[])[source]

Initialize Analyzer object with CharFilters, a Tokenizer and TokenFilters.

Parameters:
  • char_filters – (Optional) CharFilters list. CharFilters are applied to the input text in the list order. default is the empty list.
  • tokenizer – (Optional) A Tokenizer object. Tokenizer tokenizes the text modified by char_filters. default is Tokenizer initialized with no extra options. WARNING: A Tokenizer initialized with wakati=True option is not accepted.
  • token_filters – (Optional) TokenFilters list. TokenFilters are applied to the Tokenizer’s output in the list order. default is the empty list.
analyze(text)[source]

Analyze the input text with custom CharFilters, Tokenizer and TokenFilters.

Parameters:text – unicode string to be tokenized
Returns:token generator. emitted element type depends on the output of the last TokenFilter. (e.g., ExtractAttributeFilter emits strings.)

janome.charfilter module

class janome.charfilter.CharFilter[source]

Bases: object

Base CharFilter class.

A CharFilter modifies or transforms the input text according to the rule described in apply() method. Subclasses must implement apply() method.

Added in version 0.3.4

apply(text)[source]
filter(text)[source]
class janome.charfilter.RegexReplaceCharFilter(pat, repl)[source]

Bases: janome.charfilter.CharFilter

RegexReplaceCharFilter replaces string matched with a regular expression pattern to replacement string.

Added in version 0.3.4

__init__(pat, repl)[source]

Initialize RegexReplaceCharFilter with a regular expression pattern string and replacement.

Parameters:
  • pattern – regular expression string.
  • repl – replacement string.
apply(text)[source]
class janome.charfilter.UnicodeNormalizeCharFilter(form='NFKC')[source]

Bases: janome.charfilter.CharFilter

UnicodeNormalizeCharFilter normalizes Unicode string.

Added in version 0.3.4

__init__(form='NFKC')[source]

Initialize UnicodeNormalizeCharFilter with normalization form.

See also unicodedata.normalize for details.

Parameters:form – (Optional) normalization form. valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’. default is ‘NFKC’
apply(text)[source]

janome.tokenfilter module

class janome.tokenfilter.CompoundNounFilter[source]

Bases: janome.tokenfilter.TokenFilter

A CompoundNounFilter generates compound nouns.

This Filter joins contiguous nouns. For example, ‘形態素解析器’ is splitted three noun tokens ‘形態素/解析/器’ by Tokenizer and then re-joined by this filter. Generated tokens are associated with the special part-of-speech tag ‘名詞,複合,*,*’

Added in version 0.3.4

apply(tokens)[source]
class janome.tokenfilter.ExtractAttributeFilter(att)[source]

Bases: janome.tokenfilter.TokenFilter

An ExtractAttributeFilter extracts a specified attribute of Token.

NOTES This filter must placed the last of token filter chain because return values are not tokens but strings.

Added in version 0.3.4

__init__(att)[source]

Initialize ExtractAttributeFilter object.

Parameters:att – attribute name should be extraced from a token. valid values for att are ‘surface’, ‘part_of_speech’, ‘infl_type’, ‘infl_form’, ‘base_form’, ‘reading’ and ‘phonetic’.
apply(tokens)[source]
class janome.tokenfilter.LowerCaseFilter[source]

Bases: janome.tokenfilter.TokenFilter

A LowerCaseFilter converts the surface of token to lowercase.

Added in version 0.3.4

apply(tokens)[source]
class janome.tokenfilter.POSKeepFilter(pos_list)[source]

Bases: janome.tokenfilter.TokenFilter

A POSKeepFilter keeps tokens associated with part-of-speech tags listed in the keep tags list and removes other tokens.

Tag matching rule is prefix-matching. e.g., if ‘動詞’ is given as a keep tag, ‘動詞,自立,*,*’ and ‘動詞,非自立,*,*’ (or so) are kept.

Added in version 0.3.4

__init__(pos_list)[source]

Initialize POSKeepFilter object.

Parameters:pos_list – keep part-of-speech tags list.
apply(tokens)[source]
class janome.tokenfilter.POSStopFilter(pos_list)[source]

Bases: janome.tokenfilter.TokenFilter

A POSStopFilter removes tokens associated with part-of-speech tags listed in the stop tags list and keeps other tokens.

Tag matching rule is prefix-matching. e.g., if ‘動詞’ is given as a stop tag, ‘動詞,自立,*,*’ and ‘動詞,非自立,*,*’ (or so) are removed.

Added in version 0.3.4

__init__(pos_list)[source]

Initialize POSStopFilter object.

Parameters:pos_list – stop part-of-speech tags list.
apply(tokens)[source]
class janome.tokenfilter.TokenCountFilter(att='surface')[source]

Bases: janome.tokenfilter.TokenFilter

An TokenCountFilter counts word frequencies in the input text. Here, ‘word’ means an attribute of Token.

This filter generates word-frequency pairs sorted in descending order of frequency.

NOTES This filter must placed the last of token filter chain because return values are not tokens but string-integer tuples.

Added in version 0.3.5

__init__(att='surface')[source]

Initialize TokenCountFilter object.

Parameters:att – attribute name should be extraced from a token. valid values for att are ‘surface’, ‘part_of_speech’, ‘infl_type’, ‘infl_form’, ‘base_form’, ‘reading’ and ‘phonetic’.
apply(tokens)[source]
class janome.tokenfilter.TokenFilter[source]

Bases: object

Base TokenFilter class.

A TokenFilter modifies or transforms the input token sequence according to the rule described in apply() method. Subclasses must implement apply() method.

Added in version 0.3.4

apply(tokens)[source]
filter(tokens)[source]
class janome.tokenfilter.UpperCaseFilter[source]

Bases: janome.tokenfilter.TokenFilter

An UpperCaseFilter converts the surface of token to uppercase.

Added in version 0.3.4

apply(tokens)[source]