janome package¶
Submodules¶
janome.dic module¶
- class janome.dic.CompiledUserDictionary(dic_dir, connections)[source]¶
Bases:
RAMDictionary
User dictionary class (compiled)
- class janome.dic.MMapDictionary(entries_compact, entries_extra, open_files, connections)[source]¶
Bases:
Dictionary
MMap dictionary class
- class janome.dic.RAMDictionary(entries, connections)[source]¶
Bases:
Dictionary
RAM dictionary class
- class janome.dic.UnknownsDictionary(chardefs, unknowns)[source]¶
Bases:
object
Dictionary class for handling unknown words
- class janome.dic.UserDictionary(user_dict, enc, type, connections, progress_handler=None)[source]¶
Bases:
RAMDictionary
User dictionary class (on-the-fly)
- __init__(user_dict, enc, type, connections, progress_handler=None)[source]¶
Initialize user defined dictionary object.
- Parameters:
user_dict – user dictionary file (CSV format)
enc – character encoding
type – user dictionary type. supported types are ‘ipadic’ and ‘simpledic’
connections – connection cost matrix. expected value is SYS_DIC.connections
progress_handler – handler mainly to indicate progress, implementation of ProgressHandler
- classmethod line_to_entry_ipadic(line)[source]¶
Convert IPADIC formatted string to an user dictionary entry
janome.system_dic module¶
- class janome.system_dic.MMapSystemDictionary(mmap_entries, connections, chardefs, unknowns)[source]¶
Bases:
MMapDictionary
,UnknownsDictionary
MMap System dictionary class
- class janome.system_dic.SystemDictionary(entries, connections, chardefs, unknowns)[source]¶
Bases:
RAMDictionary
,UnknownsDictionary
System dictionary class
janome.tokenizer module¶
The tokenizer module supplies Token and Tokenizer classes.
Usage:
>>> from janome.tokenizer import Tokenizer
>>> t = Tokenizer()
>>> for token in t.tokenize('すもももももももものうち'):
... print(token)
...
すもも 名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
うち 名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
with wakati (‘分かち書き’) mode:
>>> from janome.tokenizer import Tokenizer
>>> t = Tokenizer()
>>> for token in t.tokenize('すもももももももものうち', wakati=True):
... print(token)
...
すもも
も
もも
も
もも
の
うち
with user dictionary (IPAdic format):
$ cat examples/user_ipadic.csv
東京スカイツリー,1288,1288,4569,名詞,固有名詞,一般,*,*,*,東京スカイツリー,トウキョウスカイツリー,トウキョウスカイツリー
東武スカイツリーライン,1288,1288,4700,名詞,固有名詞,一般,*,*,*,東武スカイツリーライン,トウブスカイツリーライン,トウブスカイツリーライン
とうきょうスカイツリー駅,1288,1288,4143,名詞,固有名詞,一般,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,トウキョウスカイツリーエキ
>>> t = Tokenizer("user_ipadic.csv", udic_enc="utf8")
>>> for token in t.tokenize('東京スカイツリーへのお越しは、東武スカイツリーライン「とうきょうスカイツリー駅」が便利です。'):
... print(token)...
...
東京スカイツリー 名詞,固有名詞,一般,*,*,*,東京スカイツリー,トウキョウスカイツリー,トウキョウスカイツリー
へ 助詞,格助詞,一般,*,*,*,へ,ヘ,エ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
お越し 名詞,一般,*,*,*,*,お越し,オコシ,オコシ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
、 記号,読点,*,*,*,*,、,、,、
東武スカイツリーライン 名詞,固有名詞,一般,*,*,*,東武スカイツリーライン,トウブスカイツリーライン,トウブスカイツリーライン
「 記号,括弧開,*,*,*,*,「,「,「
とうきょうスカイツリー駅 名詞,固有名詞,一般,*,*,*,とうきょうスカイツリー駅,トウキョウスカイツリーエキ,トウキョウスカイツリーエキ
」 記号,括弧閉,*,*,*,*,」,」,」
が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
便利 名詞,形容動詞語幹,*,*,*,*,便利,ベンリ,ベンリ
です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
。 記号,句点,*,*,*,*,。,。,。
with user dictionary (simplified format):
$ cat examples/user_simpledic.csv
東京スカイツリー,カスタム名詞,トウキョウスカイツリー
東武スカイツリーライン,カスタム名詞,トウブスカイツリーライン
とうきょうスカイツリー駅,カスタム名詞,トウキョウスカイツリーエキ
>>> t = Tokenizer("user_simpledic.csv", udic_type="simpledic", udic_enc="utf8")
>>> for token in t.tokenize('東京スカイツリーへのお越しは、東武スカイツリーライン「とうきょうスカイツリー駅」が便利です。'):
... print(token)
- class janome.tokenizer.Token(node: Node, extra: Tuple | None = None)[source]¶
Bases:
object
A Token object contains all information for a token.
- class janome.tokenizer.Tokenizer(udic: str = '', *, udic_enc: str = 'utf8', udic_type: str = 'ipadic', max_unknown_length: int = 1024, wakati: bool = False, mmap: bool = True, dotfile: str = '')[source]¶
Bases:
object
A Tokenizer tokenizes Japanese texts with system and optional user defined dictionary.
- CHUNK_SIZE = 500¶
- MAX_CHUNK_SIZE = 1024¶
- __init__(udic: str = '', *, udic_enc: str = 'utf8', udic_type: str = 'ipadic', max_unknown_length: int = 1024, wakati: bool = False, mmap: bool = True, dotfile: str = '')[source]¶
Initialize Tokenizer object with optional arguments.
- Parameters:
udic – (Optional) user dictionary file (CSV format) or directory path to compiled dictionary data
udic_enc – (Optional) character encoding for user dictionary. default is ‘utf-8’
udic_type – (Optional) user dictionray type. supported types are ‘ipadic’ and ‘simpledic’. default is ‘ipadic’
max_unknows_length – (Optional) max unknown word length. default is 1024.
wakati – (Optional) if given True load minimum sysdic data for ‘wakati’ mode.
mmap – (Optional) if given False, memory-mapped file mode is disabled. Set this option to False on any environments that do not support mmap. Default is True on 64bit architecture; otherwise False.
- tokenize(text: str, *, wakati: bool = False, baseform_unk: bool = True, dotfile: str = '') Iterator[Token | str] [source]¶
Tokenize the input text.
- Parameters:
text – unicode string to be tokenized
wakati – (Optinal) if given True returns surface forms only. default is False.
baseform_unk – (Optional) if given True sets base_form attribute for unknown tokens. default is True.
dotfile – (Optional) if specified, graphviz dot file is output to the path for later visualizing of the lattice graph. This option is ignored when the input length is larger than MAX_CHUNK_SIZE.
- Returns:
generator yielding tokens (wakati=False) or generator yielding string (wakati=True)
janome.analyzer module¶
The analyzer module supplies Analyzer framework for pre-processing and post-processing for morphological analysis.
Added in version 0.3.4
NOTE This is experimental. The class/method interfaces can be modified in the future releases.
Usage:
>>> from janome.tokenizer import Tokenizer
>>> from janome.analyzer import Analyzer
>>> from janome.charfilter import *
>>> from janome.tokenfilter import *
>>> text = '蛇の目はPure Pythonな形態素解析器です。'
>>> char_filters = [UnicodeNormalizeCharFilter(), RegexReplaceCharFilter('蛇の目', 'janome')]
>>> tokenizer = Tokenizer()
>>> token_filters = [CompoundNounFilter(), POSStopFilter(['記号','助詞']), LowerCaseFilter()]
>>> a = Analyzer(char_filters=char_filters, tokenizer=tokenizer, token_filters=token_filters)
>>> for token in a.analyze(text):
... print(token)
...
janome 名詞,固有名詞,組織,*,*,*,*,*,*
pure 名詞,固有名詞,組織,*,*,*,*,*,*
python 名詞,一般,*,*,*,*,*,*,*
な 助動詞,*,*,*,特殊・ダ,体言接続,だ,ナ,ナ
形態素解析器 名詞,複合,*,*,*,*,形態素解析器,ケイタイソカイセキキ,ケイタイソカイセキキ
です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
Usage (word count with TokenCountFilter):
>>> from janome.tokenizer import Tokenizer
>>> from janome.analyzer import Analyzer
>>> from janome.tokenfilter import *
>>> text = 'すもももももももものうち'
>>> token_filters = [POSKeepFilter(['名詞']), TokenCountFilter()]
>>> a = Analyzer(token_filters=token_filters)
>>> for k, v in a.analyze(text):
... print('%s: %d' % (k, v))
...
もも: 2
すもも: 1
うち: 1
- class janome.analyzer.Analyzer(*, char_filters: List[CharFilter] = [], tokenizer: Tokenizer | None = None, token_filters: List[TokenFilter] = [])[source]¶
Bases:
object
An Analyzer analyzes Japanese texts with customized
CharFilter
chain,Tokenizer
andTokenFilter
chain.Added in version 0.3.4
- __init__(*, char_filters: List[CharFilter] = [], tokenizer: Tokenizer | None = None, token_filters: List[TokenFilter] = [])[source]¶
Initialize Analyzer object with CharFilters, a Tokenizer and TokenFilters.
- Parameters:
char_filters – (Optional) CharFilters list. CharFilters are applied to the input text in the list order. default is the empty list.
tokenizer – (Optional) A Tokenizer object. Tokenizer tokenizes the text modified by char_filters. default is Tokenizer initialized with no extra options. WARNING: A Tokenizer initialized with wakati=True option is not accepted.
token_filters – (Optional) TokenFilters list. TokenFilters are applied to the Tokenizer’s output in the list order. default is the empty list.
- analyze(text: str) Iterator[Any] [source]¶
Analyze the input text with custom CharFilters, Tokenizer and TokenFilters.
- Parameters:
text – unicode string to be tokenized
- Returns:
token generator. emitted element type depends on the output of the last TokenFilter. (e.g., ExtractAttributeFilter emits strings.)
janome.charfilter module¶
- class janome.charfilter.CharFilter[source]¶
Bases:
ABC
Base CharFilter class.
A CharFilter modifies or transforms the input text according to the rule described in apply() method. Subclasses must implement apply() method.
Added in version 0.3.4
- class janome.charfilter.RegexReplaceCharFilter(pat: str, repl: str)[source]¶
Bases:
CharFilter
RegexReplaceCharFilter replaces string matched with a regular expression pattern to replacement string.
Added in version 0.3.4
- class janome.charfilter.UnicodeNormalizeCharFilter(form: str = 'NFKC')[source]¶
Bases:
CharFilter
UnicodeNormalizeCharFilter normalizes Unicode string.
Added in version 0.3.4
- __init__(form: str = 'NFKC')[source]¶
Initialize UnicodeNormalizeCharFilter with normalization form.
See also unicodedata.normalize
- Parameters:
form – (Optional) normalization form. valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’. default is ‘NFKC’
janome.tokenfilter module¶
- class janome.tokenfilter.CompoundNounFilter[source]¶
Bases:
TokenFilter
A CompoundNounFilter generates compound nouns.
This Filter joins contiguous nouns. For example, ‘形態素解析器’ is splitted three noun tokens ‘形態素/解析/器’ by Tokenizer and then re-joined by this filter. Generated tokens are associated with the special part-of-speech tag ‘名詞,複合,*,*’
Added in version 0.3.4
- class janome.tokenfilter.ExtractAttributeFilter(att: str)[source]¶
Bases:
TokenFilter
An ExtractAttributeFilter extracts a specified attribute of Token.
NOTES This filter must placed the last of token filter chain because return values are not tokens but strings.
Added in version 0.3.4
- class janome.tokenfilter.LowerCaseFilter[source]¶
Bases:
TokenFilter
A LowerCaseFilter converts the surface and base_form of tokens to lowercase.
Added in version 0.3.4
- class janome.tokenfilter.POSKeepFilter(pos_list: List[str])[source]¶
Bases:
TokenFilter
A POSKeepFilter keeps tokens associated with part-of-speech tags listed in the keep tags list and removes other tokens.
Tag matching rule is prefix-matching. e.g., if ‘動詞’ is given as a keep tag, ‘動詞,自立,*,*’ and ‘動詞,非自立,*,*’ (or so) are kept.
Added in version 0.3.4
- class janome.tokenfilter.POSStopFilter(pos_list: List[str])[source]¶
Bases:
TokenFilter
A POSStopFilter removes tokens associated with part-of-speech tags listed in the stop tags list and keeps other tokens.
Tag matching rule is prefix-matching. e.g., if ‘動詞’ is given as a stop tag, ‘動詞,自立,*,*’ and ‘動詞,非自立,*,*’ (or so) are removed.
Added in version 0.3.4
- class janome.tokenfilter.TokenCountFilter(att: str = 'surface', sorted: bool = False)[source]¶
Bases:
TokenFilter
An TokenCountFilter counts word frequencies in the input text. Here, ‘word’ means an attribute of Token.
This filter generates word-frequency pairs. When sorted option is set to True, pairs are sorted in descending order of frequency.
NOTES This filter must placed the last of token filter chain because return values are not tokens but string-integer tuples.
Added in version 0.3.5
- __init__(att: str = 'surface', sorted: bool = False)[source]¶
Initialize TokenCountFilter object.
- Parameters:
att – attribute name should be extraced from a token. valid values for att are ‘surface’, ‘part_of_speech’, ‘infl_type’, ‘infl_form’, ‘base_form’, ‘reading’ and ‘phonetic’.
sorted – sort items by term frequency
- class janome.tokenfilter.TokenFilter[source]¶
Bases:
ABC
Base TokenFilter class.
A TokenFilter modifies or transforms the input token sequence according to the rule described in apply() method. Subclasses must implement apply() method.
Added in version 0.3.4
- class janome.tokenfilter.UpperCaseFilter[source]¶
Bases:
TokenFilter
An UpperCaseFilter converts the surface and base_form of tokens to uppercase.
Added in version 0.3.4
- class janome.tokenfilter.WordKeepFilter(keep_words: List[str])[source]¶
Bases:
TokenFilter
A WordKeepFilter keeps tokens whose surface form is listed in the keep words list.
Added in version 0.5.0
- class janome.tokenfilter.WordStopFilter(stop_words: List[str])[source]¶
Bases:
TokenFilter
A WordStopFilter removes tokens whose surface form is listed in the stop words list.
Added in version 0.5.0