Text

Text objects contain text that can be processed using Chinese Text Analyser’s segmentation engine.

They have the following functions:

Text( text )

Creates a new Text object from a string. This can be used to process and segment generic text.

Example

1
2
3
4
5
6
local cta = require 'cta'
local text = cta.Text( '皮特满意地看了一眼洪钧:“显然陈总裁接受了你的建议。对了,我们的那两家老对手怎么样?”' )

for word in text:words() do
    ...
end

Text:words( includePunctuation )

Returns an iterator that iterates over every word in the text.

The parameter includePunctuation is optional and defaults to false.

When false only Chinese, Alpha and Number values will be returned.

Each iteration returns the following items:

  • word (string) - the word
  • wordType (int) - the type of the word as an integer (see below for different word types, each word type will be a different integer constant).
  • wordType (string) - the type of the word as a string. Valid values are:
    • None - Indicative of an error
    • Invalid - Invalid utf8 text
    • Chinese - A word made up of Chinese text
    • Alpha - A word made up of letters from the English alphabet
    • Number - A word made up of Arabic numerals
    • Whitespace - A block of whitespace (spaces, tabs, newlines etc).
    • ChinesePunctuation - Chinese punctuation
    • AsciiPunctuation - Standard ascii (English) punctuation.

Usage note: You should use Text:wordList() over Text:words() if you want to iterate over all words in the text in the order they appear, and Text:words() over Text:wordList() if you just want to test if a word exists in a sentence.

Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
local cta = require 'cta'
local text = cta.Text( '我又不是故意的。再说咱又不是贼,干吗放着正道不走非走房梁啊?' )

for word, i, type in text:words() do
    print( word, i, type )
end

print( '---' )

for word, i, type in text:words( true ) do
    print( word, i, type )
end

Output

我   3       Chinese
又   3       Chinese
不是  3       Chinese
故意  3       Chinese
的   3       Chinese
再说  3       Chinese
咱   3       Chinese
又   3       Chinese
不是  3       Chinese
贼   3       Chinese
干吗  3       Chinese
放着  3       Chinese
正道  3       Chinese
不走  3       Chinese
非   3       Chinese
走   3       Chinese
房梁  3       Chinese
啊   3       Chinese
---
我   3       Chinese
又   3       Chinese
不是  3       Chinese
故意  3       Chinese
的   3       Chinese
。 7 ChinesePunctuation
再说  3       Chinese
咱   3       Chinese
又   3       Chinese
不是  3       Chinese
贼   3       Chinese
, 7 ChinesePunctuation
干吗  3       Chinese
放着  3       Chinese
正道  3       Chinese
不走  3       Chinese
非   3       Chinese
走   3       Chinese
房梁  3       Chinese
啊   3       Chinese
? 7 ChinesePunctuation

Text:wordList()

Returns a WordList object containing all the unique words in the Text object as determined by Chinese Text Analyser’s segmentation engine.

Usage note: You should use Text:wordList() over Text:words() when you don’t care about the order or frequency of words in a sentence and simply want to be able to test if a word exists in the text.

Internally, Text:wordList() calls the equivalent of Text:words() and then adds each word to a WordList optimised for a small number of words and returns that WordList. Text:wordList() always ignores punctuation.

Example

1
2
3
4
5
6
local cta = require 'cta'
local text = cta.Text( '我又不是故意的。再说咱又不是贼,干吗放着正道不走非走房梁啊?' )

for word in text:wordList():words() do
    print( word )
end

Output



放着
不是

故意





正道
干吗
房梁
不走
再说

Text:characters( includePunctuation )

Similar to Text:words() except that it iterates over characters instead of than words.

Text:characterList()

Similar to Text:wordList() except that the returned value is a WordList containing only the unique characters that appear in the text.

Text:sentences()

Returns an iterator that iterates over all the sentences in the text.

Each sentence of the iteration is itself a Text object.

Sentences are delimited by Chinese or English full-stops, exclamation marks and question marks, as well as new lines.

Consecutive sentence delimiters at the end of a sentence will be included as part of that sentence, as will any closing brackets or quotation marks.

Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
local cta = require 'cta'
local document = cta.Document( 'file.txt' )

for line in document:lines() do
    --line is a Text object
    for sentence in line:sentences() do
        print( sentence )
    end
end

-- this will print two sentences:
--    『眼前的麻烦已够多了,还管日后呢?』
--    胡一刀见她累得辛苦,也劝她歇歇。
local text = cta.Text( '『眼前的麻烦已够多了,还管日后呢?』胡一刀见她累得辛苦,也劝她歇歇。' )
for sentence in text:sentences() do
    print( sentence )
end

Text:clozeText( word, replacement )

Returns a string with every occurrence of word in the text replaced by replacement.

The replacement parameter is optional and defaults to [...] if not specified.

The word parameter can be one of three types:

  • string - a single word, e.g. ‘好像’. If the word exists in the text it will be replaced by the contents of the replacement parameter.

  • table (array) - an array of words, e.g. { ‘好像’, ‘自己’ }. If any of the words in the array exist in the text they will each be replaced by the contents of the replacement parameter.

  • table (associative) - a mapping of words to replacement text e.g.

    1
    2
    3
    4
    5
    6
     ...
    
     local word = {}
     word['自己'] = '{{c1::%w}}'
     word['好像'] = '{{c2::%w}}'
     local cloze = text:clozeText( word )
    

    If any of the words in the word table exist in the text then they will be replaced by the associated replacement text. The replacement parameter should not be specified if this option is used.

The replacement parameter (if specified) can be a string, or an associative table whose keys match with the words in the word parameter, and whose values contain the replacement text for that word.

If word and replacement are both arrays and a word from the word array exists in the text, but is not found in replacement array, then the default replacement text will be used.

The default replacement text is [...].

If replacement contains the text %w it will be replaced by word being clozed, e.g. if word equals ‘好像’ and replacement equals ‘{{c1::%w}}’ then all instances of ‘好像’ in the text would be replaced with ‘{{c1::好像}}’.

Likewise, if replacement contains the text %n then it will be replaced by an increasing count of the words replaced, e.g. if word equals { ‘好像’, ‘自己’ } and replacement equals ‘{{c%n::%w}}’ and the text was

洪钧感觉自己所有的内脏器官好像都坠了下去

then the clozed text would be

洪钧感觉{{c1::自己}}所有的内脏器官{{c2::好像}}都坠了下去

If for whatever reason you want the replacement to contain the actual text %w or %n, use a double percent sign to escape it, e.g. %%w and %%n will produce replacement text of %w and %n respectively.

Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
local cta = require 'cta'
local text = cta.Text( '洪钧感觉自己所有的内脏器官好像都坠了下去' )

local cloze = text:clozeText( '自己' )
print( cloze )  -- 洪钧感觉[...]所有的内脏器官好像都坠了下去

cloze = text:clozeText( {'自己', '好像' } )
print( cloze )  -- 洪钧感觉[...]所有的内脏器官[...]都坠了下去

cloze = text:clozeText( '好像', '{{c1::%w}}' )
print( cloze )  -- 洪钧感觉自己所有的内脏器官{{c1::好像}}都坠了下去

local replacement = {}
replacement['自己'] = '{{c1::%w}}'
replacement['好像'] = '{{c2::%w}}'
cloze = text:clozeText( replacement )
print( cloze )  -- 洪钧感觉{{c1::自己}}所有的内脏器官{{c2::好像}}都坠了下去

cloze = text:clozeText( { '自己', '好像' }, '{{c%n::%w}}' )
print( cloze )  -- 洪钧感觉{{c1::自己}}所有的内脏器官{{c2::好像}}都坠了下去

Standard functions

The standard Lua functions .. (concat) and tostring() are also available for Text objects.

  • .. (concat) will join a Text object with anything that can be converted to a string, and return a string. Note: it does not return a new Text object.
  • tostring() returns the raw text of the Text object as a string.