Document

Document objects are files that are processed using Chinese Text Analyser’s segmentation engine, and have the following functions:

Document( filename, options )

Returns a Document object for the file specified by filename. The parameter options is an optional table argument that can contain the following keys:

  • process (boolean) - whether or not to process the file using Chinese Text Analyser before returning. Defaults to true.

Example

1
2
3
4
5
6
7
8
9
local cta = require 'cta'

-- Open a document and return after Chinese Text Analyser has completed processing
local document1 = cta.Document( 'file1.txt' )

-- Open a document and return without processing it.
-- Statistics and word lists will not be available unless you call
-- Document:process() or Document:startProcessing().
local document2 = cta.Document( 'file2.txt', { process = false } )

Normally you will want to access document statistics and word lists so it is preferable to let Chinese Text Analyser process the document.

If you are sure that your script does not need this information (perhaps you only want to search a document for some text, or only want to print out document sentences) then it will be slightly faster if you do not get Chinese Text Analyser to process the document first.

Document:hasFinishedProcessing()

Returns true if the document has finished processing and false otherwise.

If a document has finished processing, then statistics and word lists information will be available for the document. See Document() for more information.

Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
local function hasFinished( document )
    if document:hasFinishedProcessing() then
        print( document:name() .. ' has finished processing' )
    else
        print( document:name() .. ' has not finished processing' )
    end
end

local cta = require 'cta'
local document1 = cta.Document( 'file1.txt' )
local document2 = cta.Document( 'file2.txt', { process = false } )

hasFinished( document1 )
hasFinished( document2 )

Output

file1.txt has finished processing
file2.txt has not finished processing

Document:process()

Starts processing a document and waits until processing has finished before returning. If the document has already been processed this function does nothing.

You don’t normally need to call this function unless you created the document with the process parameter set to false. See Document() for more information.

Example

1
2
3
local cta = require 'cta'
local document = cta.Document( 'file.txt', { process = false } )
document:process()

Document:startProcessing()

Starts processing a document and returns immediately even if the document hasn’t finished processing. If the document has already been processed this function does nothing.

This function is useful if you want to process a large file, but there is other work you can do in your script first before you need access to the document statistics or word lists. All document processing occurs in a background thread, so you can do that other work while waiting for the document to finish processing.

Example

1
2
3
4
5
6
local cta = require 'cta'
local document = cta.Document( 'file.txt', { process = false } )
document:startProcessing()
...
-- do work --
...

Document:waitUntilProcessed()

Waits until the document has finished processing before returning. Returns immediately if the document has already been processed.

This function is useful if you have called Document:startProcessing(), then performed some work, and then want to ensure that the document has finished processing before continuing with some other work.

Example

1
2
3
4
5
6
7
8
9
local cta = require 'cta'
local document = cta.Document( 'file.txt', { process = false } )
document:startProcessing()

-- do work --

document:waitUntilProcessed()

-- do more work --

Document:name()

Returns the filename of the document.

Example

1
2
3
4
local cta = require 'cta'
local document = cta.Document( 'file.txt' )

print( document:name() )

Output

file.txt

Document:tostring()

Calls Document:name()

Example

1
2
3
4
5
local cta = require 'cta'
local document = cta.Document( 'file.txt' )

-- print will call tostring() on document
print( document )

Document:lines( includeNewlines )

Returns an iterator that iterates over all lines in the document. Each element of the iteration will be a Text object.

The includeNewlines parameter is optional and defaults to false if not specified.

If includeNewlines is true then the newline character \n is included at the end of each line.

Example

1
2
3
4
5
6
local cta = require 'cta'
local document = cta.Document( 'file.txt' )

for line in document:lines() do
    print( line )
end

Document:allWords()

Returns a WordList object containing all the unique words in the document.

An error will occur if this function is called before the document has finished processing.

Example

1
2
3
4
5
6
7
local cta = require 'cta'
local document = cta.Document( 'file.txt' )
local wordlist = document:allWords()

for word in wordlist:words() do
    print( word )
end

Document:knownWords( wordList )

Returns a WordList object containing all the unique words that exist in both the document and the wordList parameter. The wordList parameter is optional and defaults to to cta.knownWords() if not specified.

An error will occur if this function is called before the document has finished processing.

Example

1
2
3
4
5
6
7
8
9
local cta = require 'cta'
local document = cta.Document( 'file.txt' )

local known = document:knownWords()
...

local customList = cta.WordList( 'words.txt' )
known = document:knownWords( customList )
...

Document:unknownWords( wordList )

Returns a WordList object containing all the unique words that exist in the document and that don’t exist in the wordList parameter. The ‘wordList’ parameter is optional and defaults to to cta.knownWords() if not specified.

An error will occur if this function is called before the document has finished processing.

Example

1
2
3
4
5
6
7
8
9
local cta = require 'cta'
local document = cta.Document( 'file.txt' )

local unknown = document:unknownWords()
...

local customList = cta.WordList( 'words.txt' )
unknown = document:unknownWords( customList )
...

Document:allStatistics( options )

Returns a table containing frequency and other statistics for each unique word in the document.

Each element of the returned table will include a table with the following fields:

  • word - the word (note: only if options.keyByWord is false).
  • frequency - the number of times the word appeared in the document.
  • percentageFrequency - the number of times the word appeared in the document as a percentage of the total words in the document.
  • cumulativePercentageFrequency - the cumulative percentage frequency of the word.
  • firstOccurrence - the first occurrence of the word in the document specified as a byte offset from the beginning of the file.
  • hskLevel - the lowest HSK level that this word appears in. If the word does not appear in any HSK level, this value will be set to 999.

This function takes an optional table containing configuration parameters which can have the following keys:

options.keyByWord - (boolean)

If keyByWord is false the returned table is an array, sorted by the other paramters specified in options.

If keyByWord is true the first return value is a table keyed by the word. A second table is also returned containing a sorted array that can be used to process the first table in sorted order. If sorted is false (see below), no second table is returned.

Defaults to false.

It is useful to us keyByWord when you want to be able to easily access statistics for a specific word e.g.

1
2
    local stats = document:allStatistics( { keyByWord = true } )
    local wordStats = stats['资料']

options.sortBy - (string)

The field to sort by. Valid values are:

  • frequency
  • firstOccurrence
  • word
  • hskLevel

Defaults to frequency

options.sorted - (boolean)

If true the returned table(s) will be sorted.

If false the returned table will not be sorted in any particular order. This is useful when you want to keyByWord and you are not interested in having sorted results.

Defaults to true.

This value is only useful if keyByWord is true, in which case the second table containing the sort order will not be returned and the values in the first table will not be in any particular order.

It is marginally faster (and more memory efficient) to set sorted to false if you are keying by word and aren’t interested in any particular sort order.

options.ascending - (boolean)

Whether to sort in ascending or descending order.

The default value depends on the sortBy field as follows:

  • frequency - false
  • firstOccurrence - true
  • word - true
  • hskLevel - true

An error will occur if this function is called before the document has finished processing.

Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
local cta = require 'cta'

local function printRow( key, stats )
    cta.write( key )

    if stats.word ~= nil then
        cta.write( '', stats.word )
    end

    cta.print( '', stats.hskLevel,
                   stats.firstOccurrence,
                   stats.frequency,
                   stats.percentageFrequency,
                   stats.cumulativePercentageFrequency )
end

local function printKeyedByRow( statistics )
    for i, stats in ipairs( statistics ) do
        printRow( i, stats )
    end
end

local function printKeyedByWord( statistics, sorted )

    -- the order of associative arrays is not guaranted, and so
    -- if we have a valid 'sorted' parameter, use it to iterate
    -- over the statistics in the correct order
    if sorted ~= nil then
        for _, word in ipairs( sorted ) do
            printRow( word, statistics[word] )
        end
    else
        -- print unsorted
        for word, values in pairs( statistics ) do
            printRow( word, values )
        end
    end
end

local document = cta.Document( 'file.txt' )

-- Using all defaults, statistics is sorted by
-- frequency descending
local statistics = document:allStatistics()
printKeyedByRow( statistics )

-- Sort statistics by hskLevel descending
statistics = document:allStatistics( { sortBy = 'hskLevel', ascending = false } )
printKeyedByRow( statistics )

-- Sort by frequency descending (the default), key by word.
statistics, sortOrder = document:allStatistics( { keyByWord = true } )
printKeyedByWord( statistics, sortOrder )

-- Key by word, don't care about sort order.
statistics = document:allStatistics( { keyByWord = true, sorted = false } )
printKeyedByWord( statistics, nil )

Document:knownStatistics( options )

Returns a table containing frequency and other statistics for each unique known word in the document.

The options parameter is the same as for Document:allStatistics() except that it can also take one additional field:

options.wordList - (WordList)

Only words that exist in this WordList will be treated as known. If not specified this value defaults to cta.knownWords().

An error will occur if this function is called before the document has finished processing.

Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
local cta = require 'cta'
local document = cta.Document( 'file.txt' )
local wordList = cta.WordList( 'words.txt' )

-- get 'known' word statistics based on cta.knownWords()
local statistics = document:knownStatistics()
...

-- get 'known' word statistics based on wordlist
statistics = document:knownStatistics( { wordList = wordList } )
...

Document:unknownStatistics( options )

Returns a table containing frequency and other statistics for each unique unknown word in the document.

The options parameter is the same as for Document:allStatistics() except that it can also take one additional field:

options.wordList - (WordList)

Only words that do not exist in this WordList will be treated as unknown. If not specified this value defaults to cta.knownWords().

An error will occur if this function is called before the document has finished processing.

Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
local cta = require 'cta'
local document = cta.Document( 'file.txt' )
local wordList = cta.WordList( 'words.txt' )

-- get 'unknown' word statistics for all words *not* in cta.knownWords()
local statistics = document:unknownStatistics()
...

-- get 'unknown' word statistics for all words *not* in wordList
statistics = document:unknownStatistics( { wordList = wordList } )
...

Document:findWord( word )

Returns an iterator that finds all lines and sentences in the document containing the word (or words) specified by the word parameter.

The word parameter can be one of:

Each iteration will return three values:

  • word (string) - the first word in the line/sentence that matched a word in the word parameter
  • sentence (Text) - the sentence containing the word
  • line (Text) - the line containing the word

Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
local function find( document, words )
    for word, sentence, line in document:findWord( words ) do
        print( word )
        print( '', sentence )
        print( '', line )
    end
end

local cta = require 'cta'
local document = cta.Document( 'file.txt' )

-- find all instances of a single word
find( document, '搜索' )

-- find all instances of a multiple words
find( document, { '第一', '第二', '第三' } )

-- find all instances of a unknown words
find( document, document:unknownWords() )

Document:findLinesContaining( word )

Similar to Document:findWord() except each iteration only returns the word and the line.

Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
local function findLines( document, words )
    for word, line in document:findLinesContaining( words ) do
        print( word )
        print( '', line )
    end
end

local cta = require 'cta'
local document = cta.Document( 'file.txt' )

-- find all lines containing a single word
findLines( document, '搜索' )

-- find all lines containing any one of multiple words
findLines( document, { '第一', '第二', '第三' } )

-- find all lines containing unknown words
findLines( document, document:unknownWords() )

Document:findSentencesContaining( word )

Similar to Document:findWord() except each iteration only returns the word and the sentence.

Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
local function findSentences( document, words )
    for word, sentence in document:findSentencesContaining( words ) do
        print( word )
        print( '', sentence )
    end
end

local cta = require 'cta'
local document = cta.Document( 'file.txt' )

-- find all sentences containing a single word
findSentences( document, '搜索' )

-- find all sentences containing any one of multiple words
findSentences( document, { '第一', '第二', '第三' } )

-- find all sentences containing unknown words
findSentences( document, document:unknownWords() )