Lua Script

Lua is a scripting language that can be used to extend and enhance the functionality of programs that provide support for it.

A tutorial on Lua is beyond the scope of Chinese Text Analyser‘s documentation, however the web has plenty of resources and reference material for learning how to program in Lua.

A good overview of the language can be found here.

Don’t be afraid to try out and play around with different scripts in Chinese Text Analyser. You won’t be able to break anything, and doing is the best way to learn.

Before you begin

If you want to create your own Lua scripts, you will need to use a text editor. This is any program that can save a file in plain text, without any sort of formatting beyond the text you write.

Something like Notepad (Windows) or TextEdit (macOS) is sufficient for the task, however you will be better off using a text editor such as Notepad++ or Sublime Text which have many useful features for writing code.

Once you have a suitable text editor, you can write your Lua scripts, save them with a ‘.lua’ extension, and then run them from within Chinese Text Analyser.

A first example

The first thing you need to do in your Lua scripts if you want to access any of Chinese Text Analyser‘s functionality is to load the ‘cta’ module and store it in a variable. For example, the following code stores the module in a local variable called cta.

local cta = require 'cta'

Once loaded, you can then make use of the various functions of the ‘cta’ module.

For this first example, we are going to open a file containing Chinese text and print out each line from that file. Documents can be loaded using the cta.Document() function.

In its basic form, this function takes the name of a file you wish to open and returns a new Document object that references the file.

We can then store the Document object in a variable as follows:

local document = cta.Document( 'chinese.txt' )

You can see a full list of the features available to Document objects here. We are interested in the Document:lines() method, which returns an iterator to all of the lines in the document.

Using this iterator, we can then loop over all the lines in the document as follows:

for line in document:lines() do
    print( line )
end

In the example above, for each iteration of the loop, the variable line will contain a Text object referring to the next line in the document, which we then print out.

The full code for this example looks like this:

1
2
3
4
5
6
local cta = require 'cta'
local document = cta.Document( 'chinese.txt' )

for line in document:lines() do
    print( line )
end

Although this code doesn’t really do anything useful, it serves as a brief introduction for how to access Chinese Text Analyser from within Lua.

In the next few sections, we will build on this example to do more and more complex things.

Printing sentences

Now that we know how to iterate over all the lines in a document, it’s time to do something useful with that.

In a given piece of text, a single line will likely contain one or more sentences. Sentences are a useful unit for language learners to deal with, so now we’ll modify the previous example to print out each sentence from the document on a new line.

Text objects (which are returned by the Document:lines() iterator) contain a method Text:sentences() that returns an iterator over all sentences in the Text object.

Therefore we can loop over all sentences in the document by first looping over all lines and then for each line, looping over all sentences:

1
2
3
4
5
6
7
8
local cta = require 'cta'
local document = cta.Document( 'chinese.txt' )

for line in document:lines() do
    for sentence in line:sentences() do
        print( sentence )
    end
end

You can see from the above example that the way we iterate over each sentence is almost identical to the way we iterate over lines. Like lines, each sentence is a Text object.

The output from this script will be similar to the output from the first example, except that each sentence will be printed on its own line.

Although this is nice, printing out the sentences of a document still isn’t very useful. In the next section, we’ll build on this to only print sentences that contain unknown words.

Printing sentences with unknown words

Chinese Text Analyser keeps track of a user’s known vocabulary. You can access that list of known vocabulary through the cta.knownWords() function.

local known = cta.knownWords()

cta.knownWords() returns a WordList object which contains a list of words that can be efficiently tested to see whether or not a word exists in that list. This is done by calling the method WordList:contains(), which returns true if the word exists in the list and false otherwise.

We can consider any word which does not exist in the list of ‘known’ words as ‘unknown’, for example:

if known:contains( word ) then
    print( 'The word is known' )
else
    print( 'The word is unknown' )
end

We can find sentences that have unknown words by checking all the words in that sentence and seeing whether or not they are in the known list.

Building from the previous example, Text objects have a method called Text:words() that returns an iterator over each word of the text.

We can use this method to go over each word of a sentence and only print out the sentence if it contains a word is not on the list of known words.

for word in sentence:words() do
    if not known:contains( word ) then
        print( sentence )
        break
    end
end

The break keyword tells the script to immediately exit from the current loop. We use it here because once we have found at least one unknown word we don’t need to process any more words in the sentence.

The full code for this example looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
local cta = require 'cta'
local known = cta.knownWords()
local document = cta.Document( 'chinese.txt' )

for line in document:lines() do
    for sentence in line:sentences() do
        for word in sentence:words() do
            if not known:contains( word ) then
                print( sentence )
                break
            end
        end
    end
end

Now it’s starting to get interesting. We can use this script to open a document containing Chinese text and print out all the sentences that contain unknown words.

The next section shows how you can do this using your own lists of words.

Using HSK and custom word lists

Although it’s nice to be able to use Chinese Text Analyser‘s list of known words, sometimes you might have another list of words you’d like to test against, for example HSK word lists.

HSK lists can be accessed through the cta.hskLevel() function. This function has two versions:

  • cta.hskLevel( level ) - which returns a WordList object containing all the words for a single HSK level.

    local hsk6 = cta.hskLevel( 6 )
    
  • cta.hskLevel( lowerLevel, upperLevel ) - which returns a WordList object containing all the words for the range of HSK levels bounded by a lowerLevel and an upperLevel.

    local hsk1to6 = cta.hskLevel( 1, 6 )
    

We could now modify the previous example so that it prints out all HSK level 6 words from a document.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
local cta = require 'cta'
local hsk6 = cta.hskLevel( 6 )
local document = cta.Document( 'chinese.txt' )

-- keep track of all the hsk6 words that are seen in the document
local seen = {}

for line in document:lines() do
    for sentence in line:sentences() do
        for word in sentence:words() do
            if hsk6:contains( word ) then
                -- add this word to the list of seen words
                seen[word] = true
            end
        end
    end
end

-- print out all the seen words
for word in pairs( seen ) do
    print( word )
end

In the above, we are creating a table called seen which we use to keep track of any words in the document that are also in the HSK level 6 word list.

Once we’ve gone through all the sentences in the document, we then loop over all items in the seen table and print them out.

Custom word lists

Instead of HSK levels you can also load custom word lists from files using the cta.WordList() function as follows:

local wordList = cta.WordList( 'wordlist.txt' )

Which will load the words from a file called ‘wordlist.txt’. Each line of this file should contain a single word, and it should be saved using utf-8, utf-16, GB or BIG-5 encoding.

We could once again modify our earlier example so that instead of printing out sentences that contain unknown words, we print out sentences that contain words from our custom word list:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
local cta = require 'cta'
local customWords = cta.WordList( 'wordlist.txt' )
local document = cta.Document( 'chinese.txt' )

for line in document:lines() do
    for sentence in line:sentences() do
        for word in sentence:words() do
            if customWords:contains( word ) then
                print( sentence )
                break
            end
        end
    end
end

The next section will explore how to print out cloze deleted sentences.

Printing cloze deleted sentences

Now that we can find sentences that match a set of criteria, we can start to do even more useful things with that, for example printing cloze deleted sentences containing unknown words.

Text objects have a method called Text:clozeText which will generate cloze deleted text for a word (or group of words).

Like the HSK level example, we’re going to keep track of a list of ‘seen’ words and only print out sentences for unknown words that we haven’t already seen.

We can do this as follows:

if seen[word] == nil and not known:contains( word ) then

This lines means that if word does not exist in the seen table and if word is not found in the list of known words then do something.

In our case, this ‘something’ is going to be print the word and the sentence with the word clozed deleted out, separated by a tab.

In Lua, the print function will automatically separate multiple arguments with a tab space so we can achieve this like so:

print( word, sentence:clozeText( word ) )

A full version of this example can be found here:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
local cta = require 'cta'
local known = cta.knownWords()
local document = cta.Document( 'chinese.txt' )
local seen = {}

for line in document:lines() do
    for sentence in line:sentences() do
        for word in sentence:words() do
            if seen[word] == nil and not known:contains( word ) then
                print( word, sentence:clozeText( word ) )
                seen[word] = true
                break
            end
        end
    end
end

The Text:clozeText() method can also do multiple substitutions. This comes in handy if for example you’d like to make multiple cloze deleted sentences for use with something like Anki, which we’ll do in the next section.

Cloze sentences for Anki

In Anki, you can create cloze deleted cards for a sentence using the format {{cN::<word>}} where N is a unique number for the card and <word> is the word to be clozed. For example if you had the sentence:

洪钧感觉自己所有的内脏器官好像都坠了下去

and you wanted to create two separate cards with cloze text, one for the word 自己 and one for the word 好像 you would create a card containing the following text:

洪钧感觉{{c1::自己}}所有的内脏器官{{c2::好像}}都坠了下去

We can use the Text:clozeText() method to automatically create sentences in this format.

Let’s say you wanted to create a cloze deleted card for each unknown word in a sentence. The first thing you need to do is build up a list of unknown words.

Based on our other examples, this is easy to do:

local unknown = {}
for word in sentence:words() do
    if seen[word] == nil and not known:contains( word ) then
        seen[word] = true
        table.insert( unknown, word )
    end
end

Here we create a new table and store it in a local variable called unknown. Unlike the seen table, we want the unknown table to be an indexed array rather than an associative array (see Lua tables) so we use table.insert rather than the square bracket notation to add the word to the table. We do this because the version of Text:clozeText() that we will call expects an indexed array.

Note that we also don’t break out of the loop because we want the table of unknown words to contain all unknown words in the sentence.

In addition to just cloze deleting a single word, Text:clozeText() can also be used to cloze delete multiple words and can also use custom replacement text. We do this by passing the method an indexed table of words as well as the text we want to use to replace those words.

Each word in the table that also exists in the sentence will be replaced with that custom text.

We already know that Anki expects the format for cloze cards to be {{cN::<word>}} which we can replicate using the replacement text ‘{{c%n::%w}}’.

Chinese Text Analyser will automatically replace %n with a count of the number of substitutions made so far and replace %w with the word that is currently being substituted. When combined, this will give us exactly what Anki wants.

Once we have our list of unknown words from the sentence, we can check to make sure that unknown has at least one word in it and if so pass the list of words and the replacement text to the clozeText function to generate a sentence compatible with Anki‘s cloze deletion.

if #unknown > 0 then
    print( word, sentence:clozeText( unknown, '{{c%n::%w}}' ) )
end

The full example is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
local cta = require 'cta'
local known = cta.knownWords()
local document = cta.Document( 'chinese.txt' )
local seen = {}

for line in document:lines() do
    for sentence in line:sentences() do
        local unknown = {}
        for word in sentence:words() do
            if seen[word] == nil and not known:contains( word ) then
                seen[word] = true
                table.insert( unknown, word )
            end
        end

        if #unknown > 0 then
            print( sentence:clozeText( unknown, '{{c%n::%w}}' ) )
        end
    end
end

Note that unknown is created as a local variable of the sentence loop so that it is recreated fresh for each sentence.

This script will go through each sentence of the document and print out cloze deleted sentences for any sentence that contains an unknown word.

In the next section we’ll look at how to do this with sentences that are ‘mostly’ known e.g. sentences that contain more than a certain percentage of known words.

Finding mostly known sentences

Printing sentences containing unknown words is useful but sometimes a sentence contains a large number of unknown words and therefore isn’t really suitable for studying. It would be far more useful to only find sentences that contained a high percentage of known words.

For the next example we’ll write a function for test whether or not a sentence is ‘mostly known’.

This function will take 3 parameters: the sentence, a list of known words, and a threshold value, and then iterate over all words in the sentence and calculate a ratio of known to total words. If this ratio is greater than the threshold and is also less than 1 (to ensure at least one unknown word) then we’ll consider the sentence ‘mostly known’.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
local function sentenceMostlyKnown( sentence, known, threshold )

    -- make threshold default to 97% if not specified
    if threshold == nil then
        threshold = 0.97
    end

    -- the total number of words in the sentence
    local total = 0

    -- the total number of known words in the sentence
    local totalKnown = 0

    -- loop over all words in the sentence and increase
    -- the total and totalKnown counts as appropriate
    for word in sentence:words() do
        if known:contains( word ) then
            totalKnown = totalKnown + 1
        end
        total = total + 1
    end

    -- calculate the ratio of known to total words
    local ratio = totalKnown / total

    -- return true if ratio is greater than the threshold
    -- and less than 1
    return ratio >= threshold and ratio < 1
end

We can now call this function like this:

-- threshold defaults to 97% if not specified
if sentenceMostlyKnown( sentence, known ) then
    print( 'Sentence mostly known' )
else
    print( 'Sentence either completely known, or mostly unknown' )
end

If we like, we can also specify a custom threshold:

if sentenceMostlyKnown( sentence, known, 0.95 ) then
    print( 'Sentence more than 95% known' )
else
    print( 'Sentence either completely known, or less than 95% known' )
end

We can now combine this with our previous example to generate Anki cloze sentences for any sentences in a document that have a certain percentage of known words.

To avoid repeating the need to iterate over the sentence twice (once in the sentenceMostlyKnown function and once in the sentence loop to find the unknown words) we’ll modify the sentenceMostlyKnown function to also keep track of and return the unknown words.

We’ll also need to pass in another parameter seen to keep track of which words we have already seen.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
local function sentenceMostlyKnown( sentence, known, seen, threshold )
    -- make threshold optional, defaulting to 97% if not specified
    if threshold == nil then
        threshold = 0.97
    end

    -- the total number of words in the sentence
    local total = 0

    -- the total number of known words in the sentence
    local totalKnown = 0

    local unknown = {}

    -- loop over all words in the sentence and increase
    -- the total and totalKnown counts as appropriate
    for word in sentence:words() do
        if known:contains( word ) then
            totalKnown = totalKnown + 1
        else
            if seen[word] == nil then
                table.insert( unknown, word )
                seen[word] = true
            end
        end
        total = total + 1
    end

    -- calculate the ratio of known to total words
    local ratio = totalKnown / total

    -- return true if ratio is greater than the threshold
    -- and less than 1.  Also return the list of unknown words
    return ratio >= threshold and ratio < 1, unknown
end

Putting it altogether, we get:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
local function sentenceMostlyKnown( sentence, known, seen, threshold )
    -- make threshold optional, defaulting to 97% if not specified
    if threshold == nil then
        threshold = 0.97
    end

    -- the total number of words in the sentence
    local total = 0

    -- the total number of known words in the sentence
    local totalKnown = 0

    local unknown = {}

    -- loop over all words in the sentence and increase
    -- the total and totalKnown counts as appropriate
    for word in sentence:words() do
        if known:contains( word ) then
            totalKnown = totalKnown + 1
        else
            if seen[word] == nil then
                table.insert( unknown, word )
                seen[word] = true
            end
        end
        total = total + 1
    end

    -- calculate the ratio of known to total words
    local ratio = totalKnown / total

    -- return true if ratio is greater than the threshold
    -- and less than 1.  Also return the list of unknown words
    return ratio >= threshold and ratio < 1, unknown
end

local cta = require 'cta'
local known = cta.knownWords()
local document = cta.Document( 'chinese.txt' )
local seen = {}

for line in document:lines() do
    for sentence in line:sentences() do
        local mostlyKnown, unknown = sentenceMostlyKnown( sentence, known, seen )
        if mostlyKnown and #unknown > 0 then
            print( sentence:clozeText( unknown, '{{c%n::%w}}' ) )
        end
    end
end

Working with multiple files

Now that we can go through an entire document and print out cloze deleted sentences for all sentences that contain more than a certain percentage of known words, it’s not much more work to extend this to go over multiple files.

The ‘cta’ module provides a function to ask the use for a file (or list of files) to open:

local files = cta.askUserForFileToOpen( { allowMultiple = true } )

When allowMultiple is true then this function returns an indexed array of strings containing the full path to each file. When allowMultiple is false then it returns a single string containing the full path of the file the user selected. The function returns nil if the user cancelled.

We can now integrate that with our previous example. To make the code a bit easier to read, we’ll split off the processing of each document in to its own function:

local function findSentencesInDocument( filename, known, seen )
    local document = cta.Document( filename )
    for line in document:lines() do
        for sentence in line:sentences() do
            local mostlyKnown, unknown = sentenceMostlyKnown( sentence, known, seen )
            if mostlyKnown and #unknown > 0 then
                print( sentence:clozeText( unknown, '{{c%n::%w}}' ) )
            end
        end
    end
end

And we can now call that function multiple times for each file the user chose:

local known = cta.knownWords()
local seen = {}

local files = cta.askUserForFileToOpen( { allowMultiple = true } )
if files ~= nil then
    for _, filename in ipairs( files ) do
        findSentencesInDocument( file, known, seen )
    end
end

Putting it all together, we get:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
local cta = require 'cta'

local function sentenceMostlyKnown( sentence, known, seen, threshold )
    if threshold == nil then
        threshold = 0.97
    end

    local total = 0
    local totalKnown = 0

    local unknown = {}
    for word in sentence:words() do
        if known:contains( word ) then
            totalKnown = totalKnown + 1
        else
            if seen[word] == nil then
                table.insert( unknown, word )
                seen[word] = true
            end
        end
        total = total + 1
    end

    local ratio = totalKnown / total
    return ratio >= threshold and ratio < 1, unknown
end

local function findSentencesInDocument( filename, known, seen )
    local document = cta.Document( filename )
    for line in document:lines() do
        for sentence in line:sentences() do
            local mostlyKnown, unknown = sentenceMostlyKnown( sentence, known, seen )
            if mostlyKnown and #unknown > 0 then
                print( sentence:clozeText( unknown, '{{c%n::%w}}' ) )
            end
        end
    end
end

local known = cta.knownWords()
local seen = {}

local files = cta.askUserForFileToOpen( { allowMultiple = true } )
if files ~= nil then
    for _, filename in ipairs( files ) do
        findSentencesInDocument( file, known, seen )
    end
end

Working with directories of files

The ‘cta’ module also has a function to ask the user for a directory on your disk:

local directory = cta.askUserForDirectory()
if directory ~= nil then
    print( 'The user chose the directory: ' .. directory )
end

You can use that in conjunction with the ‘lfs’ module, which is included with Chinese Text Analyser, to process all the files (and sub-directories) of a directory.

You can load the ‘lfs’ module in your code like so:

local lfs = require 'lfs'

The ‘lfs’ module has a number of functions for manipulating the file system. The two that we are interested in, are dir which provides an iterator over a folder or directory on your hard drive, and attributes which tells us whether a filename is a file or a directory.

We’ll write a new function that traverses a directory and all subdirectories searching for filenames that end in ‘.txt’, and if so, processing like we did in the other examples.

local function traverseDirectory( directory )
    for file in lfs.dir( directory ) do
        -- ignore the . and .. directories
        if file ~= "." and file ~= ".." then
            -- get the full path of the file on the disk
            local fullPath = directory .. '/' .. file

            -- query the 'mode' attribute of this file
            local mode = lfs.attributes( fullPath, 'mode' )

            -- if it's a directory, then call the function again
            -- on this new directory
            if mode == "directory" then
                traverseDirectory( fullPath )
            elseif mode == "file" then
                -- it's a file so check the filename ends in .txt
                if file:match( "%.txt$" ) then
                    findSentencesInDocument( fullPath, known, seen )
                end
            end
        end
    end
end

We can combine that then with our previous example as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
local cta = require 'cta'
local lfs = require 'lfs'

local function sentenceMostlyKnown( sentence, known, seen, threshold )
    if threshold == nil then
        threshold = 0.97
    end

    local total = 0
    local totalKnown = 0

    local unknown = {}
    for word in sentence:words() do
        if known:contains( word ) then
            totalKnown = totalKnown + 1
        else
            if seen[word] == nil then
                table.insert( unknown, word )
                seen[word] = true
            end
        end
        total = total + 1
    end

    local ratio = totalKnown / total
    return ratio >= threshold and ratio < 1, unknown
end

local function findSentencesInDocument( filename, known, seen )
    local document = cta.Document( filename )
    for line in document:lines() do
        for sentence in line:sentences() do
            local mostlyKnown, unknown = sentenceMostlyKnown( sentence, known, seen )
            if mostlyKnown and #unknown > 0 then
                print( sentence:clozeText( unknown, '{{c%n::%w}}' ) )
            end
        end
    end
end

local function traverseDirectory( directory, known, seen )
    for file in lfs.dir( directory ) do
        if file ~= "." and file ~= ".." then

            local fullPath = directory .. '/' .. file
            local mode = lfs.attributes( fullPath, 'mode' )

            if mode == "directory" then
                traverseDirectory( fullPath )
            elseif mode == "file" then
                if file:match( "%.txt$" ) then
                    findSentencesInDocument( fullPath, known, seen )
                end
            end
        end
    end
end

local known = cta.knownWords()
local seen = {}

local directory = cta.askUserForDirectory()
if directory ~= nil then
    traverseDirectory( directory, known, seen )
end

If you’ve made it this far, you should be able to see that Lua scripting is a powerful and flexible tool that can be used to process documents containing Chinese text.

We started from a simple script that printed out each line in a file, and worked our way up to a 65 line script that will:

  • Find all ‘.txt’ files in a given directory (and sub-directories)
  • Extract all ‘mostly known’ sentences from those files, but only if the sentence contains an unknown word we don’t already have a sentence for.
  • Print cloze deleted versions of those sentences for importing in to Anki.

Along the way, you’ve also gained a glimpse of some of the other features available through Lua scripting, although the above examples only scratch the surface of all the things that are possible.

You can find out more about all the features available to your Lua scripts in the Lua API appendix.

You can also participate in online discussion about Lua scripting for Chinese Text Analyser on Chinese-Forums.