Export Word Lists

Word List export allows you to export lists of words, sentences and statistics from the current document.

Numerous fields are available for export, and a tab-separated list of fields is output for each word being exported.

This makes it simple to import these word lists in to other programs such as Pleco, Anki or a spreadsheet program.

_images/export-word-list.png

The Word List export dialog contains the following options:

Words

This option lets you choose which words to export. The available choices are:

  • All - All unique words from the document.
  • Known - All unique words from the document that are on the currently active word list.
  • Unknown - All unique words from the document that are not on the currently active word list.
  • Looked up - All unique words from the document that have recently been looked up, where ‘recently’ is defined as any word looked up since the current Chinese Text Analyser session was started and that has not already been exported and marked as known.

Sort by

This lets you to sort the exported words by one of the following options:

Frequency (Ascending) - the frequency of the word in ascending order, i.e. least frequent words first.

Frequency (Descending) - the frequency of the word in descending order, i.e. most frequent words first.

First Occurrence (Ascending) - the order in which the word first appeared in the document, with earlier words appearing earlier in the list.

First Occurrence (Descending) - the order in which the word first appeared in the document, with earlier words appearing later in the list.

Word (Ascending) - sorts by unicode codepoint, with characters that have a lower codepoint ordered before characters that have a higher codepoint.

Word (Descending) - sorts by unicode codepoint, with characters that have a higher codepoint ordered before characters that have a lower codepoint.

HSK Level (Ascending) - the HSK level of the word from lowest to highest (e.g. ‘1’ comes before ‘6’, and ‘*’ comes last).

HSK Level (Descending) - the HSK level of the word from highest to lowest (e.g. ‘*’ comes first, then levels ‘6’ through to ‘1’).

TOCFL Level (Ascending) - the TOCFL level of the word from lowest to highest (e.g. ‘1’ comes before ‘5’, and ‘*’ comes last).

TOCFL Level (Descending) - the TOCFL level of the word from highest to lowest (e.g. ‘*’ comes first, then levels ‘5’ through to ‘1’).

Rows

This lets you limit the total number of exported words to a certain number, and optionally sort these words by a specific metric.

All - all the words in the word list.

First ‘X’ words - limits the number of exported words to X.

Ordered by - allows you to reorder the remaining rows after the above word limit has been applied.
This reordering takes place after ‘Sort by’ and the word limit have been applied, so for example, you could first ‘Sort by’ frequency, then limit the total to the top 20 words, and then ‘order by’ first occurrence, to get the 20 most frequent words ordered by the first time they appear in the document.

Available/Selected Fields

The available and selected field boxes allow you to choose what information to export for each word.

Available Fields contains the total list of fields you can choose from, and Selected Fields contains the fields that you currently wish to export.

The following fields are supported:

Word - the word as it appears in the document.

If the document is in Simplified characters, then Word will be Simplified.

If the document is in Traditional characters, then Word will be Traditional.

Simplified - the Simplified version of the word, regardless of whether the original document uses Simplified or Traditional characters.

Traditional - the Traditional version of the word, regardless of whether the original document uses Simplified or Traditional characters.

Simplified[Traditional] - the simplified and traditional words combined as a single exported field, with the traditional version in square brackets.

Pinyin (Tones) - the pinyin of the word, using tone marks to represent the tones, e.g. pīnyīn.

Pinyin (Numbers) - the pinyin of the word, using numbers to represent the tones, e.g. pin1yin1.

English Definition - the English definition of the word.

Sentence - the first sentence from the document that contains the word.

Cloze Sentence - the same as Sentence, except the word is removed from the sentence and replaced with […] markers.

Line - the first line from the document that contains the word. This is different from Sentence in that it uses line breaks rather than punctuation to determine what to include. This is useful if the source document has already been broken up in to meaningful chunks of text on a single line but the line also has full-stops or other punctuation e.g. subtitles for a TV show or movie.

Cloze Line - the same as Line, except the word is removed from the line and replaced with […] markers.

Frequency - the total number of times the word appears in the document. If you have a document with 1,000 words, and the word ‘的’ appears 30 times, then the frequency for ‘的’ will be 30.

% Frequency - the same as Frequency, but expresssed as a percentage of the total words in the document. E.g. if the word appeared 30 times and there were 1,000 words in the document, it would have a % Frequency of 3%.

Cumulative Frequency - the sum of all previous % frequencies, ordered from most frequent to least frequent.

For example, if the most frequent word has a % frequency of 10% and the next most frequent word had a frequency of 5%, and the next most frequent word had a frequency of 3%, then they would have Cumulative %Frequencies of 10%, 15% and 18% (10 + 5 = 15. 15 + 3 = 18) and so on for every word in the document, up to 100%.

This figure lets you know that if you have learnt all the most frequent words up to a certain point, what percentage of words in the total document you will understand.

First Occurrence - the position in the document of where the word first appeared (based on its byte-offset from the beginning of the file). The lower the number, the earlier this word appeared in the document. The higher the number, the later it appeared.

For example in the sentence ‘右边中间部分有一个表头’, ‘右边’ would have the lowest First Occurrence (because it appears first) and ‘表头’ would have the highest (because it appears last).

This lets you prioritise words by the order in which they first appear.

HSK Level - the HSK level of the word.

TOCFL Level - the TOCFL level of the word.

Mark exported words as known

If you check the Mark exported words as known option, then any exported words will also be marked as ‘known’. This is useful if you are regularly exporting words to learn in a flashcard program, and don’t want to have to manually mark each word as known.