LAL: Linear Arrangement Library 21.07.01
A library focused on algorithms on linear arrangements of graphs.
|
A reader for a collection of treebanks. More...
#include <treebank_collection_reader.hpp>
Public Member Functions | |
treebank_error | init (const std::string &main_file) noexcept |
Initialise the reader with a new collection. | |
bool | end () const noexcept |
void | next_treebank () noexcept |
Opens the file of the next treebank in the main file. | |
treebank_reader & | get_treebank_reader () noexcept |
Returns a treebank reader class instance for processing a treebank. | |
Private Member Functions | |
void | step_line () noexcept |
Consumes one line of the main file m_main_file. | |
Private Attributes | |
std::string | m_main_file = "none" |
File containing the list of languages and their treebanks. | |
std::string | m_cur_treebank_name = "none" |
The name of the current treebank file. | |
std::string | m_cur_treebank_filename = "none" |
The name of the current treebank file. | |
std::ifstream | m_list |
Handler for main file reading. | |
treebank_reader | m_treebank_reader |
Object to process a language's treebank. | |
bool | m_reached_end = false |
Did we reach the end of the file? | |
bool | m_no_more_treebanks = false |
Have all trees in the file been consumed? | |
A reader for a collection of treebanks.
This class, the objects of which will be referred to as the "collection readers", is an interface to help you do a custom processing of a set of treebanks. A treebank collection is a set of files, each of which is a treebank. A treebank is a file containing one or more lines, each describing a syntactic dependency tree. If you want to output any of the features already calculated by the library, use class treebank_collection_processor instead; said class is much easier to use and can process treebanks in parallel.
Each tree in a treebank file is formatted as a list of whole positive numbers (including zero), each representing a node of the tree. The number 0 denotes the root of the tree, and a number at a certain position indicates its parent node. For example, when number 4 is at position 9 it means that node 9 has parent node 4. Therefore, if number 0 is at position 1 it means that node 1 is the root of the tree. A complete example of such a tree's representation is the following
0 3 4 1 6 3
which should be interpreted as
predecessor: 0 3 4 1 6 3 node of the tree: 1 2 3 4 5 6
Note that lines like these are not valid:
(1) 0 2 2 2 2 2 (2) 2 0 0
Line (1) is not valid due to a self-reference in the second position, and (2) not being valid due to containing two '0' (i.e., two roots).
A treebank collection reader helps you navigate through a treebank collection. It does the job of initialising the treebank_reader class, which is what you need to do a custom process of a treebank file.
Now, the treebank files are referenced within a "main file list", henceforth called the main file. The main file's lines contain only two strings describing a treebank. The first string is a self-descriptive name of the treebank (e.g., the ISO code of a language), and the second is the relative path to the file containing the syntactic dependency trees (e.g., the syntactic dependency trees of a language in a collection). The path is relative to the directory that contains the main file.
For example, the main file could be called stanford.txt, representing the Stanford treebank collection, and could contain:
arb path/to/file/ar-all.heads2 eus path/to/file/eu-all.heads2 ben path/to/file/bn-all.heads2 ...
where the first column contains a string referencing the language (in this case, an ISO code), and the second column contains the relative path to the file with the syntactic dependency trees.
The user has to initialise a collection reader with the main file (the main file list). For example, to read the Stanford collection the reader has to be initialised with the main file stanford.txt which could contain the contents given above. Bear in mind that a collection reader only processes the main file: it iterates through the list of files within the main file using the method next_treebank. This method can be called as long as method end returns false. Each call to next_treebank builds an object of class treebank_reader which allows the user to iterate through the trees within the corresponding file. This object can be retrieved by calling method get_treebank_reader.
An example of usage of this class is given in the following piece of code.
|
inlinenoexcept |
Returns true or false depending on whether there is a next treebank to be read.
|
noexcept |
Initialise the reader with a new collection.
Objects of this class can't be used to read a treebank until this method returns no error.
main_file | Main file of the collection. |
|
noexcept |
Opens the file of the next treebank in the main file.
This method can be called even after it has returned an error.
|
private |
File containing the list of languages and their treebanks.
This file's lines contain two strings, the first being the language name (used mainly for debugging purposes), and the name of the file containing the syntactic dependency trees of that language.