LAL: Linear Arrangement Library 21.07.01
A library focused on algorithms on linear arrangements of graphs.
Loading...
Searching...
No Matches
lal::io::treebank_collection_reader Class Reference

A reader for a collection of treebanks. More...

#include <treebank_collection_reader.hpp>

Public Member Functions

treebank_error init (const std::string &main_file) noexcept
 Initialise the reader with a new collection.
 
bool end () const noexcept
 
void next_treebank () noexcept
 Opens the file of the next treebank in the main file.
 
treebank_readerget_treebank_reader () noexcept
 Returns a treebank reader class instance for processing a treebank.
 

Private Member Functions

void step_line () noexcept
 Consumes one line of the main file m_main_file.
 

Private Attributes

std::string m_main_file = "none"
 File containing the list of languages and their treebanks.
 
std::string m_cur_treebank_name = "none"
 The name of the current treebank file.
 
std::string m_cur_treebank_filename = "none"
 The name of the current treebank file.
 
std::ifstream m_list
 Handler for main file reading.
 
treebank_reader m_treebank_reader
 Object to process a language's treebank.
 
bool m_reached_end = false
 Did we reach the end of the file?
 
bool m_no_more_treebanks = false
 Have all trees in the file been consumed?
 

Detailed Description

A reader for a collection of treebanks.

This class, the objects of which will be referred to as the "collection readers", is an interface to help you do a custom processing of a set of treebanks. A treebank collection is a set of files, each of which is a treebank. A treebank is a file containing one or more lines, each describing a syntactic dependency tree. If you want to output any of the features already calculated by the library, use class treebank_collection_processor instead; said class is much easier to use and can process treebanks in parallel.

Each tree in a treebank file is formatted as a list of whole positive numbers (including zero), each representing a node of the tree. The number 0 denotes the root of the tree, and a number at a certain position indicates its parent node. For example, when number 4 is at position 9 it means that node 9 has parent node 4. Therefore, if number 0 is at position 1 it means that node 1 is the root of the tree. A complete example of such a tree's representation is the following

  0 3 4 1 6 3

which should be interpreted as

predecessor:       0 3 4 1 6 3
node of the tree:  1 2 3 4 5 6

Note that lines like these are not valid:

(1) 0 2 2 2 2 2
(2) 2 0 0

Line (1) is not valid due to a self-reference in the second position, and (2) not being valid due to containing two '0' (i.e., two roots).

A treebank collection reader helps you navigate through a treebank collection. It does the job of initialising the treebank_reader class, which is what you need to do a custom process of a treebank file.

Now, the treebank files are referenced within a "main file list", henceforth called the main file. The main file's lines contain only two strings describing a treebank. The first string is a self-descriptive name of the treebank (e.g., the ISO code of a language), and the second is the relative path to the file containing the syntactic dependency trees (e.g., the syntactic dependency trees of a language in a collection). The path is relative to the directory that contains the main file.

For example, the main file could be called stanford.txt, representing the Stanford treebank collection, and could contain:

arb path/to/file/ar-all.heads2
eus path/to/file/eu-all.heads2
ben path/to/file/bn-all.heads2
...

where the first column contains a string referencing the language (in this case, an ISO code), and the second column contains the relative path to the file with the syntactic dependency trees.

The user has to initialise a collection reader with the main file (the main file list). For example, to read the Stanford collection the reader has to be initialised with the main file stanford.txt which could contain the contents given above. Bear in mind that a collection reader only processes the main file: it iterates through the list of files within the main file using the method next_treebank. This method can be called as long as method end returns false. Each call to next_treebank builds an object of class treebank_reader which allows the user to iterate through the trees within the corresponding file. This object can be retrieved by calling method get_treebank_reader.

An example of usage of this class is given in the following piece of code.

lal::io::treebank_collection tbcolreader;
// it is advisable to check for errors
auto err = tbcolreader.init(mainf)
while (not tbcolreader.end()) {
lal::io::treebank_reader& tbreader = tbcolreader.get_treebank_reader();
if (not tbreader.is_open()) { continue; }
// here goes your custom processing of the treebank
// ...
tbcolreader.next_treebank();
}
A reader for a single treebank file.
Definition treebank_reader.hpp:113
bool is_open() const noexcept
Can the treebank be read?
Definition treebank_reader.hpp:172

Member Function Documentation

◆ end()

bool lal::io::treebank_collection_reader::end ( ) const
inlinenoexcept

Returns true or false depending on whether there is a next treebank to be read.

◆ init()

treebank_error lal::io::treebank_collection_reader::init ( const std::string & main_file)
noexcept

Initialise the reader with a new collection.

Objects of this class can't be used to read a treebank until this method returns no error.

Parameters
main_fileMain file of the collection.
Returns
The type of the error, if any. The list of errors that this method can return is:

◆ next_treebank()

void lal::io::treebank_collection_reader::next_treebank ( )
noexcept

Opens the file of the next treebank in the main file.

This method can be called even after it has returned an error.

Member Data Documentation

◆ m_main_file

std::string lal::io::treebank_collection_reader::m_main_file = "none"
private

File containing the list of languages and their treebanks.

This file's lines contain two strings, the first being the language name (used mainly for debugging purposes), and the name of the file containing the syntactic dependency trees of that language.


The documentation for this class was generated from the following file: