LAL: Linear Arrangement Library 24.10.00
A library focused on algorithms on linear arrangements of graphs.
|
This library offers a wide variety of algorithms related to linear arrangements of graphs so as to provide researchers (mainly, but not exclusively) in the field of Quantitative Linguistics with a toolset with which they can perform statistical analyses on different corpora of treebanks efficiently and effectively. This library implements several state-of-the-art algorithms and offers a variety of functionalities. While most of the functions have been generalised to be applicable to general graphs, we also provide specialised functions for trees, which are more efficient than their more general counterparts. In order to understand the documentation of the functions and classes in this library, readers are urged to read through the Important concepts in LAL page to learn about the terminology and notation.
The main goal of this library is to provide algorithms with which the library's users can use to do statistical studies. One of the most attractive features offered in this library is that of treebank collection processing. This library offers a class that automatically processes a collection and computes several metrics based on the capabilities of the library. See class lal::io::treebank_collection_processor and lal::io::treebank_processor for details. We also provide classes for custom processing of treebanks (see lal::io::treebank_collection_reader and lal::io::treebank_reader).
All the features of syntactic dependency trees that can be calculated with the algorithms in this library are gathered in the namespaces lal::linarr and in lal::properties. These features include, but are not limited to,
As extra features, useful for experimentation, are the generation of different types of trees, all of which are available in the lal::generate namespace. We have implemented existing techniques (cited accordingly) or made or own to enumerate
and to generate uniformly at random
The documentation of each class includes usage examples.
This library implements several types of graphs that can be found in the lal::graphs namespace.
In order to be able to use the library comfortably, its users must take good note of the different data structures that are the library's core.
With LAL, most metrics can be calculated as exact rational numbers (see lal::numeric::rational), but also as floating point values of double precision. For the former, add the suffix '_rational' at the end of the function name. For example, the function lal::properties::var_num_crossings returns the variance of the number of crossings 'C' as a floating point value. By adding the suffix, i.e., lal::properties::var_num_crossings_rational, we obtain said variance as an exact rational value. To see what a rational value is in the context of this library, see the documentation of the namespace lal::numeric.
Most operations in this library are done using exact integer and rational arithmetic. Such arithmetic is powered by the GMP library (see [26]). We have wrapped their C structures into the classes lal::numeric::integer and lal::numeric::rational.
As it should be expected, this library offers a number of different graph abstractions: undirected graphs (see lal::graphs::undirected_graph), directed graphs (see lal::graphs::directed_graph), free trees (see lal::graphs::free_tree) and roted trees (see lal::graphs::rooted_tree), all of which can be found within the lal::graphs namespace.
Although all graphs should be regarded as unlabelled, each node carries an implicit labelling. Such labelling has a most trivial nature since each node is labelled with a number between 0 and the total number of vertices minus one.
Due to most graphs being sparse, the data structure of choice are adjacency lists where each vertex has a list of neighbouring nodes, or simply neighbors, associated to it. The user can affect the order of appearance of neighbors in multiple ways. One of them is, evidently, the order in which edges are added. Another way is via the lal::graphs::graph::normalize function, which sorts every list of neighbors increasingly by index. By default, the addition of edges is normalising, namely the following code
produces the same output as
which is
0: 1 2 3 1: 0 2: 0 3: 0
The output is easy to interpret: the first line indicates the nodes are incident to vertex 0, the second line indicates the nodes incident to vertex 1, and so on. Without normalisation (neither the call to lal::graphs::free_tree::normalize, and using 'false' in the optional parameters), the output is
0: 1 3 2 1: 0 2: 0 3: 0
where only the first line changes.
Such normalisation is required by some of the algorithms in this library. Without proper normalisation, the algorithms are not likely to compute correct values. The parameter that governs the graphs' normalisation is called the normalisation parameter.
The adjacency list structure has been extended to directed graphs in a way that the user can query them for in-degree (see lal::graphs::directed_graph::get_in_degree) and in-neighbors (see lal::graphs::directed_graph::get_in_neighbors).
NOTE Pages Notation in the library's documentation and Important concepts in LAL further explain everything there is to know about notation and concepts.
Users will note, after browsing through the library's documentation, that several concepts are quite ubiquitous. The lal::node type is simply a typedef of an unsigned integer type, and the lal::edge type is simply a STL's pair of nodes. Moreover, users need to have a deep understanding of what a head vector is (see lal::head_vector), which is what allows users to easily read trees from files and process a file containing a large collection of trees (see lal::io namespace).
A more advanced concept is that of linear arrangement (see lal::linear_arrangement). In this library, a linear arrangement is viewed as a function that relates each node to a position in a linear sequence. Due to the properties of such functions, a linear arrangement is implemented with the STL's vector. Note that the concept of linear arrangement has been detached from that of trees, and the pair of a linear arrangement and a tree forms, in the context of the library, a syntactic dependency tree (this is why this class is not implemented). The symbol of choice for representing a linear arrangement in the library is the greek letter for the number pi \(\pi\).
Now, many functions (see those fuctions within the lal::linarr namespace) admit a linear arrangement which can be empty. Whenever it is empty, i.e., the value of the parameter is an empty vector, the positions of the nodes of the graphs in question are given by their implicit label. Such empty arrangement is called, in the context of the library, the identity arrangement symbolised with \(\pi_I\). Therefore, the following measurement of the sum of the lengths of the edges are equivalent
The possibility of expliciting a linear arrangement increases the flexibility of the library. For example, for the purposes of illustration, one can calculate the expected sum of the length of the edges as follows
As a rule of the thumb, the user is encouraged not to change the default value of the parameters whenever they are given. However, certain operations can be less efficient than others, and sometimes it is even desirable to use values different from the default ones.
One the one hand, the wrong choice of operation can affect the library's performance gravely. For example, the addition/deletion of edges to/from graphs is slower when it is done edge by edge than when it is done in bulk. Users are highly encouraged to add/delete them in bulk using the appropriate functions (see, for example, lal::graphs::undirected_graph::add_edges and lal::graphs::undirected_graph::remove_edges). Although correct, the following code is discouraged
while the next piece of code is strongly encouraged whenever possible
A similar reasoning should be applied to the deletion of edges.
Furthermore, graphs are seldom required to be normalized. For example, when calculating the variance of \(C\) (see lal::properties::var_num_crossings), it is mandatory that the graph be normalized, namely, the function has a precondition that requires the graph to be normalized. If such a function is to be called eventually then add all edges in bulk and with normalisation, or read the graph from disk also with normalisation. However, if such functions will never be called then the users are encouraged to set the normalisation parameter to false. For example, if the variance of \(C\) is to be calculated,
but if not