# Multiset Discrimination

## Overview

*Multiset discrimination* is a generalization of equality or equivalence
testing from 2 to arbitrarily many arguments. It is a fundamental
(collection of) algorithmic technique(s) for partitioning a set of (possibly)
structured input values into equivalence classes according to a given
equivalence relation. Such an equivalence relation can be name equality,
structural equivalence (structural equality), and structural equality modulo
associativity, commutativity and idempotence of one or more binary
operators. It extends to other equivalence relations by building on top of
these.

Multiset discrimination applies to acyclic data structures without sharing
(including "unboxed" data) and to data structures with sharing
("boxed" data with shared pointers), and executes in both cases in
worst-case linear time, without use of hashing or address arithmetic. It
also applies to cyclic data structures, where it executes in time *O(m log n)*
time for data structures with *n* nodes and *m* edges, where neighbor
lists may be interpreted as sequences, bags or sets. Finally, these
algorithms can be defined polytypically, which means that multiset
discrimination can be efficiently extended to user-defined abstract data types
as long as the primitive data types used in their implementation themselves
provide a multiset discrimination method.

## History

Multiset discrimination was developed by Bob Paige and his collaborators over
a period of about 15 years (ca. 1985-1997). He coined the name "multiset
discrimination" to refer to the problem of *discriminating* the
elements of an input list (finding duplicates and counting them), where the
input list is to be understood as the presentation of a *multiset, *that is*
*a sequence where the order of elements does not matter. Multiset
discrimination was first developed by Paige and Tarjan (1987) for discriminating
strings of characters, which is central to their left-to-right lexicographic
sorting algorithm. Cai and Paige (1991, 1995) demonstrated that numerous language
processing problems where hashing is used, ranging from symbol table
construction to iterated strength reduction, can be solved efficiently using
multiset discrimination of strings. Paige (1991) subsequently extended multiset
discrimination to discrimination of not only strings, but more generally
sequences of arbitrary (discriminated) elements, and also (list representations of)
bags and sets of such elements. The hard part here is dealing with
commutativity, that is the fact that the elements of two bags (or sets) may occur in
arbitrary order and yet the two bags (sets) may be equivalent. The key
ingredient here is what Paige termed *weak sorting, *a simple and efficient
(linear-time) method -- by using basic multiset discrimination twice -- for
sorting a collection of sequences of elements according to a sorting order of
the elements that is computed on the fly, without use of any comparison function
on the elements. He noted that this can be extended to multiset
discrimination of arbitrary tree structured elements by bottom-up processing
from the leaves and up. He didn't bother to note this, but it is implicit
in the bottom-up nature of processing that this also works for dags, not just
trees, in linear time. In accumulation, this work done in the period of
ca. 1985-1991 yields an algorithmic
tool box for linear-time discrimination of structured data, whether shared or
not, as long as they are acyclic. Paige (1994) and Paige and Yang
(1997) developed techniques for transforming serialized
("external") data representation of structured data into type-driven
internal forms in linear time and with full dagification and elimination of
duplicates. Zibin, Gil, Considine (2003) use some multiset discrimination
techniques to solve an interesting type isomorphism problem with distributivity
and do so, curiously, by rediscovering some basic multiset discrimination
techniques, curiously without referring to any of the previous works on multiset
discrimination.

The discrimination of acyclic data using the linear-time methods
referred to above is what Bob Paige collectively referred to as
multiset discrimination. In parallel with this, he was
involved with partitioning problems for cyclic data structures.
Indeed there is a parallel track of algorithmic work dating back to
1980 that deals with what is essentially multiset discrimination of
* cyclic* data structures, even though it isn't called that by
their respective authors. We shall extend the term multiset
discrimination to those problems since they solve the same problem for
cyclic data structures, not just acylic data structures, albeit not in
linear time. Downey, Sethi and Tarjan (1980) devised an
algorithm for partitioning the nodes in a directed graph with
*ordered* outedges (that is, the collection of neighbors for each
node constitute a *sequence*) in time *O(m log n)* for
graphs with *n* nodes and *m* edges such that nodes are in
the same block of the partition if and only if they are isomorphic;
that is, their neighbor lists are also elementwise
isomorphic. They referred to this problem as the *common
subexpression problem. *Cardon and Crochemore developed an
algorithm with the same time bound for the same problem, but where the
collection of neighbors of each node is viewed as denoting an
unordered list or *bag* of nodes. They referred to this
problem as finding the *coarsest regular congruence* refining an
initial partition. Finally, Paige and Tarjan (1987) solved the
corresponding problem, again with the same time bound, where the
collection of neighbors of each node is viewed as denoting a *set,
*referring to it as the *coarsest relational partition refinement
*problem*. *They also provided a simplified algorithm
for Cardon and Crochemore's problem, calling it the *size-stable
coarsest partition refinement *problem. In our terminology
the three algorithms solve multiset discrimination where the input
elements may be (pointers to) sequences, bags, or sets (of pointers),
respectively, and where they may be cyclic (which is why we write
'pointers', as cyclicity otherwise would be in violation of the
well-foundedness axiom of Zermelo-Fränkel set theory).

Henglein (2003) develops multiset discrimination for acyclic data
in a general, polytypic framework and demonstrates that the
above-mentioned three algorithms can be combined into a single *O(m
log n)* multiset discrimination algorithm for cyclic data
structures that may contain any mixture of sequences, bags and sets.

## Literature

- P. Downey, R. Sethi, and R. Tarjan,
*Variations on the common subexpression problem,*Journal of the ACM, 27(4):758--771, October 1980. - Cardon A. and Crochemore M.,
*Partitioning a graph in O(|A| log*Theoretical Computer Science (TCS), 19:85--98, 1982._{2}|V|), - Robert Paige and Robert E. Tarjan,
*Three partition refinement algorithms,*SIAM Journal of Computing, 16(6):973--989, December 1987. - Cai, J. and Paige, R.,
*Look Ma, No Hashing, and No Arrays Neither,*Proc. 18th Annual ACM Symp. on Principles of Programming Languages (POPL), Orlando, Florida, Jan. 1991, pp. 143-154 - Robert Paige,
*Optimal Translation of User Input in Dynamically Typed Languages*, unpublished manuscript, July 1991 - Robert Paige,
*Efficient Translation of External Input in a Dynamically Typed Language,*Proc. 13th World Computer Congress, Vol. 1, Pehrson, B. and Simon, I. (editors), Elsevier Science B.V. (North Holland), Feb. 1994 - Jiazhen Cai and Robert Paige,
*Using multiset discrimination to solve language processing problems without hashing,*Theoretical Computer Science (TCS), 145(1-2):189--228, July 1995. - Robert Paige, Zhe Yang,
*High Level Reading and Data Structure Compilation*, Proc. 24th ACM SIGPLAN-SIGACT Symp. on Principles

of Programming Languages (POPL), Paris, France, Jan. 1997, ACM Press, pp. 456-469 - Sian Jha, Jens Palsberg, Tian Zhao, Fritz Henglein,
*Efficient Type Matching*, Technical Report, DIKU, University of Copenhagen, TOPPS report D-474, 2002. - Yoav Zibin, Joseph Gil, Jeffrey Considine,
*Efficient Algorithms for Isomorphisms of Simple Types*, Proc. 2003 ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pp. 160-171, Jan. 2003, ACM Press; also appeared in ACM SIGPLAN Notices, Vol. 38, No.1 - Fritz Henglein,
*Multiset discrimination.*2003. Under preparation.

## Multiset discrimination and XML Store

### Sharing and garbage collection

Multiset discrimination is a method for identifying multiply stored elements in an XML Store and compressing them by contracting all copies of an element (value) to a single copy of the element, freeing the other storage space. In contrast to (ordinary) hashing-based methods multiset discriminination even works if the children of a node are interpreted as bags or sets as in relational databases, not just sequences. The use of a 'sharer' thread that performs sharing compaction (using multiset discrimination or any other technique) allows for a rapid, asynchronous implementation of the save operation in an XML Store: data are simply copied into the XML Store, without (synchronous) check of whether they (in part) are already stored in the XML Store. Instead of doing such a synchronous check to avoid storing multiple copies of an element in an XML Store, the sharer executes concurrently and thus eliminates multiple copies asynchronously.

It can be expected that synchronization problems known from concurrent garbage collection (though mitigated by the immutability of much of the stored data) will arise. Indeed, it can be expected that sharing processing could yield a (local) garbage collector since it basically operates like a tracing copying garbage collector driven by a (local) root set.

Little is known about implementation and practical performance of multiset discrimination for in-memory data structures, and basically nothing about special characteristics (algorithmics, design and implementation) of multiset discrimination for on-disk data, even when restricted to local (single processor with local disk) processing. Beyond this, distributed multiset discrimination may turn out to be central to efficient managing of peer-to-peer systems such as XML Store and crucial enabling technology for some applications such as peer-to-peer system based backup.

### Sharing and compression

Multiset discrimination is a technique for compacting trees, dags and graphs
by contracting isomorphic nodes into a single node. Since it can keep
track of how many references there are to such a contracted node, it can be used
as the basis for *compression *of storage space using Huffman coding, where
frequently occurring references are coded using short bit strings and less
frequently occurring references using longer strings.

In general, storage compression of strings can be understood as the composition of the following phases:

- (parsing) parsing of the string into an append tree; that is, a tree whose internal nodes may have lists of 0, 1, 2 or more children and which are labeled with the associative append operator, and whose leaves are labeled with single characters;
- (sharing) multiset discriminiation of the nodes to turn the tree into a dag;
- (coding) frequency-based coding of the dag nodes into a (bit) sequence.

Note that there may be many different parses of a given string, yielding different degrees of sharing. In terms of compressing strings as much as possible, the main question is finding "good" parses in Phase 1 since the latter two phases (sharing and coding) have a known optimal method in Huffman coding.

It would appear that multiset discrimination can play the role of Phase 2 and go beyond ordinary string compression in several respects:

- it provides even more sharing than conventional string compressors as it can identify up to permutation (bags) and repetition (sets);
- it can even be used for circular data structures, which seems to be far beyond the reach of conventional compression;
- current string compressors have a two-level architecture: one (or more as in XMill) table ("let declarations") for defining the code book and then a string ("body of a let expression') using those codes. Multilevel architectures are conceivable.

The parsing problem is only required for chardata in XML documents, where it is basically the same as in ordinary string compression (and which still yields the bulk of the compression since most data is leaf data, in particular chardata, in almost tree) .

Phases 1 and 2 by themselves should provide the basis for a highly space efficient implementation of a semi-structured data that still supports navigation, that is selective access of parts of the data without having to uncompress them all; and Phases 1, 2 and 3 combined should yield an interesting compression/uncompression Interestingly, all phases should be developable as separate modules, whose combinations thus provide different, yet composable functionality.

`henglein`

)