Multiset discrimination is a generalization of equality or equivalence testing from 2 to arbitrarily many arguments. It is a fundamental (collection of) algorithmic technique(s) for partitioning a set of (possibly) structured input values into equivalence classes according to a given equivalence relation. Such an equivalence relation can be name equality, structural equivalence (structural equality), and structural equality modulo associativity, commutativity and idempotence of one or more binary operators. It extends to other equivalence relations by building on top of these.
Multiset discrimination applies to acyclic data structures without sharing (including "unboxed" data) and to data structures with sharing ("boxed" data with shared pointers), and executes in both cases in worst-case linear time, without use of hashing or address arithmetic. It also applies to cyclic data structures, where it executes in time O(m log n) time for data structures with n nodes and m edges, where neighbor lists may be interpreted as sequences, bags or sets. Finally, these algorithms can be defined polytypically, which means that multiset discrimination can be efficiently extended to user-defined abstract data types as long as the primitive data types used in their implementation themselves provide a multiset discrimination method.
Multiset discrimination was developed by Bob Paige and his collaborators over a period of about 15 years (ca. 1985-1997). He coined the name "multiset discrimination" to refer to the problem of discriminating the elements of an input list (finding duplicates and counting them), where the input list is to be understood as the presentation of a multiset, that is a sequence where the order of elements does not matter. Multiset discrimination was first developed by Paige and Tarjan (1987) for discriminating strings of characters, which is central to their left-to-right lexicographic sorting algorithm. Cai and Paige (1991, 1995) demonstrated that numerous language processing problems where hashing is used, ranging from symbol table construction to iterated strength reduction, can be solved efficiently using multiset discrimination of strings. Paige (1991) subsequently extended multiset discrimination to discrimination of not only strings, but more generally sequences of arbitrary (discriminated) elements, and also (list representations of) bags and sets of such elements. The hard part here is dealing with commutativity, that is the fact that the elements of two bags (or sets) may occur in arbitrary order and yet the two bags (sets) may be equivalent. The key ingredient here is what Paige termed weak sorting, a simple and efficient (linear-time) method -- by using basic multiset discrimination twice -- for sorting a collection of sequences of elements according to a sorting order of the elements that is computed on the fly, without use of any comparison function on the elements. He noted that this can be extended to multiset discrimination of arbitrary tree structured elements by bottom-up processing from the leaves and up. He didn't bother to note this, but it is implicit in the bottom-up nature of processing that this also works for dags, not just trees, in linear time. In accumulation, this work done in the period of ca. 1985-1991 yields an algorithmic tool box for linear-time discrimination of structured data, whether shared or not, as long as they are acyclic. Paige (1994) and Paige and Yang (1997) developed techniques for transforming serialized ("external") data representation of structured data into type-driven internal forms in linear time and with full dagification and elimination of duplicates. Zibin, Gil, Considine (2003) use some multiset discrimination techniques to solve an interesting type isomorphism problem with distributivity and do so, curiously, by rediscovering some basic multiset discrimination techniques, curiously without referring to any of the previous works on multiset discrimination.
The discrimination of acyclic data using the linear-time methods referred to above is what Bob Paige collectively referred to as multiset discrimination. In parallel with this, he was involved with partitioning problems for cyclic data structures. Indeed there is a parallel track of algorithmic work dating back to 1980 that deals with what is essentially multiset discrimination of cyclic data structures, even though it isn't called that by their respective authors. We shall extend the term multiset discrimination to those problems since they solve the same problem for cyclic data structures, not just acylic data structures, albeit not in linear time. Downey, Sethi and Tarjan (1980) devised an algorithm for partitioning the nodes in a directed graph with ordered outedges (that is, the collection of neighbors for each node constitute a sequence) in time O(m log n) for graphs with n nodes and m edges such that nodes are in the same block of the partition if and only if they are isomorphic; that is, their neighbor lists are also elementwise isomorphic. They referred to this problem as the common subexpression problem. Cardon and Crochemore developed an algorithm with the same time bound for the same problem, but where the collection of neighbors of each node is viewed as denoting an unordered list or bag of nodes. They referred to this problem as finding the coarsest regular congruence refining an initial partition. Finally, Paige and Tarjan (1987) solved the corresponding problem, again with the same time bound, where the collection of neighbors of each node is viewed as denoting a set, referring to it as the coarsest relational partition refinement problem. They also provided a simplified algorithm for Cardon and Crochemore's problem, calling it the size-stable coarsest partition refinement problem. In our terminology the three algorithms solve multiset discrimination where the input elements may be (pointers to) sequences, bags, or sets (of pointers), respectively, and where they may be cyclic (which is why we write 'pointers', as cyclicity otherwise would be in violation of the well-foundedness axiom of Zermelo-Fränkel set theory).
Henglein (2003) develops multiset discrimination for acyclic data in a general, polytypic framework and demonstrates that the above-mentioned three algorithms can be combined into a single O(m log n) multiset discrimination algorithm for cyclic data structures that may contain any mixture of sequences, bags and sets.
- P. Downey, R. Sethi, and R. Tarjan, Variations on the common subexpression problem, Journal of the ACM, 27(4):758--771, October 1980.
- Cardon A. and Crochemore M., Partitioning a graph in O(|A| log2 |V|), Theoretical Computer Science (TCS), 19:85--98, 1982.
- Robert Paige and Robert E. Tarjan, Three partition refinement algorithms, SIAM Journal of Computing, 16(6):973--989, December 1987.
- Cai, J. and Paige, R., Look Ma, No Hashing, and No Arrays Neither, Proc. 18th Annual ACM Symp. on Principles of Programming Languages (POPL), Orlando, Florida, Jan. 1991, pp. 143-154
- Robert Paige, Optimal Translation of User Input in Dynamically Typed Languages, unpublished manuscript, July 1991
- Robert Paige, Efficient Translation of External Input in a Dynamically Typed Language, Proc. 13th World Computer Congress, Vol. 1, Pehrson, B. and Simon, I. (editors), Elsevier Science B.V. (North Holland), Feb. 1994
- Jiazhen Cai and Robert Paige, Using multiset discrimination to solve language processing problems without hashing, Theoretical Computer Science (TCS), 145(1-2):189--228, July 1995.
- Robert Paige, Zhe Yang, High Level Reading and Data Structure
Compilation, Proc. 24th ACM SIGPLAN-SIGACT Symp. on Principles
of Programming Languages (POPL), Paris, France, Jan. 1997, ACM Press, pp. 456-469
- Sian Jha, Jens Palsberg, Tian Zhao, Fritz Henglein, Efficient Type Matching, Technical Report, DIKU, University of Copenhagen, TOPPS report D-474, 2002.
- Yoav Zibin, Joseph Gil, Jeffrey Considine, Efficient Algorithms for Isomorphisms of Simple Types, Proc. 2003 ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pp. 160-171, Jan. 2003, ACM Press; also appeared in ACM SIGPLAN Notices, Vol. 38, No.1
- Fritz Henglein, Multiset discrimination. 2003. Under preparation.
Sharing and garbage collection
Multiset discrimination is a method for identifying multiply stored elements in an XML Store and compressing them by contracting all copies of an element (value) to a single copy of the element, freeing the other storage space. In contrast to (ordinary) hashing-based methods multiset discriminination even works if the children of a node are interpreted as bags or sets as in relational databases, not just sequences. The use of a 'sharer' thread that performs sharing compaction (using multiset discrimination or any other technique) allows for a rapid, asynchronous implementation of the save operation in an XML Store: data are simply copied into the XML Store, without (synchronous) check of whether they (in part) are already stored in the XML Store. Instead of doing such a synchronous check to avoid storing multiple copies of an element in an XML Store, the sharer executes concurrently and thus eliminates multiple copies asynchronously.
It can be expected that synchronization problems known from concurrent garbage collection (though mitigated by the immutability of much of the stored data) will arise. Indeed, it can be expected that sharing processing could yield a (local) garbage collector since it basically operates like a tracing copying garbage collector driven by a (local) root set.
Little is known about implementation and practical performance of multiset discrimination for in-memory data structures, and basically nothing about special characteristics (algorithmics, design and implementation) of multiset discrimination for on-disk data, even when restricted to local (single processor with local disk) processing. Beyond this, distributed multiset discrimination may turn out to be central to efficient managing of peer-to-peer systems such as XML Store and crucial enabling technology for some applications such as peer-to-peer system based backup.
Sharing and compression
Multiset discrimination is a technique for compacting trees, dags and graphs by contracting isomorphic nodes into a single node. Since it can keep track of how many references there are to such a contracted node, it can be used as the basis for compression of storage space using Huffman coding, where frequently occurring references are coded using short bit strings and less frequently occurring references using longer strings.
In general, storage compression of strings can be understood as the composition of the following phases:
- (parsing) parsing of the string into an append tree; that is, a tree whose internal nodes may have lists of 0, 1, 2 or more children and which are labeled with the associative append operator, and whose leaves are labeled with single characters;
- (sharing) multiset discriminiation of the nodes to turn the tree into a dag;
- (coding) frequency-based coding of the dag nodes into a (bit) sequence.
Note that there may be many different parses of a given string, yielding different degrees of sharing. In terms of compressing strings as much as possible, the main question is finding "good" parses in Phase 1 since the latter two phases (sharing and coding) have a known optimal method in Huffman coding.
It would appear that multiset discrimination can play the role of Phase 2 and go beyond ordinary string compression in several respects:
- it provides even more sharing than conventional string compressors as it can identify up to permutation (bags) and repetition (sets);
- it can even be used for circular data structures, which seems to be far beyond the reach of conventional compression;
- current string compressors have a two-level architecture: one (or more as in XMill) table ("let declarations") for defining the code book and then a string ("body of a let expression') using those codes. Multilevel architectures are conceivable.
The parsing problem is only required for chardata in XML documents, where it is basically the same as in ordinary string compression (and which still yields the bulk of the compression since most data is leaf data, in particular chardata, in almost tree) .
Phases 1 and 2 by themselves should provide the basis for a highly space efficient implementation of a semi-structured data that still supports navigation, that is selective access of parts of the data without having to uncompress them all; and Phases 1, 2 and 3 combined should yield an interesting compression/uncompression Interestingly, all phases should be developable as separate modules, whose combinations thus provide different, yet composable functionality.