Text preview for : CSL-94-10_Ropes_Are_Better_Than_Strings.pdf part of xerox CSL-94-10 Ropes Are Better Than Strings xerox parc techReports CSL-94-10_Ropes_Are_Better_Than_Strings.pdf



Back to : CSL-94-10_Ropes_Are_Bette | Home

Ropes Are Better Than Strings
Hans-). Boehm, Russ Atkinson and Michael Plass
Ropes Are Better Than Strings


Hans-J. Boehm, Russ Atkinson and Michael Plass


CSL-94-10 Septem ber 1994 [P94-00091 ]

@ Copyright 1994 Xerox Corporation. All rights reserved.




CR Categories and Subject Descriptors: E.2 [Data Storage Representations]:
Primitive Data Items




Additional Keywords and Phrases: Strings, Ropes, Trees, Concatenation, Data
Structures




Xerox Corporation
XEROX Palo Alto Research Center
3333 Coyote Hill Road
Palo Alto, California 94304
Ropes Are Better Than Strings

Hans-J. Boehm
Russ Atkinson
Michael Plass

XeroxPARC



Abstract

Programming languages generally provide a "string" or "text" type to allow manipulation of
sequences of characters. This type is usually of crucial importance, since it is normally mentioned
in most interfaces between system components. We claim that the traditional implementations of
strings, and often the supported functionality, are not well suited to such general-purpose use.
They should be confined to applications with specific, and unusual, performance requirements.
We present "ropes" or "heavyweight" strings as an alternative that, in our experience leads to
systems that are more robust, both in functionality and in performance.
Ropes have been in use in the Cedar environment almost since its inception, but this appears
to be neither well-known, nor discussed in the literature. The algorithms have been gradually
refined. We have also recently built a second similar, but somewhat lighter weight, C-Ianguage
implementation. We describe the algorithms used in both, and give usage examples and
performance comparisons for the C version.



What's Wrong With Strings?

Programming languages such as C and traditional Pascal provide a built-in notion of strings as
essentially fixed length arrays of characters. The language itself provides the array primitives for
accessing such strings, plus often a collection of library routines for higher level operations such as
string concatenation. Thus the implementation is essentially constrained to represent strings as
contiguous arrays of characters, with or without additional space for a length, expansion room, etc.
There is no question that such data structures are occasionally appropriate, and that an "array
of characters" data structure should be provided. On the other hand, since the character string
type will be used pervasively to communicate between modules of a large system, we desire the
following characteristics:
1. Immutable strings should be well supported. A procedure should be able to operate on a string
it was passed without accidentally modifying the caller's data structures. This becomes
particularly important in the presence of concurrency, where in-place updates to strings would
often have to be properly synchronized.
2. Commonly occurring operations on strings should be efficient. In particular (nondestructive)
concatenation of strings and nondestructive substring operations should be fast, and should not
require excessive amounts of space.
3. Common string operations should scale to long strings. There should be no practical bound on
the length of strings. Performance should remain acceptable for long strings. (Most users of
many standard UNIX(TM) will immediately recognize the motivation behind this requirement.


1
2




E.g. the vi editor is unusable on many text files due to a line length limit. An unchecked input
limit in fingerd supported the Morris internet wonn [Spafford 89]. A six month old child
randomly typing at a workstation would routinely crash some older UNIX kernels due to a
buffer size limitation, etc.)

4. It should be as easy as possible to treat some other representation of "sequence of character"
(e.g. a file) as a string. Functions on strings should be maximally reusable.

Standard string representations violate most of these. Immutable strings mayor may not be
supported at the language level. Concatenation of two immutable strings involves copying both,
and thus becomes intolerably inefficient, in both time and space, for long strings. The substring
operation usually (though not necessarily) exhibits similar problems. Since strings are stored
contiguously, any copying of strings results in the allocation of large chunks of storage, which may
also result in substantial memory fragmentation. (For long strings, we have observed this to be a
problem even for some compacting garbage collectors, since they are likely to avoid moving very
large objects.)

As mentioned above, it is very common for application programs not to scale to long string
inputs at all. When they do, they commonly use special purpose data structures to represent those
strings in order to obtain acceptable perfonnance. We are not aware of any standard UNIX text
editors that use a general purpose string representation for the file being edited. Doing so would
make character insertion in long files intolerably slow. The approach we propose makes that
practical.

In order to maximize reusability of string functions, it should be easy to coerce other
representations into standard strings. This is always possible by copying the characters and
building the appropriate string representation. But this is undesirable if, for example, the original
sequence is a long file, such that only the first few characters are likely to be examined. We would
like to be able to convert files into strings without first reading them.




An Alternative

In order to allow concatenation of strings to be efficient in both time and space, it must be
possible for the result to share much of the data structure with its arguments. This implies that
fully manual storage management (e.g. based on explicit malloe/free) is impractical. (It can be
argued that this is true even with conventional string representations. Manual storage management
typically results in much needless string copying.) Though an explicitly reference counted
implementation can be built, we will assume automatic garbage collection.

Since concatenation may not copy its arguments, the natural alternative is to represent such a
string as an ordered tree, with each internal node representing the concatenation of its children,
and the leaves consisting of flat strings, usually represented as contiguous arrays of characters.
Thus the string represented by a tree is the concatenation of its leaves in left-to-right order:
3




concat



concat II foxll



liThe quill lick brown ll



Figure 1. Rope representation of "The quick brown fox"



We refer to character strings rep resented as a tree of concatenation nodes as ropes. (This is a
little sloppy. A rope may contain shared subtrees, and is thus really a directed acyclic graph, where
the out-edges of each vertex are ordered. We will continue to be sloppy.)
Ropes can be viewed as search trees that are indexed by position. If each vertex contains the
length of the string represented by the subtree, then minimal modifications of the search tree
algorithms yield the following operations on ropes:

Fetch ith character: A simple search tree lookup. Rather than examining the subtree containing
the right key, we examine the tree containing the proper position, as determined by the length
fields.
Concatenate two ropes: Search tree concatenation as defined in [Knuth 73].
Substring: Two search tree split operations, as defined in [Knuth 73].
Iterate over each character: Left-to-right tree traversal.


The first three of the above operations can be performed in time logarithmic in the length of
the argument using, for example, B-trees or AVL trees [Knuth 73, Bent et al 82]. Note that since
strings are immutable, any nodes that would be modified in the standard version of the algorithm,
as well as their ancestors, are copied. Only logarithmically many nodes need be copied.
The last can be performed in linear time for essentially any search tree variant. Thus both
concatenation and substring operations (other than for very short substrings) are asymptotically
faster than for conventional flat strings. The last exhibits roughly the same performance. The first
is somewhat slower, but usually infrequent.
In practice, we modify this in two ways. Concatenation is often a sufficiently important
operation that it should run in unit, not logarithmic, time. Long output strings are typically built
by concatenating short ones. For example, compilers generating assembly language output may do
so by concatenating strings in memory [BoehmZwaenepoel 87]. Hence binary concatenation
4




normally simply adds a root node, and does not rebalance the tree. The rebalancing operation is
either performed selectively, or invoked explicitly. Effectively this trades speed of the substring
and fetch operations for better concatenation performance. Iteration over a rope is basically
unaffected.
(Theoretically it is possible to guarantee that the first three operations run in amortized
logarithmic time, the iteration operation runs in linear time, and individual concatenations are
performed in constant time by using a "splay tree" based representation [Tarjan 83, SleatorTarjan
83], in which each node represents the concatenation of a (possibly empty) rope, a (nonempty) flat
string, and another rope. This appears impractical in our situation, since the fetch operation
requires large amounts of allocation and data structure modification. Furthermore, in a
multithreaded environment, the fetch operation appears to require locking.)
It is useful to introduce additional kinds of tree nodes. At a minimum, we allow a second kind
ofleaf node containing at least a length and a user-defined function for computing the ith character
in the string. This allows other representations of character sequences (e.g. files) to be treated as
ropes without copying. It may further be useful to introduce substring nodes, so that long
su bstrings of flat ropes do not require copying.



Algorithms
We briefly discuss the implementation of the more common operations. We will largely
ignore nodes containing user defined functions, in that they do not affect the algorithms
significantly. We assume that all leaves are nonempty. We exclude empty ropes from
consideration: they are a trivial special case.
Note that all operations are completely nondestructive. They may be performed without
locking in a multithreaded environment.



Concatenation

In the general case, concatenation involves simply allocating a concatenation node containing
two pointers to the two arguments. For performance reasons, it is desirable to deal with the
common case in which the right argument is a short flat string specially. If both arguments are
short leaves, we produce a flat rope (leaO consisting of the concatenation. This greatly reduces
space consumption and traversal times. If the left argument is a concatenation node whose right
son is a short leaf, and the right argument is also a short leaf, then we concatenate the two leaves,
and then concatenate the result to the left son of the left argument. Together, these two special
cases guarantee that if a rope is built by repeatedly concatenating individual characters to its end,
we still obtain a rope with leaves of reasonable size. They also greatly reduce the imbalance of the
resulting trees.
Since the length of ropes is, in practice, bounded by the word size of the machine, we can
place a bound on the depth of balanced ropes. The concatenation operation checks whether the
resulting tree significantly exceeds this bound. If so, the rope is explicitly rebalanced (see below).
This has several benefits:
1. The balancing operation is only invoked implicitly for long ropes, and then rarely. Each
balancing operation will normally reduce the depth of the rope to considerably below the
5




threshold.
2. Recursive operations on ropes require a bounded amount of stack space.
3. Paths in the tree can be represented in a fixed amount of space. This is important for the C
implementation (see below).



Substring

The substring operation on structured ropes can be easily implemented. We assume the
substring operation on leaves simply copies the relevant section of the leaf, and deals with negative
start arguments and over-length arguments correctly.
substr( concat(ropel, rope2), start, len) =
let
left = if start < 0 and len ~ length(ropel) then
ropel
else
substr(ropel, start, len)
right = if start < length(ropel)
and start + len > length(ropel) + length(rope2) then
rope2
else
substr(rope2, start - length(ropel), len -length(left