public abstract class UnicodeEscaper extends java.lang.Object implements Escaper
Escaper
that converts literal text into a format safe for inclusion in a particular
context (such as an XML document). Typically (but not always), the inverse process of
"unescaping" the text is performed automatically by the relevant parser.
For example, an XML escaper would convert the literal string "Foo<Bar>"
into
"Foo<Bar>"
to prevent "<Bar>"
from being confused with an XML tag. When the
resulting XML document is parsed, the parser API will return this text as the original literal
string "Foo<Bar>"
.
Note: This class is similar to CharEscaper
but with one very important difference.
A CharEscaper can only process Java UTF16
characters in isolation and may not cope when it encounters surrogate pairs. This class
facilitates the correct escaping of all Unicode characters.
As there are important reasons, including potential security issues, to handle Unicode correctly if you are considering implementing a new escaper you should favor using UnicodeEscaper wherever possible.
A UnicodeEscaper
instance is required to be stateless, and safe when used concurrently by
multiple threads.
Several popular escapers are defined as constants in the class CharEscapers
. To create
your own escapers extend this class and implement the escape(int)
method.
Constructor and Description |
---|
UnicodeEscaper() |
Modifier and Type | Method and Description |
---|---|
protected static int |
codePointAt(java.lang.CharSequence seq,
int index,
int end)
Returns the Unicode code point of the character at the given index.
|
java.lang.Appendable |
escape(java.lang.Appendable out)
Returns an
Appendable instance which automatically escapes all text appended to it
before passing the resulting text to an underlying Appendable . |
protected abstract char[] |
escape(int cp)
Returns the escaped form of the given Unicode code point, or
null if this code point
does not need to be escaped. |
java.lang.String |
escape(java.lang.String string)
Returns the escaped form of a given literal string.
|
protected java.lang.String |
escapeSlow(java.lang.String s,
int index)
Returns the escaped form of a given literal string, starting at the given index.
|
protected int |
nextEscapeIndex(java.lang.CharSequence csq,
int start,
int end)
Scans a sub-sequence of characters from a given
CharSequence , returning the index of
the next character that requires escaping. |
protected abstract char[] escape(int cp)
null
if this code point
does not need to be escaped. When called as part of an escaping operation, the given code point
is guaranteed to be in the range 0 <= cp <= Character#MAX_CODE_POINT
.
If an empty array is returned, this effectively strips the input character from the resulting text.
If the character does not need to be escaped, this method should return null
, rather
than an array containing the character representation of the code point. This enables the
escaping algorithm to perform more efficiently.
If the implementation of this method cannot correctly handle a particular code point then it should either throw an appropriate runtime exception or return a suitable replacement character. It must never silently discard invalid input as this may constitute a security risk.
cp
- the Unicode code point to escape if necessarynull
if no escaping was neededprotected int nextEscapeIndex(java.lang.CharSequence csq, int start, int end)
CharSequence
, returning the index of
the next character that requires escaping.
Note: When implementing an escaper, it is a good idea to override this method for
efficiency. The base class implementation determines successive Unicode code points and invokes
escape(int)
for each of them. If the semantics of your escaper are such that code
points in the supplementary range are either all escaped or all unescaped, this method can be
implemented more efficiently using CharSequence.charAt(int)
.
Note however that if your escaper does not escape characters in the supplementary range, you should either continue to validate the correctness of any surrogate characters encountered or provide a clear warning to users that your escaper does not validate its input.
See PercentEscaper
for an example.
csq
- a sequence of charactersstart
- the index of the first character to be scannedend
- the index immediately after the last character to be scannedjava.lang.IllegalArgumentException
- if the scanned sub-sequence of csq
contains invalid
surrogate pairspublic java.lang.String escape(java.lang.String string)
If you are escaping input in arbitrary successive chunks, then it is not generally safe to use
this method. If an input string ends with an unmatched high surrogate character, then this
method will throw IllegalArgumentException
. You should either ensure your input is
valid UTF-16 before calling this method or
use an escaped Appendable
(as returned by escape(Appendable)
) which can cope
with arbitrarily split input.
Note: When implementing an escaper it is a good idea to override this method for
efficiency by inlining the implementation of nextEscapeIndex(CharSequence, int, int)
directly. Doing this for PercentEscaper
more than doubled the performance for unescaped
strings (as measured by CharEscapersBenchmark
).
protected final java.lang.String escapeSlow(java.lang.String s, int index)
escape(String)
method when it discovers that escaping is required. It is
protected to allow subclasses to override the fastpath escaping function to inline their
escaping test. See CharEscaperBuilder
for an example usage.
This method is not reentrant and may only be invoked by the top level escape(String)
method.
s
- the literal string to be escapedindex
- the index to start escaping fromstring
java.lang.NullPointerException
- if string
is nulljava.lang.IllegalArgumentException
- if invalid surrogate characters are encounteredpublic java.lang.Appendable escape(java.lang.Appendable out)
Appendable
instance which automatically escapes all text appended to it
before passing the resulting text to an underlying Appendable
.
Unlike escape(String)
it is permitted to append arbitrarily split input to this
Appendable, including input that is split over a surrogate pair. In this case the pending high
surrogate character will not be processed until the corresponding low surrogate is appended.
This means that a trailing high surrogate character at the end of the input cannot be detected
and will be silently ignored. This is unavoidable since the Appendable interface has no
close()
method, and it is impossible to determine when the last characters have been
appended.
The methods of the returned object will propagate any exceptions thrown by the underlying
Appendable
.
For well formed UTF-16 the escaping behavior
is identical to that of escape(String)
and the following code is equivalent to (but
much slower than) escaper.escape(string)
:
{ @code StringBuilder sb = new StringBuilder(); escaper.escape(sb).append(string); return sb.toString(); }
escape
in interface Escaper
out
- the underlying Appendable
to append escaped output toAppendable
which passes text to out
after escaping itjava.lang.NullPointerException
- if out
is nulljava.lang.IllegalArgumentException
- if invalid surrogate characters are encounteredprotected static final int codePointAt(java.lang.CharSequence seq, int index, int end)
Unlike Character.codePointAt(CharSequence, int)
or String.codePointAt(int)
this
method will never fail silently when encountering an invalid surrogate pair.
The behaviour of this method is as follows:
index >= end
, IndexOutOfBoundsException
is thrown.
IllegalArgumentException
is thrown.
IllegalArgumentException
is
thrown.
seq
- the sequence of characters from which to decode the code pointindex
- the index of the first character to decodeend
- the index beyond the last valid character to decodeCopyright © 2008–2023. All rights reserved.