Method Description

static Pattern compile Compile regex and return its Pattern object. This method throws (String regex) java.util.regex.PatternSyntaxException when regex's syntax is invalid.

static Pattern compile Compile regex according to the given flags (a bitset consisting of (String regex,int flags) some combination of Pattern's CANON_EO, CASE_INSENSITIVE,

COMMENTS, DOTALL, LITERAL, MULTILINE, UNICODE_CASE, and UNIX_LINES constants) and return its Pattern object. This method throws PatternSyntaxException when regex's syntax is invalid, and IllegalArgumentException when bit values other than those corresponding to the defined match flags are set in flags.

int flags()

Matcher matcher (CharSequence input)

Return this Pattern object's match flags. This method returns 0 for Pattern instances created via compile(String), and the bitset of flags for Pattern instances created via compile(String, int).

Return a Matcher that will match input against this Pattern's compiled regex.

static boolean matches (String regex, CharSequence input)

String pattern()

static String quote (String s)

String[] split (CharSequence input)

String[] split (CharSequence input, int limit)

Compile regex and attempt to match input against the compiled regex. Return true if there is a match; otherwise, return false. This convenience method is equivalent to

Pattern.compile(regex).matcher(input).matches(), and throws PatternSyntaxException when regex's syntax is invalid.

Return this Pattern's uncompiled regex.

Quote s using "\O" and "\E" so that all other metacharacters lose their special meaning. If the returned String is later compiled into a Pattern instance, it can only be matched literally.

Split input around matches of this Pattern's compiled regex and return an array containing the matches.

Split input around matches of this Pattern's compiled regex; limit controls the number of times the compiled regex is applied and thus affects the length of the resulting array.

String toString()

Return this Pattern's uncompiled regex.

Table 9-4 reveals the java.lang.CharSequence interface, which describes a readable sequence of char values. Instances of any class that implements this interface (such as String, StringBuffer, and StringBuilder) can be passed to Pattern methods that take CharSequence arguments (such as split(CharSequence)).

NOTE: CharSequence declares methods char charAt(int index) (return the character at location index within this sequence), int length() (return the length of this sequence), CharSequence subSequence(int start, int end) (return a subsequence of this sequence ranging from location start, inclusive, to location end, exclusive), and String toString() (return a string containing this sequence's characters in the same order and having the same length as this sequence).

Table 9-4 also reveals that each of Pattern's compile() methods and its matches() method (which calls the compile(String) method) throws PatternSyntaxException when a syntax error is encountered while compiling the pattern argument. Table 9-5 describes PatternSyntaxException's methods.

Table 9-5. PatternSyntaxException Methods Method Description

String getDescription() Return a description of the syntax error.

int getIndex() Return the approximate index of where the syntax error occurred in the pattern, or -1 if the index is not known.

String getMessage() Return a multiline string containing the description of the syntax error and its index, the erroneous pattern, and a visual indication of the error index within the pattern.

String getPattern() Return the erroneous pattern.

Finally, Table 9-4's Matcher matcher(CharSequence input) method reveals that the Regular Expressions API also provides the java.util.regex.Matcher class, whose matchers attempt to match compiled regexes against input text. Matcher declares the following methods to perform matching operations:

■ boolean matches() attempts to match the entire region against the pattern. If the match succeeds, more information can be obtained by calling Matcher's start(), end(), and group() methods. For example, int start() returns the start index of the previous match, int end() returns the offset of the first character following the previous match, and String group() returns the input subsequence matched by the previous match. Each method throws java.lang.IllegalStateException when a match has not yet been attempted or the previous match attempt failed.

■ boolean lookingAt() attempts to match the input sequence, starting at the beginning of the region, against the pattern. As with matches(), this method always starts at the beginning of the region. Unlike matches(), lookingAt() does not require that the entire region be matched. If the match succeeds, more information can be obtained by calling Matcher's start(), end(), and group() methods.

■ boolean find() attempts to find the next subsequence of the input sequence that matches the pattern. It starts at the beginning of this matcher's region, or, if a previous call to this method was successful and the matcher has not since been reset (by calling Matcher's Matcher reset() or Matcher reset(CharSequence input) method), at the first character not matched by the previous match. If the match succeeds, more information can be obtained by calling Matcher's start(), end(), and group() methods.

NOTE: A matcher finds matches in a subset of its input called the region. By default, the region contains all of the matcher's input. The region can be modified by calling Matcher's Matcher region(int start, int end) method (set the limits of this matcher's region), and queried by calling Matcher's int regionStart() and int regionEnd() methods.

I have created a simple application that demonstrates Pattern, PatternSyntaxException, and Matcher. Listing 9-30 presents this application's source code.

Listing 9-30. Playing with regular expressions import java.util.regex.Matcher; import java.util.regex.Pattern; import java.util.regex.PatternSyntaxException;

public class RegExDemo {

public static void main(String[] args) {

System.err.println("usage: java RegExDemo regex input"); return;

System.out.println("regex = " + args[0]); System.out.println("input = " + args[1]); Pattern p = Pattern.compile(args[0]); Matcher m = p.matcher(args[1]); while (m.find())

System.out.println("Located [" + m.group() + "] starting at "

+ m.start() + " and ending at " + (m.end()-1));

catch (PatternSyntaxException pse) {

System.err.println("Bad regex: " + pse.getMessage()); System.err.println("Description: " + pse.getDescription()); System.err.println("Index: " + pse.getIndex()); System.err.println("Incorrect pattern: " + pse.getPattern());

After compiling this source code, execute java RegExDemo ox ox. You will discover the following output:

Located [ox] starting at 0 and ending at 1

find() searches for a match by comparing regex characters with the input characters in left-to-right order, and returns true because o equals o and x equals x.

Continuing, execute java RegExDemo box ox. This time, you will discover the following output:

regex = box input = ox find() begins by comparing regex character b with input character o. Because these characters are not equal, and because there are not enough characters in the input to continue the search, find() does not output a "Located" message to indicate a match. However, if you execute java RegExDemo ox box, you will discover a match:

Located [ox] starting at 1 and ending at 2

The ox regex consists of literal characters. More sophisticated regexes combine literal characters with metacharacters (such as the period [.]) and other regex constructs.

TIP: To specify a metacharacter as a literal character, precede the metacharacter with a backslash character (as in \ .), or place the metacharacter between \Q and \E (as in \Q.\E). In either case, make sure to double the backslash character when the escaped metacharacter appears in a string literal; for example, "\\." or "\\Q.\\E".

The period metacharacter matches all characters except for the line terminator (a one-or two-character sequence designating the end of the line). For example, each of java RegExDemo .ox box and java RegExDemo .ox fox report a match because the period matches the b in box and the f in fox.

NOTE: Pattern recognizes the following line terminators: carriage return (\r), newline (line feed) (\n), carriage return immediately followed by newline (\r\n), next line (\u0085), line separator (\u2028), and paragraph separator (\u2029). The period metacharacter can be made to also match these line terminators by specifying the Pattern.DOTALL flag when calling Pattern.compile(String, int).

A character class is a set of characters appearing between [ and ]. There are six kinds of character classes:

■ A simple character class consists of literal characters placed side by side, and matches only these characters. For example, [abc] consists of characters a, b, and c. Also, java RegExDemo t[aiou]ck tack reports a match because a is a member of [aiou]. It also reports a match when the input is tick, tock, or tuck because i, o, and u are members.

■ A negation character class consists of a circumflex metacharacter (A), followed by literal characters placed side by side, and matches all characters except for the characters in the class. For example, [Aabc] consists of all characters except for a, b, and c. Also, java RegExDemo "[Ab]ox" box does not report a match because b is not a member of [Ab], whereas java RegExDemo "[Ab]ox" fox reports a match because f is a member. (The double quotes surrounding [Ab]ox are necessary on my Windows XP platform because A is treated specially at the command line.)

■ A range character class consists of successive literal characters expressed as a starting literal character, followed by the hyphen metacharacter (-), followed by an ending literal character, and matches all characters in this range. For example, [a-z] consists of all characters from a through z. Also, java RegExDemo [h-l]ouse house reports a match because h is a member of the class, whereas java RegExDemo [h-l]ouse mouse does not report a match because m lies outside of the range and is therefore not part of the class. You can combine multiple ranges within the same range character class by placing them side by side; for example, [A-Za-z] consists of all uppercase and lowercase Latin letters.

■ A union character class consists of multiple nested character classes, and matches all characters that belong to the resulting union. For example, [abc[u-z]] consists of characters a, b, c, u, v, w, x, y, and z. Also, java RegExDemo [[0-9][A-F][a-f]] e reports a match because e is a hexadecimal character. (I could have alternatively expressed this character class as [0-9A-Fa-f] by combining multiple ranges.)

■ An intersection character class consists of multiple &&-separated nested character classes, and matches all characters that are common to these nested character classes. For example, [a-c&&[c-f]] consists of character c, which is the only character common to [a-c] and [c-f]. Also, java RegExDemo "[aeiouy&&[y]]" y reports a match because y is common to classes [aeiouy] and [y].

■ A subtraction character class consists of multiple &&-separated nested character classes, where at least one nested character class is a negation character class, and matches all characters except for those indicated by the negation character class/classes. For example, [a-z&&[Ax-z]] consists of characters a through w. (The square brackets surrounding Ax-z are necessary; otherwise, a is ignored and the resulting class consists of only x, y, and z.) Also, java RegExDemo "[a-z&&[Aaeiou]]" g reports a match because g is a consonant and only consonants belong to this class. (I am ignoring y, which is sometimes regarded as a consonant and sometimes regarded as a vowel.)

A predefined character class is a regex construct for a commonly specified character class. Table 9-6 identifies Pattern's predefined character classes.

Table 9-6. Predefined Character Classes Predefined Character Class Description

\d Match any digit character. \d is equivalent to [0-9].

\D Match any non-digit character. \D is equivalent to [A\d].

\s Match any whitespace character. \s is equivalent to [\t\n\x0B\f\r

\S Match any non-whitespace character. \S is equivalent to [a\s].

\w Match any word character. \w is equivalent to [a-zA-Z0-9].

\W Match any non-word character . \W is equivalent to [a\w].

For example, java RegExDemo \wbc abc reports a match because \w matches the word character a in abc.

A capturing group saves a match's characters for later recall during pattern matching, and is expressed as a character sequence surrounded by parentheses metacharacters ( and ). All characters within a capturing group are treated as a unit. For example, the (Android) capturing group combines A, n, d, r, o, i, and d into a unit. It matches the Android pattern against all occurrences of Android in the input. Each match replaces the previous match's saved Android characters with the next match's Android characters.

Capturing groups can appear inside other capturing groups. For example, capturing groups (A) and (B(C)) appear inside capturing group ((A)(B(C))), and capturing group (C) appears inside capturing group (B(C)). Each nested or nonnested capturing group receives its own number, numbering starts at 1, and capturing groups are numbered from left to right. For example, ((A)(B(C))) is assigned 1, (A) is assigned 2, (B(C)) is assigned 3, and (C) is assigned 4.

A capturing group saves its match for later recall via a back reference, which is a backslash character followed by a digit character denoting a capturing group number. The back reference causes the matcher to use the back reference's capturing group number to recall the capturing group's saved match, and then use that match's characters to attempt a further match. The following example uses a back reference to determine if the input consists of two consecutive Android patterns:

java RegExDemo "(Android) \1" "Android Android"

RegExDemo reports a match because the matcher detects Android, followed by a space, followed by Android in the input.

A boundary matcher is a regex construct for identifying the beginning of a line, a word boundary, the end of text, and other commonly occurring boundaries. See Table 9-7.

Table 9-7. Boundary Matchers

Boundary Matcher

Description

A

Match beginning of line.

$

Match end of line.

\b

Match word boundary.

\B

Match non-word boundary.

\A

Match beginning of text.

\G

Match end of previous match.

\Z

Match end of text except for line terminator (if present).

\z

Match end of text.

For example, java RegExDemo \b\b "I think" reports several matches, as revealed in the following output:

Located [] starting at 0 and ending at -1

Located [] starting at 1 and ending at 0

Located [] starting at 2 and ending at 1

Located [] starting at 7 and ending at 6

This output reveals several zero-length matches. When a zero-length match occurs, the starting and ending indexes are equal, although the output shows the ending index to be one less than the starting index because I specified end()-l in Listing 9-30 (so that a match's end index identifies a non-zero-length match's last character, not the character following the non-zero-length match's last character).

NOTE: A zero-length match occurs in empty input text, at the beginning of input text, after the last character of input text, or between any two characters of that text. Zero-length matches are easy to identify because they always start and end at the same index position.

The final regex construct I present is the quantifier, a numeric value implicitly or explicitly bound to a pattern. Quantifiers are categorized as greedy, reluctant, or possessive:

■ A greedy quantifier (?, *, or +) attempts to find the longest match. Specify X? to find one or no occurrences of X, X* to find zero or more occurrences of X, X+ to find one or more occurrences of X, X{n} to find n occurrences of X, X{n,} to find at least n (and possibly more) occurrences of X, and X{n,m} to find at least n but no more than m occurrences of X.

■ A reluctant quantifier (??, *?, or +?) attempts to find the shortest match. Specify X?? to find one or no occurrences of X, X*? to find zero or more occurrences of X, X+? to find one or more occurrences of X, X{n}? to find n occurrences of X, X{n,}? to find at least n (and possibly more) occurrences of X, and X{n,m}? to find at least n but no more than m occurrences of X.

■ A possessive quantifier (?+, *+, or ++) is similar to a greedy quantifier except that a possessive quantifier only makes one attempt to find the longest match, whereas a greedy quantifier can make multiple attempts. Specify X?+ to find one or no occurrences of X, X*+ to find zero or more occurrences of X, X++ to find one or more occurrences of X, X{n}+ to find n occurrences of X, X{n,}+ to find at least n (and possibly more) occurrences of X, and X{n,m}+ to find at least n but no more than m occurrences of X.

For an example of a greedy quantifier, execute java RegExDemo .*end "wend rend end". You will discover the following output:

Located [wend rend end] starting at 0 and ending at 12

The greedy quantifier (.*) matches the longest sequence of characters that terminates in end. It starts by consuming all of the input text, and then is forced to back off until it discovers that the input text terminates with these characters.

For an example of a reluctant quantifier, execute java RegExDemo .*?end "wend rend end". You will discover the following output:

Located [wend] starting at 0 and ending at 3 Located [ rend] starting at 4 and ending at 8 Located [ end] starting at 9 and ending at l2

The reluctant quantifier (.*?) matches the shortest sequence of characters that terminates in end. It begins by consuming nothing, and then slowly consumes characters until it finds a match. It then continues until it exhausts the input text.

For an example of a possessive quantifier, execute java RegExDemo .*+end "wend rend end". You will discover the following output:

The possessive quantifier (.*+) does not detect a match because it consumes the entire input text, leaving nothing left over to match end at the end of the regex. Unlike a greedy quantifier, a possessive quantifier does not back off.

While working with quantifiers, you will probably encounter zero-length matches. For example, execute java RegExDemo l? l0ll0l:

Located [l] starting at 0 and ending at 0 Located [] starting at l and ending at 0 Located [l] starting at 2 and ending at 2 Located [l] starting at 3 and ending at 3 Located [] starting at 4 and ending at 3 Located [l] starting at 5 and ending at 5 Located [] starting at 6 and ending at 5

The result of this greedy quantifier is that l is detected at locations 0, 2, 3, and 5 in the input text, and that nothing is detected (a zero-length match) at locations 1, 4, and 6.

This time, execute java RegExDemo l?? l0ll0l:

Located [] starting at 0 and ending at -l Located [] starting at l and ending at 0 Located [] starting at 2 and ending at l Located [] starting at 3 and ending at 2 Located [] starting at 4 and ending at 3 Located [] starting at 5 and ending at 4 Located [] starting at 6 and ending at 5

This output might look surprising, but remember that a reluctant quantifier looks for the shortest match, which (in this case) is no match at all.

Finally, execute java RegExDemo l+? l0ll0l:

Located [l] starting at 0 and ending at 0 Located [l] starting at 2 and ending at 2 Located [l] starting at 3 and ending at 3

Located [1] starting at 5 and ending at 5

This possessive quantifier only matches the locations where 1 is detected in the input text. It does not perform zero-length matches.

NOTE: Refer to the JDK documentation on the Pattern class to learn about additional regex constructs.

Most of the previous regex examples have not been practical, except to help you grasp how to use the various regex constructs. In contrast, the following examples reveal a regex that matches phone numbers of the form (ddd) ddd-dddd or ddd-dddd. A single space appears between (ddd) and ddd; there is no space on either side of the hyphen.

java RegExDemo "(\(\d{3}\))?\s*\d{3}-\d{4}" "800 555-1212"

regex = (\(\d{3}\))?\s*\d{3}-\d{4} input = (800) 555-1212

Located [(800) 555-1212] starting at 0 and ending at 13 java RegExDemo "(\(\d{3}\))?\s*\d{3}-\d{4}" 555-1212

regex = (\(\d{3}\))?\s*\d{3}-\d{4} input = 555-1212

Located [555-1212] starting at 0 and ending at 7

NOTE: To learn more about regular expressions, check out my JavaWorld article "Regular Expressions Simplify Pattern-Matching Code"

(http://www.javaworld.com/javaworld/jw-02-2003/jw-0207-java101.html). Also, you should check out "Lesson: Regular Expressions" (http://download-llnw.oracle.com/javase/tutorial/essential/regex/index.html) in The Java Tutorials' Essential Classes trail.

EXERCISES

The following exercises are designed to test your understanding of this chapter's additional utility APIs:

1. Define task.

2. Define executor.

3. Identify the Executor interface's limitations.

4. How are Executor's limitations overcome?

5. What differences exist between Runnable's run() method and Callable's call() method?

6. True or false: You can throw checked and unchecked exceptions from Runnable's run() method but can only throw unchecked exceptions from Callable' call() method?

7. Define future.

8. Describe the Executors class's newFixedThreadPool() method.

9. Define synchronizer.

10. Identify and describe four commonly used synchronizers.

11. What concurrency-oriented extensions to the collections framework are provided by the concurrency utilities?

12. Define lock.

13. What is the biggest advantage that Lock objects hold over the implicit locks that are obtained when threads enter critical sections (controlled via the synchronized reserved word)?

14. Define atomic variable.

15. Define internationalization.

16. Define locale.

17. What are the components of a Locale object?

18. Define resource bundle.

19. True or false: If a property resource bundle and a list resource bundle have the same complete resource bundle name, the list resource bundle takes precedence over the property resource bundle.

20. Define break iterator.

21. What kinds of break iterators does the Break Iterator API support?

22. True or false: You can pass any Locale object to any of Breaklterator's factory methods that take Locale arguments.

23. What is a collator?

24. Define date, time zone, and calendar.

25. True or false: Date instances can represent dates prior to or after the Unix epoch.

26. How would you obtain a TimeZone object that represents Central Standard Time?

27. Assuming that cal identifies a Calendar instance and locale identifies a specific locale, how would you obtain a localized name for the month represented by cal?

28. Define formatter.

29. What kinds of formatters does NumberFormat return?

30. True or false: DateFormat's getInstance() factory method is a shortcut to obtaining a default date/time formatter that uses the MEDIUM style for both the date and the time.

31. What does a message formatter let you accomplish?

32. Define preference.

33. Why is the Properties API problematic for persisting preferences?

34. How does the Preferences API persist preferences?

35. What does the Random class accomplish?

36. Define regular expression.

37. What does the Pattern class accomplish?

38. What do Pattern's compile() methods do when they discover illegal syntax in their regular expression arguments.

39. What does the Matcher class accomplish?

40. What is the difference between Matcher's matches() and lookingAt() methods?

41. Define character class.

42. Identify the various kinds of character classes.

43. Define capturing group.

44. What is a zero-length match?

45. Define quantifier.

46. What is the difference between a greedy quantifier and a reluctant quantifier?

47. How do possessive and greedy quantifiers differ?

48. Create a SpanishCollation application that outputs Spanish words ñango (weak), llamado (called), lunes (monday), champán (champagne), clamor (outcry), cerca (near), nombre (name), and chiste (joke) according to this language's current collation rules followed by its traditional collation rules. According to the current collation rules, the output order is as follows: cerca, champán, chiste, clamor, llamado, lunes, nombre, and ñango. According to the traditional collation rules, the output order is as follows: cerca, clamor, champán, chiste, lunes, llamado, nombre, and ñango. Use the RuleBasedCollator class to specify the rules for traditional collation. Also, construct your Locale object using only the es (Spanish) language code.

NOTE: The Spanish alphabet consists of 29 letters: a, b, c, ch, d, e, f, g, h, i, j, k, l, ll, m, n, ñ, o, p, q, r, s, t, u, v, w, x, y, z. (Vowels are often written with accents, as in tablón [plank or board], and u is sometimes topped with a dieresis or umlaut, as in vergüenza [bashfulness]. However, vowels with these diacritical marks are not considered separate letters.) Prior to April 1994's voting at the X Congress of the Association of Spanish Language Academies, ch was collated after c, and ll was collated after l. Because this congress adopted the standard Latin alphabet collation rules, ch is now considered a sequence of two distinct characters, and dictionaries now place words starting with ch between words starting with ce and ci. Similarly, ll is now considered a sequence of two characters.

49. Create a RearrangeText application that takes a single text argument of the form x, y and outputs y x. For example, java RearrangeText "Gosling, Dr. James" outputs "Dr. James Gosling".

50. Create a ReplaceText application that takes input text, a pattern that specifies text to replace, and replacement text command-line arguments, and uses Matcher's String replaceAll(String replacement) method to replace all matches of the pattern with the replacement text (passed to replacement). For example, java ReplaceText "too many embedded spaces" "\s+" " "should output too many embedded spaces with only a single space character between successive words.

+1 0

Post a comment