Regularni izrazi u Javi, 1. dio: Podudaranje uzoraka i klasa Uzorak

Javine klase znakova i razni nizovi nude potporu na niskoj razini za podudaranje uzoraka, ali ta podrška obično dovodi do složenog koda. Za jednostavnije i učinkovitije kodiranje, Java nudi Regex API. Ovaj dvodijelni vodič pomaže vam da započnete s regularnim izrazima i Regex API-jem. Prvo ćemo raspakirati tri moćne klase koje se nalaze u java.util.regexpaketu, a zatim ćemo istražiti Patternklasu i njezine sofisticirane konstrukcije za podudaranje uzoraka.

preuzimanje Preuzmite kod Preuzmite izvorni kod za primjere aplikacija u ovom vodiču. Stvorio Jeff Friesen za JavaWorld.

Što su regularni izrazi?

Regularni izraz , također poznat kao regularni izraz ili regexp je niz čiji uzorak (predložak) opisuje skup nizova. Uzorak određuje koji nizovi pripadaju skupu. Uzorak se sastoji od doslovnih znakova i metaznakova , koji su znakovi koji imaju posebno značenje umjesto doslovnog značenja.

Podudaranje uzoraka postupak je pretraživanja teksta radi identificiranja podudaranja ili nizova koji odgovaraju uzorku regularnog izraza. Java podržava podudaranje uzoraka putem svog Regex API-ja. API sastoji od tri classes-- Pattern, Matcheri PatternSyntaxException--all nalazi u java.util.regexpaketu:

  • Patternobjekti, poznati i kao uzorci , sastavljaju se regularni izrazi.
  • Matcherpredmeti, ili matchers su motori koji interpretiraju obrasce za pronalaženje podudaranja u karakter sekvence (objekti čija nastava provoditi java.lang.CharSequencesučelje i služe kao tekstualne izvore).
  • PatternSyntaxException objekti opisuju ilegalne uzorke regularnih izraza.

Java također pruža podršku za podudaranje uzoraka putem različitih metoda u svojoj java.lang.Stringklasi. Na primjer, boolean matches(String regex)vraća true samo ako se pozivni niz točno podudara regexs regularnim izrazom '.

Metode pogodnosti

Iza kulisa matches()i Stringdruge regex orijentirane praktične metode implementirane su u smislu Regex API-ja.

RegexDemo

Ja sam stvorio RegexDemozahtjev da pokažu Java regularnih izraza i razne metode koje se nalaze u Pattern, Matcheri PatternSyntaxExceptionklase. Evo izvornog koda za demonstraciju:

Popis 1. Demonstriranje regularnih izraza

import java.util.regex.Matcher; import java.util.regex.Pattern; import java.util.regex.PatternSyntaxException; public class RegexDemo { public static void main(String[] args) { if (args.length != 2) { System.err.println("usage: java RegexDemo regex input"); return; } // Convert new-line (\n) character sequences to new-line characters. args[1] = args[1].replaceAll("\\\\n", "\n"); try { System.out.println("regex = " + args[0]); System.out.println("input = " + args[1]); Pattern p = Pattern.compile(args[0]); Matcher m = p.matcher(args[1]); while (m.find()) System.out.println("Found [" + m.group() + "] starting at " + m.start() + " and ending at " + (m.end() - 1)); } catch (PatternSyntaxException pse) { System.err.println("Bad regex: " + pse.getMessage()); System.err.println("Description: " + pse.getDescription()); System.err.println("Index: " + pse.getIndex()); System.err.println("Incorrect pattern: " + pse.getPattern()); } } }

Prva stvar RegexDemoje main()metoda čini se da potvrditi svoj naredbenog retka. Za to su potrebna dva argumenta: prvi je argument regularni izraz, a drugi je ulazni tekst koji treba uporediti s regularnim izrazom.

Možda ćete htjeti odrediti znak novog retka ( \n) kao dio ulaznog teksta. Jedini način da se to postigne je odrediti \znak iza kojeg slijedi nznak. main()pretvara ovaj niz znakova u Unicode vrijednost 10.

Glavnina RegexDemo„s kodom nalazi se u try- catchkonstrukt. tryBlok prvi emitira određenu regularnih izraza i unos teksta, a zatim stvara Patternobjekt koji pohranjuje sastavio regularnih izraza. (Regexovi se sastavljaju kako bi se poboljšale performanse tijekom podudaranja uzoraka.) Iz Patternpredmeta se izdvaja podudarnik koji se koristi za opetovano traženje podudaranja dok ne ostane nijedno. catchBlok poziva na razne PatternSyntaxExceptionnačine izvući korisne informacije o iznimku. Te se informacije naknadno objavljuju.

U ovom trenutku ne morate znati više o radu izvornog koda; postat će jasno kada istražite API u 2. dijelu, međutim, morate sastaviti Popis 1. Dohvatite kôd s popisa 1, a zatim u naredbeni redak za kompajliranje upišite sljedeće RegexDemo:

javac RegexDemo.java

Uzorak i njegove konstrukcije

Pattern, prva od tri klase koja sadrži Regex API, sastavljeni je prikaz regularnog izraza. PatternSDK-ova dokumentacija opisuje razne konstrukcije regularnih izraza, ali ako već niste strastveni korisnik regularnog izraza, dijelovi dokumentacije mogli bi vas zbuniti. Što su kvantifikatori i koja je razlika između pohlepnih , nevoljnih i posesivnih kvantifikatora? Što su klase znakova , podudaranja granica , povratne reference i izrazi ugrađenih zastavica ? Na ova i mnoga pitanja odgovorit ću u sljedećim odjeljcima.

Doslovni nizovi

Najjednostavnija konstrukcija regularnog izraza je doslovni niz. Neki se dijelovi ulaznog teksta moraju podudarati s uzorkom ove konstrukcije da bi se uspješno podudarao s uzorkom. Razmotrimo sljedeći primjer:

java RegexDemo apple applet

Ovaj primjer pokušava otkriti postoji li podudaranje za appleobrazac u appletulaznom tekstu. Sljedeći izlaz otkriva podudaranje:

regex = apple input = applet Found [apple] starting at 0 and ending at 4

Izlaz nam pokazuje regularni izraz i ulazni tekst, a zatim ukazuje na uspješno podudaranje appleunutar applet. Osim toga, ona predstavlja početni i završni indekse toj utakmici: 0i 4, respektivno. Početni indeks identificira prvo mjesto teksta gdje se događa podudaranje uzorka; završni indeks identificira posljednje mjesto teksta za podudaranje.

Sada pretpostavimo da odredimo sljedeću naredbenu liniju:

java RegexDemo apple crabapple

Ovoga puta dobivamo sljedeću utakmicu s različitim početnim i završnim indeksima:

regex = apple input = crabapple Found [apple] starting at 4 and ending at 8

Obrnuti scenarij, u kojem appletje regularni izraz i appleulazni tekst, ne otkriva podudaranje. Cjelokupni regularni izraz mora se podudarati, a u ovom slučaju ulazni tekst ne sadrži znakove tiza apple.

Metaznakovi

Moćniji regex konstrukti kombiniraju doslovne znakove s metaznakovima. Na primjer, u a.b, metaznak razdoblja ( .) predstavlja bilo koji znak koji se pojavljuje između ai b. Razmotrimo sljedeći primjer:

java RegexDemo .ox "The quick brown fox jumps over the lazy ox."

Ovaj primjer navodi .oxkao regularni izraz i The quick brown fox jumps over the lazy ox.kao ulazni tekst. RegexDemotraži u tekstu podudaranja koja počinju bilo kojim znakom i završavaju s ox. Daje sljedeći izlaz:

regex = .ox input = The quick brown fox jumps over the lazy ox. Found [fox] starting at 16 and ending at 18 Found [ ox] starting at 39 and ending at 41

The output reveals two matches: fox and ox (with the leading space character). The . metacharacter matches the f in the first match and the space character in the second match.

What happens when we replace .ox with the period metacharacter? That is, what output results from specifying the following command line:

java RegexDemo . "The quick brown fox jumps over the lazy ox."

Because the period metacharacter matches any character, RegexDemo outputs a match for each character (including the terminating period character) in the input text:

regex = . input = The quick brown fox jumps over the lazy ox. Found [T] starting at 0 and ending at 0 Found [h] starting at 1 and ending at 1 Found [e] starting at 2 and ending at 2 Found [ ] starting at 3 and ending at 3 Found [q] starting at 4 and ending at 4 Found [u] starting at 5 and ending at 5 Found [i] starting at 6 and ending at 6 Found [c] starting at 7 and ending at 7 Found [k] starting at 8 and ending at 8 Found [ ] starting at 9 and ending at 9 Found [b] starting at 10 and ending at 10 Found [r] starting at 11 and ending at 11 Found [o] starting at 12 and ending at 12 Found [w] starting at 13 and ending at 13 Found [n] starting at 14 and ending at 14 Found [ ] starting at 15 and ending at 15 Found [f] starting at 16 and ending at 16 Found [o] starting at 17 and ending at 17 Found [x] starting at 18 and ending at 18 Found [ ] starting at 19 and ending at 19 Found [j] starting at 20 and ending at 20 Found [u] starting at 21 and ending at 21 Found [m] starting at 22 and ending at 22 Found [p] starting at 23 and ending at 23 Found [s] starting at 24 and ending at 24 Found [ ] starting at 25 and ending at 25 Found [o] starting at 26 and ending at 26 Found [v] starting at 27 and ending at 27 Found [e] starting at 28 and ending at 28 Found [r] starting at 29 and ending at 29 Found [ ] starting at 30 and ending at 30 Found [t] starting at 31 and ending at 31 Found [h] starting at 32 and ending at 32 Found [e] starting at 33 and ending at 33 Found [ ] starting at 34 and ending at 34 Found [l] starting at 35 and ending at 35 Found [a] starting at 36 and ending at 36 Found [z] starting at 37 and ending at 37 Found [y] starting at 38 and ending at 38 Found [ ] starting at 39 and ending at 39 Found [o] starting at 40 and ending at 40 Found [x] starting at 41 and ending at 41 Found [.] starting at 42 and ending at 42

Quoting metacharacters

To specify . or any metacharacter as a literal character in a regex construct, quote the metacharacter in one of the following ways:

  • Precede the metacharacter with a backslash character.
  • Place the metacharacter between \Q and \E (e.g., \Q.\E).

Remember to double each backslash character (as in \\. or \\Q.\\E) that appears in a string literal such as String regex = "\\.";. Don't double the backslash character when it appears as part of a command-line argument.

Character classes

We sometimes need to limit characters that will produce matches to a specific character set. For example, we might search text for vowels a, e, i, o, and u, where any occurrence of a vowel indicates a match. A character class identifies a set of characters between square-bracket metacharacters ([ ]), helping us accomplish this task. Pattern supports simple, negation, range, union, intersection, and subtraction character classes. We'll look at all of these below.

Simple character class

The simple character class consists of characters placed side by side and matches only those characters. For example, [abc] matches characters a, b, and c.

Consider the following example:

java RegexDemo [csw] cave

This example matches only c with its counterpart in cave, as shown in the following output:

regex = [csw] input = cave Found [c] starting at 0 and ending at 0

Negation character class

The negation character class begins with the ^ metacharacter and matches only those characters not located in that class. For example, [^abc] matches all characters except a, b, and c.

Consider this example:

java RegexDemo "[^csw]" cave

Note that the double quotes are necessary on my Windows platform, whose shell treats the ^ character as an escape character.

This example matches a, v, and e with their counterparts in cave, as shown here:

regex = [^csw] input = cave Found [a] starting at 1 and ending at 1 Found [v] starting at 2 and ending at 2 Found [e] starting at 3 and ending at 3

Range character class

The range character class consists of two characters separated by a hyphen metacharacter (-). All characters beginning with the character on the left of the hyphen and ending with the character on the right of the hyphen belong to the range. For example, [a-z] matches all lowercase alphabetic characters. It's equivalent to specifying [abcdefghijklmnopqrstuvwxyz].

Consider the following example:

java RegexDemo [a-c] clown

This example matches only c with its counterpart in clown, as shown:

regex = [a-c] input = clown Found [c] starting at 0 and ending at 0

Merging multiple ranges

You can merge multiple ranges into the same range character class by placing them side by side. For example, [a-zA-Z] matches all lowercase and uppercase alphabetic characters.

Union character class

The union character class consists of multiple nested character classes and matches all characters that belong to the resulting union. For example, [a-d[m-p]] matches characters a through d and m through p.

Consider the following example:

java RegexDemo [ab[c-e]] abcdef

This example matches a, b, c, d, and e with their counterparts in abcdef:

regex = [ab[c-e]] input = abcdef Found [a] starting at 0 and ending at 0 Found [b] starting at 1 and ending at 1 Found [c] starting at 2 and ending at 2 Found [d] starting at 3 and ending at 3 Found [e] starting at 4 and ending at 4

Intersection character class

The intersection character class consists of characters common to all nested classes and matches only common characters. For example, [a-z&&[d-f]] matches characters d, e, and f.

Consider the following example:

java RegexDemo "[aeiouy&&[y]]" party

Imajte na umu da su dvostruki navodnici potrebni na mojoj Windows platformi, čija ljuska tretira &znak kao separator naredbi.

Ovaj se primjer podudara samo ysa svojim kolegom u party:

regex = [aeiouy&&[y]] input = party Found [y] starting at 4 and ending at 4