This book is about a powerful tool called "regular expressions". It teaches you how to use regular expressions to solve problems and get the most out of tools and languages that provide them. Most documentation that mentions regular expressions doesn't even begin to hint at their power, but this book is about mastering regular expressions. Regular expressions are available in many types of tools (editors, word processors, system tools, database engines, and such), but their power is most fully exposed when available as part of a programming language. Examples include Java and JScript, Visual Basic and VBScript, JavaScript and ECMAScript, C, C++, C#, elisp, Perl, Python, Tcl, Ruby, PHP, sed , and awk . In fact, regular expressions are the very heart of many programs written in some of these languages. There's a good reason that regular expressions are found in so many diverse languages and applications: they are extremely powerful. At a low level, a regular expression describes a chunk of text. You might use it to verify a user 's input, or perhaps to sift through large amounts of data. On a higher level, regular expressions allow you to master your data. Control it. Put it to work for you. To master regular expressions is to master your data. The Need for This BookI finished the first edition of this book in late 1996, and wrote it simply because there was a need. Good documentation on regular expressions just wasn't available, so most of their power went untapped. Regular-expression documentation was available, but it centered on the "low-level view." It seemed to me that they were analogous to showing someone the alphabet and expecting them to learn to speak. The five and a half years between the first and second editions of this book saw the popular rise of the Internet, and, perhaps more than just coincidentally, a considerable expansion in the world of regular expressions. The regular expressions of almost every tool and language became more powerful and expressive. Perl, Python, Tcl, Java, and Visual Basic all got new regular-expression backends . New languages with regular expression support, like PHP, Ruby, and C#, were developed and became popular. During all this time, the basic core of the bookhow to truly understand regular expressions and how to get the most from themremained as important and relevant as ever. Yet, the first edition gradually started to show its age. It needed updating to reflect the new languages and features, as well as the expanding role that regular expressions played in the Internet world. It was published in 2002, a year that saw the landmark releases of java.util.regex , Microsoft's .NET Framework, and Perl 5.8. They were all covered fully in the second edition. My one regret with the second edition was that it didn't give more attention to PHP. In the four years since the second edition was published, PHP has only grown in importance, so it became imperative to correct that deficiency. This third edition features enhanced PHP coverage in the early chapters, plus an all new, expansive chapter devoted entirely to PHP regular expressions and how to wield them effectively. Also new in this edition, the Java chapter has been rewritten and expanded considerably to reflect new features of Java 1.5 and Java 1.6. Intended AudienceThis book will interest anyone who has an opportunity to use regular expressions. If you don't yet understand the power that regular expressions can provide, you should benefit greatly as a whole new world is opened up to you. This book should expand your understanding, even if you consider yourself an accomplished regular-expression expert. After the first edition, it wasn't uncommon for me to receive an email that started "I thought I knew regular expressions until I read Mastering Regular Expressions. Now I do." Programmers working on text- related tasks , such as web programming, will find an absolute gold mine of detail, hints, tips, and understanding that can be put to immediate use. The detail and thoroughness is simply not found anywhere else. Regular expressions are an ideaone that is implemented in various ways by various utilities (many, many more than are specifically presented in this book). If you master the general concept of regular expressions, it's a short step to mastering a particular implementation. This book concentrates on that idea, so most of the knowledge presented here transcends the utilities and languages used to present the examples. How to Read This BookThis book is part tutorial, part reference manual, and part story, depending on when you use it. Readers familiar with regular expressions might feel that they can immediately begin using this book as a detailed reference, flipping directly to the section on their favorite utility. I would like to discourage that. You'll get the most out of this book by reading the first six chapters as a story. I have found that certain habits and ways of thinking help in achieving a full understanding, but are best absorbed over pages, not merely memorized from a list. The story that is the first six chapters form the basis for the last four, covering specifics of Perl, Java, .NET, and PHP. To help you get the most from each part, I've used cross references liberally, and I've worked hard to make the index as useful as possible. (Over 1,200 cross references are sprinkled throughout the book; they are often presented as "˜" followed by a page number.) Until you read the full story, this book's use as a reference makes little sense. Before reading the story, you might look at one of the tables, such as the chart on page 92, and think it presents all the relevant information you need to know. But a great deal of background information does not appear in the charts themselves , but rather in the associated story. Once you've read the story, you'll have an appreciation for the issues, what you can remember off the top of your head, and what is important to check up on. OrganizationThe ten chapters of this book can be logically divided into roughly three parts . Here's a quick overview:
The introduction elevates the absolute novice to "issue-aware" novice. Readers with a fair amount of experience can feel free to skim the early chapters, but I particularly recommend Chapter 3 even for the grizzled expert.
The DetailsOnce you have the basics down, it's time to investigate the how and the why . Like the "teach a man to fish" parable, truly understanding the issues will allow you to apply that knowledge whenever and wherever regular expressions are found.
Tool-Specific InformationOnce the lessons of Chapters 4, 5, and 6 are under your belt, there is usually little to say about specific implementations . However, I've devoted an entire chapter to each of four popular systems:
Typographical ConventionsWhen doing (or talking about) detailed and complex text processing, being precise is important. The mere addition or subtraction of a space can make a world of difference, so I've used the following special conventions in typesetting this book:
ExercisesOccasionally, and particularly in the early chapters, I'll pose a question to highlight the importance of the concept under discussion. They're not there just to take up space; I really do want you to try them before continuing. Please. So as not to dilute their importance, I've sprinkled only a few throughout the entire book. They also serve as checkpoints: if they take more than a few moments, it's probably best to go over the relevant section again before continuing on. To help entice you to actually think about these questions as you read them, I've made checking the answers a breeze : just turn the page. Answers to questions marked with are always found by turning just one page. This way, they're out of sight while you think about the answer, but are within easy reach. Links, Code, Errata, and ContactsI learned the hard way with the first edition that URLs change more quickly than a printed book can be updated, so rather than providing an appendix of URLs, I'll provide just one: http://regex. info / There you can find regular-expression links, all code snippets from this book, a searchable index, and much more. In the unlikely event this book contains an error :-) , the errata will be available as well. If you find an error in this book, or just want to drop me a note, you can contact me at jfriedl@regex.info. The publisher can be contacted at:
For more information about books, conferences, Resource Centers, and the O'Reilly Network, see the O'Reilly web site at: http://www.oreilly.com SafariEnabledWhen you see a SafariEnabled icon on the cover of your favorite technology book, that means the book is available online through the O'Reilly Network Safari Bookshelf. Safari offers a solution that's better than e-books. It's a virtual library that lets you easily search thousands of top tech books, cut and paste code samples, download chapters, and find quick answers when you need the most accurate, current information. Try it for free at http://safari.oreilly.com. Personal Comments and AcknowledgmentsWriting the first edition of this book was a grueling task that took two and a half years and the help of many people. After the toll it took on my health and sanity , I promised that I'd never put myself through such an experience again. I have many people to thank in helping me break that promise. Foremost is my wife, Fumie. If you find this book useful, thank her; without her support and understanding, I'd have neither the strength nor sanity to undertake a task as arduous as the research, writing, and production of a book like this. While researching and writing this book, many people helped educate me on languages or systems I didn't know, and more still reviewed and corrected drafts as the manuscripts developed. In particular, I'd like to thank my brother, Stephen Friedl, for his meticulous and detailed reviews along the way. (Besides being an excellent technical reviewer, he's also an accomplished writer, known for his well-researched "Tech Tips," available at http://www.unixwiz.net/) I'd also like to thank Zak Greant, Ian Morse, Philip Hazel, Stuart Gill, William F. Maton, and my editor, Andy Oram. Special thanks for providing an insider's look at Java go to Mike "madbot" McCloskey (formerly at Sun Microsystems, now at Google), and Mark Reinhold and Dr. Cliff Click, both of Sun Microsystems. For .NET insight, I'd like to thank Microsoft's David Gutierrez, Kit George, and Ryan Byington. I thank Andrei Zmievski of Yahoo! for providing insights into PHP. I'd like to thank Dr. Ken Lunde of Adobe Systems, who created custom characters and fonts for a number of the typographical aspects of this book. The Japanese characters are from Adobe Systems' Heisei Mincho W3 typeface, while the Korean is from the Korean Ministry of Culture and Sports Munhwa typeface. It's also Ken who originally gave me the guiding principle that governs my writing: "you do the research so your readers don't have to." For help in setting up the server for http://regex.info, I'd like to thank Jeffrey Papen and Peak Web Hosting (http://www.PeakWebhosting.com/). |