Skip to content Skip to sidebar Skip to footer

Regex For Html Title?

I'm trying to scrape an HTML page for it's title using a regular expression. Here's what I'm trying: \\A\Z\ Any suggestions?

Solution 1:

<title>(.*?)</title>

The brackets around .*? lets you reference the capture group. Your regular expression library will probably have a way to return what is matched in capture groups. The group indexed 0 is the whole match. So you should probably pick group index 1, which is the first starting bracket it comes across (there's only one set of brackets here).

In some libraries, you need:

.*?<title>(.*?)</title>.*

because some require a complete match of the string.

  • \A is used to match the start of a string container
  • \< is used to match the boundary between whitespace and a character
  • \> matches the boundary between a character and whitespace

Be aware that this is not foolproof. Webpages can break your regular expression with pages like:

<html><script>
      // <title>HAHA YOU GOT THE WRONG TITLE</title></script><title>The Actual title</title></head><body></body></html>

You can avoid the possibility of this by making your regex more complicated before matching the title. However, that doesn't really work. Because the fake title could be in an HTML comment <!-- <title></title> -->, or a /* javascript */ comment.

Thus, it is better to use an actual HTML parser. You can search google to find many of these.

Post a Comment for "Regex For Html Title?"