月度归档:2015年06月

PHP中的引用详解

我们首先看下官方英文文档的解释。

PHP中的引用意味着你可以用不同的变量名访问同一个变量内容。在PHP中变量名和变量内容是不同的,引用是符号表别名。引用不同于C语言中的指针,你不能对其进行数学运算。最形象的比喻就是Unix的文件系统,变量名对应文件系统中的文件名,而变量内容对应文件系统中实际的文件内容,引用即为系统中的硬链接。

那么PHP中的引用和C语言中的指针有什么不同呢?C语言中加入定义一个整形指针“int *p”,我们可以对指针进行加操作“p++”,那么它就指向了下一个内存地址。或者我们进行赋值操作,给指针p一个新的地址“p = n”,那么p就指向了n所指向的地址,他们都指向同一个地址。而在PHP中我们不可以对引用使用加减操作,因为通过引用赋值时,如“$b=&$a”,新的$b只是一个变量名而已,和$a是一样的,&的作用就是使$b的变量内容指向了$a的变量内容,$b终究只是一个变量名,用官方的说法是$a的别名。说的直白点就是,使用了“&”的话就不会分配新的内存空间,引用的结果是新变量指向现有的内存空间

PHP中引用的作用是什么呢?引用有三个基本的作用:引用赋值、引用传递和引用返回。

引用赋值

对于“$a=$b”,$a和$b是完全一样的,他们都指向同一个内容。

引用传递

PHP中函数参数的传递都是值传递,都是将实参的值拷贝一份传递到函数中。如果想在函数中修改函数外部的值,那么就需要引用。

引用返回

可以使用引用进行函数返回,而不是返回值的拷贝。

具体的作用介绍可参考http://php.net/manual/en/language.references.whatdo.php

需要我们注意的是,在PHP5中的对象变量,我们需要深刻理解。PHP5中的对象变量已经不再保存整个对象的值,只是保存一个标识符来访问真正的对象内容。当对象作为参数传递,作为结果返回,或者赋值给另外一个变量,另外一个变量跟原来的不是引用的关系,只是他们都保存着同一个标识符的拷贝,这个标识符指向同一个对象的真正内容。 可以参考下列代码对本篇文章进行理解。

<?php

$a = 1;
$b = $a;
$b = 4;
echo $a . '<br>'; // 1

$b = &$a;
$b = 4;
echo $a . '<br>'; // 4
$c = $b;
$c = 5;
echo $b . '<br>'; // 4
echo $a . '<br><br>'; // 4

$a= 1;
function func (&$args){
    $args = 6;
}
func($a);
echo $a; // 6
?>

对象变量的引用

<?php
class A {
    public $foo = 'empty';
}
class B {
    public $foo = 'empty';
    public $bar = 'hello';
}

function normalAssignment($obj) {
    $obj->foo = 'changed';
    $obj = new B;
}

function referenceAssignment(&$obj) {
    $obj->foo = 'changed';
    $obj = new B;
}

$a = new A;
normalAssignment($a);
echo get_class($a), "\n";
echo "foo = {$a->foo}\n";

referenceAssignment($a);
echo get_class($a), "\n";
echo "foo = {$a->foo}\n";
echo "bar = {$a->bar}\n";

/*
prints:
A
foo = changed
B
foo = empty
bar = hello
*/
?>

代码中可以看出,普通变量的变量内容存储的就是实际的变量内容,而对象变量的变量内容实际上是对象的标志符,而不是正真的对象。例如将上述代码中的$a赋值给$b,“$b=$a”所代表的意义是创建一个新的变量$b,$b的变量内容存储的值和变量$a的变量内容所存储的值是一样的,都是对象的标识符,而且这两个变量具有不同的变量内容存储空间。而引用赋值“$b=$a”所代表的意义是变量$b的变量内容指向变量$a的变量内容,这两个变量的变量内容都指向一个存储空间。“=”和“=&”操作得到的变量都可对对象的值进行修改,若想在函数中修改对象的值,则在函数定义时可不必加“&”操作符。PHP5中对象的赋值操作不会产生新的对象,要复制对象,使用clone关键字。

4. Posix基本正则表达式

Posix基本正则表达式可实现不同的工具之间保持一致性,例如grep和awk。对于传统的unix工具,并没有实现扩展正则表达式。

历史

传统的unix正则表达式语法在各个工具之间都不同,posix正则表达式被ieee开发,它还具有一套宽展语法。这些标准的设计考虑到尽可能兼容简单正则表达式语法,并为unix工具提供一套标准的语法。

语法

在posix基本正则表达式语法中,大多数字符都被理解为字面意思,它们只匹配它们自己。除此之外为元字符,元字符的定义如下表。

元字符 描述
. 匹配任何单一字符(许多应用程序不对换行符进行匹配,因为换行符根据字符编码和平台相关,但是假设总是匹配换行符是安全的),在posix括号表达式“[]”中,“.”字符总是匹配本身,例如[a.c]匹配“a”、“.”或“c”。
[ ] 匹配括号中的单一字符,例如[abc]匹配“a”、“b”或“c”,[a-z]表示匹配小写字母a到小写字母 z的一个范围。这种形式可以混合使用,如[abcx-z]表示匹配”a”、“b”、“c”、“x”、“y”和”z”,[a-cx-z]具有同样效果。如果“-”是括号中的第一个或者最后一个字符,那么它将理解为字面含义“-”。“]”字符可以出现在括号表达式中,如果他是第一个字符的话,例如“[]abc]”。括号表达式中也可包含”字符类”,“等价类”和”整理字符”。
[^ ] 匹配除了括号中的任一字符,结构和用法同“[]”。
^ 如果是正则表达式的第一个字符,将匹配字符串的起始位置。
$ 如果是正则表达式的最后一个字符,将在字符串的结尾处进行匹配
* 匹配前一个匹配元素0次或多次,如“ab*c”将匹配”ac”,“abc”和“abbbc”等。“[xyz]*”匹配“”,“x”,”y”,“z”,“zx”,“zyx”,“xyzzy”等。
BRE: \{m\}
ERE: {m}
指定前一个匹配元素的匹配次数,如”a\{3\}”仅匹配“aaa”。
BRE: \{m,\}
ERE: {m,}
匹配前一个匹配元素最少m次,例如“a\{3,\}”匹配“aaa”,“aaaa”,”aaaaa”等。
BRE: \{m,n\}
ERE: {m,n}
匹配前一个匹配元素最少m次,最多n次,例如“a\{3,5\}”匹配“aaa”,”aaaa”,“aaaaa”。
BRE: \( \)
ERE: ( )
定义一个子表达式,将对待该元素为单一元素,例如“ab*”匹配“a”,“ab”,“abb”等。而“\(ab\)*”匹配”“,”abab”,”ababab“等。被匹配的字符串稍后可以被调用,请参考”\n“。
BRE only: \n 取得第n个子表达式的值,n为1到9。This construct is theoretically irregular (an expression with this construct does not obey the mathematical definition of regular expression), and was not adopted in the POSIX ERE syntax.

示例:

  • .at 匹配任何以”at“结尾的三字符的字符串
  • [hc]at 匹配hat和cat
  • [^b]at 匹配任何以”at“结尾的三字符的字符串,除了bat
  • ^[hc]at 匹配字符串或行开始处的hat或cat
  • [hc]at$ 在字符串的结尾处匹配hat或cat
  • \[.\] 匹配任一被[和]包含的字符,如”[a]”和”[b]”.

字符类

Posix标准定义了一系列字符类,这些字符类在方括号中使用。

POSIX class 相似于 意义
[:upper:] [A-Z] 大写字符
[:lower:] [a-z] 小写字符
[:alpha:] [A-Za-z] 大写字符和小写字符
[:digit:] [0-9] 数字
[:xdigit:] [0-9A-Fa-f] 十六进制数
[:alnum:] [A-Za-z0-9] 数字和大小写字符
[:punct:] 除了字母和数字的所有可见(graphic)字符
[:blank:] [ \t] 空格和tab
[:space:] [ \t\n\r\f\v] 空白字符
[:cntrl:] 控制字符
[:graph:] [^ [:cntrl:]] 可见字符 (all characters which have graphic representation)
[:print:] [[:graph] ] 可见字符和空格

示例:

  • a[[:digit:]]b 匹配”a0b”, “a1b”, …, “a9b”。
  • a[:digit:]b is an error: character classes must be in brackets
  • [[:digit:]abc] matches any digit, “a”, “b”, and “c”.
  • [abc[:digit:]] is the same as above
  • [^ABZ[:lower:]] matches any character except lowercase letters, A, B, and Z.

整理字符

Collating symbols, like character classes, are used in brackets and have the form [.ch.]. Here ch is a digraph. Collating systems are defined by the locale.

Equivalence classes

Equivalence classes, like character classes and collating symbols, are used in brackets and have the form [=a=]. They stand for any character which is equivalent to the given. According to the standard[1], For example, if ‘a’, ‘à’, and ‘â’ belong to the same equivalence class, then “[[=a=]b]”, “[[=à=]b]”, and “[[=â=]b]” are each equivalent to “[aàâb]”. Equivalence classes, like collating symbols, are defined by the locale.

External links

IEEE Std 1003.1, 2004 Edition. Chapter 9, Regular Expressions.

Use in Tools

Tools and languages that utilize this regular expression syntax include:

Links

3. Perl兼容正则表达式

Perl具有比POSIX扩展正则表达式语法更丰富,更可预测的语法。其可预测性的一个例子是,\总是引用一个非字母数字字符。采用perl,我们可以描述是否启用贪婪模式,而posix却不能,例如/a.b/模式,其中的.将尽可能匹配更多,而在/a.?b/模式中,其中的.?将尽可能的匹配更少的字符。所以,对于这个字符串“a bad dab”,采用前一种模式时将匹配整个字符,第二种将匹配“a b”。

处于这些原因,其他的工具和程序都采用类似perl的语法。例如Java 、Ruby、Python、PHP、exim、BBEdit,甚至是微软的.net都采用类似perl的语法。然而,并非所有的perl兼容模式的语法的实现都是一样的,许多实现仅是perl的一个子集。

在实施例中使用的约定:字符’m’不总是需要指定一个perl匹配操作。例如,m/[^ ABC]/也可以呈现为/[^ ABC]/。如果用户希望指定的匹配操作,而无需使用一个正斜杠作为正则表达式定界符的’m’为唯一必要的。有时为了避免“撞分隔符”指定替代正则表达式的分隔符是有用的。见perldoc perlre了解更多详情。

元字符 描述 示例
所有的if表达式返回真值
. 匹配除了换行符之外的任意单一字符,如果在方括号中,则为字面意思,匹配“.”自身
if ("Hello World\n" =~ m/...../) {
  print "Yep"; # Has length >= 5\n";
}
( ) 将要匹配的模式放在括号中,在以后我们可以通过使用$1、$2等来访问这些匹配到的元素。
if ("Hello World\n" =~ m/(H..).(o..)/) {
  print "We matched '$1' and '$2'\n";
}

Output:

We matched 'Hel' and 'o W';

+ 匹配前一个元素一次或多次
if ("Hello World\n" =~ m/l+/) {
  print "One or more \"l\"'s in the string\n";
}
? 匹配前一个元素0次或1次
if ("Hello World\n" =~ m/H.?e/) {
  print "There is an 'H' and a 'e' separated by ";
  print "0-1 characters (Ex: He Hoe)\n";
}
? 使*、+或者{M,N}表达式尽可能少的匹配
if ("Hello World\n" =~ m/(l.+?o)/) {
  print "Yep"; # The non-greedy match with 'l' followed
  # by one or more characters is 'llo' rather than 'llo wo'.
}
* 匹配前一个匹配模式元素0次或多次
if ("Hello World\n" =~ m/el*o/) {
  print "There is an 'e' followed by zero to many ";
  print "'l' followed by 'o' (eo, elo, ello, elllo)\n";
}
{M,N} 匹配的最小次数M和最大次数N
if ("Hello World\n" =~ m/l{1,2}/) {
 print "There is a substring with at least 1 ";
 print "and at most 2 l's in the string\n";
}
[…] 匹配一系列可能的字符
if ("Hello World\n" =~ m/[aeiou]+/) {
  print "Yep"; # Contains one or more vowels
}
| 匹配可选字符,相当于或者
if ("Hello World\n" =~ m/(Hello|Hi|Pogo)/) {
  print "At least one of Hello, Hi, or Pogo is ";
  print "contained in the string.\n";
}
\b 匹配单词边界
if ("Hello World\n" =~ m/llo\b/) {
  print "There is a word that ends with 'llo'\n";
}
\w 匹配字母数字字符,包括“_”
if ("Hello World\n" =~ m/\w/) {
  print "There is at least one alphanumeric ";
  print "character in the string (A-Z, a-z, 0-9, _)\n";
}
\W 匹配非字母数字字符,不匹配“_”
if ("Hello World\n" =~ m/\W/) {
  print "The space between Hello and ";
  print "World is not alphanumeric\n";
}
\s 匹配空白字符 (space, tab, newline, form feed)
if ("Hello World\n" =~ m/\s.*\s/) {
  print "There are TWO whitespace characters, which may";
  print " be separated by other characters, in the string.";
}
\S 匹配除了空白字符的任何字符
if ("Hello World\n" =~ m/\S.*\S/) {
  print "Contains two non-whitespace characters " .
        "separated by zero or more characters.";
}
\d 匹配数字,同[0-9]
if ("99 bottles of beer on the wall." =~ m/(\d+)/) {
  print "$1 is the first number in the string'\n";
}
\D 匹配非数字
if ("Hello World\n" =~ m/\D/) {
  print "There is at least one character in the string";
  print " that is not a digit.\n";
}
^ 在行开始处匹配
if ("Hello World\n" =~ m/^He/) {
  print "Starts with the characters 'He'\n";
}
$ 匹配行结束处
if ("Hello World\n" =~ m/rld$/) {
  print "Is a line or string ";
  print "that ends with 'rld'\n";
}
\A 匹配字符串的开始处(but not an internal line).
if ("Hello\nWorld\n" =~ m/\AH/) {
  print "Yep"; # The string starts with 'H'.
}
\Z 在字符串的结束处匹配 (but not an internal line).
if ("Hello\nWorld\n"; =~ m/d\n\Z/) {
  print "Yep"; # Ends with 'd\\n'\n";
}
[^…] 匹配除了方括号内的任何字符
if ("Hello World\n" =~ m/[^abc]/) {
  print "Yep"; # Contains a character other than a, b, and c.
}

使用的工具

使用perl正则表达式语法的工具包括:

  • Java
  • leafnode
  • Perl
  • Python
  • PHP

参考

链接

2. 基本正则表达式

对于基本正则表达式,当反斜线位于某些元字符前面时,不同的实现的,对反斜线的处理方式不同。例如,egrep和perl对待未加反斜线的括号和竖线(|)为元字符,保留加反斜线的版本为字符本身。老版本的grep不支持管道操作符。

操作符
操作符 效果
. “.”匹配任何单一字符
[ ] 匹配字符列表或字符范围
[^ ] 匹配不在此列表或范围内的字符
* 匹配0个或更多个字符
^ 匹配行开始处
$ 匹配行结束处
示例
示例 匹配
“.at” 任何三个字符的字符串,像hatcatbat
“[hc]at” hat和cat
“[^b]at” 同“.bat”类似,但不包含“bat”
“^[hc]at” 在行开始处匹配hat或cat
“[hc]at$” 在行结束处匹配hat或cat

使用基本正则表达式的工具有:TBD。

1. 简单正则表达式

由于考虑到兼容性的原因,在类unix系统中,简单的正则表达式仍广泛使用。大多数支持正则表达式的命令,如grep和sed等默认使用简单正则表达式,但同时支持扩展正则表达式,需通过命令行参数指定。在posix兼容系统上,简单正则表达式已经废弃,因此新的命令或工具请不要使用此语法。

当使用简单正则表达式时,除了元字符,大多数字符将按照字面意思处理,它们只匹配它们自己,例如‘a’只匹配‘a’,‘bc’只匹配‘bc’。

操作符
操作符 效果
. 点操作符匹配任何单一字符
[ ] 匹配字符列表或字符范围
[^ ] 匹配不在此范围或列表中的字符
^ 匹配行开始处 (or any line, when applied in multiline mode)
$ 匹配行结束处 (or any line, when applied in multiline mode)
( ) 括号定义了一个标记,匹配的部分在此后返回
\n
其中n是从1到9的数字;内容相匹配的第n个标记子表达式匹配。在扩展的正则表达式语法中这种不规则的结构并没有被采纳
* *跟在单字符表达式后表示匹配0次或多次,如 “ab*c”匹配”ac”, “abc”, “abbbc”等。 “[xyz]*” 匹配””, “x”, “y”, “zx”, “zyx”等.

  • \n*, where n is a digit from 1 to 9, matches zero or more iterations of what the nth marked subexpression matched. For example, “(a.)c\1*” matches “abcab” and “abcabab” but not “abcac”.
  • An expression enclosed in “\(” and “\)” followed by “*” is deemed to be invalid. In some cases (e.g. /usr/bin/xpg4/grep of SunOS 5.8), it matches zero or more iterations of the string that the enclosed expression matches. In other cases (e.g. /usr/bin/grep of SunOS 5.8), it matches what the enclosed expression matches, followed by a literal “*”.

示例

  • “^[hc]at” 在行开始处匹配hat或cat
  • “[hc]at$” 在行结束处匹配hat或cat

采用的工具

采用简单正则表达式语法的工具有:

  • Grep
  • sed

5. Posix扩展正则表达式

The more modern “extended” regular expressions can often be used with modern Unix utilities by including the command line flag “-E”.

POSIX extended regular expressions are similar in syntax to the traditional Unix regular expressions, with some exceptions. The following metacharacters are added:

Metacharacter Description
. Matches any single character (many applications exclude newlines, and exactly which characters are considered newlines is flavor, character encoding, and platform specific, but it is safe to assume that the line feed character is included). Within POSIX bracket expressions, the dot character matches a literal dot. For example, a.c matches “abc“, etc., but [a.c] matches only “a“, “.“, or “c“.
[ ] A bracket expression. Matches a single character that is contained within the brackets. For example, [abc] matches “a“, “b“, or “c“. [a-z] specifies a range which matches any lowercase letter from “a” to “z“. These forms can be mixed: [abcx-z] matches “a“, “b“, “c“, “x“, “y“, or “z“, as does [a-cx-z].The - character is treated as a literal character if it is the last or the first (after the ^) character within the brackets: [abc-], [-abc]. Note that backslash escapes are not allowed. The ] character can be included in a bracket expression if it is the first (after the ^) character: []abc].
[^ ] Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than “a“, “b“, or “c“. [^a-z] matches any single character that is not a lowercase letter from “a” to “z“. As above, literal characters and ranges can be mixed.
^ Matches the starting position within the string. In line-based tools, it matches the starting position of any line.
$ Matches the ending position of the string or the position just before a string-ending newline. In line-based tools, it matches the ending position of any line.
BRE: \( \)
ERE: ( )
Defines a marked subexpression. The string matched within the parentheses can be recalled later (see the next entry, \n). A marked subexpression is also called a block or capturing group.
\n Matches what the nth marked subexpression matched, where n is a digit from 1 to 9. This construct is theoretically irregular and was not adopted in the POSIX ERE syntax. Some tools allow referencing more than nine capturing groups.
* Matches the preceding element zero or more times. For example, ab*c matches “ac“, “abc“, “abbbc“, etc. [xyz]* matches “”, “x“, “y“, “z“, “zx“, “zyx“, “xyzzy“, and so on. \(ab\)* matches “”, “ab“, “abab“, “ababab“, and so on.
BRE: \{m,n\}
ERE: {m,n}
Matches the preceding element at least m and not more than n times. For example, a\{3,5\} matches only “aaa“, “aaaa“, and “aaaaa“. This is not found in a few older instances of regular expressions.
  • + — Match the last “block” one or more times – “ba+” matches “ba”, “baa”, “baaa” and so on
  • ? — Match the last “block” zero or one times – “ba?” matches “b” or “ba”
  • | — The choice (or set union) operator: match either the expression before or the expression after the operator – “abc|def” matches “abc” or “def”.

Also, backslashes are removed: {…} becomes {…} and (…) becomes (…). Examples:

  • “[hc]+at” matches with “hat”, “cat”, “hhat”, “chat”, “hcat”, “ccchat” etc.
  • “[hc]?at” matches “hat”, “cat” and “at”
  • “([cC]at)|([dD]og)” matches “cat”, “Cat”, “dog” and “Dog”

The characters (,),[,],.,*,?,+,^, and $ are special symbols and have to be escaped with a backslash symbol in order to be treated as literal characters. For example:

“a\.(\(|\))” matches with the string “a.)” or “a.(“

Modern regular expression tools allow a quantifier to be specified as non-greedy, by putting a question mark after the quantifier: (\[\[.*?\]\]).

Character classes

The POSIX standard defines some classes or categories of characters as shown in the following table:

POSIX class similar to meaning
[:upper:] [A-Z] uppercase letters
[:lower:] [a-z] lowercase letters
[:alpha:] [A-Za-z] upper- and lowercase letters
[:alnum:] [A-Za-z0-9] digits, upper- and lowercase letters
[:digit:] [0-9] digits
[:xdigit:] [0-9A-Fa-f] hexadecimal digits
[:punct:] [.,!?:…] punctuation
[:blank:] [ \t] space and TAB characters only
[:space:] [ \t\n\r\f\v] blank (whitespace) characters
[:cntrl:] control characters
[:graph:] [^ \t\n\r\f\v] printed characters
[:print:] [^\t\n\r\f\v] printed characters and space

Links:

 

Use in Tools

Tools and languages that utilize this regular expression syntax include:

  • AWK – uses a superset of the extended regular expression syntax

Links

8. Shell正则表达式

The Unix shell recognises a limited form of regular expressions used with filename substitution:

Operators
Operator Effect
? The hook operator specifies any single character.
[ ] boxes enable a single character to be matched against a character lists or character range.
[! ] A compliment box enables a single character not within in a character list or character range to be matched.
* An asterisk specifies zero or more characters to match.

Some operators behave differently in the shell: The asterisk and hook operators do not not need to follow a previous character in the shell and they exhibit non traditional regular expression behaviour.

Unsupported Constructs: Within the shell, a compliment box is formed using the pling symbol. The shell does not support the use of a careted box for character list exclusion. In the shell, a caret symbol within a box will simply be treated as one of the characters within the character list for matching.

Use in Tools

Tools and languages that utilize this regular expression syntax include:

  • Bourne compatible shells

7. Emacs正则表达式

Notes on regular expressions used in text editor Emacs:

  • For backslash escaping (magic vs literal), Emacs uses a mixture of BRE and ERE. Like in ERE, Emacs supports unescaped +, ?. Like in BRE, Emacs supports escaped \(, \), \|, \{, \}.
  • GNU extensions to regular expressions supported by Emacs include \w, \W, \b, \B, \<, \>, \` , \’ (start and end of buffer)
  • No “\s” like in PCRE; whitespace is matched by “\s-“.
  • No “\d” like in PCRE; use [0-9] or [[:digit:]]
  • No lookahead and no lookbehind like in PCRE
  • Emacs regexp can match characters by syntax using mode-specific syntax tables (“\sc”, “\s-“, “\s “) or by categories (“\cc”, “\cg”).

Use in Tools

Tools and languages that utilize this regular expression syntax include:

Links

6. 非Posix基本正则表达式

Non POSIX Basic Regular Expression Syntax: An additional non-POSIX class understood by some tools is [:word:], which is usually defined as [:alnum:] plus underscore. This form of regular expression is used to reflect the fact that in many programming languages these characters may be used in identifiers.

Operators
Operator Effect
. The dot operator matches any single character.
[ ] boxes enable a single character to be matched against a character lists or character range.
[^ ] A compliment box enables a single character not within in a character list or character range to be matched.
* An asterisk specifies zero or more characters to match.
^ The caret anchor matches the beginning of the line
$ The dollar anchor matches the end of the line

The editor vim further distinguishes word and word-head classes (using the notation \w and \h) since in many programming languages the characters that can begin an identifier are not the same as those that can occur in other positions.

(For an ASCII chart color-coded to show the POSIX classes, see ASCII.)

Use in Tools

Tools and languages that utilize this regular expression syntax include: