Skip to content

Commit cff4105

Browse files
committed
bug #10983 [DomCrawler] Fixed charset detection in html5 meta charset tag (77web)
This PR was squashed before being merged into the 2.3 branch (closes #10983). Discussion ---------- [DomCrawler] Fixed charset detection in html5 meta charset tag | Q | A | ------------- | --- | Bug fix? | yes | New feature? | no | BC breaks? | no | Deprecations? | no | Tests pass? | yes | Fixed tickets | N/A | License | MIT It may be minor to folks with ascii-charactered language, but is critical for us Japanese. Many Japanese websites with SJIS encoding have "Shift_JIS" as their encoding declaration. Commits ------- 172e752 [DomCrawler] Fixed charset detection in html5 meta charset tag
2 parents 5d13be7 + 172e752 commit cff4105

File tree

2 files changed

+7
-1
lines changed

2 files changed

+7
-1
lines changed

src/Symfony/Component/DomCrawler/Crawler.php

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -108,8 +108,10 @@ public function addContent($content, $type = null)
108108
}
109109
}
110110

111+
// http://www.w3.org/TR/encoding/#encodings
112+
// http://www.w3.org/TR/REC-xml/#NT-EncName
111113
if (null === $charset &&
112-
preg_match('/\<meta[^\>]+charset *= *["\']?([a-zA-Z\-0-9]+)/i', $content, $matches)) {
114+
preg_match('/\<meta[^\>]+charset *= *["\']?([a-zA-Z\-0-9_:.]+)/i', $content, $matches)) {
113115
$charset = $matches[1];
114116
}
115117

src/Symfony/Component/DomCrawler/Tests/CrawlerTest.php

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -232,6 +232,10 @@ public function testAddContent()
232232
$crawler = new Crawler();
233233
$crawler->addContent('<html><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><span>中文</span></html>');
234234
$this->assertEquals('中文', $crawler->filterXPath('//span')->text(), '->addContent() guess wrong charset');
235+
236+
$crawler = new Crawler();
237+
$crawler->addContent(mb_convert_encoding('<html><head><meta charset="Shift_JIS"></head><body>日本語</body></html>', 'SJIS', 'UTF-8'));
238+
$this->assertEquals('日本語', $crawler->filterXPath('//body')->text(), '->addContent() can recognize "Shift_JIS" in html5 meta charset tag');
235239
}
236240

237241
/**

0 commit comments

Comments
 (0)