Chris@0
|
1 Goutte, a simple PHP Web Scraper
|
Chris@0
|
2 ================================
|
Chris@0
|
3
|
Chris@0
|
4 Goutte is a screen scraping and web crawling library for PHP.
|
Chris@0
|
5
|
Chris@0
|
6 Goutte provides a nice API to crawl websites and extract data from the HTML/XML
|
Chris@0
|
7 responses.
|
Chris@0
|
8
|
Chris@0
|
9 Requirements
|
Chris@0
|
10 ------------
|
Chris@0
|
11
|
Chris@0
|
12 Goutte depends on PHP 5.5+ and Guzzle 6+.
|
Chris@0
|
13
|
Chris@0
|
14 .. tip::
|
Chris@0
|
15
|
Chris@0
|
16 If you need support for PHP 5.4 or Guzzle 4-5, use Goutte 2.x (latest `phar
|
Chris@0
|
17 <https://github.com/FriendsOfPHP/Goutte/releases/download/v2.0.4/goutte-v2.0.4.phar>`_).
|
Chris@0
|
18
|
Chris@0
|
19 If you need support for PHP 5.3 or Guzzle 3, use Goutte 1.x (latest `phar
|
Chris@0
|
20 <https://github.com/FriendsOfPHP/Goutte/releases/download/v1.0.7/goutte-v1.0.7.phar>`_).
|
Chris@0
|
21
|
Chris@0
|
22 Installation
|
Chris@0
|
23 ------------
|
Chris@0
|
24
|
Chris@0
|
25 Add ``fabpot/goutte`` as a require dependency in your ``composer.json`` file:
|
Chris@0
|
26
|
Chris@0
|
27 .. code-block:: bash
|
Chris@0
|
28
|
Chris@0
|
29 composer require fabpot/goutte
|
Chris@0
|
30
|
Chris@0
|
31 Usage
|
Chris@0
|
32 -----
|
Chris@0
|
33
|
Chris@0
|
34 Create a Goutte Client instance (which extends
|
Chris@0
|
35 ``Symfony\Component\BrowserKit\Client``):
|
Chris@0
|
36
|
Chris@0
|
37 .. code-block:: php
|
Chris@0
|
38
|
Chris@0
|
39 use Goutte\Client;
|
Chris@0
|
40
|
Chris@0
|
41 $client = new Client();
|
Chris@0
|
42
|
Chris@0
|
43 Make requests with the ``request()`` method:
|
Chris@0
|
44
|
Chris@0
|
45 .. code-block:: php
|
Chris@0
|
46
|
Chris@0
|
47 // Go to the symfony.com website
|
Chris@0
|
48 $crawler = $client->request('GET', 'https://www.symfony.com/blog/');
|
Chris@0
|
49
|
Chris@0
|
50 The method returns a ``Crawler`` object
|
Chris@0
|
51 (``Symfony\Component\DomCrawler\Crawler``).
|
Chris@0
|
52
|
Chris@0
|
53 To use your own Guzzle settings, you may create and pass a new Guzzle 6
|
Chris@0
|
54 instance to Goutte. For example, to add a 60 second request timeout:
|
Chris@0
|
55
|
Chris@0
|
56 .. code-block:: php
|
Chris@0
|
57
|
Chris@0
|
58 use Goutte\Client;
|
Chris@0
|
59 use GuzzleHttp\Client as GuzzleClient;
|
Chris@0
|
60
|
Chris@0
|
61 $goutteClient = new Client();
|
Chris@0
|
62 $guzzleClient = new GuzzleClient(array(
|
Chris@0
|
63 'timeout' => 60,
|
Chris@0
|
64 ));
|
Chris@0
|
65 $goutteClient->setClient($guzzleClient);
|
Chris@0
|
66
|
Chris@0
|
67 Click on links:
|
Chris@0
|
68
|
Chris@0
|
69 .. code-block:: php
|
Chris@0
|
70
|
Chris@0
|
71 // Click on the "Security Advisories" link
|
Chris@0
|
72 $link = $crawler->selectLink('Security Advisories')->link();
|
Chris@0
|
73 $crawler = $client->click($link);
|
Chris@0
|
74
|
Chris@0
|
75 Extract data:
|
Chris@0
|
76
|
Chris@0
|
77 .. code-block:: php
|
Chris@0
|
78
|
Chris@0
|
79 // Get the latest post in this category and display the titles
|
Chris@0
|
80 $crawler->filter('h2 > a')->each(function ($node) {
|
Chris@0
|
81 print $node->text()."\n";
|
Chris@0
|
82 });
|
Chris@0
|
83
|
Chris@0
|
84 Submit forms:
|
Chris@0
|
85
|
Chris@0
|
86 .. code-block:: php
|
Chris@0
|
87
|
Chris@0
|
88 $crawler = $client->request('GET', 'https://github.com/');
|
Chris@0
|
89 $crawler = $client->click($crawler->selectLink('Sign in')->link());
|
Chris@0
|
90 $form = $crawler->selectButton('Sign in')->form();
|
Chris@0
|
91 $crawler = $client->submit($form, array('login' => 'fabpot', 'password' => 'xxxxxx'));
|
Chris@0
|
92 $crawler->filter('.flash-error')->each(function ($node) {
|
Chris@0
|
93 print $node->text()."\n";
|
Chris@0
|
94 });
|
Chris@0
|
95
|
Chris@0
|
96 More Information
|
Chris@0
|
97 ----------------
|
Chris@0
|
98
|
Chris@0
|
99 Read the documentation of the `BrowserKit`_ and `DomCrawler`_ Symfony
|
Chris@0
|
100 Components for more information about what you can do with Goutte.
|
Chris@0
|
101
|
Chris@0
|
102 Pronunciation
|
Chris@0
|
103 -------------
|
Chris@0
|
104
|
Chris@0
|
105 Goutte is pronounced ``goot`` i.e. it rhymes with ``boot`` and not ``out``.
|
Chris@0
|
106
|
Chris@0
|
107 Technical Information
|
Chris@0
|
108 ---------------------
|
Chris@0
|
109
|
Chris@0
|
110 Goutte is a thin wrapper around the following fine PHP libraries:
|
Chris@0
|
111
|
Chris@0
|
112 * Symfony Components: `BrowserKit`_, `CssSelector`_ and `DomCrawler`_;
|
Chris@0
|
113
|
Chris@0
|
114 * `Guzzle`_ HTTP Component.
|
Chris@0
|
115
|
Chris@0
|
116 License
|
Chris@0
|
117 -------
|
Chris@0
|
118
|
Chris@0
|
119 Goutte is licensed under the MIT license.
|
Chris@0
|
120
|
Chris@0
|
121 .. _`Composer`: https://getcomposer.org
|
Chris@0
|
122 .. _`Guzzle`: http://docs.guzzlephp.org
|
Chris@0
|
123 .. _`BrowserKit`: https://symfony.com/components/BrowserKit
|
Chris@0
|
124 .. _`DomCrawler`: https://symfony.com/doc/current/components/dom_crawler.html
|
Chris@0
|
125 .. _`CssSelector`: https://symfony.com/doc/current/components/css_selector.html
|