Chris@0: Goutte, a simple PHP Web Scraper Chris@0: ================================ Chris@0: Chris@0: Goutte is a screen scraping and web crawling library for PHP. Chris@0: Chris@0: Goutte provides a nice API to crawl websites and extract data from the HTML/XML Chris@0: responses. Chris@0: Chris@0: Requirements Chris@0: ------------ Chris@0: Chris@0: Goutte depends on PHP 5.5+ and Guzzle 6+. Chris@0: Chris@0: .. tip:: Chris@0: Chris@0: If you need support for PHP 5.4 or Guzzle 4-5, use Goutte 2.x (latest `phar Chris@0: `_). Chris@0: Chris@0: If you need support for PHP 5.3 or Guzzle 3, use Goutte 1.x (latest `phar Chris@0: `_). Chris@0: Chris@0: Installation Chris@0: ------------ Chris@0: Chris@0: Add ``fabpot/goutte`` as a require dependency in your ``composer.json`` file: Chris@0: Chris@0: .. code-block:: bash Chris@0: Chris@0: composer require fabpot/goutte Chris@0: Chris@0: Usage Chris@0: ----- Chris@0: Chris@0: Create a Goutte Client instance (which extends Chris@0: ``Symfony\Component\BrowserKit\Client``): Chris@0: Chris@0: .. code-block:: php Chris@0: Chris@0: use Goutte\Client; Chris@0: Chris@0: $client = new Client(); Chris@0: Chris@0: Make requests with the ``request()`` method: Chris@0: Chris@0: .. code-block:: php Chris@0: Chris@0: // Go to the symfony.com website Chris@0: $crawler = $client->request('GET', 'https://www.symfony.com/blog/'); Chris@0: Chris@0: The method returns a ``Crawler`` object Chris@0: (``Symfony\Component\DomCrawler\Crawler``). Chris@0: Chris@0: To use your own Guzzle settings, you may create and pass a new Guzzle 6 Chris@0: instance to Goutte. For example, to add a 60 second request timeout: Chris@0: Chris@0: .. code-block:: php Chris@0: Chris@0: use Goutte\Client; Chris@0: use GuzzleHttp\Client as GuzzleClient; Chris@0: Chris@0: $goutteClient = new Client(); Chris@0: $guzzleClient = new GuzzleClient(array( Chris@0: 'timeout' => 60, Chris@0: )); Chris@0: $goutteClient->setClient($guzzleClient); Chris@0: Chris@0: Click on links: Chris@0: Chris@0: .. code-block:: php Chris@0: Chris@0: // Click on the "Security Advisories" link Chris@0: $link = $crawler->selectLink('Security Advisories')->link(); Chris@0: $crawler = $client->click($link); Chris@0: Chris@0: Extract data: Chris@0: Chris@0: .. code-block:: php Chris@0: Chris@0: // Get the latest post in this category and display the titles Chris@0: $crawler->filter('h2 > a')->each(function ($node) { Chris@0: print $node->text()."\n"; Chris@0: }); Chris@0: Chris@0: Submit forms: Chris@0: Chris@0: .. code-block:: php Chris@0: Chris@0: $crawler = $client->request('GET', 'https://github.com/'); Chris@0: $crawler = $client->click($crawler->selectLink('Sign in')->link()); Chris@0: $form = $crawler->selectButton('Sign in')->form(); Chris@0: $crawler = $client->submit($form, array('login' => 'fabpot', 'password' => 'xxxxxx')); Chris@0: $crawler->filter('.flash-error')->each(function ($node) { Chris@0: print $node->text()."\n"; Chris@0: }); Chris@0: Chris@0: More Information Chris@0: ---------------- Chris@0: Chris@0: Read the documentation of the `BrowserKit`_ and `DomCrawler`_ Symfony Chris@0: Components for more information about what you can do with Goutte. Chris@0: Chris@0: Pronunciation Chris@0: ------------- Chris@0: Chris@0: Goutte is pronounced ``goot`` i.e. it rhymes with ``boot`` and not ``out``. Chris@0: Chris@0: Technical Information Chris@0: --------------------- Chris@0: Chris@0: Goutte is a thin wrapper around the following fine PHP libraries: Chris@0: Chris@0: * Symfony Components: `BrowserKit`_, `CssSelector`_ and `DomCrawler`_; Chris@0: Chris@0: * `Guzzle`_ HTTP Component. Chris@0: Chris@0: License Chris@0: ------- Chris@0: Chris@0: Goutte is licensed under the MIT license. Chris@0: Chris@0: .. _`Composer`: https://getcomposer.org Chris@0: .. _`Guzzle`: http://docs.guzzlephp.org Chris@0: .. _`BrowserKit`: https://symfony.com/components/BrowserKit Chris@0: .. _`DomCrawler`: https://symfony.com/doc/current/components/dom_crawler.html Chris@0: .. _`CssSelector`: https://symfony.com/doc/current/components/css_selector.html