How to install and use Headless Chrome on OSX

EDIT: Headless Chrome is shipping in Chrome 59 so the need to use the full Canary path will eventually go away. You can check your Chrome version in the menu under Help > About Google Chrome.

This walkthrough shows you how to get headless Chrome up and running on OSX and explains in detail how to use the code examples provided by the Chrome team.

What problem does Headless Chrome solve?

Headless mode in Chrome is a new way to interact with websites without having to actually have a window up on the screen.  This might seem like a trivial improvement but it is actually a huge step forward for scraping data from the web.  Currently there are number of stable but informal solutions to scraping such as PhantomJS or NightmareJS (which is written in Electron).  Neither or these tools is going away (edit: the PhantomJS sole maintainer has resigned) and they’re still great solutions to scraping. If you have existing systems that are working using these tools, you can keep using them.

With that said, some users have run into trouble working with PhantomJS and Nightmare. Both have caveats when running on a shell-only system (one without an actual screen or window manager).  For example, in Nightmare (and any electron app), you would need to install a virtual display manager in order to run the application.  Additionally, since Nightmare is Electron based, it has a different security model than Chrome and may fail to catch certain security issues during testing that would happen on a production.

What versions of Chrome supports headless browsing?

Headless Chrome has been released in Chrome 59. As of April 13, 2017 Chrome Canary is the only channel that contains Chrome 59.   This means that right now, you need to install Chrome Canary if you want to use headless browsing.  This will change in the future and eventually The Chrome Dev Team will bring Chrome 59 into the main Chrome build.

To install Chrome Canary, you can download it or install it with homebrew:

brew install Caskroom/versions/google-chrome-canary

How do I find headless Chrome so that I can start it?

Many of the examples of using headless Chrome just show using a simple chrome command. This is great for Linux but does not work on OSX since that command does not get installed to your path (yet).

So to find Chrome’s path, let’s fire up our terminal to find where Chrome Canary was installed on our system.

sudo find / -type d -name "*Chrome Canary.app"

You’ll probably get some permissions errors but you’ll also get a path that looks something like this:

/Applications/Google Chrome Canary.app

Since we’ve found the path to Chrome Canary, we can use this to start Chrome in headless mode.

How do I start headless Chrome?

Once we have the path to Canary we need to run a single command to start Chrome as a headless server.

/Applications/Google\ Chrome\ Canary.app/Contents/MacOS/Google\ Chrome\ Canary --headless --remote-debugging-port=9222 --disable-gpu https://chromium.org

Specifically notice that we escaped the spaces in the file name and are looking deep into the Mac .app file to the actual Chrome binary itself. We then passed it the flags needed to start the headless browser and direct it to an initial url of https://chromium.org. The browser is waiting for us to connect on port 9222 to give it further instructions. Keep this tab open and the server running. Open another tab where we’ll connect to the browser and give it some instructions.

How do I scrape data with headless Chrome?

I am going to use Node.js to connect to our running Chrome Canary instance. You’ll need Node installed for this part of the walkthrough.

Let’s generate a generic node project with just one dependency on the Chrome Remote Interface package which will help us communicate with Chrome. We’ll also create a blank index.js file:

mkdir my-headless-chrome && cd my-headless-chrome
npm init --yes
npm install --save chrome-remote-interface 
touch index.js

Now we’re going to put some code into our index.js. This is the boilerplate example provided by the Chrome team. It instructs the browser to navigate to github.com and captures all of the network requests made on the page by watching the network property on the client.

const CDP = require("chrome-remote-interface");
 
CDP(client => {
  // extract domains
  const { Network, Page } = client;
  // setup handlers
  Network.requestWillBeSent(params => {
    console.log(params.request.url);
  });
  Page.loadEventFired(() => {
    client.close();
  });
  // enable events then start!
  Promise.all([Network.enable(), Page.enable()])
    .then(() => {
      return Page.navigate({ url: "https://github.com" });
    })
    .catch(err => {
      console.error(err);
      client.close();
    });
}).on("error", err => {
  // cannot connect to the remote endpoint
  console.error(err);
});

Finally start our node application.

node index.js

And we’ll see all of the network requests made by Chrome, all without even having an actual browser window!

https://github.com/
https://assets-cdn.github.com/assets/frameworks-12d63ce1986bd7fdb5a3f4d944c920cfb75982c70bc7f75672f75dc7b0a5d7c3.css
https://assets-cdn.github.com/assets/github-2826bd4c6eb7572d3a3e9774d7efe010d8de09ea7e2a559fa4019baeacf43f83.css
https://assets-cdn.github.com/assets/site-f4fa6ace91e5f0fabb47e8405e5ecf6a9815949cd3958338f6578e626cd443d7.css
https://assets-cdn.github.com/images/modules/site/home-illo-conversation.svg
https://assets-cdn.github.com/images/modules/site/home-illo-chaos.svg
https://assets-cdn.github.com/images/modules/site/home-illo-business.svg
https://assets-cdn.github.com/images/modules/site/integrators/slackhq.png
https://assets-cdn.github.com/images/modules/site/integrators/zenhubio.png
https://assets-cdn.github.com/images/modules/site/integrators/travis-ci.png
https://assets-cdn.github.com/images/modules/site/integrators/atom.png
https://assets-cdn.github.com/images/modules/site/integrators/circleci.png
https://assets-cdn.github.com/images/modules/site/integrators/codeship.png
https://assets-cdn.github.com/images/modules/site/integrators/codeclimate.png
https://assets-cdn.github.com/images/modules/site/integrators/gitterhq.png
https://assets-cdn.github.com/images/modules/site/integrators/waffleio.png
https://assets-cdn.github.com/images/modules/site/integrators/heroku.png
https://assets-cdn.github.com/images/modules/site/logos/airbnb-logo.png
https://assets-cdn.github.com/images/modules/site/logos/sap-logo.png
https://assets-cdn.github.com/images/modules/site/logos/ibm-logo.png
https://assets-cdn.github.com/images/modules/site/logos/google-logo.png
https://assets-cdn.github.com/images/modules/site/logos/paypal-logo.png
https://assets-cdn.github.com/images/modules/site/logos/bloomberg-logo.png
https://assets-cdn.github.com/images/modules/site/logos/spotify-logo.png
https://assets-cdn.github.com/images/modules/site/logos/swift-logo.png
https://assets-cdn.github.com/images/modules/site/logos/facebook-logo.png
https://assets-cdn.github.com/images/modules/site/logos/node-logo.png
https://assets-cdn.github.com/images/modules/site/logos/nasa-logo.png
https://assets-cdn.github.com/images/modules/site/logos/walmart-logo.png
https://assets-cdn.github.com/assets/compat-8a4318ffea09a0cdb8214b76cf2926b9f6a0ced318a317bed419db19214c690d.js
https://assets-cdn.github.com/assets/frameworks-6d109e75ad8471ba415082726c00c35fb929ceab975082492835f11eca8c07d9.js
https://assets-cdn.github.com/assets/github-5d29649478f4a2b05588bbd0d25cd56ff5445b21df31b4cccca942ad8687e1e8.js
https://assets-cdn.github.com/images/modules/site/heroes/home-code-bg-alt-01.svg
https://assets-cdn.github.com/static/fonts/roboto/roboto-light.woff
https://assets-cdn.github.com/static/fonts/roboto/roboto-regular.woff
https://assets-cdn.github.com/static/fonts/roboto/roboto-medium.woff

This is great to see the assets that might be loaded, but what about if we want to walk the DOM for elements that exist in the page? We could use a script like this which pulls out all of the image tags from Github.com:

const CDP = require("chrome-remote-interface");
 
CDP(chrome => {
  chrome.Page
    .enable()
    .then(() => {
      return chrome.Page.navigate({ url: "https://github.com" });
    })
    .then(() => {
      chrome.DOM.getDocument((error, params) => {
        if (error) {
          console.error(params);
          return;
        }
        const options = {
          nodeId: params.root.nodeId,
          selector: "img"
        };
        chrome.DOM.querySelectorAll(options, (error, params) => {
          if (error) {
            console.error(params);
            return;
          }
          params.nodeIds.forEach(nodeId => {
            const options = {
              nodeId: nodeId
            };
            chrome.DOM.getAttributes(options, (error, params) => {
              if (error) {
                console.error(params);
                return;
              }
              console.log(params.attributes);
            });
          });
        });
      });
    });
}).on("error", err => {
  console.error(err);
});

You’ll see that we can get the following data structure representing the tags in the page including the urls of the image.

  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/home-illo-conversation.svg',
    'alt',
    '',
    'width',
    '360',
    'class',
    'd-block width-fit mx-auto' ]
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/home-illo-chaos.svg',
    'alt',
    '',
    'class',
    'd-block width-fit mx-auto' ]
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/home-illo-business.svg',
    'alt',
    '',
    'class',
    'd-block width-fit mx-auto mb-4' ]
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/integrators/slackhq.png',
    'alt',
    '',
    'class',
    'd-block integrations-collage-img width-fit mx-auto' ]
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/integrators/zenhubio.png',
    'alt',
    '',
    'class',
    'd-block integrations-collage-img width-fit mx-auto' ]
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/integrators/travis-ci.png',
    'alt',
    '',
    'class',
    'd-block integrations-collage-img width-fit mx-auto' ]
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/integrators/atom.png',
    'alt',
    '',
    'class',
    'd-block integrations-collage-img width-fit mx-auto' ]
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/integrators/circleci.png',
    'alt',
    '',
    'class',
    'd-block integrations-collage-img width-fit mx-auto' ]
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/integrators/codeship.png',
    'alt',
    '',
    'class',
    'd-block integrations-collage-img width-fit mx-auto' ]
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/integrators/codeclimate.png',
    'alt',
    '',
    'class',
    'd-block integrations-collage-img width-fit mx-auto' ]
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/integrators/gitterhq.png',
    'alt',
    '',
    'class',
    'd-block integrations-collage-img width-fit mx-auto' ]
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/integrators/waffleio.png',
    'alt',
    '',
    'class',
    'd-block integrations-collage-img width-fit mx-auto' ]
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/integrators/heroku.png',
    'alt',
    '',
    'class',
    'd-block integrations-collage-img width-fit mx-auto' ]
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/logos/airbnb-logo.png',
    'alt',
    'Airbnb',
    'class',
    'logo-img px-2 px-sm-4 px-md-5 px-lg-0' ]
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/logos/sap-logo.png',
    'alt',
    'SAP',
    'class',
    'logo-img px-2 px-sm-4 px-md-5 px-lg-0' ]
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/logos/ibm-logo.png',
    'alt',
    'IBM',
    'class',
    'logo-img px-2 px-sm-4 px-md-5 px-lg-0' ]
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/logos/google-logo.png',
    'alt',
    'Google',
    'class',
    'logo-img px-2 px-sm-4 px-md-5 px-lg-0' ]
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/logos/paypal-logo.png',
    'alt',
    'PayPal',
    'class',
    'logo-img px-2 px-sm-4 px-md-5 px-lg-0' ]
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/logos/bloomberg-logo.png',
    'alt',
    'Bloomberg',
    'class',
    'logo-img px-2 px-sm-4 px-md-5 px-lg-0' ]
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/logos/spotify-logo.png',
    'alt',
    'Spotify',
    'class',
    'logo-img px-2 px-sm-4 px-md-5 px-lg-0' ]
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/logos/swift-logo.png',
    'alt',
    'Swift',
    'class',
    'logo-img px-2 px-sm-4 px-md-5 px-lg-0' ]
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/logos/facebook-logo.png',
    'alt',
    'Rails',
    'class',
    'logo-img px-2 px-sm-4 px-md-5 px-lg-0' ]
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/logos/node-logo.png',
    'alt',
    'Node',
    'class',
    'logo-img px-2 px-sm-4 px-md-5 px-lg-0' ]
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/logos/nasa-logo.png',
    'alt',
    'Nasa',
    'class',
    'logo-img px-2 px-sm-4 px-md-5 px-lg-0' ]
  [ 'src',
    'https://assets-cdn.github.com/images/modules/site/logos/walmart-logo.png',
    'alt',
    'Walmart',
    'class',
    'logo-img px-2 px-sm-4 px-md-5 px-lg-0' ]

Happy scraping!

One thought on “How to install and use Headless Chrome on OSX

  1. Paul Rohorzka says:

    Hi Jim, thanks for your write-up. As a mac osx newby, it greatly helped me over the canary hurdle.
    Just a side note: With the boilerplate code you might get “Error: Invalid tab index” if you have developer-tools attached to the headless chrome instance while running the code.

    1. Jim Cummins says:

      Hi Paul,

      I think you’re right. This might be temporary since this is an initial release in Canary. There were notes somewhere about the dev-tools and how they are still working on functionality associated with them. I will try to find that source.

      Jim

  2. Jonathan says:

    Thanks man! This makes sense!

  3. finch says:

    I’m wondering what is the difference between webdriver and chrome headless. Why webdriver was not enough?

    1. Jim Cummins says:

      Hey there! The way to think of it is in layers. Chrome, Firefox, Safari are the bottom most layer. They are close to the HTML, CSS and Javascript since they render the page. Webdriver is one level higher, further from the code. It is like a robot human that drives your browser for you, but it is not actually a browser itself.

      The distinction is an important one because Webdriver must spin up an actual browser window to perform its duties. On your Windows, Linux or OSX machine this works just find because they all support showing actually windows on the screen. The problem is, say you want to run Chrome on a server that doesn’t have a window manager. Well now you are a bit stuck. Webdriver can still work but Chrome doesn’t open because it can’t render the webpage to a screen since as far as the operating system is concerned, there is no screen. Up until now there have been (completely valid) workarounds like installing a virtual window manager. The virtual window manager tricks the operating system into thinking it has a screen. The problem is that there is a lot of wasted resources in rendering to the screen that doesn’t actually exist, also it can be a configuration headache since the virtual window manager must be installed on the system before Chrome and the webdriver are run. When you’re running your tests this all adds up to a much slower experience.

      So, this brings us to headless Chrome. Headless Chrome doesn’t need the virtual window manager and also has performance wins.
      The example I’ve shown above doesn’t show using webdriver but this doesn’t mean webdriver is going anywhere. In fact, webdriver is still super useful because we can run webdriver to control a bunch of browsers whereas the api I’ve shown above is specific to headless Chrome. Chances are that people who need to run tests in multiple browsers will still use webdriver. With that said, webdriver comes with its own overhead, so if you just need a fast browser, the above example might be for you. It really comes down to using the right tool for the job.

      Good question. Hope this helps!

      1. Eric McKinley Brown says:

        I am attempting to use Chrome v60.x (Canary) as a headless implementation within Protractor/Selenium. I am getting an error saying that the chromedriver.exe that I am using does not support this browser. It seems that the latest version of Selenium I am using has chrovedriver v2.26.x. Is there another available version of the chromedriver that I can obtain?

        1. Jim Cummins says:

          It looks like 2.29 is the latest https://sites.google.com/a/chromium.org/chromedriver/downloads

          With that said some folks have had success but they are building chromedriver from source http://blog.faraday.io/headless-chromium/

  4. ngryman says:

    Nice write-up! I used it to build a performance runner: https://github.com/ngryman/speedracer.

    1. Jim Cummins says:

      Glad to help. Speedracer looks pretty cool!

  5. Ricardo says:

    I followed all the steps and got the error:

Leave a Reply

Your email address will not be published. Required fields are marked *

*

*