SEO with AngularJS

AngularJS1 works with templates and modifies the contents of views which make the pre-rendered HTML invalid2 for search engines. But there is a way to properly fix this problem and have an efficient SEO support for any AngularJS applications.

1I use AngularJS in this article but the solution described below can be used for any AJAX application frameworks

2Invalid does not target the structure of the HTML but its contents: the search engine will see and index <h1>{{myTitle}}</h1> instead of <h1>My AngularJS Application</h1>

How major search engines work with AJAX applications ?

By major, I mean Google and Bing, which are the most prevalent search engines.

There is a convention between web servers and search engine crawlers that allows for dynamically created content to be visible to crawlers. Google & Bing currently supports this agreement.

The convention relies on hash fragments. Traditionally, they have been used to indicate a portion of a static HTML document. In an AJAX applicaton, these fragments are used to indicate its state.

Search engines will map each pretty url containing a hash fragment into an ugly url with the corresponding escaped fragment:

  • pretty url: http://redpelicans.com/#!/page/with/ajax/content
  • ugly url: http://redpelicans.com/?_escaped_fragment_=/page/with/ajax/content

When a user accesses our AJAX application, server will receive a pretty url request, and when a crawler wants to index our AJAX application, server will receive an ugly url request and expect the content from that URL to be the final, fully generated content for that page.

What we need to do is to handle the ugly url to send the pre-generated HTML of our AJAX application.

If you want to go further regarding technical specifications about the hash fragments convention, take a look at the Google full specification.

Configure AngularJS

There are several ways to inform search engine that your website contains AJAX but the simple one is to use hashbang urls:

myApp.config(['$locationProvider', function($locationProvider) {  
    $locationProvider.hashPrefix('!');
}]);

Generating the pre-rendered HTML

To generate the HTML, I use PhantomJS which is a headless browser that allows to get the page for a given ugly url, run the JavaScript and produce the expected HTML.

The ready flag

First, we need to set a flag in the DOM of our webpage to tell PhantomJS that the page is in its final state (all asynchronous content is fully loaded).
If you only wait for all Javascript to be executed, you may miss content since any asynchronous requests may have not completed their job yet.

Be careful to put the attribute inside your app directive but outside your eventually view directives:

<body data-status="{{status}}">  

Set the attribute value in your controller (if you have asynchronous content, set the status in the callback of your requests):

$scope.status = 'ready';
Configure the server

Then, we need to handle ugly urls. Redpelicans is fullstack company and I logically choose NodeJS and Express to route ugly urls.

I made an Express middleware to handle ugly urls:

app.use(escapedFragment(__dirname + '/snapshots'));  
The ugly url middleware

This middleware checks if the url contains an escaped fragment (which indicates us that the request came from a crawler) and generate the HTML accordingly:

var escapedFragment = function(snapshotsDir) {  
    var fragment = req.query._escaped_fragment_;
    if (!fragment) return next();

    if (fragment === '' || fragment === '/') fragment = '/home.html';
    if (fragment.charAt(0) !== '/') fragment = '/' + fragment;
    if (fragment.indexOf('.html') == -1) fragment += '.html';

    var snapshotPath = snapshotsDir + fragment;
    if (!fs.existsSync(snapshotPath)) {
        var url = req.protocol + '://' + req.get('Host') + req.originalUrl;
        generateSnapshot(url, snapshotPath, function(err) {
            if (err) res.sendStatus(404);
            res.sendFile(snapshotPath);
        });
    } else {
        try { res.sendFile(snapshotPath); }
        catch (err) { res.send(404); }
    }
};

The generated HTML is saved in a snapshots folder and we check if a given snapshot exists before generate it in order to speed our response time.
It is relevant for us because we use want to index our website with a static content. If you have an AJAX application with very dynamic content, you will have to generate a snapshot for each crawler call.

Generate the HTML with PhantomJS

The workflow is quite simple, we request the page through PhantomJS and wait for the ready flag. Then, we store the HTML (you can also send it directly).

var generateSnapshot = function(url, snapshotPath, cb) {  
    if (!url || !url.length) return;
    url = url.replace('?_escaped_fragment_=', '#!');

    phantom.create(function (ph) {
        ph.createPage(function (page) {
            page.open(url, function (status) {
                if (status != 'success') {
                    return cb(new Error('Unable to open ' + url));
                }

                async.retry(3, getHTML, function(err, html) {
                    if (err) {
                        return cb(new Error('Unable to get html of page at ' + url));
                    }

                    fs.writeFile(snapshotPath, html, function(err) {
                        if (err) return cb(err);
                        ph.exit();
                        cb();
                    });
                });

                function getHTML(cb, result) {
                    page.evaluate(function() {
                        var html = document.getElementsByTagName('body')[0];
                          , ready = html.getAttribute('data-status');
                        return ready == 'ready'
                            ? document.getElementsByTagName('html')[0].outerHTML
                            : false;
                    }, function(result) {
                        if (result === false) {
                            // add a delay to make tries more relevant
                            return cb(new Error('Page not ready'));
                        }
                        cb(null, result);
                    });
                }
            });
        });
    });
};

This is all you need to index your AngularJS application.
I recommend to also add a sitemap to let the crawlers know what they have to index.

Through this article, I describe a solution for indexing an AJAX application within the context of AngularJS and NodeJS. The concept can be applied with another technologies. I hope you enjoyed reading this article.