javascript - Stop search engines to index specific parts of the page

I have a php page that renders a book of let's say 100 pages. Each page has a specific url (e.g. /my-book/page-one, /my-book/page-two etc).

When flipping the pages, I change the url using the history API, using url.js.

Since all the book content is rendered from the server side, the problem is that the content is indexed by search engines (especially I'm referring to Google), but the urls are wrong (e.g. it finds a snippet on page-two but the url is page-one).

How to stop search engines (at least Google) to index all the content on the page, but index only the visible book page?

Would it work if I render the content in a different way: for example, <div data-page-number="1" data-content="Lorem ipsum..."></div> and then on the JavaScript side to change that in the needed format? That would make the page slower and in fact I'm not sure if Google will not index the changed content by JavaScript.

The code looks like this:

<div data-page="1">Page 1</div>
<div data-page="2">Page 2</div>
<div data-page="3" class="current-page">Page 3</div>
<div data-page="4">Page 4</div>
<div data-page="5">Page 5</div>

Then only visible div is the .current-page one. The same content is served on multiple urls because that's needed so the user can flip between pages.

For example, /book/page/3 will render this piece of HTML while /book/page/4 renders the same thing, the only difference being the current-page class which is added to the 4th element.

Google did index different urls, but it did it wrong: for example, the snippet Page 5 links to /book/page/2 which renders to the user Page 2 (not Page 5).

How to tell Google (and other search engines) I'm only interested to index the content in the .current-page?

I have a php page that renders a book of let's say 100 pages. Each page has a specific url (e.g. /my-book/page-one, /my-book/page-two etc).

When flipping the pages, I change the url using the history API, using url.js.

How to stop search engines (at least Google) to index all the content on the page, but index only the visible book page?

The code looks like this:

<div data-page="1">Page 1</div>
<div data-page="2">Page 2</div>
<div data-page="3" class="current-page">Page 3</div>
<div data-page="4">Page 4</div>
<div data-page="5">Page 5</div>

Then only visible div is the .current-page one. The same content is served on multiple urls because that's needed so the user can flip between pages.

For example, /book/page/3 will render this piece of HTML while /book/page/4 renders the same thing, the only difference being the current-page class which is added to the 4th element.

Google did index different urls, but it did it wrong: for example, the snippet Page 5 links to /book/page/2 which renders to the user Page 2 (not Page 5).

How to tell Google (and other search engines) I'm only interested to index the content in the .current-page?

Share Improve this question edited Jul 16, 2017 at 6:26 Cœur 38.7k26 gold badges202 silver badges277 bronze badges asked May 6, 2016 at 9:46 Ionică Bizău 113k93 gold badges307 silver badges487 bronze badges

1 You can use robots.txt to tell Google. AFAIK Google respects it. Most probably it would be better to build a sitemap.xml and tell Google what to index and what not. You can also use Google's Webmaster Tools to push the changes and see how Google is crawling your site. – Praveen Kumar Purushothaman Commented May 6, 2016 at 9:48
The question is how? I'm not sure if any of these would work. In short, I serve the same HTML on different urls, but I show only a specific part of it depending on the url. – Ionică Bizău Commented May 6, 2016 at 10:00
Can you give an Example of wrong url that is wrong indexed ? Or you do the change onClick on the element? – OBender Commented May 8, 2016 at 9:29
@OBender Let's suppose I have Hello World on page 42 (under the url /my-book/page/42). It's very possible that Google indexes this content on another url (and obviously another page), for example, /my-book/page/7. That happens because I serve the same content on multiple urls. I have no idea how this can be fixed... – Ionică Bizău Commented May 8, 2016 at 10:50
Do you mean that : /my-book/page/42 and /my-book/page/7 Have the same Content ? – OBender Commented May 8, 2016 at 12:40

| Show 1 more comment

4 Answers 4

Sorted by: Reset to default 6

As I understood he issue is that you have same content for many urls. Like:

www.my-awesome-domain.com/my-book/page/42

www.my-awesome-domain.com//my-book/page/7

And the visible content of the page is adjustable by JavaScript, that User Execute when he clicks some elements on your site.

In This case you need to do 2 things:

Mark your URL's as Canonical pages in any of the ways described in this google document: https://support.google.com/webmasters/answer/139066?hl=en
You need add a feature that each page will load to the same state after full page refresh, for example you can use hash parameter when navigating as desiribed in the article here: or here is the overview of the technique

Today google bot is executing JavaScript as announced in their official blog: https://webmasters.googleblog.com/2015/10/deprecating-our-ajax-crawling-scheme.html

So if you achieve proper page behavior when hitting Refresh (F5) and Will specify the canonical pages property, pages will be correctly crawled, and when you will follow the link you will get to the linked page.

If you need more guidance how to do it in url.js Please post another question (so it's will be proper documented for others) and I will be glad to help.

The answere is really simple: you can't do it. There is no technical possibility to keep the same content under different URLs and ask search engines to index only part of it.

If you are OK with having only one page indexed you can use, as suggested before, canonical URLs. You place the canonical URL that links to the main page on every sub-page.

You may find a "hack" that uses special tags used for Google Search Appliance: googleon and googleoff.

https://www.google.com/support/enterprise/static/gsa/docs/admin/70/gsa_doc_set/admin_crawl/preparing.html

The only issue is this will most likely not work with Google Bot (at least no one will guarantee it will) or any other search engine.

I dont think you will be able to achieve what you are looking for.

I cant see how robots.txt would have any affect. Canonical tags dont work on divs.

Google has spoken about sites like these in the past and made some suggestions for indexing, here are a couple of links that may help :

https://www.seroundtable.com/seo-single-page-12964.html

https://www.seroundtable.com/google-on-crawling-javascript-sites-progressive-web-apps-21737.html

Save the content in a JSON file which you do not render in the HTML. From the server, serve only the correct page: the content which is visible to the user.

When the user clicks the buttons (prev/next page links etc), render using JavaScript the content you have the JSON file and change the url like you're already doing.

That way you know you always serve from the server the right content and the Google bot will obviously index the pages correctly.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

javascript - Stop search engines to index specific parts of the page - Stack Overflow

4 Answers 4

与本文相关的文章

评论列表(0)