最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - PhantomJS not mimicking browser behavior when looking at YouTube videos - Stack Overflow

programmeradmin0浏览0评论

I posted this question to the PhantomJS mailing list a week ago, but have gotten no response. Hoping for better luck here...

I've been trying to use PhantomJS to scrape information from YouTube, but haven't been able to get it working.

Consider a YouTube video embedded into a web page via an iframe element. If you load the URL referenced by the src attribute directly into a browser, you get a full-page version of the video, where the video is encapsulated in an embed element. The embed element is not present in the initial page content; rather, some script tags on the page cause some Javascript to be evaluated which eventually adds the embed element to the DOM. I want to be able to access this embed element when it appears, but it never appears when I load the page in PhantomJS.

Here's the code I'm using:

var page = require("webpage").create();

page.settings.userAgent = "Mozilla/5.0 (X11; rv:24.0) Gecko/20130909 Firefox/24.0";

page.open("", function (status) {
  if (status !== "success") {
    console.log("Failed to load page");
    phantom.exit();
  } else {
    setTimeout(function () {
      var size = page.evaluate(function () {
        return document.getElementsByTagName("EMBED").length;
      });
      console.log(size);
      phantom.exit();
    }, 15000);
  }
});

I only ever see "0" printed to the console, no matter how long I set the timeout. If I look for "DIV" elements I get "3", and if I look for "SCRIPT" elements I get "5", so the code seems to be sound. I just never find any "EMBED" tags, even though if I load the URL above in my browser I do find one soon after page-load.

Does anyone have any idea what the problem might be? Thanks in advance for any help.

I posted this question to the PhantomJS mailing list a week ago, but have gotten no response. Hoping for better luck here...

I've been trying to use PhantomJS to scrape information from YouTube, but haven't been able to get it working.

Consider a YouTube video embedded into a web page via an iframe element. If you load the URL referenced by the src attribute directly into a browser, you get a full-page version of the video, where the video is encapsulated in an embed element. The embed element is not present in the initial page content; rather, some script tags on the page cause some Javascript to be evaluated which eventually adds the embed element to the DOM. I want to be able to access this embed element when it appears, but it never appears when I load the page in PhantomJS.

Here's the code I'm using:

var page = require("webpage").create();

page.settings.userAgent = "Mozilla/5.0 (X11; rv:24.0) Gecko/20130909 Firefox/24.0";

page.open("https://www.youtube.com/embed/dQw4w9WgXcQ", function (status) {
  if (status !== "success") {
    console.log("Failed to load page");
    phantom.exit();
  } else {
    setTimeout(function () {
      var size = page.evaluate(function () {
        return document.getElementsByTagName("EMBED").length;
      });
      console.log(size);
      phantom.exit();
    }, 15000);
  }
});

I only ever see "0" printed to the console, no matter how long I set the timeout. If I look for "DIV" elements I get "3", and if I look for "SCRIPT" elements I get "5", so the code seems to be sound. I just never find any "EMBED" tags, even though if I load the URL above in my browser I do find one soon after page-load.

Does anyone have any idea what the problem might be? Thanks in advance for any help.

Share Improve this question asked May 10, 2015 at 0:07 SeanSean 29.8k5 gold badges83 silver badges110 bronze badges 7
  • Have you tried just dumping the full HTML to the console? It may be that YouTube is responding with something different than what you see in your browser, perhaps based on user-agent filtering. – elixenide Commented May 10, 2015 at 0:16
  • That's why I set the User-Agent in my code above, to the same string that my actual browser uses. – Sean Commented May 10, 2015 at 0:28
  • Hmmm. So you do. Sorry; posted that comment from a mobile device. Nonetheless: have you dumped out the full HTML to see what you're getting? – elixenide Commented May 10, 2015 at 0:52
  • I can't recall if I tried this during my experiments a week ago. But if I'm sending the same user-agent as my browser's, is there any reason to expect different HTML? I've written programs that do a fair bit of automated web access, and I can't think offhand of any sites that I couldn't make behave just be setting the user agent appropriately. – Sean Commented May 10, 2015 at 4:41
  • Well, I agree that the user agent is probably not it. But, since the <embed> tag isn't being found by your script, the question is why not. The full HTML may answer that. – elixenide Commented May 10, 2015 at 4:55
 |  Show 2 more comments

3 Answers 3

Reset to default 9

Patrick's answer got me on the right track, but the full story is as follows.

Youtube's Javascript probes the browser's capabilities before deciding whether to create some kind of video element. After trawling through the minified code, I was eventually able to fool Youtube into thinking PhantomJS supported HTML5 video by wrapping document.createElement in the page's onInitialized callback.

page.onInitialized = function () {
  page.evaluate(function () {
    var create = document.createElement;
    document.createElement = function (tag) {
      var elem = create.call(document, tag);
      if (tag === "video") {
        elem.canPlayType = function () { return "probably" };
      }
      return elem;
    };
  });
};

However, this was a misstep; to get the <embed> tag I was originally after, I needed to make Youtube's code think PhantomJS supports Flash, not HTML5 video. That's also doable:

page.onInitialized = function () {
  page.evaluate(function () {
    window.navigator = {
      plugins: { "Shockwave Flash": { description: "Shockwave Flash 11.2 e202" } },
      mimeTypes: { "application/x-shockwave-flash": { enabledPlugin: true } }
    };
  });
};

So that's how it's done.

phantomjs does not support flash, or the html5 video element.

As on option - try to build phantomjs with video/audio support by yourself.

Original answer link: https://github.com/ariya/phantomjs/issues/10839#issuecomment-331457673

发布评论

评论列表(0)

  1. 暂无评论