最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

json - How is this Python script retrieving Youtube URLs from Discogs? - Stack Overflow

programmeradmin1浏览0评论

The purpose of the following code from here is to extract Youtube URLs from the Discogs API.

It provides a JSON version of a list of releases on Discogs according to a particular search query, and an HTML example of this page would be:

/?sort=title%2Casc&format_exact=Vinyl&decade=1990&style_exact=Dub+Techno&type=release

It then looks at each release and writes the Youtube URL to a text file.

How does the following code extract Youtube URLs, given that data does not the contain the key videos?

# Copy in your search URL, replace the parameters after /search?
SEARCH_URL = ';sortorder=desc&style="Dub+Techno"&format=Vinyl&type=release&year=1999&per_page=50'

# Generate an access token at 
ACCESS_TOKEN = #insert your own access token here

# Some queries have a lot of results, limit the number by setting this
PAGE_LIMIT = 25

import requests
import time
import sys
from urllib.parse import parse_qs, urlparse
from queue import Queue

INITIAL_URL = SEARCH_URL + '&token=' + ACCESS_TOKEN

q = Queue()
q.put(INITIAL_URL)

videos = []
file = open('video_urls.txt', 'w')


def discogs_request(req_url):
    response = requests.get(
        req_url, headers={'User-agent': 'SearchToVideoList/1.0'})

    if response.status_code == 429:
        time.sleep(60)
        return discogs_request(req_url)
    elif response.status_code == 200:
        return response.json()


while not q.empty():
    url = q.get()
    parsed_url = parse_qs(urlparse(url).query, keep_blank_values=True)

    data = discogs_request(url)

    if data is None:
        continue

    if 'page' in parsed_url and int(parsed_url['page'][0]) > PAGE_LIMIT:
        print('Page limit reached, exiting')
        continue

    if url is INITIAL_URL:
        print('Crawling %s releases in %s pages' % (
            data['pagination']['items'], data['pagination']['pages']))

    if 'results' in data:
        for release in data['results']:
            q.put(release['resource_url'] + '?token=' + ACCESS_TOKEN)

    if 'videos' in data:
        print(data['videos'])
        for video in data['videos']:
            file.write("%s\n" % video['uri'])
            print("Writing the following video to text: {}".format(video['uri']))
        file.flush()

    if 'pagination' in data:
        print('Current page: %s' % data['pagination']['page'])

        if 'next' in data['pagination']['urls']:
            q.put(data['pagination']['urls']['next'])

    q.task_done()


As I have only recently been acquainted with the mentioned Python modules, I tried to understand the code by abstracting away the queue in the code, so I began by running instead,

# Copy in your search URL, replace the parameters after /search?
SEARCH_URL = ';sortorder=desc&style="Dub+Techno"&format=Vinyl&type=release&year=1999&per_page=50'

# Generate an access token at 
ACCESS_TOKEN = #insert your own access token here.

# Some queries have a lot of results, limit the number by setting this
PAGE_LIMIT = 25

import requests
import time
import sys
from urllib.parse import parse_qs, urlparse
from queue import Queue

INITIAL_URL = SEARCH_URL + '&token=' + ACCESS_TOKEN

# This function is a wrapper for the requests module, which is accessing the HTTP gateway.
# requests.get() returns a response object, indicating whether the HTTP resource is available.
def discogs_request(req_url):
    response = requests.get(
        req_url, headers={'User-agent': 'SearchToVideoList/1.0'})

    if response.status_code == 429:
        time.sleep(60)
        return discogs_request(req_url)
    elif response.status_code == 200:
        return response.json()

# The first URL in the queue will be the initial Discogs search for releases that satisfy our criteria/query.
url = INITIAL_URL
print("The URL we are working on is: {}".format(url))

# This line uses urllib.parse to split the given URL into 6 components.
# The general structure of a URL is `scheme://netloc/path;parameters?query#fragment`
print("The parsed URL is: {}".format(urlparse(url)))

# We now only want the 'query' component of the URL, so we use a method to call the values of 'query'.
print("The query components of the URL are: {}".format(urlparse(url).query))

# We now convert the value of `query` to a dictionary,
parsed_url = parse_qs(urlparse(url).query, keep_blank_values=True)
print("The query components of the URL as a dictionary are: {}".format(parsed_url))

# Now store the retrieved data JSON.
data = discogs_request(url)

However, the following is returned False after I run,

'videos' in data

Here is a sample of data from my script:

{'pagination': {'page': 1,
  'pages': 3,
  'per_page': 50,
  'items': 113,
  'urls': {'last': ';sortorder=desc&style=%22Dub+Techno%22&format=Vinyl&type=release&year=1999&per_page=50&token=###&page=3',
   'next': ';sortorder=desc&style=%22Dub+Techno%22&format=Vinyl&type=release&year=1999&per_page=50&token=###&page=2'}},
 'results': [{'country': 'Germany',
   'year': '1999',
   'format': ['Vinyl', '12"', 'White Label'],
   'label': ['Force Inc. Music Works', 'SST Brüggemann GmbH', 'MPO'],
   'type': 'release',
   'genre': ['Electronic'],
   'style': ['Dub Techno', 'Minimal'],
   'id': 1136089,
   'barcode': [],
   'user_data': {'in_wantlist': False, 'in_collection': False},
   'master_id': 83020,
   'master_url': '',
   'uri': '/Exos-Yellow-Yard/release/1136089',
   'catno': 'FIM 174',
   'title': 'Exos - Yellow Yard',
   'thumb': ':fit/g:sm/q:40/h:150/w:150/czM6Ly9kaXNjb2dz/LWRhdGFiYXNlLWlt/YWdlcy9SLTExMzYw/ODktMTI0NDY3MjIz/My5qcGVn.jpeg',
   'cover_image': ':fit/g:sm/q:90/h:600/w:596/czM6Ly9kaXNjb2dz/LWRhdGFiYXNlLWlt/YWdlcy9SLTExMzYw/ODktMTI0NDY3MjIz/My5qcGVn.jpeg',
   'resource_url': '',
   'community': {'want': 270, 'have': 28},
   'format_quantity': 1,
   'formats': [{'name': 'Vinyl',
     'qty': '1',
     'descriptions': ['12"', 'White Label']}]},
  {'country': 'Germany',
   'year': '1999',
   'format': ['Vinyl', '12"', '33 ⅓ RPM', 'White Label'],
   'label': ['Profan', 'SST Brüggemann GmbH', 'MPO'],
   'type': 'release',
   'genre': ['Electronic'],
   'style': ['Dub Techno', 'Minimal Techno', 'Tech House'],
   'id': 9133409,
   'barcode': ['MPO PRO 028 A 33 RPM  K SST', 'MPO  PRO 028 B  33 RPM  K SST'],
   'user_data': {'in_wantlist': False, 'in_collection': False},
   'master_id': 327526,
   'master_url': '',
   'uri': '/Wassermann-W-I-R-Das-Original-Sven-V%C3%A4th-Mix-Thomas-Mayer-Mix/release/9133409',
   'catno': 'PROFAN 028',
   'title': 'Wassermann - W. I. R. (Das Original + Sven Väth Mix Thomas / Mayer Mix)',
   'thumb': '',
   'cover_image': '.gif',
   'resource_url': '',
   'community': {'want': 129, 'have': 23},
   'format_quantity': 1,
   'formats': [{'name': 'Vinyl',
     'qty': '1',
     'descriptions': ['12"', '33 ⅓ RPM', 'White Label']}]},

But when I put print statements to see what is being written to the .txt file in the original 1st code block, it's clear that it's writing videos to the file. So what is going on?

The purpose of the following code from here is to extract Youtube URLs from the Discogs API.

It provides a JSON version of a list of releases on Discogs according to a particular search query, and an HTML example of this page would be:

https://www.discogs/search/?sort=title%2Casc&format_exact=Vinyl&decade=1990&style_exact=Dub+Techno&type=release

It then looks at each release and writes the Youtube URL to a text file.

How does the following code extract Youtube URLs, given that data does not the contain the key videos?

# Copy in your search URL, replace the parameters after /search?
SEARCH_URL = 'https://api.discogs/database/search?sort=title&sortorder=desc&style="Dub+Techno"&format=Vinyl&type=release&year=1999&per_page=50'

# Generate an access token at https://www.discogs/settings/developers
ACCESS_TOKEN = #insert your own access token here

# Some queries have a lot of results, limit the number by setting this
PAGE_LIMIT = 25

import requests
import time
import sys
from urllib.parse import parse_qs, urlparse
from queue import Queue

INITIAL_URL = SEARCH_URL + '&token=' + ACCESS_TOKEN

q = Queue()
q.put(INITIAL_URL)

videos = []
file = open('video_urls.txt', 'w')


def discogs_request(req_url):
    response = requests.get(
        req_url, headers={'User-agent': 'SearchToVideoList/1.0'})

    if response.status_code == 429:
        time.sleep(60)
        return discogs_request(req_url)
    elif response.status_code == 200:
        return response.json()


while not q.empty():
    url = q.get()
    parsed_url = parse_qs(urlparse(url).query, keep_blank_values=True)

    data = discogs_request(url)

    if data is None:
        continue

    if 'page' in parsed_url and int(parsed_url['page'][0]) > PAGE_LIMIT:
        print('Page limit reached, exiting')
        continue

    if url is INITIAL_URL:
        print('Crawling %s releases in %s pages' % (
            data['pagination']['items'], data['pagination']['pages']))

    if 'results' in data:
        for release in data['results']:
            q.put(release['resource_url'] + '?token=' + ACCESS_TOKEN)

    if 'videos' in data:
        print(data['videos'])
        for video in data['videos']:
            file.write("%s\n" % video['uri'])
            print("Writing the following video to text: {}".format(video['uri']))
        file.flush()

    if 'pagination' in data:
        print('Current page: %s' % data['pagination']['page'])

        if 'next' in data['pagination']['urls']:
            q.put(data['pagination']['urls']['next'])

    q.task_done()


As I have only recently been acquainted with the mentioned Python modules, I tried to understand the code by abstracting away the queue in the code, so I began by running instead,

# Copy in your search URL, replace the parameters after /search?
SEARCH_URL = 'https://api.discogs/database/search?sort=title&sortorder=desc&style="Dub+Techno"&format=Vinyl&type=release&year=1999&per_page=50'

# Generate an access token at https://www.discogs/settings/developers
ACCESS_TOKEN = #insert your own access token here.

# Some queries have a lot of results, limit the number by setting this
PAGE_LIMIT = 25

import requests
import time
import sys
from urllib.parse import parse_qs, urlparse
from queue import Queue

INITIAL_URL = SEARCH_URL + '&token=' + ACCESS_TOKEN

# This function is a wrapper for the requests module, which is accessing the HTTP gateway.
# requests.get() returns a response object, indicating whether the HTTP resource is available.
def discogs_request(req_url):
    response = requests.get(
        req_url, headers={'User-agent': 'SearchToVideoList/1.0'})

    if response.status_code == 429:
        time.sleep(60)
        return discogs_request(req_url)
    elif response.status_code == 200:
        return response.json()

# The first URL in the queue will be the initial Discogs search for releases that satisfy our criteria/query.
url = INITIAL_URL
print("The URL we are working on is: {}".format(url))

# This line uses urllib.parse to split the given URL into 6 components.
# The general structure of a URL is `scheme://netloc/path;parameters?query#fragment`
print("The parsed URL is: {}".format(urlparse(url)))

# We now only want the 'query' component of the URL, so we use a method to call the values of 'query'.
print("The query components of the URL are: {}".format(urlparse(url).query))

# We now convert the value of `query` to a dictionary,
parsed_url = parse_qs(urlparse(url).query, keep_blank_values=True)
print("The query components of the URL as a dictionary are: {}".format(parsed_url))

# Now store the retrieved data JSON.
data = discogs_request(url)

However, the following is returned False after I run,

'videos' in data

Here is a sample of data from my script:

{'pagination': {'page': 1,
  'pages': 3,
  'per_page': 50,
  'items': 113,
  'urls': {'last': 'https://api.discogs/database/search?sort=title&sortorder=desc&style=%22Dub+Techno%22&format=Vinyl&type=release&year=1999&per_page=50&token=###&page=3',
   'next': 'https://api.discogs/database/search?sort=title&sortorder=desc&style=%22Dub+Techno%22&format=Vinyl&type=release&year=1999&per_page=50&token=###&page=2'}},
 'results': [{'country': 'Germany',
   'year': '1999',
   'format': ['Vinyl', '12"', 'White Label'],
   'label': ['Force Inc. Music Works', 'SST Brüggemann GmbH', 'MPO'],
   'type': 'release',
   'genre': ['Electronic'],
   'style': ['Dub Techno', 'Minimal'],
   'id': 1136089,
   'barcode': [],
   'user_data': {'in_wantlist': False, 'in_collection': False},
   'master_id': 83020,
   'master_url': 'https://api.discogs/masters/83020',
   'uri': '/Exos-Yellow-Yard/release/1136089',
   'catno': 'FIM 174',
   'title': 'Exos - Yellow Yard',
   'thumb': 'https://i.discogs/-Jz0cMoRi-g0vgwmUZOs3P6i_HyFva_fobHjY303-II/rs:fit/g:sm/q:40/h:150/w:150/czM6Ly9kaXNjb2dz/LWRhdGFiYXNlLWlt/YWdlcy9SLTExMzYw/ODktMTI0NDY3MjIz/My5qcGVn.jpeg',
   'cover_image': 'https://i.discogs/s5-uqKQaR-CXrpj8dSwWsl3Z0K9ZJs1G9w_ZwgUrsVQ/rs:fit/g:sm/q:90/h:600/w:596/czM6Ly9kaXNjb2dz/LWRhdGFiYXNlLWlt/YWdlcy9SLTExMzYw/ODktMTI0NDY3MjIz/My5qcGVn.jpeg',
   'resource_url': 'https://api.discogs/releases/1136089',
   'community': {'want': 270, 'have': 28},
   'format_quantity': 1,
   'formats': [{'name': 'Vinyl',
     'qty': '1',
     'descriptions': ['12"', 'White Label']}]},
  {'country': 'Germany',
   'year': '1999',
   'format': ['Vinyl', '12"', '33 ⅓ RPM', 'White Label'],
   'label': ['Profan', 'SST Brüggemann GmbH', 'MPO'],
   'type': 'release',
   'genre': ['Electronic'],
   'style': ['Dub Techno', 'Minimal Techno', 'Tech House'],
   'id': 9133409,
   'barcode': ['MPO PRO 028 A 33 RPM  K SST', 'MPO  PRO 028 B  33 RPM  K SST'],
   'user_data': {'in_wantlist': False, 'in_collection': False},
   'master_id': 327526,
   'master_url': 'https://api.discogs/masters/327526',
   'uri': '/Wassermann-W-I-R-Das-Original-Sven-V%C3%A4th-Mix-Thomas-Mayer-Mix/release/9133409',
   'catno': 'PROFAN 028',
   'title': 'Wassermann - W. I. R. (Das Original + Sven Väth Mix Thomas / Mayer Mix)',
   'thumb': '',
   'cover_image': 'https://st.discogs/1504bf7e69cad5ced79c9e7b6cf62bda18dce7eb/images/spacer.gif',
   'resource_url': 'https://api.discogs/releases/9133409',
   'community': {'want': 129, 'have': 23},
   'format_quantity': 1,
   'formats': [{'name': 'Vinyl',
     'qty': '1',
     'descriptions': ['12"', '33 ⅓ RPM', 'White Label']}]},

But when I put print statements to see what is being written to the .txt file in the original 1st code block, it's clear that it's writing videos to the file. So what is going on?

Share Improve this question edited Apr 1 at 1:02 microhaus asked Mar 31 at 19:50 microhausmicrohaus 1518 bronze badges 5
  • 1 Maybe first use print() (and print(type(...)), print(len(...)), etc.) to see which part of code is executed and what you really have in variables. It is called "print debugging" and it helps to see what code is really doing. – furas Commented Mar 31 at 20:06
  • 1 maybe first you should check print( data ) in both scripts to see if you get the same data. Maybe original code gets data['videos'] but your code may gets something different - for example message error. – furas Commented Mar 31 at 20:07
  • 1 @furas Ah yes, you're right. I haven't programmed in a while, will port it to PyCharm and have a look at what's going on. – microhaus Commented Mar 31 at 20:09
  • 1 @furas Printing data in the original script and my simplified one results in two different JSON entries altogether. Thank you for you suggestions. – microhaus Commented Mar 31 at 20:13
  • original code works in while-loop and it checks 'results' in data which may load another url - and maybe this url gets videos – furas Commented Mar 31 at 20:17
Add a comment  | 

1 Answer 1

Reset to default 1

It looks like the original script does a two-level crawl of the API data. When it reads the initial URL, it gets JSON data like you show, which contains results and pagination as keys, but not videos. What it does with that data is read each value from results (a list), and find the sub-key resource_url for each result, which it adds to the queue to fetch later. It also reads the pagination block for a next url to get the next page of results. Each time it parses one of the next urls, it works like the top level one.

When it gets one of the resource URLs from the queue, it reads it and gets a JSON file that does not contain results, but instead may contain a list of videos. It reads the uri field of those video entries, and writes them to the file.

You could change the logic to do both levels of processing in a set of nested loops if you wanted to, without using a queue. It would look something like this (without any error checking logic):

file = open('video_urls.txt', 'w')
outer_data = discogs_request(INITIAL_URL)

while int(outer_data['pagination']['page']) <= PAGE_LIMIT:
    for result in outer_data['results']:
        inner_data = discogs_request(result['resource_url'])

        for videos in inner_data['videos']:
            file.write(video['uri'] + '\n')

    if 'next' in outer_data['pagination']['urls']:
        top_level_json = discogs_request(outer_data['pagination']['urls']['next'])
    else:
        break
发布评论

评论列表(0)

  1. 暂无评论