paint-brush
How I Scraped YouTube Comments with Bright Data to Understand Customer Sentimentby@ayinketh
2,172 reads
2,172 reads

How I Scraped YouTube Comments with Bright Data to Understand Customer Sentiment

by AyinkethDecember 1st, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Overcome the challenges with traditional web scraping process, with the use of Bright Data as an all-in-one efficient tool designed to tackle CAPTCHAs, Honeypot, Rate limiting, Block request, and IP blocks faced during data retrieval. Along with Python libraries like Playwright, to scrap data from YouTube.
featured image - How I Scraped YouTube Comments with Bright Data to Understand Customer Sentiment
Ayinketh HackerNoon profile picture
0-item
1-item
2-item

In today’s world, where gaining valuable, data-driven insights is crucial for business growth and improving services, social media platforms like Facebook, YouTube, Instagram, and Twitter/X have become central to our everyday lives. People freely share their thoughts, opinions, and experiences, turning social media into a goldmine of public data. By scraping this data, we can unlock new opportunities to understand market trends and consumer behavior, helping to improve products and services in meaningful ways.



AI-generated image of a YouTube bot

Imagine your company just launched an exciting new product or service, and posted videos on YouTube showcasing its features and benefits. By scraping YouTube comments, you can gather valuable insights into how users are perceiving and reviewing your product—helping you understand what’s working and where there might be room for improvement.


In this article, we’ll explore:

  • The challenges of data scraping using traditional methods


  • How I overcame these hurdles to scrape data from YouTube effectively using Bright Data and Python

By the end of this article, you will understand how to scrape insight-rich social media data with Bright Data’s tools.

Challenges with Data Scraping process when using Traditional Approach

The traditional approach to scraping data from any platform usually starts by figuring out the platform's HTML structure. Next, you pinpoint where the information you need is located on the page. Then, you write scripts in Python using popular frameworks like Selenium, Beautiful Soup, or Playwright. But it does not end there. Every social media platform has specific measures that prevent data misuse or scraping. Some of the measures include:




  • Blocking IP Address: Some platforms block your IP address when you quickly make multiple automated requests from the same IP address. The website may flag the IP address as harmful if it detects unusual traffic patterns.



  • Rate Limiting: When the number of requests exceeds a certain threshold, the website rate limits the requests to prevent the abuse of their servers.



  • Header-based Request Blocking: Websites can block requests from specific sources based on headers like User-Agent and Referrer. If these headers seem suspicious or not legitimate, the website may take action to prevent the request.



  • CAPTCHAs: Another common way to stop abnormal activities is to ask the user to solve a CAPTCHA before navigating to the website content. This step ensures that it is an actual human act, not an automated bot.

  • Usage of Honeypots: Certain online platforms incorporate a sneaky component into their website’s source code which is not visible to users, but web scrapers can interact with it. If your script comes across this trap, the website will become suspicious of the activity and impose restrictions on the web scraper.


How I Scraped YouTube Data with Bright Data and Python

Bright Data is an efficient all-in-one proxy and AI-powered web scraping tool that simplifies data scraping projects with a headful GUI browser that is fully compatible with Puppeteer/Playwright/Selenium APIs. Bright Data's powerful unlocker infrastructure and premium proxy network allow you to bypass the previously mentioned challenges right out of the box.


Bright Data expertly tackles challenges like website blocks, CAPTCHAs, and fingerprints by using advanced AI to mimic real user behavior and avoid detection. Plus, its Scraping Browser comes packed with features that make web scraping more reliable while saving you time, money, and effort.


How to use Bright Data’s Scraping Browsers


1. Sign up for a free trial on the Bright Data website. You can do so by clicking on “Start free trial” or “Start free with Google”. You can proceed to the next step if you have an existing account.


2. From the dashboard, Navigate to the “Proxy and Scraping Infrastructure” section and Click on the “Add” button, then select “Scraping Browser” from the dropdown menu.




3. Enter a name of your choice in the form to create a new Scraping Browser.



4. After creating a new Scraping Browser instance, click on its name, and navigate to “Access Parameters” to access the hostname, username, and password information.


5. You can use these parameters in the following Python script to access the Scraping Browser instance.


Analyzing YouTube Comments to Gain Product Insights

Let us consider that you are working for a company and want to know how people perceive your product. You go ahead by scraping the comments of a YouTube video that specifically reviewed your product and analyze it to arrive at some metrics.


We will look into the review of the iPhone 16 to know people’s opinions.


Prerequisites

  • Please ensure that your computer already has Python installed.


  • Install the necessary packages in your project folder. You’ll use the Playwright Python Library and Pandas to get insights from the data. To make asynchronous requests, install the Asynchronous IO library. You will use NLTK and WordCloud Libraries to analyze the retrieved comments.

Extracting the YouTube Video Comments

  1. Start by extracting the comments from the iPhone 16 review video.




2. Import the necessary Python libraries in your Python Script and create a get_comments() method to get the video list from the webpage.


async def get_comments(): async with async_playwright() as playwright:

 async def get_comments():
   async with async_playwright() as playwright:
   
     auth = '<provide username here>:<provide password here>'
     host = '<provide host name here>'
     browser_url = f'wss://{auth}@{host}'

     # Connecting to the Scraping Browser
     browser = await playwright.chromium.connect_over_cdp(browser_url)
     page = await browser.new_page()
     page.set_default_timeout(3*60*1000)

     # Opens the Youtube Video Page in the browser
     await page.goto('https://www.youtube.com/watch?v=v94jRN2FhGo&ab_channel=MarquesBrownlee')
     
     for i in range(2):
       await page.evaluate("window.scrollBy(0, 500)")
       await page.wait_for_timeout(2000)

     await page.wait_for_selector("ytd-comment-renderer")

     # Parse the HTML tags to get the Comments and likes
     data = await page.query_selector_all('ytd-comment-renderer#comment')
     comments = []
     for item in data:
       comment_div = await item.query_selector('yt-formatted-string#content-text')
       comment_likes = await item.query_selector('span#vote-count-middle')
       comment = {
       "Comments": await comment_div.inner_text(),
       "Likes": await comment_likes.inner_text()
       }
       comments.append(comment)

     comment_list = json.loads(json.dumps(comments))

     #Storing into the CSV file
     with open("youtube_videos.csv", 'w', newline='') as csvfile:
     writer = csv.DictWriter(csvfile, fieldnames=comment_list[0].keys())
     writer.writeheader()
     for data in comment_list:
     writer.writerow(data)

    #Converting CSV to a data frame for further processing
     df = pd.read_csv("youtube_comments.csv")

     await browser.close()
     return df


3. The get_comments() method then works as follows:


  • Start by connecting to Bright Data’s Scraping Browser by using the credentials.
  • Create a new page pointing to the video from which you want to retrieve the comments.
  • Wait for the page to load and Identify the HTML div which encloses all the comments of the video (ytd-comment-renderer#comment)
  • Iterate through each comment, extracting the content and the corresponding number of likes.
  • Store these details in the file “youtube_comments.csv” in your working folder.
  • Transform that CSV file contents into Pandas Dataframe for further processing.
  • This method generates the CSV file that contains the comment data.


What do people think about the product?

We’re almost through! You have extracted the data from the YouTube Video. Next, let’s dig into the insights provided by the data.


First, we need to gauge the number of individuals with a positive outlook on the product. As such, you’ll be conducting a sentiment analysis of the videos with the aid of the widely used Natural Language Processing NLTK library.


nltk.download("stopwords", quiet=True)
nltk.download("vader_lexicon", quiet=True)
def transform_comments(df):
   #clean the comments
   df["Cleaned Comments"] = (
   df["Comments"].str.strip().str.lower().str.replace(r"[^\w\s]+", "",regex=True).str.replace("\n", " "))
   
   stop_words = stopwords.words("english")
   
   df["Cleaned Comments"] = df["Cleaned Comments"].apply(
   lambda comment: " ".join([word for word in comment.split() if word not in stop_words]))

   #analyse the sentiment of each comment and classify
   df["Sentiment"] = df["Cleaned Comments"].apply(lambda comment: analyze_sentiment(comment))

   #Create a bar graph to understand the sentiments of people
   sentiment_counts = df.groupby('Sentiment').size().reset_index(name='Count')
   plt.bar(sentiment_counts['Sentiment'], sentiment_counts['Count'],color=['red', 'blue', 'green'])
   plt.grid(axis='y', linestyle=' - ', alpha=0.7)
   plt.show()

def analyze_sentiment(text):
   sentiment_analyzer = SentimentIntensityAnalyzer()
   scores = sentiment_analyzer.polarity_scores(text)
   sentiment_score = scores["compound"]

   if sentiment_score <= -0.5:
   sentiment = "Negative"
   elif -0.5 < sentiment_score <= 0.5:
   sentiment = "Neutral"
   else:
   sentiment = "Positive"
   return sentiment


In the above code, you are cleaning the comments to eliminate any whitespace, special characters and newlines.


Then, you remove the common English stopwords, which don’t contribute much to the sentiment analysis.


After that, the sentiment of each comment is calculated and added as a new column in the data frame.


Finally, you create a bar graph which visually classifies the comments as “Positive”, “Negative” and “Neutral.”

According to the sentiment analysis results, many individuals hold a positive view of the product.


Which features of the product do people find most appealing?


You’re interested in discovering the aspects of the product talked about by people, which is the next intriguing piece of information you’re searching for. A helpful way to achieve this is by creating a word cloud using comments. The word size in the word cloud represents the frequency of the word in the comments.


def generate_word_cloud(df):
  comments = "\n".join(df["Comments"].tolist())
  wordcloud = WordCloud().generate(comments)


This code will create a word cloud from the YouTube Comment.


Looking into WordCloud, you can find the features talked about by people, apart from the common ones like iPhone, phone and Apple. People also spoke about display, model, camera, battery, and screen.


If you want to focus on more specific insights, you can utilize filters in the Pandas data frame based on exact keywords such as “Camera” or “Battery.” By conducting a sentiment analysis and creating a word cloud from this data, you can uncover insights explicitly tailored to those features.


How Bright Data's Scraping Browser Solves Traditional Web Scraping Challenges

As you may have observed, I should have used additional techniques to overcome the challenges mentioned earlier. Instead, I leveraged Bright Data Scraping Browser to act as my website browser. Surprisingly, the Scraping Browser took on all the problematic aspects of the job for me. It has several inherent features that can effortlessly eliminate obstacles on websites. Let me show you some of those benefits.


  • Unlimited Browser Sessions: You can launch as many browser sessions as you need on the Bright Data network without any concerns about blocked requests. Furthermore, you have the flexibility to scale the extraction process by running multiple browser sessions simultaneously. This powerful tool empowers you to access the data you need hassle-free, without restrictions or interruptions.


  • Leave Network Infrastructure Worries Behind: You can entirely rely on Bright Data’s network infrastructure for all your data retrieval needs. Thereby, you can focus on the web scraping process without worrying about server allocation and maintenance issues.


  • Proxy management: Thanks to its efficient built-in proxy management capabilities that makes use of four different types of IPs (including powerful residential IPs), the IP address is automatically switched up, ensuring that web scraping runs smoothly without any interruptions as it handles bot detection measures of websites, and avoids geolocation restrictions and rate-limiting.


  • Robust Unlocker Mechanism: The Scraping Browser makes use of Bright Data’s powerful unlocker infrastructure to bypass even the most complex bot detection measures; from handling CAPTCHAs to device fingerprint emulation to managing header information and cookies, it takes care of it all.


  • Integration with Existing Libraries: Bright Data provides excellent integration support for existing Python libraries. With the Scraping Browser, configuring the browser connection is all you need to do without any modifications to the rest of your script.


Conclusion

By utilizing Bright Data’s Scraping Browser in combination with Python, you can gather valuable information about customers, products, and the market, allowing your business to use data-driven strategies and informed decision-making in a scalable and cost-effective manner.