paint-brush
How to Build a No-Limits Stock Market Scraper with Pythonby@sushma_k_tech_architect

How to Build a No-Limits Stock Market Scraper with Python

by Sushma KukkadapuFebruary 20th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Traditional stock market APIs come with rate limits and high costs, so I built my own web scraper using Python. By extracting data from Yahoo Finance and CNN Money, I bypassed restrictions while maintaining flexibility. This guide covers setup, handling challenges like rate limiting and data validation, and future plans for AI-powered stock analysis.

People Mentioned

Mention Thumbnail

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - How to Build a No-Limits Stock Market Scraper with Python
Sushma Kukkadapu HackerNoon profile picture
0-item
1-item


Being a software engineer who works extensively with financial data, I recently hit a wall with traditional stock market APIs. After getting frustrated with rate limits and expensive subscriptions, I decided to build my own solution using web scraping. Here's how I did it, and what I learned along the way.

Introduction: Why I Needed a Different Approach

My breaking point came during a personal project where I was trying to analyze market trends. Yahoo Finance's API kept hitting rate limits, and Bloomberg Terminal's pricing made me laugh out loud - there was no way I could justify that cost for a side project. I needed something that would let me:


  • Fetch data without arbitrary limits
  • Get real-time prices and trading volumes
  • Access historical data without paying premium fees
  • Scale up my analysis as needed

The Web Scraping Solution

After some research and experimentation, I settled on scraping data from two main sources: CNN Money for trending stocks and Yahoo Finance for detailed metrics. Here's how I built it:

Setting Up the Basic Infrastructure

First, I installed the essential tools:

pip install requests bs4


Then I created a basic scraper that could handle network issues gracefully:

import requests
from bs4 import BeautifulSoup
import time
import logging

def make_request(url, max_retries=3):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
    
    for attempt in range(max_retries):
        try:
            return requests.get(url, headers=headers, timeout=10)
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            time.sleep(attempt + 1)


I started with CNN Money's hot stocks list, which gives me three categories of stocks to track:

def get_trending_stocks():
    url = 'https://money.cnn.com/data/hotstocks/index.html'
    response = make_request(url)
    soup = BeautifulSoup(response.text, "html.parser")

    tables = soup.findAll("table", {"class": "wsod_dataTable wsod_dataTableBigAlt"})
    categories = ["Most Actives", "Gainers", "Losers"]
    
    stocks = []
    for i, table in enumerate(tables):
        for row in table.findAll("tr")[1:]:  # Skip headers
            cells = row.findAll("td")
            if cells:
                stocks.append({
                    'category': categories[i],
                    'symbol': cells[0].find(text=True),
                    'company': cells[0].span.text.strip()
                })
    
    return stocks


Getting the Financial Details

For each trending stock, I fetch additional data from Yahoo Finance:

def get_stock_details(symbol):
    url = f"https://finance.yahoo.com/quote/{symbol}"
    response = make_request(url)
    soup = BeautifulSoup(response.text, "html.parser")

    data = {}
    
    # Find the main quote table
    table = soup.find("table", {"class": "W(100%)"})
    if table:
        for row in table.findAll("tr"):
            cells = row.findAll("td")
            if len(cells) > 1:
                key = cells[0].text.strip()
                value = cells[1].text.strip()
                data[key] = value
    
    return data


The Gotchas I Encountered

Building this wasn't all smooth sailing. Here are some real issues I hit and how I solved them:

  1. Rate Limiting: Yahoo Finance started blocking me after too many rapid requests. I added random delays between requests:
time.sleep(random.uniform(1, 3))  # Random delay between 1-3 seconds


  1. Data Inconsistencies: Sometimes the scraped data would be malformed. I added validation:
def validate_price(price_str):
    try:
        return float(price_str.replace('$', '').replace(',', ''))
    except:
        return None


  1. Website Changes: The sites occasionally update their HTML structure. I made my selectors more robust:
# Instead of exact class matches, use partial matches
table = soup.find("table", class_=lambda x: x and 'dataTable' in x)


Storing and Using the Data

I keep things simple with CSV storage - it's easy to work with and perfect for my needs:

import csv
from datetime import datetime

def save_stock_data(stocks):
    timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    
    with open('stock_data.csv', 'a', newline='') as file:
        writer = csv.writer(file)
        for stock in stocks:
            writer.writerow([timestamp, stock['symbol'], 
                           stock['price'], stock['volume']])


What I Learned

After running this scraper for several weeks, here are my key takeaways:

  1. Web scraping isn't just a hack - it's a viable alternative to expensive APIs when done rightly.
  2. Building in error handling and logging from the start saves huge headaches later.
  3. Stock data is messy - so always validate what you scrape.
  4. Starting simple and iterating works better than trying to build everything at once!


What's Next?

I'm currently working on adding:

  • News sentiment analysis
  • Basic pattern recognition
  • A simple dashboard for visualization


Also, would you like to integrate this scraper with machine learning models to predict stock trends? Let me know in the comments!