Files

T

Aherobo Ovie Victor 80ee708348 feat: Implement stock listing extraction and database population

- Added `extract_listings.py` for extracting stock listings from TSX, TSXV, CSE, and CBOE using Playwright.
- Created `main.py` to orchestrate the entire stock intelligence system, including extraction, database import, financial scraping, news scraping, and report generation.
- Developed `populate_database.py` to populate the database with existing JSON data.
- Introduced `scrape_nasdaq_tsx_only.py` for focused scraping of NASDAQ and TSX stocks.
- Added `setup.py` for initial setup and testing of the system.
- Created `watchlist.txt` template for user-defined stock tracking.
- Generated `final_test_output.txt` to log the results of the test run.

2025-11-06 12:34:01 +01:00

7.7 KiB

Raw Blame History

Stock Intelligence System - Setup & Usage Guide

📋 Overview

This system automatically collects comprehensive data on publicly traded stocks from TSX, TSXV, CSE, and CBOE exchanges without requiring any API keys. All data is collected via web scraping.

🎯 What This System Does

For each stock on the target exchanges, it:

Extracts company listings with ticker, name, sector, industry
Scrapes financial data from Yahoo Finance (3 years + TTM)
Collects news articles from Google News (last 12 months)
Gathers press releases from GlobeNewswire, Newswire.ca, etc.
Stores everything in a SQLite database
Generates text reports for each stock

🚀 Getting Started

Step 1: Install Dependencies

cd /Users/macbook/Desktop/Victor

# Install Python packages
pip install -r requirements.txt

# Install Playwright browser
python3 -m playwright install chromium

Step 2: Test the Setup

# Run a quick test on a few stocks
python test_extraction.py

This will:

Extract CSE listings
Show you what data was captured
Save HTML for debugging if needed

Step 3: Run the Full Pipeline

# Test mode (5 stocks for testing)
python main.py

# Full pipeline (all stocks - takes hours!)
python main.py --full

📊 Output Structure

After running, you'll have:

data/
├── listings/
│   ├── tsx_tsxv_listings.json      # TSX/TSXV stocks
│   ├── cse_listings.json           # CSE stocks
│   ├── cboe_listings.json          # CBOE stocks
│   └── all_listings_combined.json  # All stocks combined
│
├── financials/
│   ├── TICKER1_yahoo.json          # Financial data per stock
│   ├── TICKER2_yahoo.json
│   └── ...
│
├── news/
│   ├── TICKER1_news_pr.json        # News & press releases
│   ├── TICKER2_news_pr.json
│   └── ...
│
├── reports/
│   ├── TICKER1_report.txt          # Human-readable report
│   ├── TICKER2_report.txt
│   └── ...
│
└── stocks.db                        # SQLite database with all data

🔧 Individual Components

You can run each step separately:

1. Extract Listings Only

python extract_listings.py

This will:

Open a browser window (non-headless for debugging)
Navigate to each exchange
Extract all stock listings
Save to data/listings/

2. Import to Database

python database.py

This will:

Create the SQLite database schema
Import listings from JSON files
Show statistics

3. Scrape Financials

python scrape_yahoo_finance.py

This will:

Load stocks from listings
Scrape Yahoo Finance for each
Save financial data to JSON
Takes ~2-3 seconds per stock

4. Scrape News & Press Releases

python scrape_news_pr.py

This will:

Search Google News for each stock
Search press release sites
Save all articles/releases
Takes ~10-15 seconds per stock

📝 Understanding the Data

Stock Listings Format

{
  "symbol": "ABC",
  "name": "ABC Company Inc.",
  "exchange": "TSX",
  "sector": "Technology",
  "industry": "Software",
  "extracted_at": "2025-11-06T10:30:00"
}

Financial Data Format

{
  "ticker": "ABC",
  "yahoo_ticker": "ABC.TO",
  "profile": {
    "current_price": 25.50
  },
  "statistics": {
    "market_cap": "500M",
    "pe_ratio": "15.2",
    "forward_eps": "1.68"
  },
  "financials": {
    "total_revenue": ["1.2B", "1.1B", "1.0B"],
    "net_income": ["120M", "110M", "100M"]
  }
}

News Data Format

{
  "ticker": "ABC",
  "news_articles": [
    {
      "title": "ABC Company Reports Strong Earnings",
      "source": "Financial Post",
      "date": "2 days ago",
      "url": "https://...",
      "snippet": "..."
    }
  ],
  "press_releases": [
    {
      "title": "ABC Announces New Product Line",
      "source": "GlobeNewswire",
      "date": "Nov 1, 2025",
      "url": "https://..."
    }
  ]
}

🗄️ Database Schema

The SQLite database (data/stocks.db) contains:

Main Tables

stocks_master - All stock information
financial_statements - Income statement, balance sheet, cash flow
financial_metrics - Calculated ratios (P/E, ROE, etc.)
news_articles - News from various sources
press_releases - Official company releases
filings - Regulatory filings (SEDAR+, SEC)
agm_info - Annual general meeting details
tax_disclosures - Tax-related information
coverage_report - Tracks data completeness per stock

Query Examples

import sqlite3

# Connect to database
conn = sqlite3.connect('data/stocks.db')
cursor = conn.cursor()

# Get all TSX stocks
cursor.execute("SELECT symbol, company_name FROM stocks_master WHERE exchange='TSX'")
stocks = cursor.fetchall()

# Get stocks with complete data
cursor.execute("""
    SELECT ticker, exchange 
    FROM coverage_report 
    WHERE has_financials=1 AND has_news=1
""")
complete_stocks = cursor.fetchall()

# Get all news for a specific stock
cursor.execute("""
    SELECT title, source, published_date 
    FROM news_articles 
    WHERE stock_id = (SELECT id FROM stocks_master WHERE symbol='ABC')
""")
news = cursor.fetchall()

⚠️ Important Considerations

Rate Limiting

Scripts include delays between requests (2-5 seconds)
Respectful of server resources
May take several hours for full dataset

Data Quality

Not all stocks have Yahoo Finance data (especially microcaps)
News availability varies by stock
Some press releases may be behind paywalls

Troubleshooting

Problem: No listings extracted

Check data/listings/*_page.html files
Websites may have changed their structure
May need to update selectors in extract_listings.py

Problem: Yahoo Finance returns errors

Stock may not be listed on Yahoo
Try different ticker suffix (.TO vs .V for Canadian stocks)
Check data/financials/*_yahoo.json for error messages

Problem: News scraping fails

Google may rate-limit searches
Increase delays in scrape_news_pr.py
Consider running in smaller batches

🔄 Automation & Scheduling

To run this automatically on a schedule:

Using cron (macOS/Linux)

# Edit crontab
crontab -e

# Run every week on Sunday at 2 AM
0 2 * * 0 cd /Users/macbook/Desktop/Victor && /usr/local/bin/python3 main.py --full >> /tmp/stock_scraper.log 2>&1

Manual Scheduling

# Create a shell script
cat > run_scraper.sh << 'EOF'
#!/bin/bash
cd /Users/macbook/Desktop/Victor
python3 main.py --full
EOF

chmod +x run_scraper.sh

# Run it whenever needed
./run_scraper.sh

📈 Next Steps

After collecting the data, you can:

Analyze - Use pandas/numpy to analyze financial trends
Visualize - Create charts with matplotlib/plotly
Screen - Filter stocks by criteria (P/E ratio, revenue growth, etc.)
Monitor - Track specific stocks over time
Export - Generate Excel reports, CSV exports, etc.

🐛 Known Issues

Dynamic content - Some exchange sites may change their layouts
Ticker formats - Canadian stocks have different suffixes (.TO, .V, .CN)
Rate limits - Google/Yahoo may temporarily block if too aggressive
JavaScript rendering - Sites may block headless browsers

💡 Tips

Start with test mode to verify everything works
Run during off-peak hours to avoid rate limits
Check coverage_report table to see data completeness
Save HTML files for debugging when extraction fails
Use database queries for efficient data analysis

📞 Support

Check these files for more info:

PROGRESS.md - Current implementation status
README.md - Original technical plan
Individual script files have detailed comments

Last Updated: November 6, 2025

7.7 KiB Raw Blame History

Stock Intelligence System - Setup & Usage Guide

📋 Overview

🎯 What This System Does

🚀 Getting Started

Step 1: Install Dependencies

Step 2: Test the Setup

Step 3: Run the Full Pipeline

📊 Output Structure

🔧 Individual Components

1. Extract Listings Only

2. Import to Database

3. Scrape Financials

4. Scrape News & Press Releases

📝 Understanding the Data

Stock Listings Format

Financial Data Format

News Data Format

🗄️ Database Schema

Main Tables

Query Examples

⚠️ Important Considerations

Rate Limiting

Data Quality

Troubleshooting

🔄 Automation & Scheduling

Using cron (macOS/Linux)

Manual Scheduling

📈 Next Steps

🐛 Known Issues

💡 Tips

📞 Support

7.7 KiB

Raw Blame History