Files

T

Aherobo Ovie Victor 80ee708348 feat: Implement stock listing extraction and database population

- Added `extract_listings.py` for extracting stock listings from TSX, TSXV, CSE, and CBOE using Playwright.
- Created `main.py` to orchestrate the entire stock intelligence system, including extraction, database import, financial scraping, news scraping, and report generation.
- Developed `populate_database.py` to populate the database with existing JSON data.
- Introduced `scrape_nasdaq_tsx_only.py` for focused scraping of NASDAQ and TSX stocks.
- Added `setup.py` for initial setup and testing of the system.
- Created `watchlist.txt` template for user-defined stock tracking.
- Generated `final_test_output.txt` to log the results of the test run.

2025-11-06 12:34:01 +01:00

6.6 KiB

Raw Blame History

🎯 WHAT I'VE DONE - Summary

Current Status

✅ Your scraping project has been upgraded and is ready to run!

What Was Wrong Before

Your initial Scrapy spider scraped the static HTML from exchange websites, but:

The actual stock listing data loads via JavaScript after page load
Your cleaned text files only contained navigation menus, not the actual stock data
You needed a way to extract the dynamic content and structure it

What I've Built For You

5 New Python Scripts

extract_listings.py (281 lines)
- Uses Playwright to wait for JavaScript to load
- Extracts actual stock data from TSX/TSXV, CSE, CBOE
- Saves structured JSON files with all tickers
database.py (279 lines)
- Complete SQLite database schema
- Tables for stocks, financials, metrics, news, filings, etc.
- Import/export functions
- Coverage tracking
scrape_yahoo_finance.py (259 lines)
- Scrapes Yahoo Finance for each stock
- Gets price, market cap, financials, statistics
- No API key needed!
- Handles Canadian ticker formats (.TO, .V)
scrape_news_pr.py (346 lines)
- Scrapes Google News for articles
- Scrapes GlobeNewswire and Newswire.ca for press releases
- Last 12 months of coverage
- No API keys needed
main.py (309 lines)
- Orchestrates the entire pipeline
- Runs all steps in sequence
- Generates final text reports
- Tracks progress and errors
- Test mode for quick validation

Supporting Files

test_extraction.py - Quick test script
requirements.txt - All Python dependencies
GUIDE.md - Complete usage guide (you're reading part of it!)
PROGRESS.md - Project progress tracker
Updated README.md - With implementation status

How To Use It

Quick Start (3 commands)

# 1. Install dependencies
pip install -r requirements.txt
python3 -m playwright install chromium

# 2. Test it
python test_extraction.py

# 3. Run full pipeline (test mode)
python main.py

What Happens When You Run It

Step 1: Extract Listings (2-3 minutes)

Opens browser windows
Navigates to each exchange
Waits for JavaScript to load
Extracts all stock data
Saves to data/listings/*.json

Step 2: Import to Database (< 1 minute)

Creates SQLite database
Imports all stocks
Sets up tracking tables

Step 3: Scrape Financials (varies by # of stocks)

For each stock: visits Yahoo Finance
Extracts price, market cap, financials
Saves to data/financials/*.json
Updates database

Step 4: Scrape News & PR (varies by # of stocks)

Searches Google News for each stock
Searches press release sites
Saves to data/news/*.json
Updates database

Step 5: Generate Reports (< 1 minute)

Creates text file for each stock
Combines all data sources
Saves to data/reports/*.txt

File Structure After Running

Victor/
├── 📄 main.py                 ← Run this!
├── 📄 test_extraction.py      ← Or test with this
├── 📄 requirements.txt
├── 📄 GUIDE.md               ← Full documentation
├── 📄 PROGRESS.md
├── 📄 README.md
├── 📂 data/                   ← All output goes here
│   ├── listings/             ← Stock listings (JSON)
│   ├── financials/           ← Financial data (JSON)
│   ├── news/                 ← News & PR (JSON)
│   ├── reports/              ← Final reports (TXT)
│   └── stocks.db             ← SQLite database
├── 📂 scrap/                  ← Your original Scrapy project
└── 📂 cleaned_text/           ← Your original cleaned HTML

Key Features

✅ No API Keys - Pure web scraping ✅ Canadian Exchanges - TSX, TSXV, CSE, CBOE ✅ Comprehensive Data - Financials, news, press releases ✅ SQLite Database - Structured storage ✅ Text Reports - Human-readable output ✅ Progress Tracking - Know what's covered ✅ Error Handling - Continues even if some stocks fail ✅ Rate Limiting - Respectful of servers ✅ Test Mode - Verify before full run

Example Output

After running, you'll have files like:

data/listings/all_listings_combined.json

[
  {
    "symbol": "CVV",
    "name": "CanAlaska Uranium Ltd.",
    "exchange": "TSXV",
    "sector": "Materials",
    "industry": "Mining"
  },
  ...
]

data/financials/CVV_yahoo.json

{
  "ticker": "CVV",
  "profile": {
    "current_price": 0.85
  },
  "statistics": {
    "market_cap": "25M",
    "pe_ratio": "N/A"
  }
}

data/reports/CVV_report.txt

======================================================================
STOCK INTELLIGENCE REPORT: CVV
======================================================================
Company: CanAlaska Uranium Ltd.
Exchange: TSXV
Generated: 2025-11-06 10:30:00
======================================================================

[FINANCIAL DATA]
----------------------------------------------------------------------
Profile:
  current_price: 0.85

[NEWS ARTICLES - Last 12 Months]
----------------------------------------------------------------------
Title: CanAlaska Announces Drilling Results
Source: GlobeNewswire
Date: Oct 15, 2025
...

Next Steps For You

Run the test: python test_extraction.py
Check if it works: Look in data/listings/ for JSON files
If successful: Run python main.py for full pipeline
If issues: Check the HTML files saved in data/listings/ to debug

Troubleshooting

Problem: "No listings extracted"

The exchange websites may have changed
Check data/listings/*_page.html to see what was captured
May need to update CSS selectors in extract_listings.py

Problem: "playwright not found"

python3 -m playwright install chromium

Problem: "Module not found"

pip install -r requirements.txt

Time Estimates

Setup: 5 minutes
Test run (5 stocks): 2-3 minutes
Full pipeline (all stocks): Several hours depending on # of stocks

What This Gives You

For each stock, you'll have:

✅ Company info (name, ticker, sector, industry)
✅ Financial data (price, market cap, ratios)
✅ News articles (last 12 months)
✅ Press releases (official announcements)
✅ Structured database
✅ Text report

The Difference From Before

Before (your original code):

Scraped static HTML
Got navigation menus
No actual stock data

After (new code):

Waits for JavaScript
Extracts actual stock listings
Gets financial data
Gets news & press releases
Generates comprehensive reports

YOU'RE ALL SET! Run python test_extraction.py to get started! 🚀

6.6 KiB Raw Blame History