feat: Implement stock listing extraction and database population

- Added `extract_listings.py` for extracting stock listings from TSX, TSXV, CSE, and CBOE using Playwright. - Created `main.py` to orchestrate the entire stock intelligence system, including extraction, database import, financial scraping, news scraping, and report generation. - Developed `populate_database.py` to populate the database with existing JSON data. - Introduced `scrape_nasdaq_tsx_only.py` for focused scraping of NASDAQ and TSX stocks. - Added `setup.py` for initial setup and testing of the system. - Created `watchlist.txt` template for user-defined stock tracking. - Generated `final_test_output.txt` to log the results of the test run.
2025-11-06 12:34:01 +01:00
parent 389a01cb0a
commit 80ee708348
39 changed files with 8513 additions and 0 deletions
@@ -0,0 +1,334 @@
+# Stock Intelligence System - Setup & Usage Guide
+
+## 📋 Overview
+
+This system automatically collects comprehensive data on publicly traded stocks from TSX, TSXV, CSE, and CBOE exchanges **without requiring any API keys**. All data is collected via web scraping.
+
+## 🎯 What This System Does
+
+For each stock on the target exchanges, it:
+
+1. **Extracts company listings** with ticker, name, sector, industry
+2. **Scrapes financial data** from Yahoo Finance (3 years + TTM)
+3. **Collects news articles** from Google News (last 12 months)
+4. **Gathers press releases** from GlobeNewswire, Newswire.ca, etc.
+5. **Stores everything** in a SQLite database
+6. **Generates text reports** for each stock
+
+## 🚀 Getting Started
+
+### Step 1: Install Dependencies
+
+```bash
+cd /Users/macbook/Desktop/Victor
+
+# Install Python packages
+pip install -r requirements.txt
+
+# Install Playwright browser
+python3 -m playwright install chromium
+```
+
+### Step 2: Test the Setup
+
+```bash
+# Run a quick test on a few stocks
+python test_extraction.py
+```
+
+This will:
+- Extract CSE listings
+- Show you what data was captured
+- Save HTML for debugging if needed
+
+### Step 3: Run the Full Pipeline
+
+```bash
+# Test mode (5 stocks for testing)
+python main.py
+
+# Full pipeline (all stocks - takes hours!)
+python main.py --full
+```
+
+## 📊 Output Structure
+
+After running, you'll have:
+
+```
+data/
+├── listings/
+│   ├── tsx_tsxv_listings.json      # TSX/TSXV stocks
+│   ├── cse_listings.json           # CSE stocks
+│   ├── cboe_listings.json          # CBOE stocks
+│   └── all_listings_combined.json  # All stocks combined
+│
+├── financials/
+│   ├── TICKER1_yahoo.json          # Financial data per stock
+│   ├── TICKER2_yahoo.json
+│   └── ...
+│
+├── news/
+│   ├── TICKER1_news_pr.json        # News & press releases
+│   ├── TICKER2_news_pr.json
+│   └── ...
+│
+├── reports/
+│   ├── TICKER1_report.txt          # Human-readable report
+│   ├── TICKER2_report.txt
+│   └── ...
+│
+└── stocks.db                        # SQLite database with all data
+```
+
+## 🔧 Individual Components
+
+You can run each step separately:
+
+### 1. Extract Listings Only
+
+```bash
+python extract_listings.py
+```
+
+This will:
+- Open a browser window (non-headless for debugging)
+- Navigate to each exchange
+- Extract all stock listings
+- Save to `data/listings/`
+
+### 2. Import to Database
+
+```bash
+python database.py
+```
+
+This will:
+- Create the SQLite database schema
+- Import listings from JSON files
+- Show statistics
+
+### 3. Scrape Financials
+
+```bash
+python scrape_yahoo_finance.py
+```
+
+This will:
+- Load stocks from listings
+- Scrape Yahoo Finance for each
+- Save financial data to JSON
+- Takes ~2-3 seconds per stock
+
+### 4. Scrape News & Press Releases
+
+```bash
+python scrape_news_pr.py
+```
+
+This will:
+- Search Google News for each stock
+- Search press release sites
+- Save all articles/releases
+- Takes ~10-15 seconds per stock
+
+## 📝 Understanding the Data
+
+### Stock Listings Format
+
+```json
+{
+  "symbol": "ABC",
+  "name": "ABC Company Inc.",
+  "exchange": "TSX",
+  "sector": "Technology",
+  "industry": "Software",
+  "extracted_at": "2025-11-06T10:30:00"
+}
+```
+
+### Financial Data Format
+
+```json
+{
+  "ticker": "ABC",
+  "yahoo_ticker": "ABC.TO",
+  "profile": {
+    "current_price": 25.50
+  },
+  "statistics": {
+    "market_cap": "500M",
+    "pe_ratio": "15.2",
+    "forward_eps": "1.68"
+  },
+  "financials": {
+    "total_revenue": ["1.2B", "1.1B", "1.0B"],
+    "net_income": ["120M", "110M", "100M"]
+  }
+}
+```
+
+### News Data Format
+
+```json
+{
+  "ticker": "ABC",
+  "news_articles": [
+    {
+      "title": "ABC Company Reports Strong Earnings",
+      "source": "Financial Post",
+      "date": "2 days ago",
+      "url": "https://...",
+      "snippet": "..."
+    }
+  ],
+  "press_releases": [
+    {
+      "title": "ABC Announces New Product Line",
+      "source": "GlobeNewswire",
+      "date": "Nov 1, 2025",
+      "url": "https://..."
+    }
+  ]
+}
+```
+
+## 🗄️ Database Schema
+
+The SQLite database (`data/stocks.db`) contains:
+
+### Main Tables
+- **stocks_master** - All stock information
+- **financial_statements** - Income statement, balance sheet, cash flow
+- **financial_metrics** - Calculated ratios (P/E, ROE, etc.)
+- **news_articles** - News from various sources
+- **press_releases** - Official company releases
+- **filings** - Regulatory filings (SEDAR+, SEC)
+- **agm_info** - Annual general meeting details
+- **tax_disclosures** - Tax-related information
+- **coverage_report** - Tracks data completeness per stock
+
+### Query Examples
+
+```python
+import sqlite3
+
+# Connect to database
+conn = sqlite3.connect('data/stocks.db')
+cursor = conn.cursor()
+
+# Get all TSX stocks
+cursor.execute("SELECT symbol, company_name FROM stocks_master WHERE exchange='TSX'")
+stocks = cursor.fetchall()
+
+# Get stocks with complete data
+cursor.execute("""
+    SELECT ticker, exchange 
+    FROM coverage_report 
+    WHERE has_financials=1 AND has_news=1
+""")
+complete_stocks = cursor.fetchall()
+
+# Get all news for a specific stock
+cursor.execute("""
+    SELECT title, source, published_date 
+    FROM news_articles 
+    WHERE stock_id = (SELECT id FROM stocks_master WHERE symbol='ABC')
+""")
+news = cursor.fetchall()
+```
+
+## ⚠️ Important Considerations
+
+### Rate Limiting
+- Scripts include delays between requests (2-5 seconds)
+- Respectful of server resources
+- May take several hours for full dataset
+
+### Data Quality
+- Not all stocks have Yahoo Finance data (especially microcaps)
+- News availability varies by stock
+- Some press releases may be behind paywalls
+
+### Troubleshooting
+
+**Problem: No listings extracted**
+- Check `data/listings/*_page.html` files
+- Websites may have changed their structure
+- May need to update selectors in `extract_listings.py`
+
+**Problem: Yahoo Finance returns errors**
+- Stock may not be listed on Yahoo
+- Try different ticker suffix (.TO vs .V for Canadian stocks)
+- Check `data/financials/*_yahoo.json` for error messages
+
+**Problem: News scraping fails**
+- Google may rate-limit searches
+- Increase delays in `scrape_news_pr.py`
+- Consider running in smaller batches
+
+## 🔄 Automation & Scheduling
+
+To run this automatically on a schedule:
+
+### Using cron (macOS/Linux)
+
+```bash
+# Edit crontab
+crontab -e
+
+# Run every week on Sunday at 2 AM
+0 2 * * 0 cd /Users/macbook/Desktop/Victor && /usr/local/bin/python3 main.py --full >> /tmp/stock_scraper.log 2>&1
+```
+
+### Manual Scheduling
+
+```bash
+# Create a shell script
+cat > run_scraper.sh << 'EOF'
+#!/bin/bash
+cd /Users/macbook/Desktop/Victor
+python3 main.py --full
+EOF
+
+chmod +x run_scraper.sh
+
+# Run it whenever needed
+./run_scraper.sh
+```
+
+## 📈 Next Steps
+
+After collecting the data, you can:
+
+1. **Analyze** - Use pandas/numpy to analyze financial trends
+2. **Visualize** - Create charts with matplotlib/plotly
+3. **Screen** - Filter stocks by criteria (P/E ratio, revenue growth, etc.)
+4. **Monitor** - Track specific stocks over time
+5. **Export** - Generate Excel reports, CSV exports, etc.
+
+## 🐛 Known Issues
+
+1. **Dynamic content** - Some exchange sites may change their layouts
+2. **Ticker formats** - Canadian stocks have different suffixes (.TO, .V, .CN)
+3. **Rate limits** - Google/Yahoo may temporarily block if too aggressive
+4. **JavaScript rendering** - Sites may block headless browsers
+
+## 💡 Tips
+
+- Start with test mode to verify everything works
+- Run during off-peak hours to avoid rate limits
+- Check coverage_report table to see data completeness
+- Save HTML files for debugging when extraction fails
+- Use database queries for efficient data analysis
+
+## 📞 Support
+
+Check these files for more info:
+- `PROGRESS.md` - Current implementation status
+- `README.md` - Original technical plan
+- Individual script files have detailed comments
+
+---
+
+**Last Updated:** November 6, 2025