feat: Implement stock listing extraction and database population

- Added `extract_listings.py` for extracting stock listings from TSX, TSXV, CSE, and CBOE using Playwright. - Created `main.py` to orchestrate the entire stock intelligence system, including extraction, database import, financial scraping, news scraping, and report generation. - Developed `populate_database.py` to populate the database with existing JSON data. - Introduced `scrape_nasdaq_tsx_only.py` for focused scraping of NASDAQ and TSX stocks. - Added `setup.py` for initial setup and testing of the system. - Created `watchlist.txt` template for user-defined stock tracking. - Generated `final_test_output.txt` to log the results of the test run.
2025-11-06 12:34:01 +01:00
parent 389a01cb0a
commit 80ee708348
39 changed files with 8513 additions and 0 deletions
@@ -0,0 +1,243 @@
+# 🎯 WHAT I'VE DONE - Summary
+
+## Current Status
+✅ **Your scraping project has been upgraded and is ready to run!**
+
+## What Was Wrong Before
+Your initial Scrapy spider scraped the **static HTML** from exchange websites, but:
+- The actual stock listing data loads via **JavaScript** after page load
+- Your cleaned text files only contained navigation menus, not the actual stock data
+- You needed a way to extract the **dynamic content** and structure it
+
+## What I've Built For You
+
+### 5 New Python Scripts
+
+1. **`extract_listings.py`** (281 lines)
+   - Uses Playwright to wait for JavaScript to load
+   - Extracts actual stock data from TSX/TSXV, CSE, CBOE
+   - Saves structured JSON files with all tickers
+
+2. **`database.py`** (279 lines)
+   - Complete SQLite database schema
+   - Tables for stocks, financials, metrics, news, filings, etc.
+   - Import/export functions
+   - Coverage tracking
+
+3. **`scrape_yahoo_finance.py`** (259 lines)
+   - Scrapes Yahoo Finance for each stock
+   - Gets price, market cap, financials, statistics
+   - No API key needed!
+   - Handles Canadian ticker formats (.TO, .V)
+
+4. **`scrape_news_pr.py`** (346 lines)
+   - Scrapes Google News for articles
+   - Scrapes GlobeNewswire and Newswire.ca for press releases
+   - Last 12 months of coverage
+   - No API keys needed
+
+5. **`main.py`** (309 lines)
+   - Orchestrates the entire pipeline
+   - Runs all steps in sequence
+   - Generates final text reports
+   - Tracks progress and errors
+   - Test mode for quick validation
+
+### Supporting Files
+
+6. **`test_extraction.py`** - Quick test script
+7. **`requirements.txt`** - All Python dependencies
+8. **`GUIDE.md`** - Complete usage guide (you're reading part of it!)
+9. **`PROGRESS.md`** - Project progress tracker
+10. **Updated README.md** - With implementation status
+
+## How To Use It
+
+### Quick Start (3 commands)
+```bash
+# 1. Install dependencies
+pip install -r requirements.txt
+python3 -m playwright install chromium
+
+# 2. Test it
+python test_extraction.py
+
+# 3. Run full pipeline (test mode)
+python main.py
+```
+
+### What Happens When You Run It
+
+**Step 1: Extract Listings** (2-3 minutes)
+- Opens browser windows
+- Navigates to each exchange
+- Waits for JavaScript to load
+- Extracts all stock data
+- Saves to `data/listings/*.json`
+
+**Step 2: Import to Database** (< 1 minute)
+- Creates SQLite database
+- Imports all stocks
+- Sets up tracking tables
+
+**Step 3: Scrape Financials** (varies by # of stocks)
+- For each stock: visits Yahoo Finance
+- Extracts price, market cap, financials
+- Saves to `data/financials/*.json`
+- Updates database
+
+**Step 4: Scrape News & PR** (varies by # of stocks)
+- Searches Google News for each stock
+- Searches press release sites
+- Saves to `data/news/*.json`
+- Updates database
+
+**Step 5: Generate Reports** (< 1 minute)
+- Creates text file for each stock
+- Combines all data sources
+- Saves to `data/reports/*.txt`
+
+## File Structure After Running
+
+```
+Victor/
+├── 📄 main.py                 ← Run this!
+├── 📄 test_extraction.py      ← Or test with this
+├── 📄 requirements.txt
+├── 📄 GUIDE.md               ← Full documentation
+├── 📄 PROGRESS.md
+├── 📄 README.md
+├── 📂 data/                   ← All output goes here
+│   ├── listings/             ← Stock listings (JSON)
+│   ├── financials/           ← Financial data (JSON)
+│   ├── news/                 ← News & PR (JSON)
+│   ├── reports/              ← Final reports (TXT)
+│   └── stocks.db             ← SQLite database
+├── 📂 scrap/                  ← Your original Scrapy project
+└── 📂 cleaned_text/           ← Your original cleaned HTML
+```
+
+## Key Features
+
+✅ **No API Keys** - Pure web scraping
+✅ **Canadian Exchanges** - TSX, TSXV, CSE, CBOE
+✅ **Comprehensive Data** - Financials, news, press releases
+✅ **SQLite Database** - Structured storage
+✅ **Text Reports** - Human-readable output
+✅ **Progress Tracking** - Know what's covered
+✅ **Error Handling** - Continues even if some stocks fail
+✅ **Rate Limiting** - Respectful of servers
+✅ **Test Mode** - Verify before full run
+
+## Example Output
+
+After running, you'll have files like:
+
+**`data/listings/all_listings_combined.json`**
+```json
+[
+  {
+    "symbol": "CVV",
+    "name": "CanAlaska Uranium Ltd.",
+    "exchange": "TSXV",
+    "sector": "Materials",
+    "industry": "Mining"
+  },
+  ...
+]
+```
+
+**`data/financials/CVV_yahoo.json`**
+```json
+{
+  "ticker": "CVV",
+  "profile": {
+    "current_price": 0.85
+  },
+  "statistics": {
+    "market_cap": "25M",
+    "pe_ratio": "N/A"
+  }
+}
+```
+
+**`data/reports/CVV_report.txt`**
+```
+======================================================================
+STOCK INTELLIGENCE REPORT: CVV
+======================================================================
+Company: CanAlaska Uranium Ltd.
+Exchange: TSXV
+Generated: 2025-11-06 10:30:00
+======================================================================
+
+[FINANCIAL DATA]
+----------------------------------------------------------------------
+Profile:
+  current_price: 0.85
+
+[NEWS ARTICLES - Last 12 Months]
+----------------------------------------------------------------------
+Title: CanAlaska Announces Drilling Results
+Source: GlobeNewswire
+Date: Oct 15, 2025
+...
+```
+
+## Next Steps For You
+
+1. **Run the test**: `python test_extraction.py`
+2. **Check if it works**: Look in `data/listings/` for JSON files
+3. **If successful**: Run `python main.py` for full pipeline
+4. **If issues**: Check the HTML files saved in `data/listings/` to debug
+
+## Troubleshooting
+
+**Problem: "No listings extracted"**
+- The exchange websites may have changed
+- Check `data/listings/*_page.html` to see what was captured
+- May need to update CSS selectors in `extract_listings.py`
+
+**Problem: "playwright not found"**
+```bash
+python3 -m playwright install chromium
+```
+
+**Problem: "Module not found"**
+```bash
+pip install -r requirements.txt
+```
+
+## Time Estimates
+
+- **Setup**: 5 minutes
+- **Test run** (5 stocks): 2-3 minutes
+- **Full pipeline** (all stocks): Several hours depending on # of stocks
+
+## What This Gives You
+
+For each stock, you'll have:
+- ✅ Company info (name, ticker, sector, industry)
+- ✅ Financial data (price, market cap, ratios)
+- ✅ News articles (last 12 months)
+- ✅ Press releases (official announcements)
+- ✅ Structured database
+- ✅ Text report
+
+## The Difference From Before
+
+**Before (your original code):**
+- Scraped static HTML
+- Got navigation menus
+- No actual stock data
+
+**After (new code):**
+- Waits for JavaScript
+- Extracts actual stock listings
+- Gets financial data
+- Gets news & press releases
+- Generates comprehensive reports
+
+---
+
+**YOU'RE ALL SET! Run `python test_extraction.py` to get started!** 🚀