feat: Implement stock listing extraction and database population
- Added `extract_listings.py` for extracting stock listings from TSX, TSXV, CSE, and CBOE using Playwright. - Created `main.py` to orchestrate the entire stock intelligence system, including extraction, database import, financial scraping, news scraping, and report generation. - Developed `populate_database.py` to populate the database with existing JSON data. - Introduced `scrape_nasdaq_tsx_only.py` for focused scraping of NASDAQ and TSX stocks. - Added `setup.py` for initial setup and testing of the system. - Created `watchlist.txt` template for user-defined stock tracking. - Generated `final_test_output.txt` to log the results of the test run.
This commit is contained in:
+243
@@ -0,0 +1,243 @@
|
||||
# 🎯 WHAT I'VE DONE - Summary
|
||||
|
||||
## Current Status
|
||||
✅ **Your scraping project has been upgraded and is ready to run!**
|
||||
|
||||
## What Was Wrong Before
|
||||
Your initial Scrapy spider scraped the **static HTML** from exchange websites, but:
|
||||
- The actual stock listing data loads via **JavaScript** after page load
|
||||
- Your cleaned text files only contained navigation menus, not the actual stock data
|
||||
- You needed a way to extract the **dynamic content** and structure it
|
||||
|
||||
## What I've Built For You
|
||||
|
||||
### 5 New Python Scripts
|
||||
|
||||
1. **`extract_listings.py`** (281 lines)
|
||||
- Uses Playwright to wait for JavaScript to load
|
||||
- Extracts actual stock data from TSX/TSXV, CSE, CBOE
|
||||
- Saves structured JSON files with all tickers
|
||||
|
||||
2. **`database.py`** (279 lines)
|
||||
- Complete SQLite database schema
|
||||
- Tables for stocks, financials, metrics, news, filings, etc.
|
||||
- Import/export functions
|
||||
- Coverage tracking
|
||||
|
||||
3. **`scrape_yahoo_finance.py`** (259 lines)
|
||||
- Scrapes Yahoo Finance for each stock
|
||||
- Gets price, market cap, financials, statistics
|
||||
- No API key needed!
|
||||
- Handles Canadian ticker formats (.TO, .V)
|
||||
|
||||
4. **`scrape_news_pr.py`** (346 lines)
|
||||
- Scrapes Google News for articles
|
||||
- Scrapes GlobeNewswire and Newswire.ca for press releases
|
||||
- Last 12 months of coverage
|
||||
- No API keys needed
|
||||
|
||||
5. **`main.py`** (309 lines)
|
||||
- Orchestrates the entire pipeline
|
||||
- Runs all steps in sequence
|
||||
- Generates final text reports
|
||||
- Tracks progress and errors
|
||||
- Test mode for quick validation
|
||||
|
||||
### Supporting Files
|
||||
|
||||
6. **`test_extraction.py`** - Quick test script
|
||||
7. **`requirements.txt`** - All Python dependencies
|
||||
8. **`GUIDE.md`** - Complete usage guide (you're reading part of it!)
|
||||
9. **`PROGRESS.md`** - Project progress tracker
|
||||
10. **Updated README.md** - With implementation status
|
||||
|
||||
## How To Use It
|
||||
|
||||
### Quick Start (3 commands)
|
||||
```bash
|
||||
# 1. Install dependencies
|
||||
pip install -r requirements.txt
|
||||
python3 -m playwright install chromium
|
||||
|
||||
# 2. Test it
|
||||
python test_extraction.py
|
||||
|
||||
# 3. Run full pipeline (test mode)
|
||||
python main.py
|
||||
```
|
||||
|
||||
### What Happens When You Run It
|
||||
|
||||
**Step 1: Extract Listings** (2-3 minutes)
|
||||
- Opens browser windows
|
||||
- Navigates to each exchange
|
||||
- Waits for JavaScript to load
|
||||
- Extracts all stock data
|
||||
- Saves to `data/listings/*.json`
|
||||
|
||||
**Step 2: Import to Database** (< 1 minute)
|
||||
- Creates SQLite database
|
||||
- Imports all stocks
|
||||
- Sets up tracking tables
|
||||
|
||||
**Step 3: Scrape Financials** (varies by # of stocks)
|
||||
- For each stock: visits Yahoo Finance
|
||||
- Extracts price, market cap, financials
|
||||
- Saves to `data/financials/*.json`
|
||||
- Updates database
|
||||
|
||||
**Step 4: Scrape News & PR** (varies by # of stocks)
|
||||
- Searches Google News for each stock
|
||||
- Searches press release sites
|
||||
- Saves to `data/news/*.json`
|
||||
- Updates database
|
||||
|
||||
**Step 5: Generate Reports** (< 1 minute)
|
||||
- Creates text file for each stock
|
||||
- Combines all data sources
|
||||
- Saves to `data/reports/*.txt`
|
||||
|
||||
## File Structure After Running
|
||||
|
||||
```
|
||||
Victor/
|
||||
├── 📄 main.py ← Run this!
|
||||
├── 📄 test_extraction.py ← Or test with this
|
||||
├── 📄 requirements.txt
|
||||
├── 📄 GUIDE.md ← Full documentation
|
||||
├── 📄 PROGRESS.md
|
||||
├── 📄 README.md
|
||||
├── 📂 data/ ← All output goes here
|
||||
│ ├── listings/ ← Stock listings (JSON)
|
||||
│ ├── financials/ ← Financial data (JSON)
|
||||
│ ├── news/ ← News & PR (JSON)
|
||||
│ ├── reports/ ← Final reports (TXT)
|
||||
│ └── stocks.db ← SQLite database
|
||||
├── 📂 scrap/ ← Your original Scrapy project
|
||||
└── 📂 cleaned_text/ ← Your original cleaned HTML
|
||||
```
|
||||
|
||||
## Key Features
|
||||
|
||||
✅ **No API Keys** - Pure web scraping
|
||||
✅ **Canadian Exchanges** - TSX, TSXV, CSE, CBOE
|
||||
✅ **Comprehensive Data** - Financials, news, press releases
|
||||
✅ **SQLite Database** - Structured storage
|
||||
✅ **Text Reports** - Human-readable output
|
||||
✅ **Progress Tracking** - Know what's covered
|
||||
✅ **Error Handling** - Continues even if some stocks fail
|
||||
✅ **Rate Limiting** - Respectful of servers
|
||||
✅ **Test Mode** - Verify before full run
|
||||
|
||||
## Example Output
|
||||
|
||||
After running, you'll have files like:
|
||||
|
||||
**`data/listings/all_listings_combined.json`**
|
||||
```json
|
||||
[
|
||||
{
|
||||
"symbol": "CVV",
|
||||
"name": "CanAlaska Uranium Ltd.",
|
||||
"exchange": "TSXV",
|
||||
"sector": "Materials",
|
||||
"industry": "Mining"
|
||||
},
|
||||
...
|
||||
]
|
||||
```
|
||||
|
||||
**`data/financials/CVV_yahoo.json`**
|
||||
```json
|
||||
{
|
||||
"ticker": "CVV",
|
||||
"profile": {
|
||||
"current_price": 0.85
|
||||
},
|
||||
"statistics": {
|
||||
"market_cap": "25M",
|
||||
"pe_ratio": "N/A"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**`data/reports/CVV_report.txt`**
|
||||
```
|
||||
======================================================================
|
||||
STOCK INTELLIGENCE REPORT: CVV
|
||||
======================================================================
|
||||
Company: CanAlaska Uranium Ltd.
|
||||
Exchange: TSXV
|
||||
Generated: 2025-11-06 10:30:00
|
||||
======================================================================
|
||||
|
||||
[FINANCIAL DATA]
|
||||
----------------------------------------------------------------------
|
||||
Profile:
|
||||
current_price: 0.85
|
||||
|
||||
[NEWS ARTICLES - Last 12 Months]
|
||||
----------------------------------------------------------------------
|
||||
Title: CanAlaska Announces Drilling Results
|
||||
Source: GlobeNewswire
|
||||
Date: Oct 15, 2025
|
||||
...
|
||||
```
|
||||
|
||||
## Next Steps For You
|
||||
|
||||
1. **Run the test**: `python test_extraction.py`
|
||||
2. **Check if it works**: Look in `data/listings/` for JSON files
|
||||
3. **If successful**: Run `python main.py` for full pipeline
|
||||
4. **If issues**: Check the HTML files saved in `data/listings/` to debug
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Problem: "No listings extracted"**
|
||||
- The exchange websites may have changed
|
||||
- Check `data/listings/*_page.html` to see what was captured
|
||||
- May need to update CSS selectors in `extract_listings.py`
|
||||
|
||||
**Problem: "playwright not found"**
|
||||
```bash
|
||||
python3 -m playwright install chromium
|
||||
```
|
||||
|
||||
**Problem: "Module not found"**
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## Time Estimates
|
||||
|
||||
- **Setup**: 5 minutes
|
||||
- **Test run** (5 stocks): 2-3 minutes
|
||||
- **Full pipeline** (all stocks): Several hours depending on # of stocks
|
||||
|
||||
## What This Gives You
|
||||
|
||||
For each stock, you'll have:
|
||||
- ✅ Company info (name, ticker, sector, industry)
|
||||
- ✅ Financial data (price, market cap, ratios)
|
||||
- ✅ News articles (last 12 months)
|
||||
- ✅ Press releases (official announcements)
|
||||
- ✅ Structured database
|
||||
- ✅ Text report
|
||||
|
||||
## The Difference From Before
|
||||
|
||||
**Before (your original code):**
|
||||
- Scraped static HTML
|
||||
- Got navigation menus
|
||||
- No actual stock data
|
||||
|
||||
**After (new code):**
|
||||
- Waits for JavaScript
|
||||
- Extracts actual stock listings
|
||||
- Gets financial data
|
||||
- Gets news & press releases
|
||||
- Generates comprehensive reports
|
||||
|
||||
---
|
||||
|
||||
**YOU'RE ALL SET! Run `python test_extraction.py` to get started!** 🚀
|
||||
Reference in New Issue
Block a user