# NewsIQ - AI News Intelligence System

##Project Overview

**NewsIQ** is an AI-powered news intelligence platform that ingests RSS feeds, performs semantic analysis, and delivers intelligent recommendations. The system demonstrates your ability to design clean backend architectures and integrate AI capabilities with modern tooling.

## 🛠️ System Workflow

The platform consists of a 4-stage pipeline:

1. **RSS Ingestion**: Pulls articles from RSS feeds with deduplication and feed tracking
2. **AI Processing**: Performs embedding generation, summarization, sentiment analysis, entity extraction, and category classification
3. **Vector Storage**: Stores embeddings in a vector database for semantic search
4. **Intelligent Retrieval**: Enables semantic search and AI-driven recommendations

---

## What the System Should Do

### Core Features

* **Smart RSS Ingestion**

  * Parse articles from multiple feeds
  * Track last updated time per feed
  * Avoid reprocessing old or duplicate articles

* **AI-Based Content Analysis**

  * Generate embeddings using **Cohere**
  * Extract named entities, sentiment, summaries, and categories via **Groq**

* **Semantic Vector Search**

  * Store and query article embeddings using vector similarity
  * Filter results using metadata (source, sentiment, date)

* **Intelligent Recommendations**

  * Recommend articles based on similarity, recency, and category preferences

* **Category Filtering by User Preference**

  * Let users set categories of interest (e.g., sports, music)
  * Only process articles that match user-defined categories

* **Robust API**

  * Well-structured endpoints for updates, search, recommendations, and analytics
  * OpenAPI documentation for interactive testing

---

## Tech Stack

| Component    | Tool                                    |
| ------------ | --------------------------------------- |
| Backend      | FastAPI                                 |
| ORM          | SQLAlchemy                              |
| DB           | SQLite (production-ready switchable)    |
| Embeddings   | [Cohere API](https://cohere.ai)         |
| LLM Analysis | [Groq API](https://groq.com)            |
| Vector DB    | Choose one: FAISS / Weaviate / Pinecone |
| Feeds        | `feedparser`, `requests`                |
| Env Mgmt     | `python-dotenv`                         |
| Migrations   | Alembic                                 |
| Testing      | `pytest`                                |

---

## RSS Feed Sources
https://www.nytimes.com/rss
https://www.cnbc.com/id/100727362/device/rss/rss.html
https://www.bbc.co.uk/sport/football/57000000
https://www.aljazeera.com/xml/rss/all.xml
https://www.nytimes.com/svc/collections/v1/publish/https://www.nytimes.com/section/world/rss.xml
https://globalnews.ca/world/feed/
https://feeds.skynews.com/feeds/rss/world.xml
https://www.e-ir.info/feed/
https://www.thecipherbrief.com/feeds/feed.rss
https://warontherocks.com/feed/
```

---

## User Settings (Category Preferences)

Let users define what types of news they care about — e.g., *sports*, *technology*, *music*.

### Settings API

```http
# Update user preferences
PUT /api/settings/categories

Request Body:
{
    "user_id": "user_123",
    "preferred_categories": ["sports", "technology", "music"]
}

Response:
{
    "status": "success",
    "message": "Preferences updated"
}

# Get user preferences
GET /api/settings/categories?user_id=user_123

Response:
{
    "user_id": "user_123",
    "preferred_categories": ["sports", "technology", "music"]
}
```

### Filtering Logic

* Only process and store articles that **match the user’s categories**
* AI should classify articles into categories before deciding whether to continue processing
* Defaults to processing all categories if no preferences are set

---

##  API Endpoints Overview

###  Article Updates

* `POST /api/updates/fetch-latest`
  Fetch and process new articles from RSS feeds
* `GET /api/updates/status`
  Get current ingestion status for all RSS feeds

###  Article Management

* `GET /api/articles/{article_id}`
  Retrieve full article details
* `GET /api/articles/`
  Paginated, filterable list of articles
* `GET /api/articles/{id}/analysis`
  Get AI-generated metadata (summary, entities, sentiment, etc.)

### Semantic Search & Discovery

* `POST /api/search/semantic`
  Perform embedding-based semantic search
* `GET /api/search/similar/{id}`
  Find articles similar to the given one
* `POST /api/recommendations/`
  Generate personalized article recommendations

### Analytics

* `GET /api/analytics/trends`
  Get trending topics from recent articles
* `GET /api/analytics/sentiment`
  Analyze sentiment distribution over time or source
* `GET /api/analytics/sources`
  View performance/coverage of individual RSS sources

---

##  Key Technical Challenges

### 1. Incremental Updates

* Avoid reprocessing the same articles
* Track and sync timestamps per feed
* Detect updated articles by comparing URL + content fingerprint

### 2. Duplicate Detection

* Use URL match + fuzzy title match + content similarity
* Handle near-duplicate and updated versions of the same story

### 3. Category Filtering

* Classify articles during AI processing
* Skip irrelevant categories based on user preferences
* Ensure high accuracy in categorization

### 4. AI & Vector Sync

* Ensure metadata and vector store are always in sync
* Handle failures in vector DB gracefully
* Implement cleanup of orphaned or outdated vectors

### 5. Performance

* Index frequently filtered fields
* Use batch processing and async operations
* Optimize semantic search 

---

## Bonus Features (Optional)

* Background job scheduler for automated fetches

---

## ✅ Success Criteria

### Functional Requirements

*  RSS articles processed
* Incremental updates work without duplication
* Semantic search returns relevant results 
* User preferences respected in every processing cycle

### Technical Requirements

* Clean, modular, testable codebase
* Proper use of SQLAlchemy, Pydantic, and environment configs
* Documented Readme
* Unit test coverage for core logic
* Alembic-based migrations for schema changes


## 📁 Deliverables

* ✅ Working FastAPI backend with REST endpoints
* ✅ SQLAlchemy ORM models with Alembic migrations
* ✅ AI integration with Cohere and Groq
* ✅ Vector similarity search with metadata filters
* ✅ Smart RSS ingestion with category filtering
* ✅ API documentation via OpenAPI
* ✅ Clean README with setup and architecture overview (add architectural diagram)

---

## Documentation Expectations

* Describe high-level architecture and design decisions
* Explain how article processing, filtering, and recommendations work
* Document how category filtering is enforced
* Include instructions for deployment, testing, and extending the system