--- license: mit language: - en tags: - text-classification - distilbert - advertisement-detection - rss - news - binary-classification pipeline_tag: text-classification --- # DistilBERT RSS Advertisement Detection A DistilBERT-based model for classifying RSS article titles as advertisements or legitimate news content. ## Model Description This model is fine-tuned from `distilbert-base-uncased` for binary text classification. It can distinguish between: - **Advertisement**: Promotional content, deals, sales, sponsored content - **News**: Legitimate news articles, editorial content, research findings ## Intended Use - **Primary**: Filtering RSS feeds to separate advertisements from news - **Secondary**: Content moderation, spam detection, content categorization - **Research**: Text classification, advertisement detection studies ## Performance - **Accuracy**: ~95% - **F1 Score**: ~94% - **Precision**: ~93% - **Recall**: ~94% ## Training Data - **Source**: 75+ RSS feeds from major tech news outlets - **Articles**: 1,600+ RSS articles - **Labeled**: 1,000+ manually labeled examples - **Sources**: TechCrunch, WIRED, The Verge, Ars Technica, OpenAI, Google AI, etc. ## Usage ```python from transformers import pipeline # Load the model classifier = pipeline("text-classification", model="SoroushXYZ/distilbert-rss-ad-detection") # Classify examples examples = [ "Apple Announces New iPhone with Advanced AI Features", "50% OFF - Limited Time Offer on Premium Headphones!", "Scientists Discover New Method for Carbon Capture", "Buy Now! Get Free Shipping on All Electronics Today Only!" ] for text in examples: result = classifier(text) print(f"{text} -> {result[0]['label']} ({result[0]['score']:.3f})") ``` ## Model Architecture - **Base Model**: distilbert-base-uncased - **Task**: Binary text classification - **Input**: Text (max 128 tokens) - **Output**: Class probabilities (news, advertisement) ## Training Details - **Epochs**: 3 - **Batch Size**: 16 - **Learning Rate**: 5e-5 - **Optimizer**: AdamW - **Framework**: PyTorch + Transformers ## Limitations - Trained primarily on tech news content - May not generalize well to other domains - Performance depends on title quality and clarity - Limited to English language content ## Citation If you use this model, please cite: ```bibtex @misc{distilbert-rss-ad-detection, title={DistilBERT RSS Advertisement Detection}, author={Your Name}, year={2024}, url={https://huggingface.co/SoroushXYZ/distilbert-rss-ad-detection} } ```