The Reputation Analysis project aims to build a framework/utility that analyzes news articles related to a specific company and presents any reputational threatening data on a concise dashboard. The solution involves web scraping, text processing, natural language processing (NLP), and data analysis techniques to gather, analyze, and visualize information from media articles.

GITHUB : https://github.com/HighnessAtharva/CRIF-Hackathon-2023

Project Tech Stack

The project utilizes the following technologies and tools:

  • Python
  • Selenium
  • News API
  • Spacy NLP library
  • BeautifulSoup
  • Regex
  • Trifulatura
  • Tensorflow
  • Tableau or PowerBI (dashboard creation)
  • CSV processing with pandas

Learnings

Throughout the project, several important learnings were gained:

  • Web scraping techniques using Selenium and web search engines (Google, Yahoo, Duckduckgo).
  • Extracting relevant information from webpages using BeautifulSoup and regex.
  • Text processing and cleaning of news articles to remove irrelevant data.
  • Sentiment analysis using TensorFlow to determine the sentiment score of articles.
  • Named Entity Recognition (NER) using the pre-trained roBERta model from Spacy to identify organization and risk entities.
  • Establishing relationships between risk elements and company names through dependency parsing or predicate classifiers.
  • Creating interactive dashboards using Tableau or PowerBI to present the analyzed data.
  • Efficient file handling and processing of CSV files for data storage and manipulation.
  • Optimizing NLP processing and reducing scraping time for improved performance.
  • Dealing with exceptions, error handling, and API throttling during web scraping.
  • Customizing NER support in the Spacy model to include risk entities.
  • Learning and applying Tableau for data visualization and storytelling.

The project provided valuable insights into web scraping, text processing, NLP, data analysis, and data visualization techniques. It also enhanced skills in utilizing various Python libraries for scraping, processing, and analyzing textual data. Additionally, working with Tableau expanded knowledge of creating visually appealing and informative dashboards.

The overall experience of this project has contributed to a deeper understanding of reputation analysis and the technical skills required to implement such a solution.