An ETL pipeline that collects Reddit posts, applies sentiment analysis, and publishes results on Slack
data:image/s3,"s3://crabby-images/59cc2/59cc222ae6e8d88f964775fe6bbf88c6436e712f" alt="Screenshot of a part of the code"
In this project, I built an ETL (Extract, Transform, Load) pipeline with multiple steps:
- Collect posts from Reddit: A script gets Reddits from the Reddit API and inserts them to a MongoDB. (see directory reddit_collector)
- Transform Reddit posts: An ETL job extracts data from MongoDB, transforms it including Sentiment Analysis, and loads it into a PostgreSQL database. (see directory etl_job)
- Publish selected posts in Slack: In the last step, data on the posts including results of the Sentiment Analysis are loaded and sent as Slack messages. (see directory slack_bot)
The whole pipeline runs using Docker and Docker Compose. For sentiment analysis, I used the SentimentIntensityAnalyzer from vaderSentiment library.