An ETL pipeline that collects Reddit posts, applies sentiment analysis, and publishes results on Slack
![Screenshot of a part of the code](/files/styles/large_16_9/public/2023-09/reddit-etl-header.jpg?itok=UGFuM9rt)
In this project, I built an ETL (Extract, Transform, Load) pipeline with multiple steps:
- Collect posts from Reddit: A script gets Reddits from the Reddit API and inserts them to a MongoDB. (see directory reddit_collector)
- Transform Reddit posts: An ETL job extracts data from MongoDB, transforms it including Sentiment Analysis, and loads it into a PostgreSQL database. (see directory etl_job)
- Publish selected posts in Slack: In the last step, data on the posts including results of the Sentiment Analysis are loaded and sent as Slack messages. (see directory slack_bot)
The whole pipeline runs using Docker and Docker Compose. For sentiment analysis, I used the SentimentIntensityAnalyzer from vaderSentiment library.