Exploring Behind IMDB's Top 1000 Films
I had been planning on working on this project for a while because I love watching movies and I try to keep up to date on the latest trends in Hollywood. Keeping this in mind, I went out to search for a database that would line with my interest and hoped to apply my skills in data analytics and python in this field. I came across the “IMDB Top 1000 Movies” dataset which included movies which had an IMDB rating of +7.0. I was interested in choosing this dataset because included a lot of attributes such as the runtime, genre, actors etc. allowing me to create insightful correlations and visuals.
​
Explore the code on my GitHub: https://github.com/SteelWarrior123/My_Projects
Here is the dataset I used: https://www.kaggle.com/datasets/arthurchongg/imdb-top-1000-movies/data?select=imdb_raw.csv

The line graph illustrates the frequency of hits released since 1921, showcasing the evolution of the golden age of cinema over time. There is a consistent upward trend in the number of movies released each year, indicating a gradual growth in the production of high-quality films.
This trend further highlights the ongoing expansion and evolution of the film industry, illustrating the ever-changing creativity of filmmakers over the decades and their adeptness at understanding viewer trends. However, the last 15 years have seen a steep decline in the count of hits being released. This phenomenon could be explained by the reliance of movie studios in producing franchise films and remakes which have most commonly been received poorly by the audience.

Across the top 1000 hits, these few directors and actors stood out among the rest. When speaking of directors, it was no surprise to see Steven Spielberg (13) and Martin Scorsese (10) as directors who produced the most highly rated movies such as "Jurassic Park", "The Irishmen" and "Shutter Island" which are shared between the two. However, not seeing James Cameron make the top 5 list was a shock given his highly-grossing movies such as "Titanic", "Avatar" and "Terminator".
​
As for the actors, Robert De Niro and Tom Hanks (my personal favourite) were the names most frequently spotted across the dataset (11 for each). Both were popularized by their roles in "Cast Away", "Green Mile" and "Goodfellas", "Taxi Driver" respectively.

The pie chart offers insight into the genre distribution among the top hits. Notably, drama movies emerge as the most popular, comprising nearly half (46.40%) of the dataset. Action movies, although the second most prevalent genre (11%), exhibit a significant gap in comparison.
This prompted me to delve deeper into why we love drama movies. Typically characterized by seriousness, intensity, and dramatic elements, drama films often revolve around conflict as a central plot element, resolved by the story's conclusion. While conflict is a common theme across genres, dramatic films uniquely delve into characters' emotions during these conflicts and the relationships they foster.
Directors' focus on individual characters intensifies the narrative and further enriches the storyline. Following our characters' journeys and experiencing their emotions in tandem with viewers enhances engagement and tapping into our innate human curiosity and inclination to empathize with others' experiences portrayed on screen.
Correlation Analysis

The following scatter plot visualization the relationship between a top hit's Gross revenue (Soley in the US and Canada) and its runtime. By analysing the r-value of this distribution, we can examine that its a weak positive correlation between the two attributes.
The data set does include a few outliers such as the hit 1939 romance film "Gone with the Wind" which had a runtime of almost 240 minutes but was only able to generate around $200 million adjusted to inflation. The dataset also included "Star Wars: Episode VII - The Force Awakens" which peaked at the gross revenue charts with revenues of up to $930 million with a runtime of only 138 minutes.
​

This scatter plot depicts the correlation between a top hit's IMDB rating and its runtime. The r-value indicates a positive correlation between these attributes. Interestingly, as the movie's length increases, its IMDB ratings tend to be higher on average. This phenomenon can be attributed to exemplary directorial skills in maintaining an engaging storyline and captivating audience interest throughout the extended duration of the film.
Conclusion
I had a lot of fun working on this project as it provided me with the opportunity to perform correlation analysis on a range of diverse attributes. For example, I explored the relationship between a movie's runtime and its overall gross revenue or its IMDB rating. Leveraging the knowledge acquired from my statistics courses, I effectively interpreted the output of these analyses. This project served as a practical application of the concepts learned in class, offering deeper insights into the significance of analytical methodologies in real-world scenarios.