Spotting the Blockbusters: A Guide for Film Studios Looking for the Next Hit

Shreyoshi Das
11 min readMay 2, 2021

Created by: Shreyoshi Das

Introduction

The film industry used to be one of the most lucrative businesses in the world. However, industry growth has stagnated in recent years, most notably due to the COVID-19 pandemic. In 2020, the global entertainment market was valued at $80.8 billion, a decline of 18% from 2019 and the lowest since 2016. As the economy recovers, film studios have been searching for new projects to fund to bounce back from the losses they have suffered.

This analysis uses an IMDB dataset with 5000 movies to examine different variables associated with movies and understand what factors are most strongly associated with making movies commercially successful and well-received by audiences. These insights can be used by film studios to better determine which types of movies are likely to be successful and allocate funding to projects more strategically.

Methodology

Gross amount earned ($) was used to measure commercial success and IMDB score was used to measure audience reception. Other variables of interest were analyzed with a focus on determining their impact on gross earnings and IMDB score.

The key variables examined in this analysis include the following:

  1. Budget
  2. Country of Origin
  3. Plot Keywords
  4. Genre Keywords
  5. Content Rating
  6. Language
  7. Twitter Sentiment Score
  8. Gross Amount Earned ($)
  9. IMDB Score

R was used to create visualizations and word clouds, conduct linear regressions, and scrape tweets for sentiment analysis using the Twitter API and twitteR package.

The Data at a Glance: Illustrative Figures

Distributions of Key Variables

Commercial Success

There are some interesting trends in the distribution of the variables analyzed when doing some preliminary analysis.

Figure 1 compares summary statistics of gross amount earned with movie budget. In aggregate, the average gross amount earned is larger than the average budget, but not by a large margin. However, there is more variability in the size of movie budgets compared to the spread of gross earnings.

Figure 1

Movies within the dataset also earned less in gross revenue than one would expect. Figure 2 shows that about 40% of movies made between $0–50 million at the box office and another 25% of movies made between $50–100 million. However, there were a few movies in the tail end of the distribution that were very commercially successful.

Figure 2

Audience Reception

The general public’s reaction to a film was measured using IMDB scores. Figure 3 shows that IMDB users rated movies in this dataset relatively high, with scores clustering between 6.5–7.5 out of 10.

Figure 3

Country of Origin

Although the film industry is global in nature, most of the movies in this dataset came from the Western world. Of the top 10 countries producing the most movies, the United States was the frontrunner with nearly 6x more movies than the UK in second place. The large gap between the US and other countries is surprising considering countries like China and India also have large film industries, which may indicate the presence of a Western bias in the data.

Figure 4

Exploring Movie Elements

To determine what makes a movie successful, I wanted to take a closer look at the elements that make up a movie. To do so, I examined plot and genre keywords associated with movies and used natural language processing to determine which words appeared most frequently. This provided a basis for later analysis that helped me understand whether certain plot and genre keywords were associated with commercially successful movies or movies that were well-received by audiences.

Figure 5. Plot keywords (left) and genre keywords (right)

Figure 5 shows which plot and genre keywords appeared most frequently in movie descriptions. The most common movie plots tend to be centered around a small number of topics such as sex, friendship, and family. There is more variety in movie genres represented with thriller, action, comedy and romance movies appearing most frequently in the dataset.

However, just because a plot or genre is common does not necessarily mean it is associated with a film’s success. Thus, I conducted additional analysis to determine which factors were most strongly associated with movie success.

Analysis

Correlational Data

Are commercially successful movies also highly rated by audiences? Figure 6 shows there is a positive correlation between gross amount earned and IMDB score.

Figure 6

Although causality cannot be determined, there are a few possible explanations for this trend. First, “early bird” reviews and ratings of films can be made public before new movies are released to the masses, so measures of audience reception such as IMDB scores may affect whether consumers spend money to watch the movie in the first place. The reverse could also be the case. If a movie is commercially successful, it is likely to attract more attention from later moviegoers who then boost its IMDB score.

Figure 7

The top 10 highest grossing movies in the dataset all received an IMDB score of at least 6.5. However, only two were both well-received by audiences and very commercially successful: Avatar and the Avengers (top right quadrant of Figure 7). This implies that it is difficult for a movie to be a box office hit and be consistently highly rated by audiences.

Moreover, all of the films except Star Wars: Episode IV — A New Hope had relatively large budgets. Movies with larger budgets were not necessarily more commercially successful or well-received, however. For example, the Dark Knight Rises was rated higher than Avengers: Age of Ultron despite having a similarly sized budget. The Dark Knight was also the highest rated movie in the top 10 highest grossing movies, but had a mid-sized budget relative to the other movies included. Thus, inflating a movie’s budget past a certain point is unlikely to be effective in increasing its quality or appeal to audiences.

Regression Analysis

The following analyses use linear regressions to determine which factors are significantly associated with gross amount earned and IMDB score for movies in the dataset.

Gross Amount Earned

Plot & Genre

Figure 8: Output from regression of gross earnings on plot and genre keywords

Figure 8 shows the results from a regression of gross amount earned on plot and genre keywords.

The following plot and genre keywords had a statistically significant positive relationship with gross earnings: “action”, “adventure”, “family”, “fantasy”, and “alien”. “Adventure” and “fantasy” movies were associated with the highest and second highest increases in gross amount earned, respectively. This may be because these types of movies tend to showcase themes that appeal to a wider audience spanning different ages and genders.

On the other hand, “horror” and “drama” movies had a statistically significant negative relationship with gross amount earned. A potential explanation for this is that horror and drama movies may be less appealing to moviegoers due to the more serious and emotional content of these movies. Thus, a smaller, more niche group of consumers may be watching these films, which lowers the gross amount earned for films in these categories.

Other Factors

I also wanted to determine if other variables apart from plot and genre had a significant effect on gross amount earned. These variables were: whether the movie was in color or not, movie duration, movie Facebook likes, whether the movie was in English or not, and content rating (G, PG, M, R).

Figure 9: Output from regression of gross earnings on other variables

All variables except for a content rating of M had a statistically significant relationship with gross amount earned. Having a content rating of G was associated with the highest positive increase in gross revenue, whereas a content of rating R was associated with the largest decrease in gross revenue. One explanation for this is that G-rated movies are accessible to a larger audience than R-rated movies and thus sell more tickets. These findings also support the more general conclusion that universally appealing movies (e.g. movies featuring more relatable themes and/or being accessible to a larger group of people) are more likely to be commercially successful.

IMDB Score

Plot & Genre

Figure 10: Output from regression of IMDB score on plot and genre keywords

“Horror”, “action”, “family”, “thriller”, and “female” had a statistically significant negative relationship with IMDB score. Horror and action films in particular were associated with the largest decreases in IMDB rating. I find this interesting because horror movies were associated with both low box office earnings and lower IMDB scores. This implies that horror movies may not be a lucrative genre for film studios to invest in. Moreover, female-driven films were given lower IMDB scores, although being a female-driven film did not have a statistically significant effect on gross earnings. More data is needed to draw reliable conclusions about how featuring more women in film affects a movie’s success, particularly as cinema with women at the center is becoming more common.

Surprisingly, action movies were associated with one of the highest increases in gross amount earned despite having a significant negative relationship with IMDB ratings. Moviegoers may have high expectations when they go to see action movies but are disappointed after watching. It may also be the case that action movies market themselves more effectively prior to release, which encourages more consumers to initially flock to theaters. However, moviegoers may decide that the quality of films in this category is subpar after watching.

“Drama”, “history”, “adventure” had a statistically significant positive relationship with IMDB score. Adventure movies were associated with both positive increases in box office revenue and IMDB scores, implying these types of movies are particularly good investments for film studios to produce. However, dramas tended to do worse at the box office despite being associated with positive increases in IMDB scores. This further hints at a disconnect between the audience reception of movies and commercial success.

Some possible explanations are (1) certain genres/plots may only appeal to a niche group of moviegoers who are already likely to rate those types of movies highly and (2) marketing and promotion for certain types of movies (e.g. dramas) may be weaker compared to other films, so fewer people buy tickets to go see these movies. However, those that do watch the films enjoy them and rate them highly.

Other Factors

Figure 11: Output from regression of IMDB score on other variables

Of the other variables analyzed in a separate regression, all but content rating of M were statistically significant. Notably, content ratings of G and R were both significantly positively associated with an increase in IMDB score, although movies that were rated G were associated with a larger increase in IMDB score than R-rated movies.

Surprisingly, a film’s language being English was associated with a statistically significant decrease in IMDB score. However, this is likely due to the overrepresentation of Western, English movies in the dataset and may not be a true indicator of audience reception to a film.

Twitter Sentiment

I also conducted two separate regressions of gross amount earned on Twitter sentiment and IMDB score on Twitter sentiment to understand whether tweets about movies were associated with film success. Of the tweets sampled, the regression concluded that Twitter sentiment did not have statistically significant relationships with either gross amount earned or IMDB score.

It may be the case that both positive and negative tweets about a movie contribute to its popularity, resulting in more consumers purchasing tickets to watch the movie. This supports the notion that “any publicity is good publicity”. If this is true, then a future area of analysis could include analyzing whether the number of tweets, rather than the sentiment of tweets, has a statistically significant relationship with the commercial success of a movie.

Another explanation for these results is that tweets may not accurately reflect the public’s true perceptions of the quality of a movie. This can lead to a disconnect between the sentiment of tweets posted about a movie and the movie’s IMDB score.

Conclusions

  1. Box office success and higher IMDB scores are generally positively correlated. A positive feedback loop may exist between box office success and higher IMDB scores as moviegoers watch a movie, rate it, and then spread the word to their friends and family about the movie. Thus, having a strong opening weekend is likely boost IMDB scores and improve the long-term success of a film. However, this correlation is relatively weak, so some commercially successful movies may still be rated lower than other films that earn the same amount or higher due to other factors.
  2. Variables such as content rating may be more important to determining a film’s success than plot, genre, and Twitter sentiment. Associations between variables such as content rating and gross amount earned/IMDB score explained more variation in the data than plot, genre and Twitter sentiment regressions did. I had expected that regressions including variables intrinsic to a movie, such as plot and genre, would have had more of an impact on movie success than other factors. However, my analysis supports the conclusion that a movie’s success may be more dependent on how accessible it is to a wider audience (e.g. by having a content rating of G instead of R) rather than on the type of story (e.g. a specific plot or genre) the movie tells.
  3. Movies with large budgets do not necessarily end up being more commercially successful or well-received by audiences. Most recent movies tend to have similarly sized, large budgets, which may be a result of rising production costs within the film industry more broadly.
  4. Female-led films draw mixed reviews from audiences. More female-driven films must be produced and included in datasets to draw more robust conclusions about whether stories about women are associated with greater commercial success or a positive reception from moviegoers.

Based on these findings, a film studio should prioritize funding movies with universal themes, relatable plots, and minimal content rating restrictions. This will ensure the movie will be appealing and accessible to the largest audience without requiring an excessively large budget.

I hope my analysis can be used to fund a wider variety of films that bring consumers back to movie theaters and help us all enjoy the wonder of cinema once again.

--

--