Estimating R&D output with a negative-binomial regression model


Data on firm level R&D investments and patent applications allow us studying the question whether higher investments yield higher R&D outputs approximated by the number of patents. For this study, we use the “PatentsRD” dataset which comes with the “Ecdat” package. It includes seven firm level variables collected for the years between 1983 and 1991. Thus, we have a panel dataset which provides some possibilities to cope with the problem of omitted variables (one aspect of the endogeneity problem in econometric models). We can control for the influence of omitted variables without actually observing them. The idea is that if an omitted variable is constant over time, it cannot be responsible for changes in the dependent variable. Standard OLS regression does not take heterogeneity across groups or time into account.

We have n firms that we observe in T = 9 (1983-1881) points in time.

Basic model:

That is, is constant over time. The effect of can be eliminated if we have data of at least two points in time.


By differencing we get:

We can extent this by not only including fixed effects for firms but also for the time, that is variables that change over time but are the same for all firms such as legislation.

Going back to our dataset (“PatentsRD”), the data frame contains the following variables:

year : year

fi: firm's id

sector: firm's main industry sector, one of aero (aerospace), chem (chemistry), comput (computer), drugs, elec (electricity), food, fuel (fuel and mining), glass, instr (instruments), machin (machinery), metals, other, paper, soft (software), motor (motor vehicules)

geo: geographic area, one of eu (European Union), japan, usa, rotw (rest of the world)

patent: numbers of European patent applications

rdexp: log of R\&D expenditures

spil: log of spillovers

Economic theory tells us that each one of the six independent variables can have an influence on the patent output.

We apply a negative binomial specification since with patent data (which are count data) we are often confronted with the situation of over dispersion which gets expressed in the fact that we have a lot of cases for which we have zero counts. Overdispersion means that (conditional) variance exceeds the (conditional) mean which is at odds with the Poisson distribution. Thus, the negative binomial specification is a generalization of the Poisson regression.


# Load the "Ecdat" library that includes the "PatentsRD" dataset
library ("Ecdat")
library ("pglm")
library ("car")
library ("gplots")

# Load the "PatentsRD" dataset

# Show information about the dataset

# Get a summary of the dataset
summary (PatentsRD)

# The dataset was originally collected for and used
# in a paper called:
# Cincer, Michele (1997) “Patents, R \& D and technological
# spillovers at # the firm level:
# some evidence from econometric count models for panel data”,
# Journal of Applied Econometrics, 12(3), may–june, 265–280.

# By drawing a histogramm, we see that there are a lot of
# firms that have # not patents in a certain year
# This confronts us with the problem of overdispersion and
# suggests applying the negavtive binomial specifcation
hist (PatentsRD$patent)

# We get a second indicator vor overdispersion if we compare
# the mean with the standard deviation
# The unconditional mean of our outcome variable is much
# lower than its variance. 

# Have a look at the data

# Theory suggest that there is a (causal) relationship
# between the R&D output of a firm and its R&D input
# To inspect this, we plot the R&D expenditure against
# the patent output
# This looks as if there is inded a postive relationship
# between R&D expenditures and patents

# We can also see that we have to control for diffrent
# firm locations
scatterplot(patent~year|geo, boxplots  =FALSE, smooth=TRUE,
reg.line  =FALSE, data=PatentsRD)

# And we see that the patent output is quite heterogenous
# across firms and years
# plotmeans draw a 95% confidence interval around the means
plotmeans(patent ~ fi, main="Heterogeineity across firms",
plotmeans(patent ~ year, main="Heterogeineity across years", data=PatentsRD)

# To further analyse this relationship, we can first do
# a standard (pooled) OLS regression
ols <- lm (patent ~ rdexp, data = PatentsRD)
# The "summary" command provides the regression results:
# We see that rdexp is positve and has a very small p-value
summary (ols)
# We see that the R&D expenditures hat a posiitve and
# significant effect on the patent output

# Now wen can plot the regression line

# In the next step, we use a negative binomial fixed effects
# (model = "within") regression model and include the control
# variables
# We also include lage from 0 to 4 years since we expects that
# the effect of R&D is not direclty visible in patent counts.
# The same holds for spillovers
# Also, we include her time and individual fixed effects
fixed <- plm (patent ~ lag(rdexp, k = 0:4) + factor(geo) +
factor(sector) + lag(spil, k = 0:4), data = PatentsRD,
index = c('year', 'fi'), model = "within", family = negbin,
effect = c("individual", "time", "twoways") )

# Then we have a look at the results
summary (fixed)
# Different from the pooled OLS case, we see now that
# rdexp is not significant for all years. We see only
# a positive significant influence for the third and fourth
# year and the case without lag.
# Since we have a log transformed R&D expenditure variable,
# we can make the following interpretation:
# The R&D varibale for the k = 3 case is about 5.8.
# For a 10 % increase in this variable (keeping everythin
# else constant) the (mean) patent output will increase by
# beta_x * log(1.1) = 5.8 * log (1.1) = 0.24

# To check if our decision of including a time effect was
# correct we can run a Lagrange Multiplier Test
plmtest(fixed, c("time"), type=("bp"))
# If this number is < 0.05 then use time-fixed effects

Retrieving OECD data via SDMX

The OECD website offer numerous datasets that we can access directly from R via the SDMX standard. Once we have loaded the data we are free to analyzing and visualizing them. It all starts with the data we are interested in. Innovation economist like to study patent data that reflect the outcome of an inventive process. For instance, we might be interestd in studying patent outputs of OECD countries in the biotech industry. The respective data can be accessed from the following OECD website: There we go to Export and select “SDMC (XML)”. Here we copy the “SDMX DATA URL“. This URL includes not only biotech patents but also nanotech etc. and applicants  as well as invetor data. According to my experience so far the query does not work if we use the full URL that we copied. That is why we have to manally change (reduce) it to get what we are interested in, for instance in the following way to focus on biotech and inventors: ""

Now we can work with this URL in R. Here is an example:

# First we have to load the required libraries

library ("rsdmx")
library ("ggplot2")
library ("dplyr")

# Then we have to identify the url to access the dataset
# by the SDMX standard

url <- ";endTime=2014"

# We read the data and transform the RSDMX object into
# a data frame:

dat <- readSDMX(url)
dat.f <-

# Then we can filter and order the data, for instance
# to visualize the top seven countries which had the
# largest number of patents in the year 2010

sub <- filter(dat.f,obsTime == "2010" ) <- sub [order (-sub$obsValue),] [1:7,3] <- filter(dat.f,LOCATION %in%
# Finally, we plot the data into a line graph

ggplot(, aes(x=obsTime, y=obsValue, group=LOCATION, colour=LOCATION)) +
 xlab("Years") + ylab("Number of patents") +
 ggtitle("Biotech patents") +
 theme_bw() +
 theme(legend.position=c(.7, .4))

Projection of a two-mode innovation network

If you want to analyze the structure of an innovation network you are often faced with the problem that you have two-mode data, e.g. you have data of firms that collaborate in a common R&D project. That is, each firm is connected to the project (Figure 1).Rplot_netWe assume that if firms participate in the same project, then they interact with each other. So we need a projection of a two-mode network into a one-mode network where the actors are linked to each other (Figure 2).Rplot_net.pThis can be done by the igraph package. A small example dataset can be downloaded here: example1.

# load the igraph package
library ("igraph")
# load the example dataset
data <- read.csv2("example1.csv", header = T)
# create an igraph object
net <- (data, directed = F)
# plot the two-mode network, firms are connected
# to the projects
plot (net)
# add a "type" variable to indicate that we have
# two-mode network
V(net)$type<- V(net)$name %in% data [,1]
# project the two-mode into a one-mode network
net.p <-  bipartite.projection(net)$proj2
# plot the projected network
plot (net.p)