# Estimating R&D output with a negative-binomial regression model

Data on firm level R&D investments and patent applications allow us studying the question whether higher investments yield higher R&D outputs approximated by the number of patents. For this study, we use the “PatentsRD” dataset which comes with the “Ecdat” package. It includes seven firm level variables collected for the years between 1983 and 1991. Thus, we have a panel dataset which provides some possibilities to cope with the problem of omitted variables (one aspect of the endogeneity problem in econometric models). We can control for the influence of omitted variables without actually observing them. The idea is that if an omitted variable is constant over time, it cannot be responsible for changes in the dependent variable. Standard OLS regression does not take heterogeneity across groups or time into account.

We have n firms that we observe in T = 9 (1983-1881) points in time.

Basic model:

That is,  is constant over time. The effect of  can be eliminated if we have data of at least two points in time.

E.g.

By differencing we get:

We can extent this by not only including fixed effects for firms but also for the time, that is variables that change over time but are the same for all firms such as legislation.

Going back to our dataset (“PatentsRD”), the data frame contains the following variables:

year : year

fi: firm's id

sector: firm's main industry sector, one of aero (aerospace), chem (chemistry), comput (computer), drugs, elec (electricity), food, fuel (fuel and mining), glass, instr (instruments), machin (machinery), metals, other, paper, soft (software), motor (motor vehicules)

geo: geographic area, one of eu (European Union), japan, usa, rotw (rest of the world)

patent: numbers of European patent applications

rdexp: log of R\&D expenditures

spil: log of spillovers

Economic theory tells us that each one of the six independent variables can have an influence on the patent output.

We apply a negative binomial specification since with patent data (which are count data) we are often confronted with the situation of over dispersion which gets expressed in the fact that we have a lot of cases for which we have zero counts. Overdispersion means that (conditional) variance exceeds the (conditional) mean which is at odds with the Poisson distribution. Thus, the negative binomial specification is a generalization of the Poisson regression.


# Load the "Ecdat" library that includes the "PatentsRD" dataset
library ("Ecdat")
library ("pglm")
library ("car")
library ("gplots")

data(PatentsRD)

# Show information about the dataset
?PatentsRD

# Get a summary of the dataset
summary (PatentsRD)

# The dataset was originally collected for and used
# in a paper called:
# Cincer, Michele (1997) “Patents, R \& D and technological
# spillovers at # the firm level:
# some evidence from econometric count models for panel data”,
# Journal of Applied Econometrics, 12(3), may–june, 265–280.

# By drawing a histogramm, we see that there are a lot of
# firms that have # not patents in a certain year
# This confronts us with the problem of overdispersion and
# suggests applying the negavtive binomial specifcation
hist (PatentsRD$patent) # We get a second indicator vor overdispersion if we compare # the mean with the standard deviation mean(PatentsRD$patent)
sd(PatentsRD$patent) # The unconditional mean of our outcome variable is much # lower than its variance. # Have a look at the data head(PatentsRD) # Theory suggest that there is a (causal) relationship # between the R&D output of a firm and its R&D input # To inspect this, we plot the R&D expenditure against # the patent output plot(PatentsRD$patent~PatentsRD$rdexp) # This looks as if there is inded a postive relationship # between R&D expenditures and patents # We can also see that we have to control for diffrent # firm locations scatterplot(patent~year|geo, boxplots =FALSE, smooth=TRUE, reg.line =FALSE, data=PatentsRD) # And we see that the patent output is quite heterogenous # across firms and years # plotmeans draw a 95% confidence interval around the means plotmeans(patent ~ fi, main="Heterogeineity across firms", data=PatentsRD) plotmeans(patent ~ year, main="Heterogeineity across years", data=PatentsRD) # To further analyse this relationship, we can first do # a standard (pooled) OLS regression ols <- lm (patent ~ rdexp, data = PatentsRD) # The "summary" command provides the regression results: # We see that rdexp is positve and has a very small p-value summary (ols) # We see that the R&D expenditures hat a posiitve and # significant effect on the patent output # Now wen can plot the regression line plot(PatentsRD$patent~PatentsRD\$rdexp)
abline(ols)

# In the next step, we use a negative binomial fixed effects
# (model = "within") regression model and include the control
# variables
# We also include lage from 0 to 4 years since we expects that
# the effect of R&D is not direclty visible in patent counts.
# The same holds for spillovers
# Also, we include her time and individual fixed effects
fixed <- plm (patent ~ lag(rdexp, k = 0:4) + factor(geo) +
factor(sector) + lag(spil, k = 0:4), data = PatentsRD,
index = c('year', 'fi'), model = "within", family = negbin,
effect = c("individual", "time", "twoways") )

# Then we have a look at the results
summary (fixed)
# Different from the pooled OLS case, we see now that
# rdexp is not significant for all years. We see only
# a positive significant influence for the third and fourth
# year and the case without lag.
# Since we have a log transformed R&D expenditure variable,
# we can make the following interpretation:
# The R&D varibale for the k = 3 case is about 5.8.
# For a 10 % increase in this variable (keeping everythin
# else constant) the (mean) patent output will increase by
# beta_x * log(1.1) = 5.8 * log (1.1) = 0.24

# To check if our decision of including a time effect was
# correct we can run a Lagrange Multiplier Test
plmtest(fixed, c("time"), type=("bp"))
# If this number is < 0.05 then use time-fixed effects