Quantitative Trading With R Understanding Mathematical And Computational Pdf

9,321,752 libros libros
84,837,646 artículos artículos
Inicio de ZLibrary
Inicio
Principal Quantitative Trading with R: Understanding Mathematical and Computational Tools from a Quant's...
Portada del libro Quantitative Trading with R: Understanding Mathematical and Computational Tools from a Quant's Perspective
Quantitative Trading with R: Understanding Mathematical and Computational Tools from a Quant's Perspective

Harry Georgakopoulos
¿Qué tanto le ha gustado este libro?
¿De qué calidad es el archivo descargado?
Descargue el libro para evaluar su calidad
¿Cuál es la calidad de los archivos descargados?
Quantitative Trading with R offers readers a glimpse into the daily activities of quants/traders who deal with financial data analysis and the formulation of model-driven trading strategies.
Based on the author's own experience as a quant, lecturer, and high-frequency trader, this book illuminates many of the problems that these professionals encounter on a daily basis. Answers to some of the more relevant questions are provided, and the easy-to-follow examples show the reader how to build functional R computer code in the process.
Georgakopoulos has written an invaluable introductory work for students, researchers, and practitioners alike. Anyone interested in applying programming, mathematical, and financial concepts to the creation and analysis of simple trading strategies will benefit from the lessons provided in this book. Accessible yet comprehensive, Quantitative Trading with R focuses on helping readers achieve practical competency in utilizing the popular R language for data exploration and strategy development.
Engaging and straightforward in his explanations, Georgakopoulos outlines basic trading concepts and walks the reader through the necessary math, data analysis, finance, and programming that quants/traders rely on. To increase retention and impact, individual case studies are split up into smaller modules. Chapters contain a balanced mix of mathematics, finance, and programming theory, and cover such diverse topics such as statistics, data analysis, time series manipulation, back-testing, and R-programming.
In Quantitative Trading with R, Georgakopoulos offers up a highly readable yet in-depth guidebook. Readers will emerge better acquainted with the R language and the relevant packages that are used by academics and practitioners in the quantitative trading realm.
Editorial:
Palgrave Macmillan
El archivo se enviará a su dirección de correo electrónico durante el transcurso de 1-5 minutos.
El archivo se enviará a su cuenta de Kindle durante el transcurso de 1-5 minutos.
Nota: Ud. debe verificar cada libro que desea enviar a su Kindle. Revise su correo electrónico y encuentre un mensaje de verificación de Amazon Kindle.
También le puede interesar Powered by Rec2Me

Términos más frecuentes

                Quantitative Trading with R  Quantitative Trading with R Understanding Mathematical and Computational Tools from a Quant's Perspective  Harry Georgakopoulos  QUANTITATIVE TRADING WITH R  Copyright © Folk Creations, Inc., 2015. All right reserved. First published in 2015 by PALGRAVE MACMILLAN® in the United States—a division of St. Martins Press LLC, 175 Fifth Avenue, New York, NY 10010. Where this book is distributed in the UK, Europe and the rest of the world, his is by Palgrave Macmillan, a division of Macmillan Publishers Limited, registered in England, company number 785998, of Houndmills, Basingstoke, Hampshire RG21 6XS. Palgrave Macmillan is the global academic imprint of the above companies and has companies and representatives throughout the world. Palgrave® and Macmillan® are registered trademarks in the United States, the United Kingdom, Europe and other countries. ISBN: 978–1–137–35407–5 Library of Congress Cataloging-in-Publication Data Georgakopoulos, Harry. Quantitative trading with R : understanding mathematical and computational tools from a quant's perspective / Harry Georgakopoulos. pages cm ISBN 978–1–137–35407–5 (hardback)— ISBN 1–137–35407–0 () 1. Stocks—Mathematical models. 2. Investment analysis—Mathematical models. 3. Corporations—Finance—Computer programs. 4. Commodity exchanges. I. Title. HG4661.G46 2015 332.640285'5133–dc23 2014028408 A catalogue record of the book is available from the British Library. Design by Newgen Knowledge Works (P) Ltd., Chennai, India. First edition: January 2015 10 9 8 7 6 5 4 3 2 1 Printed in the United States of America.  To Pinelopi, Maria, and Anastasia  Contents List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii 1  An Overview . . . . . . . . . . . . . . . . The mission statement . . . . ; . . . . . Financial markets and instruments . Trading strategies . . . . . . . . . . . . . High-frequency trading . . . . . . . . About the orderbook . . . . . . . . . . Trading automation . . . . . . . . . . . Where to get data from . . . . . . . . . Summary . . . . . . . . . . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  1 1 3 5 5 6 7 8 9  2  Tools of the Trade . . . . . . . . . The R language . . . . . . . . . . . Getting started with R . . . . . . The c() object . . . . . . . . . . . The matrix() object . . . . . . . The data.frame() object . . . The list() object . . . . . . . . . The new.env() object . . . . . . Using the plot() function . . . Functional programming . . . . Writing functions in R . . . . . . Branching and looping . . . . . . A recommended style guide . . A pairwise correlation example Summary . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  11 11 14 17 19 21 23 25 26 30 31 34 35 36 41  3  Working with Data . . . . . . . . . . . Getting data into R . . . . . . . . . . . . Installing packages in R . . . . . . . . Storing and transmitting data . . . . Extracting data from a spreadsheet . Accessing a database . . . . . . . . . . . The dplyr package . . . . . . . . . . . . Using the xts package . . . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  43 43 45 46 48 50 53 56  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . .  viii  Contents  Using the quantmod package Charting with quantmod . . . Graphing with ggplot2 . . . . . Summary . . . . . . . . . . . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  64 65 67 70  4  Basic Statistics and Probability . . What is a statistic? . . . . . . . . . . . Population versus sample . . . . . . Central Limit Theorem in R . . . . Unbiasedness and efficiency . . . . Probability basics . . . . . . . . . . . . Random variables . . . . . . . . . . . Probabilities . . . . . . . . . . . . . . . Probability distributions . . . . . . . Bayes versus frequentist approach Simulations of coins . . . . . . . . . . On the use of RStan . . . . . . . . . . Summary . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  . . . . . . . . . . . . .  71 71 71 74 76 83 83 84 85 87 88 90 90  5  Intermediate Statistics and Probability . Random process . . . . . . . . . . . . . . . . Stock price distributions . . . . . . . . . . . Stationarity . . . . . . . . . . . . . . . . . . . . Determining stationarity with urca . . . Assumptions of normality . . . . . . . . . . Correlation . . . . . . . . . . . . . . . . . . . . Filtering data . . . . . . . . . . . . . . . . . . . R formulas . . . . . . . . . . . . . . . . . . . . The linear in linear regression . . . . . . . Volatility . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  . . . . . . . . . . . .  91 91 92 95 98 99 103 105 107 112 114 117  6  Spreads, Betas and Risk . . . . . . . . . . . . . . . . . . Defining the stock spread . . . . . . . . . . . . . . . . . Ordinary Least Squares versus Total Least Squares Constructing the spread . . . . . . . . . . . . . . . . . . Signal generation and validation . . . . . . . . . . . . Trading the spread . . . . . . . . . . . . . . . . . . . . . . Considering the risk . . . . . . . . . . . . . . . . . . . . . More on the equity curve . . . . . . . . . . . . . . . . . Strategy attributes . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  119 119 121 123 126 132 137 138 143 144  7  Backtesting with Quantstrat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Backtesting methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 About blotter and PerformanceAnalytics . . . . . . . . . . . . . . . . . . . . . . . 149  Contents  ix  Initial setup . . . . . . . . . . . . . . . . . . . . . . . . . The first strategy: A simple trend follower . . . . Backtesting the first strategy . . . . . . . . . . . . . Evaluating the performance . . . . . . . . . . . . . . The second strategy: Cumulative Connors RSI . Evaluating the mean-reverting strategy . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  149 151 154 158 162 168 175  8  High-Frequency Data . . . . . . . . . . High-frequency quotes . . . . . . . . . Inter-quote arrival times . . . . . . . . Identifying liquidity regimes . . . . . The micro-price . . . . . . . . . . . . . Distributions and autocorrelations . The highfrequency package . . . . . Summary . . . . . . . . . . . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  177 178 182 183 186 187 196 197  9  Options . . . . . . . . . . . . . . Option theoretical value . . . A history of options . . . . . . Valuation of options . . . . . . Exploring options trade data Implied volatility . . . . . . . . Summary . . . . . . . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  199 199 200 200 203 208 214  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  217 217 218 222 224 225 230 236  11 Speed, Testing, and Reporting . . . Runtime execution improvements Benchmarking R code . . . . . . . . The Rcpp solution . . . . . . . . . . . Calling R from C++ with RInside Writing unit tests with testthat . . Using knitr for documentation . . Summary . . . . . . . . . . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  . . . . . . . .  237 237 238 239 242 243 245 250  10 Optimization . . . . . . . . The motivating parabola . Newton's method . . . . . . The brute-force approach R optimization routines . A curve-fitting exercise . . Portfolio optimization . . Summary . . . . . . . . . . .  . . . . . . . .  . . . . . . . .  Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269  Figures 1.1 1.2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11  Topic graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . An orderbook. . . . . . . . . . . . . . . . . . . . . . . . . . . . R console. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Default plot of a vector. . . . . . . . . . . . . . . . . . . . . . Vector line plot. . . . . . . . . . . . . . . . . . . . . . . . . . . A plot with extra attributes. . . . . . . . . . . . . . . . . . . Four plots on a canvas. . . . . . . . . . . . . . . . . . . . . . Graph with legend and text. . . . . . . . . . . . . . . . . . . If statement flow chart. . . . . . . . . . . . . . . . . . . . . . Sample stock price file. . . . . . . . . . . . . . . . . . . . . . Correlation matrix output. . . . . . . . . . . . . . . . . . . Pairwise scatter plot. . . . . . . . . . . . . . . . . . . . . . . . Time series data of AAPL price. . . . . . . . . . . . . . . . Elementary stock line chart. . . . . . . . . . . . . . . . . . . Sample signals sheet. . . . . . . . . . . . . . . . . . . . . . . . Sample strength sheet. . . . . . . . . . . . . . . . . . . . . . . First xts plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . An xts candle plot. . . . . . . . . . . . . . . . . . . . . . . . . Subsetting an xts plot. . . . . . . . . . . . . . . . . . . . . . . An xts plot with ablines. . . . . . . . . . . . . . . . . . . . . Plots of time differences between trades. . . . . . . . . . AAPL data from quantmod. . . . . . . . . . . . . . . . . . Chartseries of stock prices with indicators. . . . . . . . . Chartseries of stock prices with indicators recharted. . Chartseries of stock prices with custom indicator. . . . Cuts applied to a vector. . . . . . . . . . . . . . . . . . . . . Volume profile with ggplot2. . . . . . . . . . . . . . . . . . Population versus sample. . . . . . . . . . . . . . . . . . . . Gaussian population. . . . . . . . . . . . . . . . . . . . . . . Convergence with large N. . . . . . . . . . . . . . . . . . . . Convergence to a normal. . . . . . . . . . . . . . . . . . . . Nonnormal population. . . . . . . . . . . . . . . . . . . . . Nonnormal sample distribution. . . . . . . . . . . . . . . Bias versus efficiency. . . . . . . . . . . . . . . . . . . . . . . Estimating variance. . . . . . . . . . . . . . . . . . . . . . . . Estimator consistency. . . . . . . . . . . . . . . . . . . . . . . Random variable as a mapping. . . . . . . . . . . . . . . . Coin toss mass function. . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  2 6 14 26 27 28 29 30 34 38 40 41 44 45 49 49 58 58 59 61 64 64 66 67 68 68 69 72 73 75 75 76 76 77 79 82 83 86  xii  Figures  4.12 4.13 4.14 4.15 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 5.17 5.18 5.19 5.20 5.21 5.22 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 6.16 7.1 7.2  The normal distribution. . . . . . . . . . . . . . . . . . . . . . Examples of continuous distributions. . . . . . . . . . . . . Coin flip density. . . . . . . . . . . . . . . . . . . . . . . . . . . Bayesian prior to posterior. . . . . . . . . . . . . . . . . . . . Random process. . . . . . . . . . . . . . . . . . . . . . . . . . . Price distribution. . . . . . . . . . . . . . . . . . . . . . . . . . Four price distributions. . . . . . . . . . . . . . . . . . . . . . Stationary inputs to black box. . . . . . . . . . . . . . . . . . Stationary return distributions. . . . . . . . . . . . . . . . . Stock prices and returns. . . . . . . . . . . . . . . . . . . . . . Normal histogram with density. . . . . . . . . . . . . . . . . Leptokurtic and platykurtic. . . . . . . . . . . . . . . . . . . Quantile-quantile plots. . . . . . . . . . . . . . . . . . . . . . Erroneous value. . . . . . . . . . . . . . . . . . . . . . . . . . . Typical scatter plot of returns. . . . . . . . . . . . . . . . . . Outliers in VXX. . . . . . . . . . . . . . . . . . . . . . . . . . . Removal of outliers. . . . . . . . . . . . . . . . . . . . . . . . . Regression output on plot. . . . . . . . . . . . . . . . . . . . . Diagnostics plots. . . . . . . . . . . . . . . . . . . . . . . . . . . Scatter plot with lag. . . . . . . . . . . . . . . . . . . . . . . . . Cross correlation of returns. . . . . . . . . . . . . . . . . . . Line versus parabola. . . . . . . . . . . . . . . . . . . . . . . . Square root transformation. . . . . . . . . . . . . . . . . . . . Autocorrelation of returns. . . . . . . . . . . . . . . . . . . . Autocorrelation of theoretical squared returns. . . . . . . Autocorrelation of actual squared returns. . . . . . . . . . Pepsi versus Coke price changes. . . . . . . . . . . . . . . . . Ordinary least squares distance minimization. . . . . . . Total least squares distance minimization. . . . . . . . . . Total least squares regression between SPY and AAPL. . The AAPL versus SPY spread. . . . . . . . . . . . . . . . . . . Out of sample AAPL versus SPY spread. . . . . . . . . . . Rolling beta spread. . . . . . . . . . . . . . . . . . . . . . . . . Price difference spread for SPY versus AAPL. . . . . . . . Out of sample spread with bands. . . . . . . . . . . . . . . . Buy and sell signals superimposed. . . . . . . . . . . . . . . The SPY versus AAPL equity curve. . . . . . . . . . . . . . Two equity curves with same ending value. . . . . . . . . Two equity curves with different ending values. . . . . . Two equity curves with a different number of trades. . . Drawdown curves. . . . . . . . . . . . . . . . . . . . . . . . . . Histogram of trade duration. . . . . . . . . . . . . . . . . . . Clenow equity curve comparison. . . . . . . . . . . . . . . . Clenow XLB equity curve. . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  86 87 88 89 91 93 94 96 97 98 100 101 102 103 104 105 106 108 109 111 112 113 114 116 116 117 121 122 122 123 126 127 129 130 131 132 136 138 139 139 141 144 161 163  Figures  7.3 7.4 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10 8.11 8.12 9.1 9.2 9.3 9.4 9.5 9.6 9.7 10.1 10.2 10.3 10.4 10.5 10.6 10.7 11.1 11.2 11.3  Connors equity curve comparison. . . . . . . . . . . . . . Connors XLB equity curve and indicators. . . . . . . . . SPY bid price with outliers. . . . . . . . . . . . . . . . . . . SPY bid price filtered. . . . . . . . . . . . . . . . . . . . . . . SPY intraday bid ask spread. . . . . . . . . . . . . . . . . . SPY intraday spread histograms. . . . . . . . . . . . . . . . Micro-price between bid and ask. . . . . . . . . . . . . . . Micro-price distribution. . . . . . . . . . . . . . . . . . . . . Micro-price distribution with normal superimposed. Micro-price autocorrelation of returns. . . . . . . . . . . Trade price autocorrelation of returns. . . . . . . . . . . Trade price distribution of returns. . . . . . . . . . . . . . Histogram of price levels. . . . . . . . . . . . . . . . . . . . Autocorrelation of trade prices. . . . . . . . . . . . . . . . Black Scholes inputs. . . . . . . . . . . . . . . . . . . . . . . . Option sensitivities. . . . . . . . . . . . . . . . . . . . . . . . Short dated option sensitivities. . . . . . . . . . . . . . . . Option traded volumes. . . . . . . . . . . . . . . . . . . . . Spread of option quotes. . . . . . . . . . . . . . . . . . . . . Call and put premiums. . . . . . . . . . . . . . . . . . . . . . Implied volatility skew. . . . . . . . . . . . . . . . . . . . . . Parabolic function and its derivative. . . . . . . . . . . . . Rates and maturities graph. . . . . . . . . . . . . . . . . . . Rates graph with polynomial fit. . . . . . . . . . . . . . . . Rates graph with 2 polynomial fits. . . . . . . . . . . . . . Rates graph with piecewise polynomial fits. . . . . . . . Rates graph with loess fits. . . . . . . . . . . . . . . . . . Equal and optimized weights. . . . . . . . . . . . . . . . . . Setting up knitr and pdflatex. . . . . . . . . . . . . . . . . Sample pdf document output. . . . . . . . . . . . . . . . . A pdf with a ggplot2 plot. . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  xiii  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  173 174 180 181 184 185 187 188 189 189 191 192 194 195 199 204 204 208 211 213 214 218 226 227 228 230 231 236 247 249 250  Tables 1.1 1.2 1.3 1.4 3.1 3.2 3.3  Globex volume estimates by year Popular stock exchanges . . . . . . Popular futures exchanges . . . . . Active contract volumes . . . . . . Popular relational databases . . . Sample data row from dplyr . . . Output of summarise for dplyr .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . . . . . . .  . 4 . 4 . 4 . 5 . 51 . 55 . 56  Acknowledgments You know that saying about standing on the shoulders of giants? Well, this book is dedicated to all those giants who, in one way or another, inspired and guided my work throughout the years. This book would not have been possible without the contribution of people like Jeff Ryan, Dirk Eddelbuettel, Ilya Kipnis, Hadley Wickham, Joshua Ulrich, Romain Francois, Guy Yollin, Bernhard Pfaff, Eric Zivot, Paul Teetor, Yihui Xie, Peter Carl, Jan Humme, Brian G. Peterson, Thomas Hutchinson, Steven Todd, Dimitrios Liakakos, Ed Zarek, and many others. First and foremost, I would like to thank Ilya Kipnis for contributing excellent content on the backtesting of trading strategies via the use of the quantstrat package. Ilya maintains an insightful blog here: http://quantstrattrader.wordpress.com/. He is also a prolific R developer, and his projects can be found on GitHub here: www.github.com/IlyaKipnis. My gratitude and appreciation also go out to Tick Data, Inc. for graciously providing historical intraday data for use throughout this book. Tick Data, Inc. provides research-quality historical market data solutions to practitioners and academics. Their website is www.TickData.com. Readers of this book get a special 20 percent discount on all data purchases from Tick Data's online store by using promo-code: 13FC20. Dirk Eddelbuettel (of Rcpp, RProtoBuf and RQuantLib fame) was gracious enough to provide guidance and insight during the beginning stages of this publication. I would like to thank him for this contribution of his, among those of many others within the R community. A big thank you goes out to the graduate students of FINC 621 (Financial Mathematics and Modeling II) at Loyola University in Chicago for inspiring a lot of the content in this book. Last, but not least, I am grateful to the R-core team, as well as the numerous third-party contributors for maintaining and improving the R language, which has become such an integral part of my daily work routine.  1  An Overview  My primary intent in writing this book is to provide the reader with basic programming, financial, and mathematical tools that can be successfully leveraged both in industry and academia. I cover the use of the R programming language, as well as the R environment as a means for manipulating financial market data and for solving a subset of problems that quants and traders typically encounter in their day-to-day activities. The chapters that follow should be treated as a tutorial on a recommended set of tools that I have personally found useful and that have served me well during the last few years of my career as a quant trader/developer. I am writing this book from the vantage point of a quant practitioner and not that of an academic. A significant portion of the content is based on my lecture notes from a graduate level class in quantitative finance that I teach on a part-time basis at Loyola University in Chicago. This is an introductory-level book. No prior programming experience or advanced mathematical knowledge is assumed. Having said this, some chapters will tend to flow easier if you have had some prior exposure to the following topics. On the math side, I recommend a review of basic calculus, linear algebra, statistics, and probability.1 On the programming side, familiarity with VBA, Python, and SQL2 is helpful. This book is also aimed at practitioners and seasoned traders who want to learn more about how to conduct data analysis on financial data and how to write useful R scripts to automate some of their workflow. Trading and programming are vast topics in their own right, and by no means will I attempt to give a thorough explanation of each concept. You will not become an expert programmer by reading this book, nor will you make a ton of money in the markets by following my advice. This book will, however, provide tools and ideas that can assist in the analysis, implementation, and presentation of trading strategies and other related quantitative topics. Figure 1.1 provides an illustration of the items I will address in subsequent chapters. The mission statement I will attempt to take a somewhat fuzzy concept—that of creating a trading strategy—and provide plausible answers to some questions that will naturally arise.  2  Quantitative Trading with R  Strategy  Risk Reward  Data  Efficiency Probability R  Optimization Statistics  Visualization  Reporting  Figure 1.1 Topic graph.  Questions like the following: How can I automate some of my trading ideas? What programming language should I use and why? What are the mathematical, financial, and programming tools needed to evaluate my strategy? Where do I get the data to test a trading strategy? How do I know that my strategy is any good? How do I present my results to others? Most books on programming can be used as references. You go to the index and find the topic that interests you, and then you simply go to that particular page for further information. To get the most out of this book, I recommend that you do not follow this approach. Rather, start from the beginning and read all the chapters in a linear fashion. There is a method behind this madness. I intend to expose you to a methodology of thinking about quantitative finance and to give you the confidence to tackle some of the real-world problems that naturally arise in this context. And you will accomplish all this, while utilizing R to automate the required tasks. It is prudent to form a mental map of where we are headed and what obstacles lie in our path. One of our end goals will be to obtain the necessary programming skills so as to tackle some very specific problems that quants and traders typically care about. The other end goal will be to manipulate financial data and to use mathematical techniques to evaluate trading strategies. For the purpose of making these goals more concrete, I will bake them directly into a mission statement. Here is a first attempt at such a statement: We will come up with an automated trading strategy that will trade a portfolio of liquid instruments in the market. The strategy will be efficient, robust, and scalable. Furthermore, the strategy will be profitable and have low risk.  An Overview  3  Here are some questions that might arise after reading the mission statement: 1. What is a market? 2. What is meant by instruments, and furthermore, what is meant by liquid instruments? 3. What is a trading strategy, and how does one go about formulating such a thing? 4. How is profitability of a trading strategy defined? 5. What is risk? Specifically, how can one quantify risk in the context of a trading strategy? 6. How can a trading strategy be automated? 7. What is meant by efficiency? Financial markets and instruments A market is either a physical or a virtual place where transactions occur between participants. In ancient Greece, the Athenian citizens would gather in the agora3 and trade honey, olive oil, other agricultural products, and works of art in exchange for similar items. Transactions would be carried out with in-kind merchandise and currency. Similar marketplaces existed all over the ancient world. In those physical markets, as in today's physical markets, participants would have to physically meet and agree both on the price and the terms of delivery before a transaction was confirmed. Today, many of the physical marketplaces of the world are giving way to virtual ones. Take amazon.com, ebay.com, and alibaba.com as examples of this trend. These are vast online marketplaces where buyers and sellers interact and transact entirely via computer. Similar trends have been occurring in the financial markets over the last few years. The old floor pits of the futures, stocks, and options exchanges are giving way to electronic platforms. Table 1.1 lists approximate electronic versus floor trading volume percentages on the CME exchange. Globex refers to the Chicago Mercantile Exchange (CME) electronic platform. An organized market's primary objective is to bring together participants who are willing to trade their goods and services at an agreed-upon price. A secondary objective of a successful marketplace is to facilitate the orderly conduct of such transactions. Electronic financial markets certainly fit this description. Over the years, literally hundreds of financial exchanges and alternate electronic venues have popped up all over the globe. Some of the more notable stock trading venues are outlined in Table 1.2. Some of the more notable futures exchanges are listed in Table 1.3. Such exchanges enable the efficient and orderly transaction of standardized financial contracts. Financial instruments are stocks, futures, bonds, currencies, vanilla options, exotic options, swaps, swaptions, and so forth. Some of these instruments have become more popular than others [35]. The E-mini financial futures and eurodollar contracts traded on the CME, for example, are some of the most  4  Quantitative Trading with R  Table 1.1 Globex volume estimates by year CME volume profile Year  Globex (%)  Other (%)  1992  0.2  99.8  1993  0.4  99.6 99.5  1994  0.5  ...  ...  ...  2007  78  22  2008  82  18  2009  86  14  ...  ...  ...  2014  93  7  Table 1.2 Popular stock exchanges Country  Name  Website  USA  NYSE  www.nyse.nyx.com  USA  NASDAQ  www.nasdaqomx.com  USA  BATS  www.batstrading.com  USA  Direct Edge  www.directedge.com  Japan  TSE  www.tse.or.jp/english  UK  LSE  www.nasdaqomx.com  UK  NYSE Euronext  www.euronext.com  Table 1.3 Popular futures exchanges Country  Name  Website  USA  CME  www.cmegroup.com  USA  ICE  www.theice.com  UK  Euronext  www.euronext.com  Germany  Eurex  www.eurexchange.com  liquid contracts in the world. Investors and traders rely on these to manage their market and interest rate risks on a daily basis. The following table lists the average daily volumes of a few of these CME futures contracts [36]. Liquidity is really a measure of how easy it is to buy or sell a certain quantity of an instrument at a favorable price. Liquidity can be loosely thought of as a proxy for transacted volume. Investors and traders love liquid products because they can potentially trade a lot of size in those products without adversely affecting the price.  An Overview  5  Table 1.4 Active contract volumes Most active contracts Contract  Ticker  Volume  Eurodollar  GE  2,482,899  E-mini S&P 500  ES  1,804,469  10-Year treasury note  ZN  1,322,388  5-Year treasury note  ZF  773,183  Trading strategies Financial institutions are some of the most heavily regulated entities on the planet.4 This is both a good and a bad thing. It is bad, because this tends to stifle competition and innovation. It is good, because this enables market participants to have more confidence in the validity, and fairness of their transactions. Having confidence in the validity of ones trades is a very important concept. This is what enables investors and traders to invest large amounts of time and money in the development of electronic trading platforms that actively participate in these electronic marketplaces. Trading strategies arise when an investor or a trader spots an opportunity that can be legally exploited by placing buy or sell orders for certain financial instruments. These transactions have to be offset at a later point in time in order for the trader to realize a profit on the trades. Trading strategies have holding periods that can range anywhere from microseconds to years. Various flavors of trading strategies exist: long term, short term, opportunistic, high-frequency, low latency, and so forth. Whatever the nomenclature used, a trading strategy is just a set of well-defined rules that traders apply with the end goal of making a profit in the financial markets. These strategies exploit either structural effects in the marketplace, statistical relationships between financial instruments, or some kind of external information flow. High-frequency trading To most people, the term high-frequecy trading refers to trading strategies in which trades execute within a very small amount of time. In reality, high-frequency trading is actually a broader encompassing term than that. The holding periods of trades do not necessarily have to be in the subsecond range. It is often the case that trades are entered into the market within a small time span, but kept on for a much longer amount of time. The term also refers to algorithms that continuously scour the market for opportunities, but only place a limited number of trades when that opportunity arises. These algorithms still process market data at a highfrequency, but they do not execute trades at such frequencies. To make matters even more confusing, high-frequency trading is also sometimes referred to as low-latency trading. Since these trading approaches always involve the use of fast computers and  6  Quantitative Trading with R  125, 1 200, 4 300, 5 210, 8  12 11 10 9 10 7 6 5  150, 5 230, 1 202, 3 100, 4  Figure 1.2 An orderbook.  network connections, it is natural that such naming conflicts exist. Most of these strategies can, furthermore, be broken down into market-making, market-taking, and opportunistic strategies. A high-frequency, market-making strategy is one in which the trading algorithm constantly provides liquidity to other market participants. There is inherent risk that the active investor (taking the other side of the trade) has more information than the passive (making) algorithm. Some exchanges offer monetary compensation to market-making traders due to this information asymmetry. The profit that market-making strategies capture is relatively small compared to that of their taking counterparts. What they lack in profitability per trade, though, they make up in volume. Such a market-making strategy might be willing to capture only 1-tick of edge for every trade. A tick refers to the minimum price increment allowed by the exchange for any instrument. In actuality, due to adverse price moves, such strategies might only end up capturing a small fraction of a tick when everything is said and done. Market-taking strategies will, on average, cross the bid-ask spread whenever a trading opportunity manifests. Such a trading opportunity might be the result of a buy or sell signal from an algorithm that is analyzing market data, or from a bullish/bearish news event. Immediacy of execution is usually favored over price in these types of strategies. Market-taking strategies require more edge per trade than their making counterparts. About the orderbook An orderbook can be conceptualized as the aggregation of buy and sell limit orders5 from all market participants. The orderbook is a convenient data structure employed by traders and exchanges for the purposes of click-trading and/or matching orders between buyers and sellers. An orderbook can be depicted as having all the resting bids on the left vertical side and all the offers on the right vertical side. Figure 1.2 lists the prices, volume and trade-count per level in such a vertical arrangement [29].6  An Overview  7  The best-bid price is the highest price at which a passive trader would be willing to buy. The best-ask, or best-offer price is the lowest price at which a passive trader would be willing to sell. In our example, the best-bid price is 8, and the best-offer price is 9. The quantities associated with the best-bid and best-ask prices are 125 and 100 respectively. We also observe that there is 1 distinct order on the best-bid and 4 distinct orders on the best-offer. Anytime there is a change in any one of these numbers, we say a market data event has occurred. A market data event might be the cancellation of a 1-lot on the 7 price level or even the arrival of a new order on the second best-ask price. In this case, the new quantity reflected at the second best-bid would become 199, and the second best-ask quantity would also increase. Given that multiple products trade on any one exchange, it is easy to see that the amount of information transmitted and stored by the exchanges is enormous. If we couple this with the fact that exchanges also have to relay order-status information to traders on a real-time basis, then one really starts to appreciate the complexity of the infrastructure involved in processing and storing all of this data.  Trading automation Not too long ago, the vast majority of trades occurred in the pits. The typical order flow went as follows: investors would inform their brokers of their trading intentions over the phone or even in person. The brokers would then relay that information to their guys in the pits. If a counterparty wanted to take the other side of the trade, a transaction would occur and a confirmation would be sent back to the broker. In turn, the broker would convey the fill price and quantity to the customer. Today, technology allows us to enter and exit trades at a much faster pace. Automated trading platforms such as those offered by Interactive Brokers, Etrade, and Ameritrade, just to mention a few, can be used to enter orders into the market at a rapid pace and at low cost. The matching between a buyer and a seller happens almost instantaneously within centralized matching engines that are hosted by the exchanges themselves. Matching engines are computer systems that keep track of all the order flow and make sure the correct number of contracts is allocated to the buyers and sellers in a timely fashion. Once a transaction is established by the matching engine, an order acknowledgment report is sent back to the trader who initiated the trades. The exchanges also disseminate other types of information other than the order reports. These include market data, as well as, exchange-related status messages. Many proprietary trading firms, banks, and hedge funds interact with the exchanges via direct connections to their matching engines and market data feeds. The exchanges take advantage of the "need for speed" that these institutions have and offer collocation facilities to these traders at a hefty premium. Collocation simply means that the traders can store their servers as close to the exchange as  8  Quantitative Trading with R  physically possible. If a particular trading strategy is very sensitive to latency, then this collocation service is worth paying for. The automated trading industry has evolved to one in which coding, math, and critical thinking skills are valued just as much as financial skills, sometimes even more so. Swarms of coders from fields such as engineering, computer science, physics and applied mathematics can be found working at various hedge funds and trading firms worldwide. For the most part, these technologists deal with issues in network infrastructure, data storage, system integration, data processing, visualization, and algorithm development. In today's electronic markets, the roles between trader, technologist, and quant are indeed becoming blurred. The acumen to process, analyze, and present information from disparate sources, as well as the ability to develop trading strategies based on the results, are widely sought after skill sets in the industry. It is my prediction that in the not too distant future, the word trader will mean something entirely different from what it means today. A trader will likely be a savvy technologist with a keen sense of market dynamics who also utilizes a data-driven approach to build, monitor, and exploit profitable trading strategies. To get a comprehensive discussion on market connectivity, execution, algorithmic trading and automation in general, Barry Johnson's book titled Algorithmic Trading and DMA: An Introduction to Direct Access Trading Strategies is highly recommended. The website for the book is located here: http://www.algo-dma.com. Where to get data from Financial data is the the lifeblood of any successful trading business. Prior to trading, it is important to get a feel for how a strategy can behave in production. We accomplish this by analyzing historical data and by monitoring real-time data of proposed strategies, market events, and economic announcements. But where can we find such data? Specifically, where can we find time series data for stock prices, option prices, futures prices, and other economic indicator series? For daily granularity, a few of the popular and free choices are •  Yahoo Finance: http://finance.yahoo.com/ • Quandl: http://www.quandl.com/ • Google Finance: http://www.google.com/finance • Federal Reserve Bank Of St. Louis: http://research.stlouisfed.org/fred2/ We will utilize the first option heavily in this book. Most of the data will be daily open, high, low, and close prices for single stock names and Exchange Traded Funds  An Overview  9  (ETF's). This data is available for free, and thus comes with the usual caveats. One needs to be vigilant of the quality of such data and put processes in place to clean, filter, and potentially delete erroneous values. The best alternative is to actually capture the data on your own. This requires a lot of infrastructure and can be a costly and time-consuming endeavor. This is the approach that a lot of institutional traders, banks, and proprietary trading firms take. They accomplish this by being directly connected to the various exchanges around the world. Typically, servers are co-located right at the exchanges, and these machines record the data locally onto disks. The raw data is subsequently processed, cleaned, and stored in a database for future retrieval. The intermediate solution is to obtain the data from a third-party vendor. Many such vendors exist, and the quality of the data can range from poor to excellent. A few recommended data vendors are • • • • • • • •  Tick Data, Inc: http://www.tickdata.com/ Bloomberg: http://www.bloomberg.com/enterprise/data/ Thomson Reuters: http://thomsonreuters.com/financial/market-data/ CME Group: http://www.cmegroup.com/market-data/datamine-historical-data/ NYSE Market Data: http://www.nyxdata.com/Data-Products/NYSE Hanweck Associates: http://www.hanweckassoc.com/ Activ Financial: http://www.activfinancial.com/ Markit: http://www.markit.com/  Summary The first few sections of this chapter elaborate on the purpose and potential audience of this book. Some math and programming resources are recommended as useful guides for understanding the technical discussions that will be presented in upcoming chapters. For the purpose of motivating the subsequent analysis, a mission statement is presented that outlines the end goals of a desirable trading strategy. A brief explanation of financial markets, financial instruments and trading strategies is provided. The chapter ends with a discussion of high-frequency trading, the automation of such strategies, and on the issue of obtaining daily and intra-day financial data.  2  Tools of the Trade  The primary tools that quants and traders rely on to perform their daily activities include intuition, data, computer hardware, computer software, mathematics, and finance. They utilize these tools in ingenious ways as a means to an end. The end, of course, is the generation of consistent profits in the financial markets. Many traders have done well for themselves by relying on intuition alone. But intuition alone, on average, will not yield superior results. A tool chest of sorts is required in order to maximize the quant/trader's chances of producing consistent and favorable outcomes. A programming language is one such tool. In this book, we will learn how to wield the R programming language for the purposes of manipulating data, performing math operations, automating workflows, displaying informative visualizations, creating reproducible results, and doing lots of other cool stuff. The R language R [88] is an open-source scripting language that has become very popular among statisticians, data science practitioners, and academics over the years. It is a functional programming language by nature, but it also supports the object oriented and imperative programming paradigms.1 In some sense, R is both a programming language as well as a development framework. The framework has support for some advanced graphing capabilities and provides access to multiple state-of-theart statistical packages. The language itself supports conditional statements, loops, functions, classes, and most of the other constructs with which VBA and C++ users are familiar. The plethora of contributed packages by third parties, a solid user-base, and a strong open-source community are some other key strengths of R. The R system can be divided into two conceptual parts: 1. the base installation downloadable from CRAN 2. everything else The base R installation contains, among other things, the necessary code to run R. Many useful functions and libraries are also part of this base package. Some of these include: utils, stats, datasets, graphics, grDevices, and methods.  12  Quantitative Trading with R  R is a dialect of the S language. The S language was developed by John Chambers, Rick Becker, and Allan Wilks at Bell Laboratories in the late 1970s. It started off as a collection of Fortran libraries. The language was used internally at Bell Labs for statistical analysis. In 1988, the bulk of the system was rewritten in C, and in 1998, version 4 of the language was released [93]. Insightful Corporation was granted an exclusive license in 1993 to commercially develop and market software products related to the S language. R was created by Ross Ihaka and Robert Gentleman in 1991 at the University of Auckland, New Zealand. Legend has it that the name R is a play on the name S, as well as the fact that the names Ross and Robert both start with an R. R became public in 1993, and in 1995 the General Public License (GNU) was used to effectively make R free software [96]. The R Core Group was formed in 1997. This group currently controls the R source code and is responsible for all the major software releases. The first version of R (R 1.0.0) was released in 2000. R version 3.0 was released in April 2013 and included support for long vectors, among other major improvements. R is free, has thousands of add-on statistical and analytical libraries available for download, is very popular in the quantitative finance and academic fields, allows for the rapid prototyping of ideas and is the programming language with which the author has the most experience. Theoretically speaking, we can use any Turning Complete2 programming language to accomplish our goals. Some, of course, are easier than others to learn and apply. The following examples illustrate how the same "Hello World!" output can be obtained in different programming languages. These examples are referenced from Wikipedia [124]: In Assembly x86-64 Linux:  .section .rodata string: .ascii "Hello, world!\n" length: .quad . -string .section .text .globl _start _start: movq $4, %rax movq $1, %rbx movq $string, %rcx movq length, %rdx int $0x80 movq %rax, %rbx movq $1, %rax int $0x80  Tools of the Trade  13  In Brainfuck: ++++++++++[>+++++++> ++++++++++>+++>+>+++ +<<<<<-]> ++.>+.+++++++ ..+++.>>>++++.<<++.<+++ +++++.--------.+++.------.-------.>+.>. In C++: #include <iostream> int main() { std::cout << "Hello, world!" << std::endl; return 0; } In Julia: println("Hello, world!") In Swift: println("Hello, world!") In Python 2: print "Hello, world!" In R: cat("Hello, world!\n") Ideally, we need a programming language that allows for the rapid prototyping of an idea, provides instant feedback to the programmer (interpreted versus compiled execution),3 enables the creation of charts and more complex graphics in an efficient manner, exposes additional functionality through the inclusion of third-party libraries, is platform independent (can be used on Windows, Mac, and Linux), has methods for easily accessing and manipulating data, allows for extensibility (being able to write custom functions not only in R itself, but also C, Fortran, or even C++), is fast, and is free. Quite a few computational frameworks satisfy most of the above requirements. No programming languages exist, however, that satisfy all of these requirements. The most commonly used languages in the quantitative-finance realm include  14  Quantitative Trading with R  Figure 2.1 R console.  (in no particular order): Python, R, Matlab, Julia,4 C++, C#, Java, and VBA. As a general rule, Python, Julia, Matlab, VBA and R are used as prototyping/research languages, whereas C++, C#, and Java tend to be used for more application/infrastructure development. Getting started with R Here are the instructions for installing R on either a Windows or a Mac machine. 1. Navigate to the following website: http://www.r-project.org/. Click on the download R link in the "Getting Started" section, or alternatively, click on the "CRAN" link in the left-hand menu. 2. Select the appropriate mirror site that is closest to where the target machine resides. 3. Select the appropriate base package for Linux, Mac, or Windows. 4. Select all the default options for the installation of choice. After the installation is complete, click on the resulting R-icon to bring up the console. Figure 2.1 illustrates what the R console looks like on a Mac computer. There are at least three ways to enter commands into R: 1. by typing expressions directly into the console 2. by copying code from an open text file and by pasting that code into the R console 3. by sourcing (via the source() command) the code from an external file The next few examples demonstrate some of this basic R functionality. The recommended workflow for following along is to enter the commands in a text file and then copy them into the R console. This will make it easier to repeat or amend statements. The history() function will assist with the recollection of previously entered commands. Typing the following into R will return the last four commands entered: history(4). To get the entire history of a session, we can use the history(Inf) statement.  Tools of the Trade  15  First and foremost, R can be used as a scientific calculator. All the usual mathematical operations can directly be entered and evaluated in the console. Operations such as addition, subtraction, multiplication, division, and exponentiation are referenced by the known symbols (+, -, *, /, ^). The basic building blocks in R are numbers, strings, and booleans. Numbers are the familiar symbols 1, 2, 3, 1e+06, 3.1415, and so forth. Strings are character sequences of 0 or more symbols encapsulated by either single or double quotes. Examples are • • • • • •  "1" "" "" "this is a string" 'Hello' '2 + 4'  Boolean variables evaluate to either a TRUE or a FALSE statement. In R, the following evaluates to TRUE: 1 + 2 == 3. Notice the use of the double equal sign. The single equal sign is reserved for assignment operations. The following R expression would yield an error: 1 + 2 = 3. A few other symbols that show up often in R programs are the open and closed parentheses (), the open and close curly braces {}, the open and closed square brackets [], and the assignment operator <-. Here are some example expressions that will return numerical answers: 1 + 1 sqrt(2) 20 + (26.8 * 23.4) / 2 + exp(1.34) * cos(1) sin(1) 5^4 sqrt(-1 + 0i) Advanced math operations are also possible in R. The following code-snippet creates a function called integrand() and then calls the R-defined function integrate(). Writing custom functions in R will be covered in detail in a subsequent section. The purpose of this example is to show that advanced numerical algorithms are readily accessible by calling pre-canned functionality: integrand <- function(x) 1 / ((x + 1) * sqrt(x)) integrate(integrand, lower = 0, upper = Inf) The above code evaluates the integral:  0  ∞  1 √ dx (x + 1) x  (2.1)  16  Quantitative Trading with R  The assignment of a value to a variable is accomplished via the <- operator: x <- 3 x <- x + 1 z <- x ^ 2 z <- "hello quants" y <- "a" Z <- sqrt(2) new.X <- 2.3 A few things to notice from the previous example are the following: •  •  • • •  In R, expressions are case sensitive. The variable z is not the same as Z. Spaces or special characters are not allowed within variable names. The dot . operator is an exception. It is perfectly valid to have a variable name start with a dot (i.e., .myVar). Variable names are not allowed to start with numeric characters. Variables in R do not have to be declared as int, double, or string as in other languages. R, dynamically figures out what the type of the variable is during run-time. Contents of variables can be copied into other variables. The example z <- x ^ 2 does not actually modify the value of x. Rather, x is squared, and the result is assigned to the new variable z. Other languages use the = operator in place of the <- operator to denote assignment. R is capable of supporting both conventions. The <- operator will be used throughout the rest of this book for consistency purposes.  Entering 5+4 should return the value 9. The [1] before the 9 means that this is the first element of a vector. We will use two number signs in a row (##) to denote the output of a piece of code. The single number sign will be used for comments within the code. This is simply a convention we will adopt for this book. They are both equally valid comment initiators. 5 + 4 ## [1] 9 White spaces between the numbers and the + sign do not affect the output. Neither do they produce any warnings. The expression 5 + 4 still yields 9. Warnings and errors usually occur when R cannot successfully parse a command into meaningful code. For example, if one forgets to type in the + sign, 5 4, will issue the following error: Error: unexpected numeric constant in "5 4". Before any meaningful work with data can be conducted, that data has to be stored inside a suitable container. The important data containers in R are •  vector • matrix  Tools of the Trade  17  •  data frame list • environment •  Once the data is placed inside such a data structure, it can be manipulated in various ways. The c() object A vector can be thought of as a 1-dimensional array. Vectors only hold data of a similar type. That is to say, only numbers, only characters or only booleans can be placed inside a vector. The following example creates three vectors of type numeric and character: first_vector <- c(1, 2, 3, 4, 5, 6) second_vector <- c("a", "b", "hello") third_vector <- c("a", 2, 23) The concatenation operator c() is used to create a vector of numbers, characters, or booleans. The third example mixes numbers with characters. R will convert the type of any numeric value into characters. Typing the variable name into the R console reveals the contents of our newly created vectors: first_vector ## [1] 1 2 3 4 5 6 third_vector ## [1] "a" "2" "23" The concatenation operator c() can also be used to combine existing vectors into larger ones: new_vector <- c(first_vector, 7, 8, 9) new_vector ## [1] 1 2 3 4 5 6 7 8 9 The extraction of elements from within a vector can be accomplished via a call to the [] operator. The following examples illustrate various operations that can be performed on vectors. The idea of an index becomes important when we start talking about extracting elements from containers. R uses a 1-based indexing scheme (i.e., the first element of a vector has an index of 1.) This is in contrast to other languages (i.e., C++, Java, Python) in which the first element has an index of 0.  18  Quantitative Trading with R  The first example specifies a single index to use for extracting the data. The second example specifies two indexes. Notice the c() operator that is used to create a vector of indexes. These indexes are subsequently used to extract the elements from the initial vector. This method of extracting data elements from containers is very important, and we will use it over and over again: # Extract the 4th element example_1 <- new_vector[4] # Extract the 5th and the 8th elements example_2 <- new_vector[c(5, 8)] example_2 ## [1] 5 8 The following examples address another important concept, that of vectorization. Instead of performing operations on one element at a time, vectorization allows us to perform the same operation on all the elements at the same time. Conceptually, we can treat a vector of numbers as a single number. Here are some rudimentary examples: x <- c(1, 5, 10, 15, 20) ## [1] 1 5 10 15 20 x2 <- 2 * x ## [1] 2 10 20 30 40 x3 <- x ^ 2 ## [1] 1 25 100 225 400 x4 <- x / x2 ## [1] 0.5 0.5 0.5 0.5 0.5 x5 <- round(x * (x / x2) ^ 3.5 + sqrt(x4), 3) ## [1] 0.795 1.149 1.591 2.033 2.475 x6 <- round(c(c(x2[2:4], x3[1:2]), x5[4]), 2) ## [1] 10.00 20.00 30.00 1.00 25.00 2.03 •  Vectorization allows us to avoid looping through all the elements of the vector. Rather, the operation of interest is performed on all the elements at once. • If we only wanted to perform an operation on the fourth and sixth elements of our vector, we would have to "index" into the vector and extract the elements of interest (y <- x[4] + x[6]). The last example, x6, combines some of the  Tools of the Trade  19  operations discussed earlier in this tutorial. We are extracting specific elements from vectors x2, x3, and x5, and then concatenating them into a single vector. The result of the operation is then truncated to 2 decimal places. The matrix() object A matrix can be thought of as a two-dimensional vector. Matrices also hold data of a similar type. The following code defines a matrix with two rows and three columns. In R, matrices are stored in columnar format. my_matrix <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, ncol = 3) my_matrix ## [,1] [,2] [,3] ## [1,] 1 3 5 ## [2,] 2 4 6 The default matrix() command assumes that the input data will be arranged in columnar format. In order to arrange the data in row format, we need to modify our previous example slightly: my_matrix <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, ncol = 3, byrow = TRUE) my_matrix ## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 4 5 6 A matrix is an object. Objects have attributes. Attributes are extra pieces of information that adorn objects. For matrices, a useful attribute is a character vector of names for the rows and another character vector of names for the columns. These can both be entered into a list object and passed as an argument to the dimnames() function: dimnames(my_matrix) <- list(c("one", "hello"), c("column1", "column2", "c3")) my_matrix ## column1 column2 c3 ## one 1 2 3 ## hello 4 5 6 We can query the object for its attributes:  20  Quantitative Trading with R  attributes(my_matrix) ## $dim ## [1] 2 3 ## $dimnames ## $dimnames[[1]] ## [1] "one" "hello" ## $dimnames[[2]] ## [1] "column1" "column2" "c3" This output tells us that the matrix has two rows and three columns. It also tells us what the row and column names are. The extraction of elements from a matrix can be accomplished via the use of the [,] operator. To extract the element located in row 1 and column 3, we need to issue the following command: ans <- my_matrix[1, 3] ans ## [1] 3 Operations on matrices can also be vectorized: new_matrix_1 <- my_matrix * my_matrix new_matrix_1 ## [,1] [,2] [,3] ## [1,] 1 4 9 ## [2,] 16 25 36 new_matrix_2 <- sqrt(my_matrix) new_matrix_2 ## [,1] [,2] [,3] ## [1,] 1 1.414214 1.732051 ## [2,] 2 2.236068 2.449490 Here are some examples that utilize vectorization and single element operations: mat1 <- matrix(rnorm(1000), nrow = 100) round(mat1[1:5, 2:6], 3) ## [,1] [,2] [,3] [,4] [,5] ## [1,] -1.544 1.281 1.397 0.407 -0.459 ## [2,] 0.483 0.046 -1.817 -0.289 0.597  Tools of the Trade  21  ## [3,] 0.405 1.045 -0.726 -0.163 0.258 ## [4,] 0.141 -0.294 -1.225 -0.217 -0.771 ## [5,] -0.537 0.226 0.126 -1.584 -1.237 mat2 <- mat1[1:25, ] ^ 2 head(round(mat2, 0), 9)[,1:7] ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] ## [1,] 1 2 2 2 0 0 7 ## [2,] 0 0 0 3 0 0 0 ## [3,] 0 0 1 1 0 0 1 ## [4,] 0 0 0 2 0 1 4 ## [5,] 1 0 0 0 3 2 1 ## [6,] 2 1 3 1 1 1 1 ## [7,] 0 0 0 0 0 1 0 ## [8,] 1 2 0 0 1 2 0 ## [9,] 0 0 3 0 2 2 0 The data.frame() object It often helps to think of a data.frame() object as a single spreadsheet. A data frame is a hybrid, two-dimensional container that can include numeric, character, boolean, and factor types. Whenever data is read into R from an external environment, it is likely that the resulting object will end up being a data frame. The following code creates such a structure: df <- data.frame(price = c(89.2, 23.2, 21.2), symbol = c("MOT", "AAPL", "IBM"), action = c("Buy", "Sell", "Buy")) df ## price symbol action ## 1 89.2 MOT Buy ## 2 23.2 AAPL Sell ## 3 21.2 IBM Buy A data frame accepts columns of data as input. Different names can be assigned to each column of data. In a data frame, as in a matrix, it is important to ensure that the number of rows is the same for all columns. The data need to be in rectangular format. If this is not the case, R will issue an error message. Factors are a convenient data type that can assist in the categorization and analysis of data. For our subsequent analysis we will not be needing these constructs. In order to disable the conversion of any character vector into a factor, we can use the stringsAsFactors = FALSE argument within the data.frame() call:  22  Quantitative Trading with R  df3 <-data.frame(price = c(89.2, 23.2, 21.2), symbol = c("MOT", "AAPL", "IBM"), action = c("Buy", "Sell", "Buy"), stringsAsFactors = FALSE) class(df3$symbol) ## [1] "character" Some takeaways from the previous examples are the following: •  Functions can take multiple input arguments. To figure out what arguments are available for standard R functions, use the ? operator in front of the function name. i.e. ?data.frame. • Objects can be passed directly into other functions. Functions are objects. In fact, everything in R is an object! Data frames can also be indexed via the [,] operator: price <- df[1, 1] price ## [1] 89.2 df2 <- data.frame(col1 = c(1, 2, 3), col2 = c(1, 2, 3, 4)) ## Error in data.frame(col1 = c(1,2,3), ## col2 = c(1,2,3,4)) : arguments imply ## differing number of rows: 3, 4 The $ operator extracts data columns by name: symbols <- df$symbol symbols ## [1] MOT AAPL IBM ## Levels: AAPL IBM MOT The "Levels" descriptor for the symbols column implies that the type of variable is a "factor": class(symbols) ## [1] "factor"  Tools of the Trade  23  The symbols column from the df3 data frame, however, yields a character vector instead: symbols <- df3$symbol symbols ## [1] "MOT"  "AAPL" "IBM"  The list() object A list object is one of those data structures that is very useful to R programmers. It is one of the most general containers in the sense that it can store objects of different types and sizes. The following code creates a list and populates it with three separate objects: my_list <- list(a = c(1, 2, 3, 4, 5), b = matrix(1:10, nrow = 2, ncol = 5), c = data.frame(price = c(89.3, 98.2, 21.2), stock = c("MOT", "IBM", "CSCO"))) my_list ## $a ## [1] 1 2 3 4 5 ## $b ## ## [1,] ## [2,] ## ## ## ## ##  [,1] [,2] [,3] [,4] [,5] 1 3 5 7 9 2 4 6 8 10  $c price stock 1 89.3 MOT 2 98.2 IBM 3 21.2 CSCO  The first element of the list my list is named a, and it holds a numeric vector of length 5. The second component is a matrix, and the third one, a data frame. Many functions in R use this list structure as a general container to hold the results of computations. Lists can be indexed by passing a number (the index of the list element) or by passing the element name into the double bracket operator [[]]: first_element <- my_list[[1]]  Quantitative Trading with R  24  first_element ## [1] 1 2 3 4 5 class(first_element) ## [1] "numeric" An alternate extraction method is the following: second_element <- my_list[["b"]] second_element ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 3 5 7 9 ## [2,] 2 4 6 8 10 class(second_element) ## [1] "matrix" The single bracket operator [] is used to extract a section of a list. This is a source of confusion for many novice R programmers. As a reminder, double brackets [[]] return list elements, whereas single brackets return lists. Here's an example: part_of_list <- my_list[c(1, 3)] part_of_list ## $a ## [1] 1 2 3 4 5 ## ## ## ## ##  $c 1 2 3  price stock 89.3 MOT 98.2 IBM 21.2 CSCO  class(part_of_list) ## [1] "list" The size of the list can be determined by calling the length() function. size_of_list <- length(my_list) size_of_list ## [1] 3  Tools of the Trade  25  The new.env() object An environment is a powerful data structure that R leverages quite a bit under the hood for performance reasons. It differs from the other structures in that it has reference semantics.5 It is most similar to the list object, with an added reference to a parent environment. Environments are often used to emulate hash maps with O(1) lookup performance.6 The link http://adv-r.had.co.nz/Environments.html contains more in-depth information on the semantics and use of this construct. A new environment can be created with the new.env() command: env <- new.env() env[["first"]] <- 5 env[["second"]] <- 6 env$third <- 7 Just as in a list object, the assignment of name-value pairs can be accomplished via the $ or [[]] operators. Here is where the differences begin. Typing the name env into the console does not reveal the names nor the associated data we are accustomed to seeing: env ## <environment: 0x101ef2f18> Instead, we get back a cryptic hexadecimal code. To obtain the names, we have to use the ls command: ls(env) ## [1] "first"  "second" "third"  To obtain the values associated with those names, we can use the get() command: get("first", envir = env) ## 5 Removing elements from an environment is accomplished via the rm() command: rm("second", envir = env) ls(env) ## [1] "first" "third" The copy and modify rules we have covered thus far for lists, data frames, matrices, and vectors do not apply to environments. Due to the reference semantics, when we create a copy of an environment and then proceed to modify one of  26  Quantitative Trading with R  X  15 10 5  1  2  3  4 5 Index  6  7  8  Figure 2.2 Default plot of a vector.  its elements, the elements of the original object will also be modified. Here is an example: env_2 <- env env_2$third <- 42 get("third", envir = env) ## [1] 42 Using the plot() function One of the highlights of the R programming framework is the rich graphing functionality that even the base installation exposes to end users. Most of the advanced graphing functionality, however, is provided by external packages such as ggplot2, ggvis, rCharts, and rgl. The graphics CRAN7 task view has a nice list of relevant packages. For the bulk of our subsequent work, we will stick to the basic plot() command, which is more than adequate in satisfying our graphical needs. Here is what a default plot() output produces: # Create a vector of numbers x and plot them x <- c(1, 2, 3.2, 4, 3, 2.1, 9, 19) plot(x) The type argument can be used to modify the graph from a points-plot to a line-plot: # Convert the graph into a line plot plot(x, type = "l") A call to ?plot reveals a few more useful plot types: •  ''p'' for points • ''l'' for lines  Tools of the Trade  27  X  15 10 5  1  2  3  4 5 Index  6  7  8  Figure 2.3 Vector line plot.  • • • • • • •  ''b'' for both ''c'' for the lines part alone of ''b'' ''o'' for both overplotted ''h'' for histogram like (or high-density) vertical lines ''s'' for stair steps ''S'' for other steps ''n'' for no plotting  It helps to think of a plot as a painting canvas. A blank canvas is created by calling the plot() command with some default data and arguments. Lines and text are then drawn on the canvas by issuing calls to the respective functions. The following example demonstrates the creation of a plot with a main title, axis-labels, and a basic grid. A vertical and a horizontal line are also placed on the graph after the initial points have been rendered by plot(): # Set up the canvas plot(rnorm(1000), main = "Some returns", cex.main = 0.9, xlab = "Time", ylab = "Returns") # Superimpose a basic grid grid() # Create a few vertical and horizontal lines abline(v = 400, lwd = 2, lty = 1) abline(h = 2, lwd = 3, lty = 3) Further information on v, h, lwd, lty, and other arguments of abline() can be found by calling ?abline. The lwd argument defines the line-width, and the lty argument defines the line-type.  28  Quantitative Trading with R 3 2 Returns  1 0 –1 –2 –3 0  200  400  600  800  1000  Time Figure 2.4 A plot with extra attributes.  The par() command is used to query or set up global graphical parameters that can be used by all subsequent calls to plot(). The following code splits the viewing window into a rectangular format with two rows and two columns. A plot() command can then be issued for each one of the child windows. Lines and text can subsequently be added to each unique child plot: # Create a 2-row, 2-column format par(mfrow = c(2, 2)) # First plot (points). plot(rnorm(100), main = "Graph 1") # Second plot (lines). plot(rnorm(100), main = "Graph 2", type = "l") # Third plot (steps) with a vertical line plot(rnorm(100), main = "Graph 3", type = "s") abline(v = 50, lwd = 4) # Fourth plot plot(rnorm(100), type = "h", main = "Graph 4") # Reset the plot window par(mfrow = c(1, 1)) It becomes evident that the order in which arguments are passed into functions does not matter, as long as they are given proper names. Behind the scenes, R uses either named matching or positional matching8 to figure out the correct assignment. The call to plot(x, main = "Hello", type = "h") is identical to plot(x, type = "h", main = "Hello").  Tools of the Trade  4  3  3 rnorm (100)  rnorm (100)  Graph 1 4 2 1 0  Graph 2  2 1 0 –2  –2 0  20 40 60 80  0  20 40 60 80  Index  Index  Graph 3  Graph 4  2  2  1  rnorm (100)  rnorm (100)  29  0 –1  1 0 –1 –2  –2 0  20 40 60 80  0  20 40 60 80  Index  Index  Figure 2.5 Four plots on a canvas.  Here's how to add some text and a legend to the plot: plot(rnorm(100), main = "A line plot", cex.main = 0.8, xlab = "x-axis", ylab = "y-axis", type = "l") # Extra text mtext("Some text at the top", side = 3) # At x = 40 and y = -1 coordinates legend(40, -1, "A legend") There are many settings that can be enabled within plot(). Entering ?plot.default in the console will list what these are. Alternatively, the formals() function can be used to extract the arguments of the function: formals(plot.default) ## $x  30  Quantitative Trading with R Some text at the top 2  Y-axis  1 0 –1  A legend  –2 –3 0  20  40  60  80  100  X-axis Figure 2.6 Graph with legend and text.  ## $y ## NULL ## $type ## [1] "p" ## ...  Functional programming Functional programming is a programming style that is often used by R programmers. The main actor in the functional programming arena is the function itself. The idea that differentiates this type of programming from others (i.e., imperative programming) is the the concept of nonmutability of state [85]. The resulting answer is solely a function of the input data. Nothing outside of the function is affected. The answers obtained from a program are exactly the same given the same input data. There exists a solid mathematical theory underlying the ideas inherent in functional programming. R, per se, does not impose a strict functional programming framework. Imperative, object-oriented, and functional concepts can, and often are, combined to yield R programs. How do we add up all the numbers between 1 and 100? Here is a vectorized, functional approach: ans <- sum(1:100) ans ## [1] 5050  Tools of the Trade  31  Functional programming helps abstract away many unnecessary details. For example, if all we want to do is sum a vector of numbers, then sum(x), where x is a vector, seems like the most intuitive thing to do. There is no need to loop through the vector, assign the value of each element to a temporary variable, keep track of the sum, and manage the mechanics of the branching logic, for or while loops. Internally, of course, there are for loops and imperative programming concepts at work. This is how the abstraction is accomplished in the first place. In R, this is typically implemented in a low-level language such as C, C++ or Fortran. The following is an imperative example in R: answer <- 0 for(i in 1:100) { answer <- answer + i } answer ## [1] 5050 Almost all operations in R can be vectorized and implemented using a functional programming paradigm. Not only will the resulting code be faster (in most cases), it will also be more succinct and less error prone. Writing functions in R Programming is like baking a cake. Given the raw ingredients (data and parameters), the application of a specific recipe will yield a cake as the end product. The data typically reaches the program either in real time, through a network connection in almost realtime, or through a database query of some kind. The recipe (program) provides the necessary instructions that transform and combine the data into the desired end result. At a basic level, a program can be thought of as one long sequence of instructions. Using the cake analogy, the high-level pseudocode of the program might look something like this: 1. 2. 3. 4. 5. 6.  Take all the eggs out of the refrigerator and put them on the counter. Consciously make an effort to move your right hand over the eggs. Zero-in on one of the eggs and spread your fingers apart. Lower your hand and clamp your fingers. Lift your arm and rotate your body clockwise by 60 degrees. Move your right arm up and down in short bursts over the edge of the bowl. Etc.  And we have only cracked one egg at this point! One can imagine how tedious this becomes when the same instructions have to be repeated over and over again for all the eggs, let alone all the other ingredients. One way to alleviate this pain is to abstract away or to encapsulate some of the repetitious tasks. This is what functions do. Functions can be thought of as little black boxes that take some, or no  Quantitative Trading with R  32  inputs, and return one, many, or no outputs. We can conceptually create a function that takes as an input parameter the number of eggs at our disposal and simply returns a true or false response when all the eggs are either successfully cracked or unsuccessfully cracked. A function declaration helps define the interface between the user and the underlying function logic. An example of such a declaration in C++ would be: bool CrackEggs(int). This means that the function CrackEggs() takes as an input an integer and returns a boolean. R does not force us to declare what the type of the inputs or the output needs to be up front. R can infer what the inputs are at run time. The statement CrackEggs(4) and CrackEggs("wow") will both work in R. In the first case, R realizes that the input to the function is a number, whereas in the second case, the input to the function is a character vector. Depending on the internal implementation of the function, this might issue an error message. We need to tell R that CrackEggs() is a function that takes in one input and returns a boolean. The function() argument is used for this purpose: crack_eggs <- function(number_of_eggs) { # Code that determines whether eggs have been cracked. # If they have, set have_all_eggs_been_cracked <- TRUE, # otherwise, set to FALSE return(have_all_eggs_been_cracked) } Now, our recipe can look something like this: 1. 2. 3. 4.  gather ingredients() crack eggs() add ingredients() bake cake()  And this, in a nutshell, is how abstraction9 works. Throughout this book, we will attempt to write small functions with the aim of accomplishing a certain task and then group these functions into larger programs to get our end result. The base R installation includes many predefined functions. Unfortunately, it is difficult for newcomers to find detailed information on how to use these functions. This is also one of the few reasons why R is deemed to have a steep learning curve at first. The documentation included with the base version of R lacks rigor, and is short on examples and recommended use cases. Having said this, there are many great tutorials on most aspects of the R language available online and in print. Some of the references10 included at the end of this book should steer the reader in the right direction. The following examples contain a few functions that are worth memorizing:  Tools of the Trade  # Greate 100 standard normals x <- rnorm(100, mean = 0, sd = 1) # Find the length of the vector x. length(x) # Compute the mean of x mean(x) # Compute the standard deviation of x sd(x) # Compute the median value of the vector x median(x) # Compute the range (min, max) of a variable range(x) # Find the sum of all the numbers in x sum(x) # Do a cumulative sum of the values in x cumsum(x) # Display the first 3 elements of x head(x, 3) # Display summary statistics on x summary(x) # Sort x from largest to smallest. sort(x, decreasing = TRUE) # Compute the successive difference in x diff(x) # Create an integer sequence from 1 to 10 1:10 # A sequence from 1 to 10 in steps of 0.1 seq(1, 10, 0.1)  33  34  Quantitative Trading with R  TRUE Condition  Do Task A  FALSE Do Task B Figure 2.7 If statement flow chart.  # Print the string hello to the screen print("hello") Branching and looping Before we move on to the fun stuff, here is a cursory look at the if() and for() commands. These constructs work in the same way as they do in most of the other programming languages. The if() command acts as a branching mechanism that redirects the flow of the program based on the evaluation of the boolean expression that is passed in as an argument. The following example will always print "XYZ" to the screen since the number 1 is most certainly not equal to 2: # Define a boolean variable my_boolean <- 1 == 2 if (my_boolean) { print("not correct") } else { print("XYZ") } The commands ifelse() and switch() are also used for controlling the flow of execution. For the repetitive execution of code, the for(), while(), and repeat() commands should be used. The for() loop is used to execute certain functionality multiple times in a row. According to the help("for") documentation, the syntax of the for loop is of the form: for(var in seq) expr, where var is a variable, in is a reserved keyword, and seq is an expression evaluating to a vector. Here are two examples: for(i in 1:5) { cat(i, "\n")  Tools of the Trade  } ## ## ## ## ##  35  1 2 3 4 5  some_list <- list() for(z in c("hello", "goodbye")) { some_list[[z]] <- z } some_list ## $hello ## [1] "hello" ## $goodbye ## [1] "goodbye" At this point, we have sufficient information to start combining some of these commands into useful programs. As a motivating example, we will write a function to compute the pairwise correlations between 6 different stocks passed in by the user. Functionality already exists in R, which facilitates such computations in a straightforward manner. For pedagogical reasons, we will do it the hard way. This problem can be loosely decomposed into the following tasks: 1. Obtain the names of the six stocks. 2. Make sure they are valid stock names. 3. Connect to a database that has the prices for these stocks and pull that information into memory. 4. Clean up the data by identifying missing values. 5. Place all the filtered data into a suitable container like a matrix or a data frame. 6. Compute the pairwise correlations and return the result to the user. 7. Create a visualization of these correlations. But before we write our first custom function, we need to talk about style. A recommended style guide Different people have different ways of writing code. This is good from a creative standpoint, but bad from a stylistic viewpoint. Various coding style guides have been proposed for almost all of the programming languages out there. A style guide is a set of rules on the visual appearance and layout of code. These guidelines specify the number of spaces to use, the naming convention of variable and function  Quantitative Trading with R  36  names, the proper use of comments, and so forth. Why is it important to adhere to a style guide? It is important because it allows for better readability of code, as well as portability of code between developers. Consider the following two examples. Functionally, they both produce the same output. One, however, is better than the other: #sum numbers x<-0;for(i in 1:10){x=x+1};x or # Sum numbers x <- 0 for(i in 1:10) { x <- x + 1 } x The second variant is visually more appealing as it does a better job of separating out the functionality. Notice that the assignment in the first case is x = x + 1, whereas in the second case, x <- x + 1. In the cleaner version, a consistent assignment operator was used for both the initialization (x <- 0) and the incrementing (x <- x + 1) stages. Again, the functionality does not change, the quality of the code, however, feels better in the second variant. Hadley Wickham has drafted a style guide that will be adopted throughout this book. The rules can be found here: http://r-pkgs.had.co.nz/style.html. The following list contains a subset of these rules: •  File names should be meaningful and end in .r. For example, file-name.r is recommended, whereas, file-name.R is not. • Variable and function names should be lowercase, and the underscore should be used to separate out the words. The camel case variable name firstFunction should be written as first function instead. • For code indentation, use two spaces. Don't mix tabs and spaces. • Use <- rather than =, for assignment. A pairwise correlation example Now, back to writing our pairwise correlation function. In an attempt to break down the task, we will write a few helper functions. The first helper function will validate whether a vector of stock symbols contains valid names, and will return only the valid ones for further processing. For simplicity, we will assume that a valid symbol is any sequence of 2 to 4 letters of the English alphabet. Numbers will not be allowed as part of the symbol identifier. We will make use of regular expressions11 to accomplish this initial filtering task.  Tools of the Trade  37  filter_and_sort_symbols <- function(symbols) { # Name: filter_symbols # Purpose: Convert to upper case if not # and remove any non valid symbols # Input: symbols = vector of stock tickers # Output: filtered_symbols = filtered symbols # Convert symbols to uppercase symbols <- toupper(symbols) # Validate the symbol names valid <- regexpr("^[A-Z]{2,4}$", symbols) # Return only the valid ones return(sort(symbols[valid == 1])) } Regular expressions are a powerful string filtering mechanism. Even though their syntax can appear quite daunting at first, it is worth spending the time to learn how to apply them. They provide a very concise and efficient way to perform text manipulations. The regular expression pattern used in the previous example (^[A-Z]{2,4}$) specifies that the string to be matched should start with an uppercase letter and end with an uppercase letter. It also requires that there be exactly two, three, or four letters present. Anything else will not be considered as a valid stock symbol. The regexpr() function returns a vector of equal length to that of the symbols vector. An entry of 1 is used to denote the valid names, and an entry of −1, the invalid ones. The topper() function is used to convert all the letters into uppercase prior to applying the regular expression. Here is a test of the function we just wrote: filter_symbols(c("MOT", "cvx", "123", "Gog2", "XLe")) ## "MOT" "CVX" "XLE" The next step requires us to pass the filtered vector of symbols into a function that will read in a .csv file and extract only the relevant data for those symbols. This function can later be augmented to read in price data from multiple sources, including external databases. For the purposes of this exercise, we will use a .csv file that contains 1856 trading days of prices for the following nine stocks: AAPL, CVX, IBM, XOM, GS, BA, MON, TEVA and CME. The time range of the data is from January 3, 2007 to May 16, 2014. We can consider this file as our database. These prices were obtained from Yahoo and the format looks as follows:  Quantitative Trading with R  38  2014-05-09 2014-05-12 2014-05-13 2014-05-14 2014-05-15 2014-05-16  AAPL 585.54 592.83 593.76 593.87 588.82 597.51  CVX 123.97 124.18 124.78 125.35 123.81 123.18  IBM 190.08 192.57 192.19 188.72 186.46 187.06  XOM 101.95 102.23 102.36 102.29 100.78 100.74  GS 157.20 159.55 160.28 159.45 156.64 156.43  BA 131.10 132.60 133.45 132.99 131.21 130.81  MON 115.66 115.97 116.91 117.00 115.40 116.04  TEVA 48.91 49.64 50.85 50.18 49.72 49.81  CME 69.59 70.91 70.59 69.81 69.48 68.68  Figure 2.8 Sample stock price file.  extract_prices <- function(filtered_symbols, file_path) { # Name: extract_prices # Purpose: Read price data from specified file # Inputs: filtered_symbols = vector of symbols, # file_path = location of price data # Output: prices = data.frame of prices per symbol # Read in the .csv price file all_prices <- read.csv(file = file_path, header = TRUE, stringsAsFactors = FALSE) # Make the dates row names rownames(all_prices) <- all_prices$Date # Remove the original Date column all_prices$Date <- NULL # Extract only the relevant data columns valid_columns <- colnames(all_prices) %in% filtered_symbols return(all_prices[, valid_columns]) } A few new concepts were introduced in the extract prices() function that need further clarification: The use of NULL and the use of the %in% command. By assigning a column name of a data frame to NULL, we effectively remove that column from the data frame. This operation can also be used to remove elements from a list. The %in% command asks the following question: which elements of vector A are also in vector B? A <- c(1, 2, 5, 6, 9) B <- c(0, 3, 6, 9, 10) A %in% B ## [1] FALSE FALSE FALSE  TRUE  TRUE  Tools of the Trade  39  Now that we have the prices of the filtered stocks in a data frame, we can perform some basic filtering. For now, we will take a look at the data and identify the rows with missing values. At this stage, we will not use this information to filter the data. We just care about the mechanics of identifying bad entries. filter_prices <- function(prices) { # Name: filter_prices # Purpose: Identify the rows with missing values # Inputs: prices = data.frame of prices # Output: missing_rows = vector of indexes where # data is missing in any of the columns # Returns a boolean vector of good or bad rows valid_rows <- complete.cases(prices) # Identify the index of the missing rows missing_rows <- which(valid_rows == FALSE) return(missing_rows) } The next step in our list requires us to compute pairwise correlations between all the stocks.12 The mathematical formula for the Pearson sample correlation coefficient ρ between two vectors of numbers is n  − x̄)(yi − ȳ) n 2 2 i=1 (xi − x̄) i=1 (yi − ȳ)  rxy = n  i=1 (xi  (2.2)  There is no need to delve into the details of the above formulation. We will implement it directly by calling the cor() function. compute_pairwise_correlations <- function(prices) { # Name: compute_pairwise_correlations # Purpose: Calculates pairwise correlations of returns # and plots the pairwise relationships # Inputs: prices = data.frame of prices # Output: correlation_matrix = A correlation matrix # Convert prices to returns returns <- apply(prices, 2, function(x) diff(log(x))) # Plot all the pairwise relationships pairs(returns, main = "Pairwise return scatter plot")  Quantitative Trading with R  40  AAPL CVX IBM XOM BA TEVA  AAPL 1.0000000 0.4555762 0.4974812 0.4152326 0.4221255 0.2793489  CVX 0.4555762 1.0000000 0.5789544 0.8912227 0.6004590 0.4228898  IBM 0.4974812 0.5789544 1.0000000 0.5668389 0.5214248 0.3214548  XOM 0.4152326 0.8912227 0.5668389 1.0000000 0.5955963 0.4112595  BA 0.4221255 0.6004590 0.5214248 0.5955963 1.0000000 0.3479621  TEVA 0.2793489 0.4228898 0.3214548 0.4112595 0.3479621 1.0000000  Figure 2.9 Correlation matrix output.  # Compute the pairwise correlations correlation_matrix <- cor(returns, use = "complete.obs") return(correlation_matrix) } Now that our helper functions have been defined, it is time to tie everything together: # Stock tickers entered by user symbols <- c("IBM", "XOM", "2SG", "TEva", "G0og", "CVX", "AAPL", "BA") # Location of our database of prices file_path <- "path/prices.csv" # Filter and sort the symbols filtered_symbols <- filter_and_sort_symbols(symbols) filtered_symbols ## [1] "AAPL" "BA" "CVX" "IBM" "TEVA" "XOM" # Extract prices prices <- extract_prices(filtered_symbols, file_path) # Filter prices missing_rows <- filter_prices(prices) missing_rows ## integer(0) # Compute correlations correlation_matrix <- compute_pairwise_correlations(prices) correlation_matrix Here is the correlation matrix: As a side effect, the same function also generates a scatter plot of all the returns in a nice rectangular layout. This is courtesy of the pairs() function.  Tools of the Trade –0.10 0.00 0.10 0.20  –0.15 –0.05 0.05 0.15  –0.10  41  0.00  0.00  APPL  –0.20 0.10  CVX –0.10  0.05  IBM –0.05 0.15  XOM  0.00 –0.15  0.10  BA –0.05  0.00  TEVA  –0.00 –0.20  –0.05 0.05  –0.05  0.05  –0.05  0.05  0.15  Figure 2.10 Pairwise scatter plot.  Summary This chapter addresses the basics of the R programming language. A brief history of the language is outlined, and the main reasons to use the R programming language are presented. Some of these include the ease of data manipulation and processing, the open-source nature of the framework, the plethora of add-on packages, and a vibrant online community. The important data containers (c(), matrix(), data.frame(), list()), as well as the most frequently used operations on these containers are showcased, and examples are provided. Some of the base graphical capabilities of the R language are also explored. The chapter ends with an introduction to creating customized functions within R. After briefly discussing the use of code-styling conventions, a practical use-case on calculating pairwise correlations between stocks is provided.  3  Working with Data  Financial data comes in many forms and sizes. In this book, we will mostly concern ourselves with a particular class of financial data, namely, time series data. Time series data contain a time component that is primarily used as an index (key) to identify other meaningful values. Another way to think about time series data is as key-value pairs in which the key is the time, and the value is either a single entry or a vector of entries. The following table gives an idea of what a typical daily time series for the stock AAPL might look like. This data was downloaded from Yahoo1 in the form of a .csv file. The first column contains the date, and the other columns contain price and volume information. Now that we have the downloaded file available on our computer, it is time to load the data into R for further manipulation. Getting data into R As a first step, we might want to take a look at what the closing price for AAPL looks like on a daily basis. There is a quick way to get data from a spreadsheet directly into the R environment. Simply copy the closing price column from the .csv file and then issue the following command inside the R console: # In Windows aapl <- read.table("clipboard") # On Mac/Linux aapl <- read.table(pipe("pbpaste")) This will, in effect, read the contents of the clipboard into a variable named aapl. We can look at the first few entries of this object by using the head() command. head(aapl) ## V1 ## 1 523.48 ## 2 530.32 ## 3 523.44  44  Quantitative Trading with R  Prices Date  Open  High  Low  Close  Volume  Adj Close*  Apr 10, 2014  530.68  532.24  523.17  523.48  8,530,600  523.48  Apr 9, 2014  522.64  530.49  522.02  530.32  7,337,800  530.32  Apr 8, 2014  525.19  526.12  518.70  523.44  8,697,800  523.44  Apr 7, 2014  528.02  530.90  521.89  523.47  10,309,400  523.47  Apr 4, 2014  539.81  540.00  530.58  531.82  9,830,400  531.82  Apr 3, 2014  541.39  542.50  537.64  538.79  5,798,000  538.79  Apr 2, 2014  542.38  543.48  540.26  542.55  6,443,600  542.55  Apr 1, 2014  537.76  541.87  536.77  541.65  7,170,000  541.65  Mar 31, 2014  539.23  540.81  535.93  536.74  6,023,900  536.74  Figure 3.1 Time series data of AAPL price.  ## 4 523.47 ## 5 531.82 ## 6 538.79 What kind of object is aapl anyway? Is it a matrix? A list? A data.frame? class(aapl) ## [1] "data.frame" It seems that the aapl object is indeed a data frame. We also notice that R has provided its own name for the extracted column of data (V1). Upon closer inspection, it appears that the data is printed out in reverse chronological order onto the screen. This can be remedied by either sorting the original spreadsheet in chronological order or by reversing the contents of the aapl object we just created. Here is the latter approach: aapl <- aapl[rev(rownames(aapl)), , drop = FALSE] The rev() command reverses the entries as expected. This syntax is somewhat confusing. We are effectively telling R to reverse the row entries of the data frame aapl, and to keep the original structure of the object. The drop = FALSE argument prevents the data frame from degrading into a vector. In order to extract and visualize the raw vector of prices, we can do the following: prices <- aapl$V1 plot(prices, main = "AAPL plot", type = 'l') The functions read.table() and read.csv() have many options that make reading input data into R a breeze. Instead of copying the closing-price column and using the "clipboard" command, we can specify the location of the file that includes the data in which we are interested. The ?read.csv command lists the available options at our disposal.  Working with Data  45  Prices  600  400  200  0 0  2000  4000  6000  8000  Index Figure 3.2 Elementary stock line chart.  # Load the .csv file aapl_2 <- read.csv(file = "path/aapl.csv", header = TRUE, stringsAsFactors = FALSE) # Reverse the entries aapl_2 <- aapl_2[rev(rownames(aapl_2)), ] This time around, we specified that headers are present in the input file. R knows to call the columns by their correct names, and if we want to extract the closingprice, the following command suffices: aapl_close <- aapl_2[, "Close"] To get some quick summary statistics in tabular format, we can utilize the summary() function. summary(aapl_close) ## Min. 1st Qu. Median ## 11.00 25.50 40.50  Mean 96.29  3rd Qu. Max. 77.00 702.10  The summary() function outputs the mean, median, first, and third quantiles, along with any missing value (NA) information. Summarizing and visualizing a particular data set should be one of the first steps conducted prior to performing any subsequent analysis. Most of the time, such summary statistics help uncover structural issues with the data. Installing packages in R Most of the functionality that we care about has probably already been created by another R user. Leveraging the work of others is something that is highly encouraged within the R community. There is no need to reinvent the wheel if it already  46  Quantitative Trading with R  exists and rolls just fine. Packages, or libraries are bundles of functions (usually written in R, C++, C or Fortran), data, and supporting documentation on a specific topic. A list of such specific topics can be found on the CRAN Task Views website: cran.r-project.org/web/views/. Some organizations also create their own packages in-house without making them publicly available. For a thorough discussion on how to create such a packaged bundle, the following reference serves as a good starting point.2 A package we will leverage later in this section is called quantmod. It is the brainchild of Jeffrey Ryan,3 and it allows one to work with financial time series data in an intuitive manner. To install quantmod, open up an R console and enter the following command: install.packages("quantmod"). Notice the quotes around the package name. Here is a full list of what the install.packages() function arguments are: install.packages(pkgs, lib, repos = getOption("repos"), contriburl = contrib.url(repos, type), method, available = NULL, destdir = NULL, dependencies = NA, type = getOption("pkgType"), configure.args = getOption("configure.args"), configure.vars = getOption("configure.vars"), clean = FALSE, Ncpus = getOption("Ncpus", 1L), verbose = getOption("verbose"), libs_only = FALSE, INSTALL_opts, quiet = FALSE, keep_outputs = FALSE, ...) During the installation process, quantmod automatically loads all the dependencies that it needs. If the local machine is connected to the Internet, R will use a default folder as the repository for the installed packages. This folder location can be changed by specifying a diff