It's crazy how poor the financial data provider offerings out there are. Most financial data is riddled with inconsistencies, wildly overpriced, and in esoteric formats. Simply ingesting financial data in a reliable manner requires significant engineering.
For something so important to the economy, it's amazing that there isn't a better solution, or that an open standard hasn't been mandated.
I feel this; I email Quandl regularly to fix data errors that the simplest of automated checks should catch ("why is this price 1200% higher than the previous one?").
But, they do have a mostly-decent API (tables; timeseries is pretty bad).
Something that always bugs me is properly adjusting prices when backtesting. The "right" way seems to be how Quantopian now handles it [1], in a just-in-time fashion, but that code isn't in their public libraries, and over email they declined to tell me where they get the data.
Keep an updated corp action table with date, corp action type, and adjustment factor.
Corp action type is important because divs adjust prices but not volumes, for example. Splits adjust both.
When you're ready to use an adjusted time-series: select the corp actions you care about and calculate a running product of 1+the adjustment factor. As-of join the adjustment factors, multiply and you're done.
I stand corrected; the code for on-the-fly adjusting _is_ in Zipline, but you have to know that stock splits are treated like dividends, which wasn't obvious to me.
At least some of data they use comes from a vendor they aren't able to name publicly.
For my current job, we wanted to get a mapping of stock tickers and exchanges to CUSIPs. Every provider we looked at — and this is fundamental trade data — was full of errors and missing values. Couple that with the extortion that is CUSIP (you can't use CUSIP values without a license from them, and licenses start at $xx,xxx+). It's criminally inept. And when you do fix it up, you don't want to publish it, because you spent all your time and resources fixing it… and it becomes a trade secret.
This is why finance is lucrative, similar to esoteric codes in various types of law. Nothing to do with math models or superior prediction, just paying for someone else to fight through identifier hell, exchange protocol hell, etc., and be able to do some mickey mouse math at the end of it.
Honestly, this stuff is so bad that the headache of it might fully justify huge finance compensation, and I’ve had colleagues who turned down huge bonuses and raises to leave finance companies solely to avoid this type of stuff and seek a career where the headaches bother them less and they are paid less.
Did you look at Factset's datafeed? I've found its reference data and symbology to be pretty reliable. Cusips will cost a lot with redistribution charges though. You're better off avoiding them if possible.
I agree, CUSIP is also a problem for the privateer (meaning all data needs to be free to use). While I have found a way to find a mapping online, I have no idea of the accuracy and have to trust that the unaware provider QA's the data.
I really would like to see something like Bloomberg's OpenFIGI take the place of cusips, but its not nearly as widely used. https://www.openfigi.com/ The api does allow you to convert from cusip though.
For anyone curious in esoteric formats, check out some of the documentation for financial data providers.
CRSP[1] is pretty much regarded as the highest quality pricing data in the US, with stock prices going back to 1925. The database API is written for C and FORTRAN-95.
Data providers also have a habit of providing their own proprietary security IDs, or just mapping to tickers. So if you're trying to build a database with several providers, you have to wrangle together 15 different security identifiers, taking care of mergers/acquisitions, delistings, ticker recycling, etc. It is a fun exercise.
Any advice on where an individual could purchase (even limited) access to CRSP data?
I'm working on a data-driven financial analysis blog and can't seem to find decent time-series fundamentals data now that yahoo and google have taken down their api's. Everything I find seems to be a $1000+ yearly subscription.
Spoke to these guys a while back. Asked for examples of real alternative data they had...one intertesting was flight data for private jets labeled against which company owned them. Theory being if ceo of company x keeps visiting a place near company y there may be an acquisition or merger in play.
I wonder if anyone did/could use it to buy real estate before HQ2 was announced. I don't know if the person in charge of finding the real estate was high enough to fly private.
Speaking from experience: it's not illegal insider trading unless you violate a confidentiality agreement or fiduciary duty.
I specifically say "illegal insider trading" because insider trading is not intrinsically illegal. The SEC distinguishes between insider trading and illegal insider trading (and by extension, so does the compliance department of every investment firm, bank and hedge fund). If, through your own research, you discover information which is both material and nonpublic, and you proceed to trade on that information, you are insider trading. However it is not illegal unless you have thereby broken an agreement or duty (namely with the company itself, its affiliates or your own clients) at any part of the process.
In practice this usually means the information is tainted if any of the following is true:
1) you have a fiduciary duty to the shareholders of the company in question,
2) you are employed by, or contract for, the company in question,
3) you are employed by, or contract for, an affiliate of the company (such as a vendor),
4) you disobeyed terms and conditions of service related to use of the product related to the information you found.
Obviously the standard disclaimers apply: I am not a lawyer, don't take potential legal advice from a random HN commenter, etc.
Source: I used to work in financial forecasting using alternative data.
In America, this kind of research in explicitly encouraged, and very much NOT insider trading according to the SEC. Insider trading has to involved _theft_, not just insider knowledge.
If you overhear someone talking about an impending acquisition in a coffee shop, and you trade on that information, you're quite safe in the US. European countries can and do consider that insider trading, though.
We're in the midst of a data gold rush. People who have data are struggling to monetize it. If you're a data buyer, you're probably swamped with the quantity and breadth of data providers out there. AI/ML techniques to make sense of this data are still only scratching the surface. I think this is where there is a lot of low-hanging fruit: creating services or tools that allow non-CS/non-Quant people to extract insights from TBs of data...
On the exchange side: these guys are always on the prowl for hot new properties to scoop up. The traditional business model of simply earning fees on exchange trading is slowly eroding away (for the last 10 years). So they need to branch out into services and other data plays...
Alternative take: there isn't that much low hanging fruit there.
Hear me out.
"To the person who only has a hammer, everything looks like a nail."
The data in front of your is the data you want to analyze, but it doesn't follow that that is the data you ought to analyze. I predict that most of the data you look at will result in nothing. The null hypothesis will not be rejected in the vast majority of cases.
I think we -- machine learning learners -- have a fantasy that the signal is lurking and if we just employ that one very clever technique it will emerge. Sure random forests failed, and neural nets failed and the SVR failed but if I reduce the step size, plug the output of the SVR into the net and change the kernel...
Let me put an example: suppose you want to analyze the movement of the stock market using the movement of the stars. Adding more information on the stars, and more techniques may feel like you're making progress but it isn't.
Conversely, even a simple piece of simple information that requires minimal analysis (this companies sales are way up and no one else but you know it) would be very useful in making that prediction.
The first data set is rich, but simply doesn't have the required signal. The second is simple, but has the required signal. The data that is widely available is unlikely to have unextracted signal left in it.
I've been selling good data in a particular industry for three years. In this industry at least, the so-called "low-hanging fruit" only seems low-hanging until you realize that the people who could benefit most from the data are the ones who are mentally lazy and least likely to adopt it. Data has the same problems as any other product and may even be harder because you need to 1) acquire the data and 2) build tools that solve reliably difficult problems using huge amounts of noisy information...
Isn't there utility in accepting the null hypothesis? It's almost as valuable to know that there is no signal in the data as there is in the opposite, i.e., knowing where not to look for information.
I think your example is really justifying a "machine learner" that has some domain expertise and doesn't blindly apply algorithms to some array of numbers.
Bingo. You nailed it. I work in finance. Developed markets have efficient stock markets. They are highly liquid. The reality is that there's lot of people competing for the same profits. In reality when there's that many players, if there's a profit to be had from a dataset you will be buy from a vendor, chances are one of your many competitors already bought it and found it. This is why we now say don't try to beat the market, you likely can't and mostly just need to get lucky having the right holding when an unforeseen event occurs. Too many variables at play that we just don't understand. Most firms are buying these datasets to stay relevant but they really make no difference in their actual investing strategies.
This is where you might use a genetic algorithm or to learn which data to use on a particular prediction. Good AI won’t use all data just trim down to signal.
For finance in particular, I'd say we're drowning in a massive volume of shitty data.
A client of mine purchases several fundamental feeds from Quandl, and I email them regularly to point out errors. Not weird, hard, tricky errors, but stuff like "why are all these volumes missing" or "there's a 1-day closing price increase of 1200%" or "you divided when you should have multiplied". This tells me neither Quandl nor the original provider (e.g. Zacks) do any serious data validation, despite claiming to.
If the companies people have been paying for decades for this data get it wrong this often, how can I trust any weirder data they're trying to sell me? I thought the point of buying these feeds was to let the seller worry about quality assurance.
this doesn't matter - any sophisticated user will have their own software to clean the data anyway. Their concern is getting the data, they know how to clean it once they have it.
You are right, extracting insights from data is a low-hanging fruit. From what I observe there is huge lack for proper services and tools that can automatically produce insights. There are of course automatic machine learning solutions, but they focus more on machine learning model tuning (in the kaggle style) rather that giving users understanding and awareness of the their data.
I think data scientists needs to produce more actionable insights as oppose to living in their own world. I suspect there will be an rising group of people who can understand data science techniques and communicate them effectively to drive business decisions. These people will be the ones who can clinch the top posts.
Nasdaq already makes more money on data licensing than on trading fees or IPO's. Each time a professional in the financial services industry wants real-time display data, for example, they have to pay Nasdaq a monthly fee. Nasdaq and NYSE compete for listings less so now for trading fees, but because it makes its data licensing package more valuable.
How do these services make it easier to evaluate data? The medium article starts with a disclaimer about DLT... Talking with investors buying data, one shouldn't be surprised to hear them request uploads to their FTP. Their data teams are overcommitted when it comes to the evaluation side of consuming data. They aren't (yet) resourced like a tech startup.
How should they prioritize learning about ingesting data from a DLT? They have data brokers (like Quandl) coming to them with assurances of frictionless integration, with data they can understand and use, today!
I'm calling it here: the most useful data is private, or, can't be sold due to confidentiality. The fact that data is confidential is great evidence that we know it is useful, but also that we hope others won't use that signal.
I've been researching this topic for some time now, alternative data, and not surprised since Nasdaq is a large provider of software (e.g. market-making sw amongst dozens of other sw):
QUANDL SPECIFIC:
-Quandl has a pretty decent blog that I would check out, you never know what new large corporate policy enacted might get rid of it:
https://blog.quandl.com/
GENERAL NOTES:
-More and more asset managers are using it and there is some worry that everyone is making the same conclusions off the same data set, and thus no money to be made. Though most practitioners say this is a none-issue, there is more and more alt. data sets out there to chose from, cleaning the data is tricky and testing the veracity of the data provided and knowing how to combine it with others sets is a key competitive advantage that not every asset manager is good at.
-The ROI is something that is top of mind but not always easily attributable throughout the year, e.g. one large insight very late in the financial year can bring +100x returns on what was paid for a data provider's software.
-Hugely successful funds like Renaissance's Medallion has likely been doing this for a long long time, coupled with top PhDs looking for a lot of statistical correlation with traditional data as well.
-More and more data sets that are being created and thrown into a self-learning financial model (aka AI) have a lot of people excited, and certainly there are a lot of small funds being created, though seems to be mostly by young people or not-so-great hedge fund managers. Getting large investors to lay down significant capital has a huge trust component to it, aka want to bet only on succesful grey-haired largely-male dominated folks
-A lot of alternative data can be found directly from the Bloomberg terminal e.g. MAPS <Go> function. However my understanding is that it's not that deep, quality is an issue, and everyone has access to it (no real competitive advantage).
That's probably fair. Quandl started by offering the "everyday" investor API access. I know the typical VC approach is to first get users and then scale, but often in investing/financial data products, it seems better to price high and then move down market. If you study the companies with the most success in the past (Bloomberg, CapIQ, MSCI, Eze, Advent, Factset, Morningstar, etc.), none of them started by trying to cater to the DIY investor.
> 'The company offers a global database of alternative, financial and public data, including information on capital markets, energy, shipping, healthcare, education, demography, economics and society.'
The way I think of alternative data is data about a business or industry that is obtained, collated, and analyzed through non-traditional communication channels, and helps to provide a better picture of how a company or industry is doing than just relying on trade data and financial statements. The best example of this I can think of is companies scraping AliBaba at certain frequencies, trying to ascertain the movement of certain products or raw materials. This data is then sold to investment firms and hedge funds, because they feel it gives them an edge.
One company that operates in this space is YipitData. From what I've been told, they started as something similar to GroupOn but then pivoted to this space after scraping for their own competitive intelligence reasons.
One example could be combining credit card data and location data to try and infer is bad weather affected same-store sales. Another use could be determining if a company was emailing loss-leading discount promos at the quarter to juice its sales growth. Another could be collecting Tesla VINs to see if it is hitting its target productions. In the last case, Bloomberg has made this available for free:
Alternative data is non-financial data which can be tied to various securities.
Financial data, for example, would be EUR USD spot prices. Non-financial data (i.e. "alternative data") could be healthcare reports which you could theoretically couple to e.g. pharma stocks.
For something so important to the economy, it's amazing that there isn't a better solution, or that an open standard hasn't been mandated.
But, they do have a mostly-decent API (tables; timeseries is pretty bad).
Something that always bugs me is properly adjusting prices when backtesting. The "right" way seems to be how Quantopian now handles it [1], in a just-in-time fashion, but that code isn't in their public libraries, and over email they declined to tell me where they get the data.
[1] https://www.quantopian.com/quantopian2/adjustments
Keep an updated corp action table with date, corp action type, and adjustment factor.
Corp action type is important because divs adjust prices but not volumes, for example. Splits adjust both.
When you're ready to use an adjusted time-series: select the corp actions you care about and calculate a running product of 1+the adjustment factor. As-of join the adjustment factors, multiply and you're done.
At least some of data they use comes from a vendor they aren't able to name publicly.
Honestly, this stuff is so bad that the headache of it might fully justify huge finance compensation, and I’ve had colleagues who turned down huge bonuses and raises to leave finance companies solely to avoid this type of stuff and seek a career where the headaches bother them less and they are paid less.
CRSP[1] is pretty much regarded as the highest quality pricing data in the US, with stock prices going back to 1925. The database API is written for C and FORTRAN-95.
Data providers also have a habit of providing their own proprietary security IDs, or just mapping to tickers. So if you're trying to build a database with several providers, you have to wrangle together 15 different security identifiers, taking care of mergers/acquisitions, delistings, ticker recycling, etc. It is a fun exercise.
[1]http://www.crsp.com/files/programmers-guide.pdf
I'm working on a data-driven financial analysis blog and can't seem to find decent time-series fundamentals data now that yahoo and google have taken down their api's. Everything I find seems to be a $1000+ yearly subscription.
Also I remember hearing about some Amazon employees buying real estate once the terms were being finalized...
I specifically say "illegal insider trading" because insider trading is not intrinsically illegal. The SEC distinguishes between insider trading and illegal insider trading (and by extension, so does the compliance department of every investment firm, bank and hedge fund). If, through your own research, you discover information which is both material and nonpublic, and you proceed to trade on that information, you are insider trading. However it is not illegal unless you have thereby broken an agreement or duty (namely with the company itself, its affiliates or your own clients) at any part of the process.
In practice this usually means the information is tainted if any of the following is true:
1) you have a fiduciary duty to the shareholders of the company in question,
2) you are employed by, or contract for, the company in question,
3) you are employed by, or contract for, an affiliate of the company (such as a vendor),
4) you disobeyed terms and conditions of service related to use of the product related to the information you found.
Obviously the standard disclaimers apply: I am not a lawyer, don't take potential legal advice from a random HN commenter, etc.
Source: I used to work in financial forecasting using alternative data.
If you overhear someone talking about an impending acquisition in a coffee shop, and you trade on that information, you're quite safe in the US. European countries can and do consider that insider trading, though.
Is Spying on Corporate Jets Insider Trading? https://www.cnbc.com/id/100272132
On the exchange side: these guys are always on the prowl for hot new properties to scoop up. The traditional business model of simply earning fees on exchange trading is slowly eroding away (for the last 10 years). So they need to branch out into services and other data plays...
Hear me out.
"To the person who only has a hammer, everything looks like a nail."
The data in front of your is the data you want to analyze, but it doesn't follow that that is the data you ought to analyze. I predict that most of the data you look at will result in nothing. The null hypothesis will not be rejected in the vast majority of cases.
I think we -- machine learning learners -- have a fantasy that the signal is lurking and if we just employ that one very clever technique it will emerge. Sure random forests failed, and neural nets failed and the SVR failed but if I reduce the step size, plug the output of the SVR into the net and change the kernel...
Let me put an example: suppose you want to analyze the movement of the stock market using the movement of the stars. Adding more information on the stars, and more techniques may feel like you're making progress but it isn't.
Conversely, even a simple piece of simple information that requires minimal analysis (this companies sales are way up and no one else but you know it) would be very useful in making that prediction.
The first data set is rich, but simply doesn't have the required signal. The second is simple, but has the required signal. The data that is widely available is unlikely to have unextracted signal left in it.
I think your example is really justifying a "machine learner" that has some domain expertise and doesn't blindly apply algorithms to some array of numbers.
A client of mine purchases several fundamental feeds from Quandl, and I email them regularly to point out errors. Not weird, hard, tricky errors, but stuff like "why are all these volumes missing" or "there's a 1-day closing price increase of 1200%" or "you divided when you should have multiplied". This tells me neither Quandl nor the original provider (e.g. Zacks) do any serious data validation, despite claiming to.
If the companies people have been paying for decades for this data get it wrong this often, how can I trust any weirder data they're trying to sell me? I thought the point of buying these feeds was to let the seller worry about quality assurance.
There is ChainLink (https://chain.link/) that lets you sell your data via API service through decentralized oracle nodes.
https://blog.goodaudience.com/the-four-biggest-use-cases-for...
Monetization is coming soon... in a big way.
How should they prioritize learning about ingesting data from a DLT? They have data brokers (like Quandl) coming to them with assurances of frictionless integration, with data they can understand and use, today!
Deleted Comment
Shameless plug: https://KloudTrader.com/narwhal
QUANDL SPECIFIC: -Quandl has a pretty decent blog that I would check out, you never know what new large corporate policy enacted might get rid of it: https://blog.quandl.com/
GENERAL NOTES:
-More and more asset managers are using it and there is some worry that everyone is making the same conclusions off the same data set, and thus no money to be made. Though most practitioners say this is a none-issue, there is more and more alt. data sets out there to chose from, cleaning the data is tricky and testing the veracity of the data provided and knowing how to combine it with others sets is a key competitive advantage that not every asset manager is good at.
-The ROI is something that is top of mind but not always easily attributable throughout the year, e.g. one large insight very late in the financial year can bring +100x returns on what was paid for a data provider's software.
-Hugely successful funds like Renaissance's Medallion has likely been doing this for a long long time, coupled with top PhDs looking for a lot of statistical correlation with traditional data as well.
-More and more data sets that are being created and thrown into a self-learning financial model (aka AI) have a lot of people excited, and certainly there are a lot of small funds being created, though seems to be mostly by young people or not-so-great hedge fund managers. Getting large investors to lay down significant capital has a huge trust component to it, aka want to bet only on succesful grey-haired largely-male dominated folks -A lot of alternative data can be found directly from the Bloomberg terminal e.g. MAPS <Go> function. However my understanding is that it's not that deep, quality is an issue, and everyone has access to it (no real competitive advantage).
> 'The company offers a global database of alternative, financial and public data, including information on capital markets, energy, shipping, healthcare, education, demography, economics and society.'
which doesn't really answer the question.
One company that operates in this space is YipitData. From what I've been told, they started as something similar to GroupOn but then pivoted to this space after scraping for their own competitive intelligence reasons.
https://www.bloomberg.com/graphics/2018-tesla-tracker/
Financial data, for example, would be EUR USD spot prices. Non-financial data (i.e. "alternative data") could be healthcare reports which you could theoretically couple to e.g. pharma stocks.
- Real-time weather data from major ports and across the main shipping lines
- Telemetry from crop and soil report systems
- Up-to-date satellite imagery of basically anything large under construction (solar farms, factories, ...)
Provide information like that in a machine-readable, consistent format and you have a business.
Btw... Using satellite images to track car manufacturers' inventory levels is an old idea, used for more than a decade.