Essay PreviewMore ↓
This project study will introduce the detail of our working, including data-preprocessing, exploratory data analysis, predictive model construction, result analysis. The report is designed to have 4 sections. Section 1 will be a brief project introduction. Section 2 is about data description and data preprocessing. The data mining methodologies we employed is detailed in Section 3. Section 4 shows the results of this data mining project.
Problem Description and Objective
Venture capital is financial capital provided to some young, high-potential, high risk companies. The venture capital fund makes money by owning equity in the companies it invests in, which usually have a novel technology or business model in high technology industries, such as biotechnology, IT and software. In addition, venture capital is attractive for new companies with limited operating history that are too small to raise capital in the public markets and are hardly qualified to secure a bank loan. The typical venture capital investment happens when venture capitalists show strong interests on the targeted start-ups and expect high returns at the time when they exit, which is usually with the company going IPO or being acquired.
The initial idea we have seems irrational at first. We supposed that we are the investment manager of a venture capital fund. We have 10 million in hand. Generally, if we do nothing with this money, our money keeps watered due to the inflation. In effect, the existence of a huge amount of historical data shows that data mining can provide a competitive advantage over human inspection of these data. Even though, economics theory named the Efficient Market Hypothesis suggests that the markets adapt so rapidly in terms of price adjustments that there is no space to gain profits in a consistent way. However, this theory does not always align with the reality in the financial market, which leaves the investors some space for speculation.
The general goal of venture capital is to maintain a portfolio of equities of some early-stage, and high-potential companies.
How to Cite this Page
"Data Mining." 123HelpMe.com. 22 Jan 2020
Need Writing Help?
Get feedback on grammar, clarity, concision and logic instantly.Check your paper »
- Data mining software Data mining has the potential to give businesses a competitive edge in Customer Relationship Management. Organizations use methods such as complex algorithms, artificial intelligence, and statistics to mine meaningful patterns from large sets of data. These patterns can then be utilized to do a number of things including targeting customers by predicting future behavior and learning more about present behavior. One widely accepted model is the Cross-Industry Standard Process for Data Mining (CRISP-DM) which has six phases: business understanding, data understanding, data preparation, model building, testing and evaluation, and deployment.... [tags: Data mining, Customer relationship management]
839 words (2.4 pages)
- Every day, almost every moment, we are making decisions. The decision-making process is extremely important in our life. Since as long as you made a decision, we will contribute most of our capital, time, focus and energy on the direction you selected. We believe that better decision can make life better. Different decision results come from, sometimes, different knowledge set or preferences people have. Before doing this project, we have reached a consensus that knowledge is power. And data mining can give us better knowledge to make better decision.... [tags: Data Mining Project]
2509 words (7.2 pages)
- I. INTRODUCTION The concept of Sustainability Development started in the 80’s era and different meaning and definition were introduced. One among them most frequently quoted definition is from the Brundtland Report: "Sustainable development is development that meets the needs of the present without compromising the ability of future generations to meet their own needs. It contains two key concepts within it: First, the concept of needs, in particular the essential needs of the world's poor, to which overriding priority should be given; and Second, the idea of limitations imposed by the state of technology and social organization on the environment's ability to meet present and future n... [tags: data mining, sustainability development]
2597 words (7.4 pages)
- Data Mining Benefits and Drawbacks Introduction In a world where computers are becoming as essential to daily life as the cars we drive or the telephones we use to communicate, it is difficult to find a person who doesn’t have some particular use for computers. Computers have become the information stores of the world. If you take a moment to think about all the kinds of information a person can and does hold on their computer it is staggering. I myself have all the passwords to my email and bank accounts, the history of every web page I’ve visited in the last 3 weeks, my credit card numbers, the complete history of all my banking transactions for the last three years stored on my comput... [tags: Internet Data Mining Businesses Essays]
2696 words (7.7 pages)
- A Review of the Literature of Apply Data Mining Technology for Customer Relationship Management and Customer Privacy in Banking Introduction The concept of customer relationship management (CRM) was developed in the mid- 1990s, when the information technology was being used to ‘track multiple activities of customers’ (Chieko Minamia, 2008). After many years, the CRM is considered the ‘underlying tool’ for business, because mining the customer value is the key to success for each company. CRM was defined as ‘helping organizations to better discriminate and more effectively allocate resources to the most profitable group of customers through the cycle of customer identification, customer att... [tags: Data mining, Customer relationship management]
1744 words (5 pages)
- Data Mining and Privacy-an ethical look I. Introduction In 2001, the MIT Technology Review listed data mining as one of the top 10 technologies that will change the world.[i] So, what is data mining. For many people, the simple answer is that data mining is the collecting of people’s information when logged onto the Internet. But Webopedia emphasizes that data mining is not the collection of data itself, but the statistical interpretation of it – allowing people to obtain new information or find hidden patterns within that collected data.[ii] It is the combination of these, collection and analysis, which are cause for concern.... [tags: Data Mining Technology Technological Essays]
3249 words (9.3 pages)
- Abstract With time, direct marketing has become increasingly popular in academics and the corporate sector. This is mainly due to the increasingly competitive marketing where most if not all companies are willing to take any profitable measure in terms of marketing, the advancement of technology and the changes in the general customer behavior that has, to date, been hard to predict. With many studies being conducted to explore the dimensions of direct marketing, the topic is yet not fully understood.... [tags: Marketing, Sales, Data mining, Customer service]
734 words (2.1 pages)
- Data Mining Abstract Data mining is a combination of database and artificial intelligence technologies. Although the AI field has taken a major dive in the last decade; this new emerging field has shown that AI can add major contributions to existing fields in computer science. In fact, many experts believe that data mining is the third hottest field in the industry behind the Internet, and data warehousing. Data mining is really just the next step in the process of analyzing data. Instead of getting queries on standard or user-specified relationships, data mining goes a step farther by finding meaningful relationships in data.... [tags: Technology Database Computer]
1503 words (4.3 pages)
- Data Mining With the increased and widespread use of technologies, interest in data mining has increased rapidly. Companies are now utilized data mining techniques to exam their database looking for trends, relationships, and outcomes to enhance their overall operations and discover new patterns that may allow them to better serve their customers. Data mining provides numerous benefits to businesses, government, society as well as individual persons. However, like many technologies, there are negative things that caused by data mining such as invasion of privacy right.... [tags: Hackers Computers Technology Essays]
4464 words (12.8 pages)
- An Introduction to Data Mining Overview Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems.... [tags: Database Technology]
1986 words (5.7 pages)
In our project study, we collected the data from the famous venture capital website Crunchbase.com. The dataset includes about 30k observations, and each of them represents one historical investment record for one company. There are three types of tables in this dataset, which are “Companies”, “Investments” and “Rounds”. Details about the information contained in each type of tables are presented in Table 1. As shown, “Companies” contains mostly the company information in terms of location, founded date, first/last funding information and its current status. Another table “Investments” introduces additional information about investors’ name, location and the type of funding they provided. The last one, “Rounds” supplements the information in terms of the exact funding amount in each round.
Each record contains information of 36 variables, and these variables can be concluded as three parts. The table below shows the detail of them.
Companies Investments Rounds
Name, Category Name, Category Name, Category
Location (Country, State, City) Location (Country, State, City) Location (Country, State, City)
Founded (Date, Location) Investors (Name, Location) Funding (Round Type, Date)
Funding (First & Last) Funding (Round Type, Date) Funding Amount
To make the dataset more tangible and interpretable, we provide the investment information to one company as an example, shown in Table 2.
Companies Investments Rounds
100Plus, analytics 100Plus, analytics 100Plus, analytics
San Francisco, CA, USA San Francisco, CA, USA San Francisco, CA, USA
16/09/2011 Band of Angels – Menlo Park, CA, USA
… (4 more) Angel, 30/11/2011
… (4 more)
02/11/2011, 30/11/2011 Angel, 30/11/2011
… (4 more) 750K,
… (4 more)
Before we do any analysis on the raw dataset, we merged the three tables into one, which basically contains almost all the information shown here. Thus, we generate one big table with 14 attributes and around 30k instances. To discover the underlying information, we have to do data exploration, illustrated in the next subsection.
As we mentioned above, the main goal of this project is to build a predictive model with high accuracy that can predict the future status of the companies. In such predicative task, model building is mainly based on the historical data and provide the prediction on some interesting variable.
Given the lack of further information on the problem domain, it is better to investigate some of the statistical properties of the data, so as to get a better understanding.
Companies distribution and funding amount
The first idea we have is to graph two maps that marks all the investment companies’ location and funding amount. This helps us understand where these companies are located geographically, and how much fund they received. We make use of the software called Tableau to do this. The graphs shows in Figure 1 and Figure 2.
From this two graphs, we can get a clear view that, there are two main area which embrace more companies and more funding amount. One is California and the other is New York. It is easy to understand why these two place are the central of investment.
In New York, the Wall Street is the heart of financial system. The New York Stock Exchange, which is the world's largest stock exchange by market capitalization of its listed companies and several other major exchanges have or had headquarters here, including NASDAQ, the New York Mercantile Exchange, and the New York Board of Trade. Anchored by Wall Street, New York City is one of the world's principal financial centers.
California, where Silicon Valley located, could be comparable to that of the largest of countries in economy. In 2011, California is responsible for 13.1 percent of the United States' $14.96 trillion gross domestic product. Silicon Valley is the home to many of the world's largest technology corporations as well as thousands of small startups. Despite the development of other high-tech economic centers throughout the United States and the world, Silicon Valley continues to be a leading hub for high-tech innovation and development, accounting for one-third of all of the venture capital investment in the United States.
Figure 1: The map of companies distribution
Figure 2: Funding amount
Timeline of investment data
Second, we also build a figure to show the summary of time line when companies got funding. The figure 2 shows that there is an increasing trend that most companies got funding in later 2000s. The number of companies got funding before 2000 is very small at that time, which indicating that the limitation of data collection.
Considering to the special internet speculative bubble, the dot-com bubble at that time, venture capitalists saw record-setting growth as dot-com companies experienced high growth rate in their stock prices and therefore moved faster and with less caution than usual, which finally create a huge bubble. The collapse of dot-com bubble took place during 1999–2001. The stock market crash caused the loss of $5 trillion in the market value of companies from March 2000 to October 2002. And the 9/11 terrorist accelerated the stock market drop.
After that, venture investors were scared and struggled to climb out the hole generated by significant early loss. From 2004 to 2007, fundraising increased steady from the bottom line in 2003. Venture investors approached new investments armed with the experience and lessons learned from the tech market crash. Although hit by the global financial crisis on 2008, investors respond rapidly at the onset of the crisis by insulating themselves as much as possible form the downturn. New venture investment fell sharply in the late 2008, and still keep growing after 2009.
The data in 2013 is extremely small because it is incomplete.
Figure 3: Timeline of venture capital
Companies Industry Distribution
Third, we build a graph to show the industry distribution of companies. The top 3 industries are software, biotech and web.
Figure 4: Companies Industry Distribution
The attribute are listed at detail below, and take some examples.
company_region: SF Bay
company_city: Mountain view
investor_region: SF Bay
investor_city: Mountain view
In the attribute status_score, we define operating and closed as -1, IPO and acquired as 1.
Remove some attributes
In the dataset, attributes like “company_name”, “company_founded” location, are meaningless since they are basically IDs. In data mining task, ID-like attributes make little sense while generating predictive model. We did this as we consider investors rationally prefer to concentrate on companies’ current and past financial situation, potential market performance and something more realistic.
On the other hand, some attributes like company region and invest region, they are redundant due to their vagueness in terms of location compared to some other attributes we have in hand, such as state and city.
Moreover, it is reasonable to remove some attributes in which a huge proportion of observations are missing. Too much missing information means we have little to make use of. In our case, the information in “investor_category_code” is extremely incomplete. For the reason that we cannot fill them with the most frequent values or predictive values, we decided to simply remove it.
Dividing train and test dataset
Since our goal of this project is to build a predictive model of the future status of companies, the accuracy is the main criteria to judge our model. So in this analysis we are going to carry out, we will assume that we don’t know the ground truth status for the companies which were founded after 2012. We treat them as our test cases. The train set included the rest of the data which corresponds to the companies founded before 2012. After diving the whole dataset into two, the train set has 26124 observations with known status, and the testset has 1525 observations with status unknown. After these data manipulation, we moved to the next step of our job.
In this part, we will briefly introduce the algorithms we used to apply in this predictive model. The algorithms include NaiveBayes, BayesNet, lazy.IBK(instan-based KNN), Trees.J48, Trees.RandomForest, Rules.JRip(Ripper).
A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong independence assumptions. It assumes that the presence or absence of a particular feature is unrelated to the presence or absence of any other feature, given the class variable. Despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. An advantage of Naive Bayes is that it only requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification.
IBK means instance-based k-nearest neighbor algorithm. One advantage that instance-based learning has over other methods of machine learning is its ability to adapt its model to previously unseen data. Where other methods generally require the entire set of training data to be re-examined when one instance is changed, instance-based learners may simply store a new instance or throw an old instance away. KNN rules in effect implicitly compute the decision boundary. It is also possible to compute the decision boundary explicitly, and to do so efficiently, so that the computational complexity is a function of the boundary complexity.
A Bayes network is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG). Formally, Bayesian networks are directed acyclic graphs whose nodes represent random variables in the Bayesian sense: they may be observable quantities, latent variables, unknown parameters or hypotheses. Edges represent conditional dependencies; nodes that are not connected represent variables that are conditionally independent of each other. Each node is associated with a probability function that takes as input a particular set of values for the node's parent variables and gives the probability of the variable represented by the node.
The C4.5 technique is one of the decision tree families that can produce both decision tree and rule-sets; and construct a tree. Besides that, C4.5 models are easy to understand as the rules that are derived from the technique have a very straightforward interpretation. J48 classifier is among the most popular and powerful decision tree classifiers. C5.0 and J48 are the improved versions of C4.5 algorithms. Shortly, J48 is an optimized implementation of C4.5.
Random forests are an ensemble learning method for classification that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees. The individual decision trees are generated using a random selection of attributes at each node to determine the split. More formally, each trees depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. During classification, each tree votes and the most popular class is returned. Random forests are comparable in accuracy to AdaBoost, yet are more robust to errors and outliers. The generalization error for a forest converges as long as the number of trees in the forest is large. Thus, overfitting is not a problem. The accuracy of a random forest depends on the strength of the individual classifiers and a measure of the dependence between them. The ideal is to maintain the strength of individual classifiers without increasing their correlation.
JRip (RIPPER) is one of the basic and most popular algorithms. Classes are examined in increasing size and an initial set of rules for the class is generated using incremental reduced error JRip (RIPPER) proceeds by treating all the examples of a particular judgment in the training data as a class, and finding a set of rules that cover all the members of that class. Thereafter it proceeds to the next class and does the same, repeating this until all classes have been covered.
Data Mining Tool Selection: Weka
After explained the data and methodology, we move to choose our data mining tools. Data Selection of appropriate data mining tools and techniques depends on the main task of the data mining process. In this project, we decided to use Weka software as our data mining tools. Weka, actually, is able to provide all the required data mining algorithms we briefly explain upside..Even though Weka expects the data to be fed into to be in ARFF format, we used the csv convertor to load our CSV. Data into Weka and apply the algorithems.
We also took some screeenshoot picture of Weka, which shows below.
algorithm accuracy precision recall f-measure
bayes.NaiveBayes 0.762476 0.977 0.762 0.851
bayes.BayesNet 0.784777 0.968 0.785 0.863
lazy.IBk 0.968504 0.956 0.969 0.962
trees.J48 0.976378 0.954 0.976 0.965
trees.RandomForest 0.976378 0.953 0.976 0.965
rules.JRip 0.976378 0.953 0.976 0.965
[I can explain the results. You can leave it for me.]
1. Ben Christensen, Sven Weber, John Otterson, Venture Investing After the Bubble: A decade of evolution, SVB Financial Group
2. JiaWei Han, Micheline Kamber, Data Mining: Concepts and Techniques, 3rd edition
3. Pang-Ning Tan, Vipin Kumar, Michael Steinbach, Introduction to Data Mining
4. Data source: www.crunchbase.com