Data
The first step in our project was to scrape data from eBay. Although writing the scraping script was not difficult, the security measures taken by eBay to block the ip addresses of the web scrapers made the process very difficult. We ended up implementing a system that uses tor to bypass those security features and to successully scrape the data.
Detailed Scraping Process
We looked at the eBay motor part of the website for getting bidding information of the motor vehicles auctioned on eBay. We designed a two step scraping process for the purpose of getting all the information:
eBay only shows 10k items in the search. So to maximize our dataset we scrapped each car model separately. In this step, we scrapped all of the search result pages for each model and scraped item id, car mileage, car year, and model data.
Then we tried to scrape the bid history of each of the items by using their item ids. However, soon we found that eBay's security system blocks our ip and asks for solving captcha after we scrape a few hundred pages (most likely due to maximum call limit per day/within a small timeframe). To get around that we designed a distributed scraping system using tor and polipo client. Tor clinet broadcasts at port 9050, and we made our polipo client listen to that port. Our scraped connects with port 8123 (polipo port) and channels the data through tor. Therefore we get new ip every once in a while. To force the ip renewal process, we included a network restart bash script and that forces ip switching each time we get error (presumably when eBay asks for solving captcha).
After we had the data, we cleaned the dataset and explored the data set with some exploratory analysis to check the results.
Exploratory Analysis
Most Popular Car Model:
First we found what car models were the most popular (in fact the word cloud above is the result!). We got the following pie chart from the data
Popular bidding time:
The following histogram shows when most of the bids were made. It seems that 7pm-8pm is the most popular time for bidding
More analysis:
More analysis has been done in the notebook. Please check our git for more.