diff --git a/README.md b/README.md index fb2a3fb..2035d0a 100644 --- a/README.md +++ b/README.md @@ -1,640 +1,9 @@ # Ethereum ETL +Convert blockchain data into convenient formats like CSVs and relational databases. + +[Read documentation here](https://ethereum-etl.readthedocs.io). + [![Join the chat at https://gitter.im/ethereum-eth](https://badges.gitter.im/ethereum-etl.svg)](https://gitter.im/ethereum-etl/Lobby?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) [![Build Status](https://travis-ci.org/blockchain-etl/ethereum-etl.png)](https://travis-ci.org/blockchain-etl/ethereum-etl) -[Join Telegram Group](https://t.me/joinchat/GsMpbA3mv1OJ6YMp3T5ORQ) - -Install Ethereum ETL: - -```bash -pip3 install ethereum-etl -``` - -Export blocks and transactions ([Schema](#blockscsv), [Reference](#export_blocks_and_transactions)): - -```bash -> ethereumetl export_blocks_and_transactions --start-block 0 --end-block 500000 \ ---provider-uri https://mainnet.infura.io --blocks-output blocks.csv --transactions-output transactions.csv -``` - -Export ERC20 and ERC721 transfers ([Schema](#token_transferscsv), [Reference](#export_token_transfers)): - -```bash -> ethereumetl export_token_transfers --start-block 0 --end-block 500000 \ ---provider-uri file://$HOME/Library/Ethereum/geth.ipc --output token_transfers.csv -``` - -Export traces ([Schema](#tracescsv), [Reference](#export_traces)): - -```bash -> ethereumetl export_traces --start-block 0 --end-block 500000 \ ---provider-uri file://$HOME/Library/Ethereum/parity.ipc --output traces.csv -``` - ---- - -Stream blocks, transactions, logs, token_transfers continually to console ([Reference](#stream)): - -```bash -> pip3 install ethereum-etl[streaming] -> ethereumetl stream --start-block 500000 -e block,transaction,log,token_transfer --log-file log.txt -``` - -Find other commands [here](#command-reference). - -For the latest version, check out the repo and call -```bash -> pip3 install -e . -> python3 ethereumetl.py -``` - -[LIMITATIONS](#limitations) - -## Table of Contents - -- [Schema](#schema) - - [blocks.csv](#blockscsv) - - [transactions.csv](#transactionscsv) - - [token_transfers.csv](#token_transferscsv) - - [receipts.csv](#receiptscsv) - - [logs.csv](#logscsv) - - [contracts.csv](#contractscsv) - - [tokens.csv](#tokenscsv) - - [traces.csv](#tracescsv) -- [Exporting the Blockchain](#exporting-the-blockchain) - - [Export in 2 Hours](#export-in-2-hours) - - [Command Reference](#command-reference) -- [Ethereum Classic Support](#ethereum-classic-support) -- [Querying in Amazon Athena](#querying-in-amazon-athena) -- [Querying in Google BigQuery](#querying-in-google-bigquery) - - [Public Dataset](#public-dataset) - - [How to Query Balances for all Ethereum Addresses](#how-to-query-balances-for-all-ethereum-addresses) - - [Building Token Recommender in Google Cloud Platform](#building-token-recommender-in-google-cloud-platform) -- [Blockchain ETL in Media](#blockchain-etl-in-media) - - -## Schema - -### blocks.csv - -Column | Type | -------------------|--------------------| -number | bigint | -hash | hex_string | -parent_hash | hex_string | -nonce | hex_string | -sha3_uncles | hex_string | -logs_bloom | hex_string | -transactions_root | hex_string | -state_root | hex_string | -receipts_root | hex_string | -miner | address | -difficulty | numeric | -total_difficulty | numeric | -size | bigint | -extra_data | hex_string | -gas_limit | bigint | -gas_used | bigint | -timestamp | bigint | -transaction_count | bigint | - -### transactions.csv - -Column | Type | ------------------|-------------| -hash | hex_string | -nonce | bigint | -block_hash | hex_string | -block_number | bigint | -transaction_index| bigint | -from_address | address | -to_address | address | -value | numeric | -gas | bigint | -gas_price | bigint | -input | hex_string | -block_timestamp | bigint | - -### token_transfers.csv - -Column | Type | ---------------------|-------------| -token_address | address | -from_address | address | -to_address | address | -value | numeric | -transaction_hash | hex_string | -log_index | bigint | -block_number | bigint | - -### receipts.csv - -Column | Type | ------------------------------|-------------| -transaction_hash | hex_string | -transaction_index | bigint | -block_hash | hex_string | -block_number | bigint | -cumulative_gas_used | bigint | -gas_used | bigint | -contract_address | address | -root | hex_string | -status | bigint | - -### logs.csv - -Column | Type | --------------------------|-------------| -log_index | bigint | -transaction_hash | hex_string | -transaction_index | bigint | -block_hash | hex_string | -block_number | bigint | -address | address | -data | hex_string | -topics | string | - -### contracts.csv - -Column | Type | ------------------------------|-------------| -address | address | -bytecode | hex_string | -function_sighashes | string | -is_erc20 | boolean | -is_erc721 | boolean | -block_number | bigint | - -### tokens.csv - -Column | Type | ------------------------------|-------------| -address | address | -symbol | string | -name | string | -decimals | bigint | -total_supply | numeric | - -### traces.csv - -Column | Type | ------------------------------|-------------| -block_number | bigint | -transaction_hash | hex_string | -transaction_index | bigint | -from_address | address | -to_address | address | -value | numeric | -input | hex_string | -output | hex_string | -trace_type | string | -call_type | string | -reward_type | string | -gas | bigint | -gas_used | bigint | -subtraces | bigint | -trace_address | string | -error | string | -status | bigint | - -You can find column descriptions in [https://github.com/medvedev1088/ethereum-etl-airflow](https://github.com/medvedev1088/ethereum-etl-airflow/tree/master/dags/resources/stages/raw/schemas) - -Note: for the `address` type all hex characters are lower-cased. -`boolean` type can have 2 values: `True` or `False`. - -## LIMITATIONS - -- In case the contract is a proxy, which forwards all calls to a delegate, interface detection doesn’t work, -which means `is_erc20` and `is_erc721` will always be false for proxy contracts and they will be missing in the `tokens` -table. -- The metadata methods (`symbol`, `name`, `decimals`, `total_supply`) for ERC20 are optional, so around 10% of the -contracts are missing this data. Also some contracts (EOS) implement these methods but with wrong return type, -so the metadata columns are missing in this case as well. -- `token_transfers.value`, `tokens.decimals` and `tokens.total_supply` have type `STRING` in BigQuery tables, -because numeric types there can't handle 32-byte integers. You should use -`cast(value as FLOAT64)` (possible loss of precision) or -`safe_cast(value as NUMERIC)` (possible overflow) to convert to numbers. -- The contracts that don't implement `decimals()` function but have the -[fallback function](https://solidity.readthedocs.io/en/v0.4.21/contracts.html#fallback-function) that returns a `boolean` -will have `0` or `1` in the `decimals` column in the CSVs. - -## Exporting the Blockchain - -If you'd like to have the blockchain data platform -set up and hosted for you in AWS or GCP, get in touch with us -[here](https://d5ai.typeform.com/to/cmOoLe). - -1. Install python 3.5.3+ https://www.python.org/downloads/ - -1. You can use Infura if you don't need ERC20 transfers (Infura doesn't support eth_getFilterLogs JSON RPC method). -For that use `-p https://mainnet.infura.io` option for the commands below. If you need ERC20 transfers or want to -export the data ~40 times faster, you will need to set up a local Ethereum node: - -1. Install geth https://github.com/ethereum/go-ethereum/wiki/Installing-Geth - -1. Start geth. -Make sure it downloaded the blocks that you need by executing `eth.syncing` in the JS console. -You can export blocks below `currentBlock`, -there is no need to wait until the full sync as the state is not needed (unless you also need contracts bytecode -and token details; for those you need to wait until the full sync). - -1. Install Ethereum ETL: - - ```bash - > pip3 install ethereum-etl - ``` - -1. Export all: - - ```bash - > ethereumetl export_all --help - > ethereumetl export_all -s 0 -e 5999999 -b 100000 -p file://$HOME/Library/Ethereum/geth.ipc -o output - ``` - - In case `ethereumetl` command is not available in PATH, use `python3 -m ethereumetl` instead. - - The result will be in the `output` subdirectory, partitioned in Hive style: - - ```bash - output/blocks/start_block=00000000/end_block=00099999/blocks_00000000_00099999.csv - output/blocks/start_block=00100000/end_block=00199999/blocks_00100000_00199999.csv - ... - output/transactions/start_block=00000000/end_block=00099999/transactions_00000000_00099999.csv - ... - output/token_transfers/start_block=00000000/end_block=00099999/token_transfers_00000000_00099999.csv - ... - ``` - -Should work with geth and parity, on Linux, Mac, Windows. -If you use Parity you should disable warp mode with `--no-warp` option because warp mode -does not place all of the block or receipt data into the database https://wiki.parity.io/Getting-Synced - -If you see weird behavior, e.g. wrong number of rows in the CSV files or corrupted files, -check out this issue: https://github.com/medvedev1088/ethereum-etl/issues/28 - -### Export in 2 Hours - -You can use AWS Auto Scaling and Data Pipeline to reduce the exporting time to a few hours. -Read this article for details https://medium.com/@medvedev1088/how-to-export-the-entire-ethereum-blockchain-to-csv-in-2-hours-for-10-69fef511e9a2 - -### Running in Docker - -1. Install Docker https://docs.docker.com/install/ - -1. Build a docker image - ```bash - > docker build -t ethereum-etl:latest . - > docker image ls - ``` - -1. Run a container out of the image - ```bash - > docker run -v $HOME/output:/ethereum-etl/output ethereum-etl:latest export_all -s 0 -e 5499999 -b 100000 -p https://mainnet.infura.io - > docker run -v $HOME/output:/ethereum-etl/output ethereum-etl:latest export_all -s 2018-01-01 -e 2018-01-01 -p https://mainnet.infura.io - ``` - -1. Run streaming to console or Pub/Sub - ```bash - > docker build -t ethereum-etl:latest-streaming -f Dockerfile_with_streaming . - > echo "Stream to console" - > docker run ethereum-etl:latest-streaming stream --start-block 500000 --log-file log.txt - > echo "Stream to Pub/Sub" - > docker run -v /path_to_credentials_file/:/ethereum-etl/ --env GOOGLE_APPLICATION_CREDENTIALS=/ethereum-etl/credentials_file.json ethereum-etl:latest-streaming stream --start-block 500000 --output projects//topics/crypto_ethereum - ``` - -### Command Reference - -- [export_blocks_and_transactions](#export_blocks_and_transactions) -- [export_token_transfers](#export_token_transfers) -- [extract_token_transfers](#extract_token_transfers) -- [export_receipts_and_logs](#export_receipts_and_logs) -- [export_contracts](#export_contracts) -- [export_tokens](#export_tokens) -- [export_traces](#export_traces) -- [export_geth_traces](#export_geth_traces) -- [extract_geth_traces](#extract_geth_traces) -- [get_block_range_for_date](#get_block_range_for_date) -- [get_keccak_hash](#get_keccak_hash) -- [stream](#stream) - -All the commands accept `-h` parameter for help, e.g.: - -```bash -> ethereumetl export_blocks_and_transactions -h - -Usage: ethereumetl export_blocks_and_transactions [OPTIONS] - - Export blocks and transactions. - -Options: - -s, --start-block INTEGER Start block - -e, --end-block INTEGER End block [required] - -b, --batch-size INTEGER The number of blocks to export at a time. - -p, --provider-uri TEXT The URI of the web3 provider e.g. - file://$HOME/Library/Ethereum/geth.ipc or - https://mainnet.infura.io - -w, --max-workers INTEGER The maximum number of workers. - --blocks-output TEXT The output file for blocks. If not provided - blocks will not be exported. Use "-" for stdout - --transactions-output TEXT The output file for transactions. If not - provided transactions will not be exported. Use - "-" for stdout - -h, --help Show this message and exit. -``` - -For the `--output` parameters the supported types are csv and json. The format type is inferred from the output file name. - -#### export_blocks_and_transactions - -```bash -> ethereumetl export_blocks_and_transactions --start-block 0 --end-block 500000 \ ---provider-uri file://$HOME/Library/Ethereum/geth.ipc \ ---blocks-output blocks.csv --transactions-output transactions.csv -``` - -Omit `--blocks-output` or `--transactions-output` options if you want to export only transactions/blocks. - -You can tune `--batch-size`, `--max-workers` for performance. - -[Blocks and transactions schema](#blockscsv). - -#### export_token_transfers - -The API used in this command is not supported by Infura, so you will need a local node. -If you want to use Infura for exporting ERC20 transfers refer to [extract_token_transfers](#extract_token_transfers) - -```bash -> ethereumetl export_token_transfers --start-block 0 --end-block 500000 \ ---provider-uri file://$HOME/Library/Ethereum/geth.ipc --batch-size 100 --output token_transfers.csv -``` - -Include `--tokens --tokens ` to filter only certain tokens, e.g. - -```bash -> ethereumetl export_token_transfers --start-block 0 --end-block 500000 \ ---provider-uri file://$HOME/Library/Ethereum/geth.ipc --output token_transfers.csv \ ---tokens 0x86fa049857e0209aa7d9e616f7eb3b3b78ecfdb0 --tokens 0x06012c8cf97bead5deae237070f9587f8e7a266d -``` - -You can tune `--batch-size`, `--max-workers` for performance. - -[Token transfers schema](#token_transferscsv). - -#### export_receipts_and_logs - -First extract transaction hashes from `transactions.csv` -(Exported with [export_blocks_and_transactions](#export_blocks_and_transactions)): - -```bash -> ethereumetl extract_csv_column --input transactions.csv --column hash --output transaction_hashes.txt -``` - -Then export receipts and logs: - -```bash -> ethereumetl export_receipts_and_logs --transaction-hashes transaction_hashes.txt \ ---provider-uri file://$HOME/Library/Ethereum/geth.ipc --receipts-output receipts.csv --logs-output logs.csv -``` - -Omit `--receipts-output` or `--logs-output` options if you want to export only logs/receipts. - -You can tune `--batch-size`, `--max-workers` for performance. - -Upvote this feature request https://github.com/paritytech/parity/issues/9075, -it will make receipts and logs export much faster. - -[Receipts and logs schema](#receiptscsv). - -#### extract_token_transfers - -First export receipt logs with [export_receipts_and_logs](#export_receipts_and_logs). - -Then extract transfers from the logs.csv file: - -```bash -> ethereumetl extract_token_transfers --logs logs.csv --output token_transfers.csv -``` - -You can tune `--batch-size`, `--max-workers` for performance. - -[Token transfers schema](#token_transferscsv). - -#### export_contracts - -First extract contract addresses from `receipts.csv` -(Exported with [export_receipts_and_logs](#export_receipts_and_logs)): - -```bash -> ethereumetl extract_csv_column --input receipts.csv --column contract_address --output contract_addresses.txt -``` - -Then export contracts: - -```bash -> ethereumetl export_contracts --contract-addresses contract_addresses.txt \ ---provider-uri file://$HOME/Library/Ethereum/geth.ipc --output contracts.csv -``` - -You can tune `--batch-size`, `--max-workers` for performance. - -[Contracts schema](#contractscsv). - -#### export_tokens - -First extract token addresses from `contracts.json` -(Exported with [export_contracts](#export_contracts)): - -```bash -> ethereumetl filter_items -i contracts.json -p "item['is_erc20'] or item['is_erc721']" | \ -ethereumetl extract_field -f address -o token_addresses.txt -``` - -Then export ERC20 / ERC721 tokens: - -```bash -> ethereumetl export_tokens --token-addresses token_addresses.txt \ ---provider-uri file://$HOME/Library/Ethereum/geth.ipc --output tokens.csv -``` - -You can tune `--max-workers` for performance. - -[Tokens schema](#tokenscsv). - -#### export_traces - -Also called internal transactions. -The API used in this command is not supported by Infura, -so you will need a local Parity archive node (`parity --tracing on`). -Make sure your node has at least 8GB of memory, or else you will face timeout errors. -See [this issue](https://github.com/blockchain-etl/ethereum-etl/issues/137) - -```bash -> ethereumetl export_traces --start-block 0 --end-block 500000 \ ---provider-uri file://$HOME/Library/Ethereum/parity.ipc --batch-size 100 --output traces.csv -``` - -You can tune `--batch-size`, `--max-workers` for performance. - -[Traces schema](#tracescsv). - -#### export_geth_traces - -Read [Differences between geth and parity traces.csv](#differences-between-geth-and-parity-tracescsv) - -The API used in this command is not supported by Infura, -so you will need a local Geth archive node (`geth --gcmode archive --syncmode full --ipcapi debug`). -When using rpc, add `--rpc --rpcapi debug` options. - -```bash -> ethereumetl export_geth_traces --start-block 0 --end-block 500000 \ ---provider-uri file://$HOME/Library/Ethereum/geth.ipc --batch-size 100 --output geth_traces.json -``` - -You can tune `--batch-size`, `--max-workers` for performance. - -#### extract_geth_traces - -```bash -> ethereumetl extract_geth_traces --input geth_traces.json --output traces.csv -``` - -You can tune `--batch-size`, `--max-workers` for performance. - -#### get_block_range_for_date - -```bash -> ethereumetl get_block_range_for_date --provider-uri=https://mainnet.infura.io --date 2018-01-01 -4832686,4838611 -``` - -#### get_keccak_hash - -```bash -> ethereumetl get_keccak_hash -i "transfer(address,uint256)" -0xa9059cbb2ab09eb219583f4a59a5d0623ade346d962bcd4e46b11da047c9049b -``` - -#### stream - -```bash -> pip3 install ethereum-etl[streaming] -> ethereumetl stream --provider-uri https://mainnet.infura.io --start-block 500000 -``` - -- This command outputs blocks, transactions, logs, token_transfers to the console by default. -- Entity types can be specified with the `-e` option, -e.g. `-e block,transaction,log,token_transfer,trace,contract,token`. -- Use `--output` option to specify the Google Pub/Sub topic where to publish blockchain data, -e.g. `projects//topics/bitcoin_blockchain`. Data will be pushed to -`projects//topics/bitcoin_blockchain.blocks`, `projects//topics/bitcoin_blockchain.transactions` -etc. topics. -- The command saves its state to `last_synced_block.txt` file where the last synced block number is saved periodically. -- Specify either `--start-block` or `--last-synced-block-file` option. `--last-synced-block-file` should point to the -file where the block number, from which to start streaming the blockchain data, is saved. -- Use the `--lag` option to specify how many blocks to lag behind the head of the blockchain. It's the simplest way to -handle chain reorganizations - they are less likely the further a block from the head. -- You can tune `--period-seconds`, `--batch-size`, `--block-batch-size`, `--max-workers` for performance. -- Refer to [blockchain-etl-streaming](https://github.com/blockchain-etl/blockchain-etl-streaming) for -instructions on deploying it to Kubernetes. - -Stream blockchain data continually to Google Pub/Sub: - -```bash -> export GOOGLE_APPLICATION_CREDENTIALS=/path_to_credentials_file.json -> ethereumetl stream --start-block 500000 --output projects//topics/crypto_ethereum -``` - -### Running Tests - -```bash -> pip3 install -e .[dev,streaming] -> export ETHEREUM_ETL_RUN_SLOW_TESTS=True -> pytest -vv -``` - -### Running Tox Tests - -```bash -> pip3 install tox -> tox -``` - -### Ethereum Classic Support - -For getting ETC csv files, make sure you pass in the `--chain classic` param where it's required for the scripts you want to export. -ETC won't run if your `--provider-uri` is Infura. It will provide a warning and change the provider-uri to `https://ethereumclassic.network` instead. For faster performance, run a client instead locally for classic such as `parity chain=classic` and Geth-classic. - -### Differences between geth and parity traces.csv - -- `to_address` field differs for `callcode` trace (geth seems to return correct value, as parity value of `to_address` is same as `to_address` of parent call); -- geth output doesn't have `reward` traces; -- geth output doesn't have `to_address`, `from_address`, `value` for `suicide` traces; -- `error` field contains human readable error message, which might differ in geth/parity output; -- geth output doesn't have `transaction_hash`; -- `gas_used` is 0 on traces with error in geth, empty in parity; -- zero output of subcalls is `0x000...` in geth, `0x` in parity; - -## Querying in Amazon Athena - -- Upload the files to S3: - -```bash -> cd output -> aws s3 sync . s3:///ethereumetl/export --region ap-southeast-1 -``` - -- Sign in to Athena https://console.aws.amazon.com/athena/home - -- Create a database: - -```sql -CREATE DATABASE ethereumetl; -``` - -- Create the tables: - - blocks: [schemas/aws/blocks.sql](schemas/aws/blocks.sql) - - transactions: [schemas/aws/transactions.sql](schemas/aws/transactions.sql) - - token_transfers: [schemas/aws/token_transfers.sql](schemas/aws/token_transfers.sql) - - contracts: [schemas/aws/contracts.sql](schemas/aws/contracts.sql) - - receipts: [schemas/aws/receipts.sql](schemas/aws/receipts.sql) - - logs: [schemas/aws/logs.sql](schemas/aws/logs.sql) - - tokens: [schemas/aws/tokens.sql](schemas/aws/tokens.sql) - -### Airflow DAGs - -Refer to https://github.com/medvedev1088/ethereum-etl-airflow for the instructions. - -### Tables for Parquet Files - -Read this article on how to convert CSVs to Parquet https://medium.com/@medvedev1088/converting-ethereum-etl-files-to-parquet-399e048ddd30 - -- Create the tables: - - parquet_blocks: [schemas/aws/parquet/parquet_blocks.sql](schemas/aws/parquet/parquet_blocks.sql) - - parquet_transactions: [schemas/aws/parquet/parquet_transactions.sql](schemas/aws/parquet/parquet_transactions.sql) - - parquet_token_transfers: [schemas/aws/parquet/parquet_token_transfers.sql](schemas/aws/parquet/parquet_token_transfers.sql) - -Note that DECIMAL type is limited to 38 digits in Hive https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-decimal -so values greater than 38 decimals will be null. - -## Querying in Google BigQuery - -### Public Dataset - -You can query the data that's updated daily in the public BigQuery dataset -https://medium.com/@medvedev1088/ethereum-blockchain-on-google-bigquery-283fb300f579 - -### How to Query Balances for all Ethereum Addresses - -Read this article -https://medium.com/google-cloud/how-to-query-balances-for-all-ethereum-addresses-in-bigquery-fb594e4034a7 - -### Building Token Recommender in Google Cloud Platform - -Read this article -https://medium.com/google-cloud/building-token-recommender-in-google-cloud-platform-1be5a54698eb - -## Blockchain ETL in Media - -- A Technical Breakdown Of Google's New Blockchain Search Tools: https://www.forbes.com/sites/michaeldelcastillo/2019/02/05/google-launches-search-for-bitcoin-ethereum-bitcoin-cash-dash-dogecoin-ethereum-classic-litecoin-and-zcash/#394fc868c789 -- Navigating Bitcoin, Ethereum, XRP: How Google Is Quietly Making Blockchains Searchable: https://www.forbes.com/sites/michaeldelcastillo/2019/02/04/navigating-bitcoin-ethereum-xrp-how-google-is-quietly-making-blockchains-searchable/?ss=crypto-blockchain#49e111da4248 diff --git a/docs/amazon-athena.md b/docs/amazon-athena.md new file mode 100644 index 0000000..374edb6 --- /dev/null +++ b/docs/amazon-athena.md @@ -0,0 +1,42 @@ +# Amazon Athena + +## Querying in Amazon Athena + +- Upload the files to S3: + +```bash +> cd output +> aws s3 sync . s3:///ethereumetl/export --region ap-southeast-1 +``` + +- Sign in to Athena https://console.aws.amazon.com/athena/home + +- Create a database: + +```sql +CREATE DATABASE ethereumetl; +``` + +- Create the tables: + - blocks: [schemas/aws/blocks.sql](https://github.com/blockchain-etl/ethereum-etl/blob/master/schemas/aws/blocks.sql) + - transactions: [schemas/aws/transactions.sql](https://github.com/blockchain-etl/ethereum-etl/blob/master/schemas/aws/transactions.sql) + - token_transfers: [schemas/aws/token_transfers.sql](https://github.com/blockchain-etl/ethereum-etl/blob/master/schemas/aws/token_transfers.sql) + - contracts: [schemas/aws/contracts.sql](https://github.com/blockchain-etl/ethereum-etl/blob/master/schemas/aws/contracts.sql) + - receipts: [schemas/aws/receipts.sql](https://github.com/blockchain-etl/ethereum-etl/blob/master/schemas/aws/receipts.sql) + - logs: [schemas/aws/logs.sql](https://github.com/blockchain-etl/ethereum-etl/blob/master/schemas/aws/logs.sql) + - tokens: [schemas/aws/tokens.sql](https://github.com/blockchain-etl/ethereum-etl/blob/master/schemas/aws/tokens.sql) + +## Airflow DAGs + +Refer to https://github.com/medvedev1088/ethereum-etl-airflow for the instructions. + +## Tables for Parquet Files + +Read [this article](https://medium.com/@medvedev1088/converting-ethereum-etl-files-to-parquet-399e048ddd30) on how to convert CSVs to Parquet. + +- Create the tables: + - parquet_blocks: [schemas/aws/parquet/parquet_blocks.sql](https://github.com/blockchain-etl/ethereum-etl/blob/master/schemas/aws/parquet/parquet_blocks.sql) + - parquet_transactions: [schemas/aws/parquet/parquet_transactions.sql](https://github.com/blockchain-etl/ethereum-etl/blob/master/schemas/aws/parquet/parquet_transactions.sql) + - parquet_token_transfers: [schemas/aws/parquet/parquet_token_transfers.sql](https://github.com/blockchain-etl/ethereum-etl/blob/master/schemas/aws/parquet/parquet_token_transfers.sql) + +Note that [DECIMAL type is limited to 38 digits in Hive](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-decimal) so values greater than 38 decimals will be null. \ No newline at end of file diff --git a/docs/command-reference.md b/docs/command-reference.md new file mode 100644 index 0000000..2847664 --- /dev/null +++ b/docs/command-reference.md @@ -0,0 +1,241 @@ +# Command Reference + +- [export_blocks_and_transactions](#export_blocks_and_transactions) +- [export_token_transfers](#export_token_transfers) +- [extract_token_transfers](#extract_token_transfers) +- [export_receipts_and_logs](#export_receipts_and_logs) +- [export_contracts](#export_contracts) +- [export_tokens](#export_tokens) +- [export_traces](#export_traces) +- [export_geth_traces](#export_geth_traces) +- [extract_geth_traces](#extract_geth_traces) +- [get_block_range_for_date](#get_block_range_for_date) +- [get_keccak_hash](#get_keccak_hash) +- [stream](#stream) + +All the commands accept `-h` parameter for help, e.g.: + +```bash +> ethereumetl export_blocks_and_transactions -h + +Usage: ethereumetl export_blocks_and_transactions [OPTIONS] + + Export blocks and transactions. + +Options: + -s, --start-block INTEGER Start block + -e, --end-block INTEGER End block [required] + -b, --batch-size INTEGER The number of blocks to export at a time. + -p, --provider-uri TEXT The URI of the web3 provider e.g. + file://$HOME/Library/Ethereum/geth.ipc or + https://mainnet.infura.io + -w, --max-workers INTEGER The maximum number of workers. + --blocks-output TEXT The output file for blocks. If not provided + blocks will not be exported. Use "-" for stdout + --transactions-output TEXT The output file for transactions. If not + provided transactions will not be exported. Use + "-" for stdout + -h, --help Show this message and exit. +``` + +For the `--output` parameters the supported types are csv and json. The format type is inferred from the output file name. + +#### export_blocks_and_transactions + +```bash +> ethereumetl export_blocks_and_transactions --start-block 0 --end-block 500000 \ +--provider-uri file://$HOME/Library/Ethereum/geth.ipc \ +--blocks-output blocks.csv --transactions-output transactions.csv +``` + +Omit `--blocks-output` or `--transactions-output` options if you want to export only transactions/blocks. + +You can tune `--batch-size`, `--max-workers` for performance. + +[Blocks and transactions schema](#blockscsv). + +#### export_token_transfers + +The API used in this command is not supported by Infura, so you will need a local node. +If you want to use Infura for exporting ERC20 transfers refer to [extract_token_transfers](#extract_token_transfers) + +```bash +> ethereumetl export_token_transfers --start-block 0 --end-block 500000 \ +--provider-uri file://$HOME/Library/Ethereum/geth.ipc --batch-size 100 --output token_transfers.csv +``` + +Include `--tokens --tokens ` to filter only certain tokens, e.g. + +```bash +> ethereumetl export_token_transfers --start-block 0 --end-block 500000 \ +--provider-uri file://$HOME/Library/Ethereum/geth.ipc --output token_transfers.csv \ +--tokens 0x86fa049857e0209aa7d9e616f7eb3b3b78ecfdb0 --tokens 0x06012c8cf97bead5deae237070f9587f8e7a266d +``` + +You can tune `--batch-size`, `--max-workers` for performance. + +[Token transfers schema](#token_transferscsv). + +#### export_receipts_and_logs + +First extract transaction hashes from `transactions.csv` +(Exported with [export_blocks_and_transactions](#export_blocks_and_transactions)): + +```bash +> ethereumetl extract_csv_column --input transactions.csv --column hash --output transaction_hashes.txt +``` + +Then export receipts and logs: + +```bash +> ethereumetl export_receipts_and_logs --transaction-hashes transaction_hashes.txt \ +--provider-uri file://$HOME/Library/Ethereum/geth.ipc --receipts-output receipts.csv --logs-output logs.csv +``` + +Omit `--receipts-output` or `--logs-output` options if you want to export only logs/receipts. + +You can tune `--batch-size`, `--max-workers` for performance. + +Upvote this feature request https://github.com/paritytech/parity/issues/9075, +it will make receipts and logs export much faster. + +[Receipts and logs schema](#receiptscsv). + +#### extract_token_transfers + +First export receipt logs with [export_receipts_and_logs](#export_receipts_and_logs). + +Then extract transfers from the logs.csv file: + +```bash +> ethereumetl extract_token_transfers --logs logs.csv --output token_transfers.csv +``` + +You can tune `--batch-size`, `--max-workers` for performance. + +[Token transfers schema](#token_transferscsv). + +#### export_contracts + +First extract contract addresses from `receipts.csv` +(Exported with [export_receipts_and_logs](#export_receipts_and_logs)): + +```bash +> ethereumetl extract_csv_column --input receipts.csv --column contract_address --output contract_addresses.txt +``` + +Then export contracts: + +```bash +> ethereumetl export_contracts --contract-addresses contract_addresses.txt \ +--provider-uri file://$HOME/Library/Ethereum/geth.ipc --output contracts.csv +``` + +You can tune `--batch-size`, `--max-workers` for performance. + +[Contracts schema](#contractscsv). + +#### export_tokens + +First extract token addresses from `contracts.json` +(Exported with [export_contracts](#export_contracts)): + +```bash +> ethereumetl filter_items -i contracts.json -p "item['is_erc20'] or item['is_erc721']" | \ +ethereumetl extract_field -f address -o token_addresses.txt +``` + +Then export ERC20 / ERC721 tokens: + +```bash +> ethereumetl export_tokens --token-addresses token_addresses.txt \ +--provider-uri file://$HOME/Library/Ethereum/geth.ipc --output tokens.csv +``` + +You can tune `--max-workers` for performance. + +[Tokens schema](#tokenscsv). + +#### export_traces + +Also called internal transactions. +The API used in this command is not supported by Infura, +so you will need a local Parity archive node (`parity --tracing on`). +Make sure your node has at least 8GB of memory, or else you will face timeout errors. +See [this issue](https://github.com/blockchain-etl/ethereum-etl/issues/137) + +```bash +> ethereumetl export_traces --start-block 0 --end-block 500000 \ +--provider-uri file://$HOME/Library/Ethereum/parity.ipc --batch-size 100 --output traces.csv +``` + +You can tune `--batch-size`, `--max-workers` for performance. + +[Traces schema](#tracescsv). + +#### export_geth_traces + +Read [Differences between geth and parity traces.csv](#differences-between-geth-and-parity-tracescsv) + +The API used in this command is not supported by Infura, +so you will need a local Geth archive node (`geth --gcmode archive --syncmode full --ipcapi debug`). +When using rpc, add `--rpc --rpcapi debug` options. + +```bash +> ethereumetl export_geth_traces --start-block 0 --end-block 500000 \ +--provider-uri file://$HOME/Library/Ethereum/geth.ipc --batch-size 100 --output geth_traces.json +``` + +You can tune `--batch-size`, `--max-workers` for performance. + +#### extract_geth_traces + +```bash +> ethereumetl extract_geth_traces --input geth_traces.json --output traces.csv +``` + +You can tune `--batch-size`, `--max-workers` for performance. + +#### get_block_range_for_date + +```bash +> ethereumetl get_block_range_for_date --provider-uri=https://mainnet.infura.io --date 2018-01-01 +4832686,4838611 +``` + +#### get_keccak_hash + +```bash +> ethereumetl get_keccak_hash -i "transfer(address,uint256)" +0xa9059cbb2ab09eb219583f4a59a5d0623ade346d962bcd4e46b11da047c9049b +``` + +#### stream + +```bash +> pip3 install ethereum-etl[streaming] +> ethereumetl stream --provider-uri https://mainnet.infura.io --start-block 500000 +``` + +- This command outputs blocks, transactions, logs, token_transfers to the console by default. +- Entity types can be specified with the `-e` option, +e.g. `-e block,transaction,log,token_transfer,trace,contract,token`. +- Use `--output` option to specify the Google Pub/Sub topic where to publish blockchain data, +e.g. `projects//topics/bitcoin_blockchain`. Data will be pushed to +`projects//topics/bitcoin_blockchain.blocks`, `projects//topics/bitcoin_blockchain.transactions` +etc. topics. +- The command saves its state to `last_synced_block.txt` file where the last synced block number is saved periodically. +- Specify either `--start-block` or `--last-synced-block-file` option. `--last-synced-block-file` should point to the +file where the block number, from which to start streaming the blockchain data, is saved. +- Use the `--lag` option to specify how many blocks to lag behind the head of the blockchain. It's the simplest way to +handle chain reorganizations - they are less likely the further a block from the head. +- You can tune `--period-seconds`, `--batch-size`, `--block-batch-size`, `--max-workers` for performance. +- Refer to [blockchain-etl-streaming](https://github.com/blockchain-etl/blockchain-etl-streaming) for +instructions on deploying it to Kubernetes. + +Stream blockchain data continually to Google Pub/Sub: + +```bash +> export GOOGLE_APPLICATION_CREDENTIALS=/path_to_credentials_file.json +> ethereumetl stream --start-block 500000 --output projects//topics/crypto_ethereum +``` diff --git a/docs/contact.md b/docs/contact.md new file mode 100644 index 0000000..8e81f7c --- /dev/null +++ b/docs/contact.md @@ -0,0 +1,4 @@ +# Contact + +- [D5 Discord Server](https://discord.gg/wukrezR) +- [Telegram Group](https://t.me/joinchat/GsMpbA3mv1OJ6YMp3T5ORQ) diff --git a/docs/ethereum-classic.md b/docs/ethereum-classic.md new file mode 100644 index 0000000..873fb22 --- /dev/null +++ b/docs/ethereum-classic.md @@ -0,0 +1,4 @@ +# Ethereum Classic + +For getting ETC csv files, make sure you pass in the `--chain classic` param where it's required for the scripts you want to export. +ETC won't run if your `--provider-uri` is Infura. It will provide a warning and change the provider-uri to `https://ethereumclassic.network` instead. For faster performance, run a client instead locally for classic such as `parity chain=classic` and Geth-classic. \ No newline at end of file diff --git a/docs/exporting-the-blockchain.md b/docs/exporting-the-blockchain.md new file mode 100644 index 0000000..e1fed82 --- /dev/null +++ b/docs/exporting-the-blockchain.md @@ -0,0 +1,89 @@ +## Exporting the Blockchain + +If you'd like to have blockchain data set up and hosted for you, [get in touch with us at D5](https://d5.ai/?ref=ethereumetl). + +1. Install python 3.5.3+ https://www.python.org/downloads/ + +1. You can use Infura if you don't need ERC20 transfers (Infura doesn't support eth_getFilterLogs JSON RPC method). +For that use `-p https://mainnet.infura.io` option for the commands below. If you need ERC20 transfers or want to +export the data ~40 times faster, you will need to set up a local Ethereum node: + +1. Install geth https://github.com/ethereum/go-ethereum/wiki/Installing-Geth + +1. Start geth. +Make sure it downloaded the blocks that you need by executing `eth.syncing` in the JS console. +You can export blocks below `currentBlock`, +there is no need to wait until the full sync as the state is not needed (unless you also need contracts bytecode +and token details; for those you need to wait until the full sync). + +1. Install Ethereum ETL: `> pip3 install ethereum-etl` + +1. Export all: + +```bash +> ethereumetl export_all --help +> ethereumetl export_all -s 0 -e 5999999 -b 100000 -p file://$HOME/Library/Ethereum/geth.ipc -o output +``` + +In case `ethereumetl` command is not available in PATH, use `python3 -m ethereumetl` instead. + +The result will be in the `output` subdirectory, partitioned in Hive style: +```bash +output/blocks/start_block=00000000/end_block=00099999/blocks_00000000_00099999.csv +output/blocks/start_block=00100000/end_block=00199999/blocks_00100000_00199999.csv +... +output/transactions/start_block=00000000/end_block=00099999/transactions_00000000_00099999.csv +... +output/token_transfers/start_block=00000000/end_block=00099999/token_transfers_00000000_00099999.csv +... +``` + +Should work with geth and parity, on Linux, Mac, Windows. +If you use Parity you should disable warp mode with `--no-warp` option because warp mode +does not place all of the block or receipt data into the database https://wiki.parity.io/Getting-Synced + +If you see weird behavior, e.g. wrong number of rows in the CSV files or corrupted files, +check out this issue: https://github.com/medvedev1088/ethereum-etl/issues/28 + +### Export in 2 Hours + +You can use AWS Auto Scaling and Data Pipeline to reduce the exporting time to a few hours. +Read [this article](https://medium.com/@medvedev1088/how-to-export-the-entire-ethereum-blockchain-to-csv-in-2-hours-for-10-69fef511e9a2) for details. + +### Running in Docker + +1. Install Docker https://docs.docker.com/install/ + +2. Build a docker image + + > docker build -t ethereum-etl:latest . + > docker image ls + + +3. Run a container out of the image + + > docker run -v $HOME/output:/ethereum-etl/output ethereum-etl:latest export_all -s 0 -e 5499999 -b 100000 -p https://mainnet.infura.io + > docker run -v $HOME/output:/ethereum-etl/output ethereum-etl:latest export_all -s 2018-01-01 -e 2018-01-01 -p https://mainnet.infura.io + +1. Run streaming to console or Pub/Sub + + > docker build -t ethereum-etl:latest-streaming -f Dockerfile_with_streaming . + > echo "Stream to console" + > docker run ethereum-etl:latest-streaming stream --start-block 500000 --log-file log.txt + > echo "Stream to Pub/Sub" + > docker run -v /path_to_credentials_file/:/ethereum-etl/ --env GOOGLE_APPLICATION_CREDENTIALS=/ethereum-etl/credentials_file.json ethereum-etl:latest-streaming stream --start-block 500000 --output projects//topics/crypto_ethereum + +### Running Tests + +```bash +> pip3 install -e .[dev,streaming] +> export ETHEREUM_ETL_RUN_SLOW_TESTS=True +> pytest -vv +``` + +### Running Tox Tests + +```bash +> pip3 install tox +> tox +``` diff --git a/docs/google-bigquery.md b/docs/google-bigquery.md new file mode 100644 index 0000000..9de2184 --- /dev/null +++ b/docs/google-bigquery.md @@ -0,0 +1,16 @@ +# Google BiqQuery + +## Querying in BigQuery + +If you'd rather not export the blockchain data yourself, we publish all tables as a public dataset in [BigQuery](https://medium.com/@medvedev1088/ethereum-blockchain-on-google-bigquery-283fb300f579). + +Data is updated near real-time (~4-minute delay to account for block finality). + +### How to Query Balances for all Ethereum Addresses + +Read [this article](https://medium.com/google-cloud/how-to-query-balances-for-all-ethereum-addresses-in-bigquery-fb594e4034a7). + +### Building Token Recommender in Google Cloud Platform + +Read [this article]( +https://medium.com/google-cloud/building-token-recommender-in-google-cloud-platform-1be5a54698eb). \ No newline at end of file diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000..c691235 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,22 @@ +# Overview + +Convert blockchain data into convenient formats like CSVs and relational databases. + +Ethereum ETL is the most popular open source project for Ethereum data, with 700+ likes on Github. + +## Features + +Easily export: + +* Blocks +* Transactions +* ERC20 / ERC721 tokens +* Token transfers +* Receipts +* Logs +* Contracts +* Internal transactions + +## Projects using Ethereum ETL +* Google - Public BigQuery Ethereum datasets +* Nansen by D5 - Analytics platform for Ethereum diff --git a/docs/limitations.md b/docs/limitations.md new file mode 100644 index 0000000..dfd8766 --- /dev/null +++ b/docs/limitations.md @@ -0,0 +1,15 @@ +# Limitation + +- In case the contract is a proxy, which forwards all calls to a delegate, interface detection doesn’t work, +which means `is_erc20` and `is_erc721` will always be false for proxy contracts and they will be missing in the `tokens` +table. +- The metadata methods (`symbol`, `name`, `decimals`, `total_supply`) for ERC20 are optional, so around 10% of the +contracts are missing this data. Also some contracts (EOS) implement these methods but with wrong return type, +so the metadata columns are missing in this case as well. +- `token_transfers.value`, `tokens.decimals` and `tokens.total_supply` have type `STRING` in BigQuery tables, +because numeric types there can't handle 32-byte integers. You should use +`cast(value as FLOAT64)` (possible loss of precision) or +`safe_cast(value as NUMERIC)` (possible overflow) to convert to numbers. +- The contracts that don't implement `decimals()` function but have the +[fallback function](https://solidity.readthedocs.io/en/v0.4.21/contracts.html#fallback-function) that returns a `boolean` +will have `0` or `1` in the `decimals` column in the CSVs. \ No newline at end of file diff --git a/docs/media.md b/docs/media.md new file mode 100644 index 0000000..9bbf9dc --- /dev/null +++ b/docs/media.md @@ -0,0 +1,4 @@ +## Ethereum ETL in the Media + +- [A Technical Breakdown Of Google's New Blockchain Search Tools](https://www.forbes.com/sites/michaeldelcastillo/2019/02/05/google-launches-search-for-bitcoin-ethereum-bitcoin-cash-dash-dogecoin-ethereum-classic-litecoin-and-zcash/#394fc868c789) +- [Navigating Bitcoin, Ethereum, XRP: How Google Is Quietly Making Blockchains Searchable](https://www.forbes.com/sites/michaeldelcastillo/2019/02/04/navigating-bitcoin-ethereum-xrp-how-google-is-quietly-making-blockchains-searchable/?ss=crypto-blockchain#49e111da4248) diff --git a/docs/quickstart.md b/docs/quickstart.md new file mode 100644 index 0000000..881c76c --- /dev/null +++ b/docs/quickstart.md @@ -0,0 +1,43 @@ +# Quickstart + +Install Ethereum ETL: + +```bash +pip3 install ethereum-etl +``` + +Export blocks and transactions ([Schema](#blockscsv), [Reference](#export_blocks_and_transactions)): + +```bash +> ethereumetl export_blocks_and_transactions --start-block 0 --end-block 500000 \ +--provider-uri https://mainnet.infura.io --blocks-output blocks.csv --transactions-output transactions.csv +``` + +Export ERC20 and ERC721 transfers ([Schema](#token_transferscsv), [Reference](#export_token_transfers)): + +```bash +> ethereumetl export_token_transfers --start-block 0 --end-block 500000 \ +--provider-uri file://$HOME/Library/Ethereum/geth.ipc --output token_transfers.csv +``` + +Export traces ([Schema](#tracescsv), [Reference](#export_traces)): + +```bash +> ethereumetl export_traces --start-block 0 --end-block 500000 \ +--provider-uri file://$HOME/Library/Ethereum/parity.ipc --output traces.csv +``` + +Stream blocks, transactions, logs, token_transfers continually to console ([Reference](#stream)): + +```bash +> pip3 install ethereum-etl[streaming] +> ethereumetl stream --start-block 500000 -e block,transaction,log,token_transfer --log-file log.txt +``` + +Find other commands [here](#command-reference). + +For the latest version, check out the repo and call +```bash +> pip3 install -e . +> python3 ethereumetl.py +``` \ No newline at end of file diff --git a/docs/schema.md b/docs/schema.md new file mode 100644 index 0000000..0fce745 --- /dev/null +++ b/docs/schema.md @@ -0,0 +1,138 @@ +# Schema + +## blocks.csv + +Column | Type | +------------------|--------------------| +number | bigint | +hash | hex_string | +parent_hash | hex_string | +nonce | hex_string | +sha3_uncles | hex_string | +logs_bloom | hex_string | +transactions_root | hex_string | +state_root | hex_string | +receipts_root | hex_string | +miner | address | +difficulty | numeric | +total_difficulty | numeric | +size | bigint | +extra_data | hex_string | +gas_limit | bigint | +gas_used | bigint | +timestamp | bigint | +transaction_count | bigint | + +## transactions.csv + +Column | Type | +-----------------|-------------| +hash | hex_string | +nonce | bigint | +block_hash | hex_string | +block_number | bigint | +transaction_index| bigint | +from_address | address | +to_address | address | +value | numeric | +gas | bigint | +gas_price | bigint | +input | hex_string | +block_timestamp | bigint | + +## token_transfers.csv + +Column | Type | +--------------------|-------------| +token_address | address | +from_address | address | +to_address | address | +value | numeric | +transaction_hash | hex_string | +log_index | bigint | +block_number | bigint | + +## receipts.csv + +Column | Type | +-----------------------------|-------------| +transaction_hash | hex_string | +transaction_index | bigint | +block_hash | hex_string | +block_number | bigint | +cumulative_gas_used | bigint | +gas_used | bigint | +contract_address | address | +root | hex_string | +status | bigint | + +## logs.csv + +Column | Type | +-------------------------|-------------| +log_index | bigint | +transaction_hash | hex_string | +transaction_index | bigint | +block_hash | hex_string | +block_number | bigint | +address | address | +data | hex_string | +topics | string | + +## contracts.csv + +Column | Type | +-----------------------------|-------------| +address | address | +bytecode | hex_string | +function_sighashes | string | +is_erc20 | boolean | +is_erc721 | boolean | +block_number | bigint | + +## tokens.csv + +Column | Type | +-----------------------------|-------------| +address | address | +symbol | string | +name | string | +decimals | bigint | +total_supply | numeric | + +## traces.csv + +Column | Type | +-----------------------------|-------------| +block_number | bigint | +transaction_hash | hex_string | +transaction_index | bigint | +from_address | address | +to_address | address | +value | numeric | +input | hex_string | +output | hex_string | +trace_type | string | +call_type | string | +reward_type | string | +gas | bigint | +gas_used | bigint | +subtraces | bigint | +trace_address | string | +error | string | +status | bigint | + +### Differences between geth and parity traces.csv + +- `to_address` field differs for `callcode` trace (geth seems to return correct value, as parity value of `to_address` is same as `to_address` of parent call); +- geth output doesn't have `reward` traces; +- geth output doesn't have `to_address`, `from_address`, `value` for `suicide` traces; +- `error` field contains human readable error message, which might differ in geth/parity output; +- geth output doesn't have `transaction_hash`; +- `gas_used` is 0 on traces with error in geth, empty in parity; +- zero output of subcalls is `0x000...` in geth, `0x` in parity; + +You can find column descriptions in [https://github.com/medvedev1088/ethereum-etl-airflow](https://github.com/medvedev1088/ethereum-etl-airflow/tree/master/dags/resources/stages/raw/schemas) + +Note: for the `address` type all hex characters are lower-cased. +`boolean` type can have 2 values: `True` or `False`. \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml new file mode 100644 index 0000000..13e2f47 --- /dev/null +++ b/mkdocs.yml @@ -0,0 +1,17 @@ +site_name: Ethereum ETL +nav: + - Overview: index.md + - Quickstart: quickstart.md + - Exporting the Blockchain: exporting-the-blockchain.md + - Google BigQuery: google-bigquery.md + - Amazon Athena: amazon-athena.md + - Ethereum Classic: ethereum-classic.md + - References: + - Commands: command-reference.md + - Schema: schema.md + - Limitations: limitations.md + - Project: + - Contact Us: contact.md + - Media: media.md +theme: readthedocs +repo_url: https://github.com/blockchain-etl/ethereum-etl/