Data mining

Based on TensorBay Action, this example integrates four steps: data crawling, conversion, parsing, and analytics into a complete workflow, giving you a quick overview of the Graviti Data platform.

1. Create a dataset

a. Click TensorBay to your Private workspace or Team workspace，and click Create a New Dataset

a. Enter the Developer Tools page，click Create AccessKey and copy it.

b. Enter the dataset page you have created, Click Action Configuration and Create Secret on the Settings page.

c. Name the secret you have created as accesskey, and paste the secret value that was copied in step a.

3. Create a workflow

a. Click Create Workflow on the Action page.

b. Fill in the workflow name.（Notice： Workflow names can only contain lowercase letters, numbers and minus signs, and must not be less than 2 characters with a minus sign at the beginning.）

c. Choose the workflow trigger mode.（Default: on manual）

d. Configurate the workflow parameter.（Notice：This example parameter is derived from the command line parameter of the Images setting to adjust the month of crawled paper. The default is 1.）

e. Configurate the workflow Instance.

f. Use the following code to create YAML file.

# A Workflow consists of multiple tasks that can be run serially or in parallel.
tasks:
  # This workflow includes four tasks: the scraper, pdf2txt, parser, statistics
  scraper:
    container:
         # The docker image on which this task depends is as below (Images from public and private repositories are both available)
      image: hub.graviti.cn/miner/scraper:2.3

      # The commaand`./archive/run.py {{workflow.parameters.month}}`will be excuted after Images running.
      command: [python3]
      args: ["./archive/run.py", "{{workflow.parameters.month}}"]
  pdf2txt:
    # pdf2txt depends on scraper, i.e. it only starts running after scraper has finished running
    dependencies:
      - scraper
    container:
      image: hub.graviti.cn/miner/pdf2txt:2.0
      command: [python3]
      args: ["pdf2txt.py"]
  parser:
    # parser depends on pdf2txt, i.e. it will only start running after pdf2txt has finished running
    dependencies:
      - pdf2txt
    container:
      image: hub.graviti.cn/miner/parser:2.0
      command: [python3]
      args: ["parser.py"]
  statistics:
    # statistics depend on the parser, i.e. they only start running after the parser has finished running
    dependencies:
      - parser
    container:
      image: hub.graviti.cn/miner/statistics:2.0
      command: [python3]
      args: ["statistics.py"]

g. Publish the workflow.

4. Run the workflow

a. On the Action page, click the workflow you have created and run it.

b. Adjust the parameter, for example, change the value to 10 (month), and click Run.

5. View the results

a. Click the outcome to enter workflow detail page, and click User Logs to view details of the workflow log.

b. On dataset detail page, click General -> Dataset Preview to view the statistics, which is the outcome of this workflow.

PreviousModel training NextTensorBay

Last updated 3 years ago

Was this helpful?

# A Workflow consists of multiple tasks that can be run serially or in parallel. tasks: # This workflow includes four tasks: the scraper, pdf2txt, parser, statistics scraper: container: # The docker image on which this task depends is as below (Images from public and private repositories are both available) image: hub.graviti.cn/miner/scraper:2.3 # The commaand`./archive/run.py {{workflow.parameters.month}}`will be excuted after Images running. command: [python3] args: ["./archive/run.py", "{{workflow.parameters.month}}"] pdf2txt: # pdf2txt depends on scraper, i.e. it only starts running after scraper has finished running dependencies: - scraper container: image: hub.graviti.cn/miner/pdf2txt:2.0 command: [python3] args: ["pdf2txt.py"] parser: # parser depends on pdf2txt, i.e. it will only start running after pdf2txt has finished running dependencies: - pdf2txt container: image: hub.graviti.cn/miner/parser:2.0 command: [python3] args: ["parser.py"] statistics: # statistics depend on the parser, i.e. they only start running after the parser has finished running dependencies: - parser container: image: hub.graviti.cn/miner/statistics:2.0 command: [python3] args: ["statistics.py"]