# Data mining

#### 1. Create a dataset

a. Click TensorBay to your Private workspace or Team workspace，and click *Create a New Dataset*

![](https://2993186011-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MGbJTODB-ncDvFhokcx%2Fuploads%2FYljblzfV2G8TQGd1BxIu%2FCreateDataset.png?alt=media\&token=b1faf395-e4f3-4e3a-8b97-30f674de0d63)

a. Enter the Developer Tools page，click *Create AccessKey* and copy it.

![](https://2993186011-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MGbJTODB-ncDvFhokcx%2Fuploads%2FZwe0JJUxCae3QViCQ9Z0%2Faccesskey.png?alt=media\&token=1acecb27-d873-49c3-a2c9-586e6d21132b)

b. Enter the dataset page you have created, Click *Action Configuration* and *Create Secret* on the Settings page.

![](https://2993186011-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MGbJTODB-ncDvFhokcx%2Fuploads%2FwGuOWA74lj79ZryzaMNr%2Fconfig.png?alt=media\&token=d7d887e8-5e9d-4da3-af6a-6b9cedb459a5)

c. Name the secret you have created as `accesskey`, and paste the secret value that was copied in step a.

![](https://2993186011-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MGbJTODB-ncDvFhokcx%2Fuploads%2FDF0WZDU1AQOMWq2rDefn%2Faccesskey2.png?alt=media\&token=6ad83c68-8d9e-4e5e-9c21-6229491396a8)

#### 3. Create a workflow

a. Click *Create Workflow* on the Action page.

![](https://2993186011-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MGbJTODB-ncDvFhokcx%2Fuploads%2FxGbxt1SSIc9sYFwVQiax%2Fworkflow.png?alt=media\&token=ee6be809-f632-4171-84af-5f755e5cb4c7)

b. Fill in the *workflow name*.（Notice： Workflow names can only contain lowercase letters, numbers and minus signs, and must not be less than 2 characters with a minus sign at the beginning.）

![](https://2993186011-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MGbJTODB-ncDvFhokcx%2Fuploads%2FHoJD3wmZ7yHymunJDMiA%2Fworkflow_name.png?alt=media\&token=2d0002ca-fb86-4080-a869-c39b316f2cc3)

c. Choose the workflow *trigger* mode.（Default: on manual）

![](https://2993186011-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MGbJTODB-ncDvFhokcx%2Fuploads%2FHXm0NuVBvgsJ3eqebXeF%2Fworkflow_trigger.png?alt=media\&token=dfd824f5-51e7-4bf5-918f-96bbac4cff2e)

d. Configurate the workflow *parameter*.（Notice：This example parameter is derived from the command line parameter of the Images setting to adjust the month of crawled paper. The default is 1.）

![](https://2993186011-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MGbJTODB-ncDvFhokcx%2Fuploads%2FPY9MBIuNRbwVQtG23FHw%2Fworkflow_parameter.png?alt=media\&token=880d9765-baeb-4961-a07c-071ba788e8bd)

e. Configurate the workflow *Instance*.

![](https://2993186011-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MGbJTODB-ncDvFhokcx%2Fuploads%2FGFc749ED0bistaDQMcIt%2Fworkflow_instance.png?alt=media\&token=3e6c6019-aae1-4f0f-9dfb-00cfe722261d)

f. Use the following code to create *YAML* fil&#x65;**.**

```
# A Workflow consists of multiple tasks that can be run serially or in parallel.
tasks:
  # This workflow includes four tasks: the scraper, pdf2txt, parser, statistics
  scraper:
    container:
         # The docker image on which this task depends is as below (Images from public and private repositories are both available)
      image: hub.graviti.cn/miner/scraper:2.3

      # The commaand`./archive/run.py {{workflow.parameters.month}}`will be excuted after Images running.
      command: [python3]
      args: ["./archive/run.py", "{{workflow.parameters.month}}"]
  pdf2txt:
    # pdf2txt depends on scraper, i.e. it only starts running after scraper has finished running
    dependencies:
      - scraper
    container:
      image: hub.graviti.cn/miner/pdf2txt:2.0
      command: [python3]
      args: ["pdf2txt.py"]
  parser:
    # parser depends on pdf2txt, i.e. it will only start running after pdf2txt has finished running
    dependencies:
      - pdf2txt
    container:
      image: hub.graviti.cn/miner/parser:2.0
      command: [python3]
      args: ["parser.py"]
  statistics:
    # statistics depend on the parser, i.e. they only start running after the parser has finished running
    dependencies:
      - parser
    container:
      image: hub.graviti.cn/miner/statistics:2.0
      command: [python3]
      args: ["statistics.py"]
```

![](https://2993186011-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MGbJTODB-ncDvFhokcx%2Fuploads%2FcrNsw67LzIw867NQNqHz%2Fworkflow_yaml.png?alt=media\&token=7c6bb8cb-e2ff-4ee6-9c55-6a8c7d3bed18)

g. Publish the workflow.

![](https://2993186011-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MGbJTODB-ncDvFhokcx%2Fuploads%2FOppOwGLs9HiM4PP4h2Kz%2Fworkflow_public.png?alt=media\&token=ebccb428-7915-4885-b586-b8e094343dd5)

#### 4. Run the workflow

a. On the Action page, click the workflow you have created and run it.

![](https://2993186011-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MGbJTODB-ncDvFhokcx%2Fuploads%2FU1Rr567fuEFlD6FA5MDw%2Fworkflow_run.png?alt=media\&token=b1be1916-04c7-492c-9c1d-3aefc7b93630)

b. Adjust the parameter, for example, change the value to 10 (month), and click *Run*.

![](https://2993186011-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MGbJTODB-ncDvFhokcx%2Fuploads%2FIyzGxN6My55M0Gf3kP2c%2Fworkflow_run_para.png?alt=media\&token=083f29bd-d842-4b7a-9103-b08d283d266c)

#### 5. View the results

a. Click the outcome to enter workflow detail page, and click *User Logs* to view details of the workflow log.

![](https://2993186011-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MGbJTODB-ncDvFhokcx%2Fuploads%2F80Wu8JNzXljc49xIbt6X%2Fworkflow_log.png?alt=media\&token=9be54f29-bc9f-47b1-8f08-3af63a8a2a27)

b. On dataset detail page, click *General* -> *Dataset Preview* to view the statistics, which is the outcome of this workflow.

![](https://2993186011-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MGbJTODB-ncDvFhokcx%2Fuploads%2F3KPODmd7aL2a6H5SLmo8%2Foutcome1.png?alt=media\&token=a3576670-d7b8-420c-ab54-b9677480b084)

![](https://2993186011-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MGbJTODB-ncDvFhokcx%2Fuploads%2FiGOIisqVyJQSI9KVZ488%2Fimage.png?alt=media\&token=68eba8aa-3092-4011-96e4-172dae4bf343)
