# waxy
crawler in the works for rust.

This is a work in progress.

The wax worker is the crawler. The crawler is being built out to generate or "press" different docs like "HtmlDocument".
The wax worker presses docs
The specific docs will implement methods to parse themselves.

This is a slow process for crawling, and calling blind. The last thing anyone wants with a crawler is to not be able to crawl.

Scraping functionality is coming. However, other parser options may not be available.

### main dependencies of waxy:

1. reqwest https://docs.rs/reqwest/latest/reqwest/
2. scraper https://crates.io/crates/scraper
3. tokio-test https://crates.io/crates/tokio-test

### Notes:
_01/03/22_ 
- looking at creating my own parser, the parser would be included in another crate
- I am building testing server functionality as to be able to serve all pages locally


This pretty much uses strings for everything.

how to use:
```toml
[dependencies]
waxy = "0.1.1"
tokio = { version = "1", features = ["full"] }
```

```rust
use waxy::pressers::HtmlPresser;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    

    //Wax worker

    /*
    
    create a single record from url

    */
    match HtmlPresser::press_record("https://example.com").await{
        Ok(res)=>{
            println!("{:?}", res);
        },
        Err(..)=>{
            println!("went bad")
        }

    }

    println!();
    println!("----------------------");
    println!();

    /*
    
    crawl a vector or urls for a vector of documents

    */

    match HtmlPresser::press_records(vec!["https://example.com"]).await{
        Ok(res)=>{
            println!("{:?}", res.len());
        },
        Err(..)=>{
            println!("went bad")
        }

    }

    println!();
    println!("----------------------");
    println!();

   /*
    
    crawl a domain, the "1" is the limit of pages you are willing to crawl

    */

    match HtmlPresser::press_records_blind("https://funnyjunk.com",1).await{
        Ok(res)=>{
            println!("{:?}", res.len());
        },
        Err(..)=>{
            println!("went bad")
        }

    }

    /*
    blind crawl a domain for links, 
    inputs:
    url to site
    link limit, limit of the number of links willing to be grabbed
    page limit, limit of the number of pages to crawl for links
    */

    match HtmlPresser::press_urls("https://example.com",1,1).await{
        Ok(res)=>{
            println!("{:?}", res.len());
        },
        Err(..)=>{
            println!("went bad")
        }

    }

    println!();
    println!("----------------------");
    println!();

    /*
    blind crawl a domain for links that match a pattern, 
    inputs:
    url to site
    pattern the  url should match
    link limit, limit of the number of links willing to be grabbed
    page limit, limit of the number of pages to crawl for links
    */
    match HtmlPresser::press_curated_urls("https://example.com", ".", 1,1).await{
        Ok(res)=>{
            println!("{:?}", res);
        },
        Err(..)=>{
            println!("went bad")
        }

    }

    println!();
    println!("----------------------");
    println!();

        /*
    blind crawl a domain for document whose urls that match a pattern, 
    inputs:
    url to site
    pattern the  url should match
    page limit, limit of the number of pages to crawl for links
    */
    match HtmlPresser::press_curated_records("https://example.com", ".", 1).await{
        Ok(res)=>{
            println!("{:?}", res);
        },
        Err(..)=>{
            println!("went bad")
        }

    }

    println!();
    println!("----------------------");
    println!();
    
    //get doc
    let record = HtmlPresser::press_record("https://funnyjunk.com").await.unwrap();

    //get anchors
    println!("{:?}",record.anchors());
    println!();
    println!("{:?}",record.anchors_curate("."));
    println!();
    println!("{:?}",record.domain_anchors());
    println!();
    //call headers
    println!("{:?}",record.headers);
    println!();
    //call meta data
    println!("{:?}",record.meta_data());
    println!();
    //tag text and html
    println!("{:?}",record.tag_html("title"));
    println!();
    println!("{:?}",record.tag_text("title"));
    println!();
    println!();

    Ok(())


}


```
