Example: Analyzing the news live with RAW

This example shows how to create an endpoint that analyzes the news by combining data from multiple external web APIs.

info

If you are not familiar with RAW we recommend checking out our Getting started guide first. To use RAW, you need an account which you can create and use for free here.

How to analyze the news? There are three steps involved:

obtain a machine-readable list of news articles;
process the news articles to obtain additional metadata;
use a natural language API to detect well identified entities.

info

If you want to try this example, you can deploy the following endpoint:

Analyzing the news

See how to turn a RSS feed into a detailed textual analysis of its articles.

Overview
Code

Usage:

/examples/news

description(url: string) =
    let
        encoded = Http.UrlEncode(url),
        key = Environment.Secret("opengraphKey"),
        opengraphReq = Http.Get("https://opengraph.io/api/1.1/site/" + encoded, args = [{"app_id", key}]),
        metadata = Json.Read(opengraphReq, type record(hybridGraph: record(title: string, description: string)))
    in
        metadata.hybridGraph.description

analyze(text: string) =
    let
        outputType = type record(
            entities: collection(
                record(
                    name: string,
                    `type`: string,
                    metadata: record(),
                    salience: int,
                    mentions: collection(record(text: record(content: string, beginOffset: int), `type`: string))
                )
            ),
            language: string
        ),
        query = Json.Print({document: {`type`: "PLAIN_TEXT", content: text}, encodingType: "UTF8"}),
        key = Environment.Secret("language@google"),
        httpPost = Http.Post(
            "https://language.googleapis.com/v1/documents:analyzeEntities",
            args = [{"key", key}],
            headers = [{"x-raw-output-format", "json"}, {"Content-Type", "application/json; charset=utf-8"}],
            bodyString = query
        ),
        r = Json.Read(httpPost, outputType)
    in
        r

let
    feed = Xml.InferAndRead("http://rss.cnn.com/rss/edition_us.rss"),
    items = Collection.Take(feed.channel.item, 10),
    withMetadata = Collection.Transform(items, (i) -> {title: i.title, link: i.link, description: description(i.link)}),
    withAnalysis = Collection.Transform(withMetadata, (r) -> Record.AddField(r, analysis = analyze(r.description))),
    explodeEntities = Collection.Explode(withAnalysis, (row) -> row.analysis.entities),
    interestingEntities = Collection.Filter(
        explodeEntities,
        (row) ->
            List.Contains(["PERSON", "LOCATION", "ORGANIZATION", "EVENT", "WORK_OF_ART", "CONSUMER_GOOD"], row.`type`)
    ),
    grouped = Collection.GroupBy(
        interestingEntities,
        (row) -> {name: row.name, `type`: row.`type`, metadata: row.metadata}
    ),
    report = Collection.Transform(
        grouped,
        (g) ->
            {
                key: g.key,
                total_salience: Collection.Sum(g.group.salience),
                story_count: Collection.Count(g.group),
                stories: Collection.Distinct(g.group.link),
                mention_count: Collection.Count(Collection.Explode(g.group, (g) -> g.mentions))
            }
    )
in
    Collection.OrderBy(report, (row) -> row.story_count, "DESC")

To start, we use an RSS feed which is a well-known XML standard to present updates to websites in a computer-readable format. For our example, we use CNN US channel that has a list of news feeds on different subjects.

Then, for each news article link in the file, we issue a call to OpenGraph.io, which exposes an API that extracts OpenGraph metadata from the content of a URL.

Finally, text summaries are sent to Google's natural language API that returns a number of well identified entities (people, companies, locations) it recognized in it.

Now that we have associated each article to its formal entities, we perform an aggregation tto shape these information to our needs.

Step 1: Read an RSS feed

RSS format is based on XML. Snapi supports XML with Xml.InferAndRead. Here's how to read the CNN US news feed:

let
    feed = Xml.InferAndRead("http://rss.cnn.com/rss/edition_us.rss")
in
    // links, titles and other metadata which reside inside the `item` node, within `channel`.
    feed.channel.item.title

The results look like:

[
  "Suspect in Dallas Zoo animal thefts allegedly admitted to the crime and says he would do it again, affidavits claim",
  "School and food vendor apologize for insensitive lunch served on first day of Black History Mont",
  "An off-duty New York police officer who was shot while trying to buy an SUV has died",
  "Labor Secretary Marty Walsh expected to leave Biden administration | CNN Politics",
  ...
  "HS football players gain perspective helping vets",
  "Milwaukee Dancing Grannies planning return",
  "Fire crews respond to fire at boarded up building",
  ...
]

Step 2: Extracting OpenGraph metadata

RSS data contains some metadata about each article it refers to (e.g. its title), but more metadata can be found in the articles themselves. We have to traverse down to process articles.

OpenGraph specifies a set of <meta/> HTML tags that help including generic web pages in Facebook's social graph. News articles include the title of the article, its type (e.g. article, opinion), links to illustrations and a description that contains a summary of the article.

Here are the tags found in one of the CNN articles.

<meta property="og:title" content="Suspect in Dallas Zoo animal thefts allegedly admitted to the crime and says he would do it again, affidavits claim">
<meta property="og:site_name" content="CNN">
<meta property="og:type" content="article">
<meta property="og:url" content="https://www.cnn.com/2023/02/08/us/dallas-zoo-suspect-arrest-affidavits/index.html">
<meta property="og:image" content="https://cdn.cnn.com/cnnnext/dam/assets/230207231552-01-dallas-zoo-020323-file-super-tease.jpg">
<meta property="og:description" content="The man who faces charges stemming from a string of suspicious activities at the Dallas Zoo allegedly admitted to stealing two tamarin monkeys and trying to steal the clouded snow leopard last month, according to arrest warrant affidavits.">

The OpenGraph.io website exposes an API that extracts OpenGraph metadata from the content of a URL. This includes the description field. We'd like to isolate that description field in order to perform textual analysis later. As we're processing the collection of links found in the RSS file, the content of their description tag can be obtained by passing that link to OpenGraph.io's API.

Let's define a function that performs the HTTP call to OpenGraph.io.

description(url: string) =
    let
        encoded = Http.UrlEncode(url),
        key = "####",
        opengraphReq = Http.Get(
            "https://opengraph.io/api/1.1/site/" + encoded,
            args = [{"app_id", key}]
        ),
        metadata = Json.Read(
            opengraphReq,
            type record(
                hybridGraph: record(title: string, description: string)
            )
        )
    in
        metadata.hybridGraph.description

Here's what is obtained with the article used as an example:

{
  "title": "Suspect in Dallas Zoo animal thefts allegedly admitted to the crime and says he would do it again, affidavits claim | CNN",
  "description": "The man who faces charges stemming from a string of suspicious activities at the Dallas Zoo allegedly admitted to stealing two tamarin monkeys and trying to steal the clouded snow leopard last month, according to arrest warrant affidavits.",
  "type": "article",
  "image": {
    "url": "https://media.cnn.com/api/v1/images/stellar/prod/230207231552-01-dallas-zoo-020323-file.jpg?c=16x9&q=w_800,c_fill"
  },
  "url": "https://www.cnn.com/2023/02/08/us/dallas-zoo-suspect-arrest-affidavits/index.html",
  "site_name": "CNN",
  "articlePublishedTime": "2023-02-08T07:33:17Z",
  "articleModifiedTime": "2023-02-08T08:14:39Z"
}

Step 3: Perform the textual analysis

A second function called analyze is defined (code isn't shown here) that sends the content of the description field Google's Natural Language API, using HTTP too. The function returns the set of entities identified by the service. Here's the entity matching Joe Biden.

{
  "name": "Joe Biden",
  "type": "PERSON",
  "metadata": {
    "mid": "/m/012gx2",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Joe_Biden"
  },
  "salience": 0.2149425,
  "mentions": [
    { "text": { "content": "Biden", "beginOffset": 54 }, "type": "PROPER" },
    { "text": { "content": "Joe Biden", "beginOffset": 190 }, "type": "PROPER" },
    { "text": { "content": "President", "beginOffset": 180 }, "type": "COMMON" }
  ]
}

Step 4: Our data product

Both functions are cascaded in order to augment the RSS initial data with textual analysis:

let
    feed = Xml.InferAndRead("http://rss.cnn.com/rss/edition_us.rss"),
    items = feed.channel.item,
    withMetadata = Collection.Transform(
        items,
        (i) ->
            {title: i.title, link: i.link, description: description(i.link)}
    ),
    withAnalysis = Collection.Transform(
        withMetadata,
        (r) -> Record.AddField(r, analysis = analyze(r.description))
    ),
    ....

Here is an example of a row that has been augmented with both the description and its entities:

{
    "title": "Labor Secretary Marty Walsh expected to leave Biden administration | CNN Politics",
    "link": "https://www.cnn.com/2023/02/07/politics/marty-walsh-leaving/index.html",
    "description": "Labor Secretary Marty Walsh is expected to depart the Biden administration soon, according to two people familiar with the matter, marking the first Cabinet secretary departure of President Joe Biden's presidency.",
    "analysis": {
      "entities": [
        {
          "name": "Marty Walsh",
          "type": "PERSON",
          "metadata": {
            "wikipedia_url": "https://en.wikipedia.org/wiki/Marty_Walsh",
            "mid": "/m/0swn343"
          },
          "salience": 0.50773776,
          "mentions": [
            { "text": { "content": "Marty Walsh", "beginOffset": 16 }, "type": "PROPER" },
            { "text": { "content": "Labor Secretary", "beginOffset": 0 }, "type": "COMMON" }
          ]
        },
        {
          "name": "Joe Biden",
          "type": "PERSON",
          "metadata": {
            "mid": "/m/012gx2",
            "wikipedia_url": "https://en.wikipedia.org/wiki/Joe_Biden"
          },
          "salience": 0.2149425,
          "mentions": [
            { "text": { "content": "Biden", "beginOffset": 54 }, "type": "PROPER" },
            { "text": { "content": "Joe Biden", "beginOffset": 190 }, "type": "PROPER" },
            { "text": { "content": "President", "beginOffset": 180 }, "type": "COMMON" }
          ]
        },
        ...
        {
          "name": "secretary departure",
          "type": "EVENT",
          "metadata": {},
          "salience": 0.045937307,
          "mentions": [
            { "text": { "content": "secretary departure", "beginOffset": 157 }, "type": "COMMON" }
          ]
        },
        ...
      ],
      "language": "en"
    }
  }

Present aggregated results

Results are now the output from two external APIs, added to our input RSS feed items.

Depending on what question we are asking, the final query could return different structures.

We show here a query that returns aggregated Entity and Type information across all the pages in the RSS feed, in descending order of "hits", to see what’s "most reported".

let //
    // ...
    //
    explodeEntities = Collection.Explode(
        withAnalysis,
        (row) -> row.analysis.entities
    ),
    interestingEntities = Collection.Filter(
        explodeEntities,
        (row) ->
            List.Contains(
                [
                    "PERSON",
                    "LOCATION",
                    "ORGANIZATION",
                    "EVENT",
                    "WORK_OF_ART",
                    "CONSUMER_GOOD"
                ],
                row.`type`
            )
    ),
    grouped = Collection.GroupBy(
        interestingEntities,
        (row) ->
            {name: row.name, `type`: row.`type`, metadata: row.metadata}
    ),
    report = Collection.Transform(grouped,
        g -> {
            g.key,
            total_salience: Collection.Sum(g.group.salience),
            story_count: Collection.Count(g.group),
            stories: Collection.Distinct(g.group.link),
            mention_count: Collection.Count(Collection.Explode(g.group, g -> g.mentions))
        })
in
    Collection.OrderBy(report, row -> row.story_count, "DESC")

The results are:

[
  {
    "key": {
      "name": "police",
      "type": "PERSON",
      "metadata": {
        "value": null,
        "wikipedia_url": null,
        "mid": null,
        "currency": null,
        "year": null
      }
    },
    "total_salience": 0.56725444,
    "story_count": 3,
    "stories": [
      "https://abc7ny.com/police-involved-shooting-grand-concourse-section-suspect-shot-in-head-and-leg-bronx/12524318",
      "https://www.atlantanewsfirst.com/2022/12/04/police-2-ford-mustangs-totaling-nearly-200k-stolen-upson-county-dealership/",
      "https://www.cbs58.com/news/horizon-west-condo-owners-in-waukesha-remember-building-fire-one-year-later"
    ],
    "mention_count": 3
  },
  {
    "key": {
      "name": "students",
      "type": "PERSON",
      "metadata": {
        "value": null,
        "wikipedia_url": null,
        "mid": null,
        "currency": null,
        "year": null
      }
    },
    "total_salience": 0.42099381599999997,
    "story_count": 2,
    "stories": [
      "https://www.cnn.com/2023/02/06/us/aramark-black-history-month-menu-school-reaj/index.html",
      "https://www.wptv.com/news/education/200-000-worth-of-supplies-distributed-for-palm-beach-county-schools-during-giveaway-event"
    ],
    "mention_count": 2
  },
  {
    "key": {
      "name": "Amazon",
      "type": "ORGANIZATION",
      "metadata": {
        "value": null,
        "wikipedia_url": "https://en.wikipedia.org/wiki/Amazon_(company)",
        "mid": "/m/0mgkg",
        "currency": null,
        "year": null
      }
    },
    "total_salience": 0.006539275,
    "story_count": 1,
    "stories": [
      "https://www.tmj4.com/news/local-news/10-year-old-upset-over-vr-headset-fatally-shoots-mother-charged-as-an-adult"
    ],
    "mention_count": 1
  },
  ...
  ...
  ...

Ready to try it out?

Otherwise, if you have questions/comments, join us on Discord!

Example: Analyzing the news live with RAW

Step 1: Read an RSS feed​

Step 2: Extracting OpenGraph metadata​

Step 3: Perform the textual analysis​

Step 4: Our data product​

Present aggregated results​

Step 1: Read an RSS feed

Step 2: Extracting OpenGraph metadata

Step 3: Perform the textual analysis

Step 4: Our data product

Present aggregated results