Example: Analyzing the news live with RAW
This example shows how to create an endpoint that analyzes the news by combining data from multiple external web APIs.
If you are not familiar with RAW we recommend checking out our Getting started guide first. To use RAW, you need an account which you can create and use for free here.
How to analyze the news? There are three steps involved:
- obtain a machine-readable list of news articles;
- process the news articles to obtain additional metadata;
- use a natural language API to detect well identified entities.
If you want to try this example, you can deploy the following endpoint:
Analyzing the news
- Overview
- Code
Usage:
/examples/news
description(url: string) =
let
encoded = Http.UrlEncode(url),
key = Environment.Secret("opengraphKey"),
opengraphReq = Http.Get("https://opengraph.io/api/1.1/site/" + encoded, args = [{"app_id", key}]),
metadata = Json.Read(opengraphReq, type record(hybridGraph: record(title: string, description: string)))
in
metadata.hybridGraph.description
analyze(text: string) =
let
outputType = type record(
entities: collection(
record(
name: string,
`type`: string,
metadata: record(),
salience: int,
mentions: collection(record(text: record(content: string, beginOffset: int), `type`: string))
)
),
language: string
),
query = Json.Print({document: {`type`: "PLAIN_TEXT", content: text}, encodingType: "UTF8"}),
key = Environment.Secret("language@google"),
httpPost = Http.Post(
"https://language.googleapis.com/v1/documents:analyzeEntities",
args = [{"key", key}],
headers = [{"x-raw-output-format", "json"}, {"Content-Type", "application/json; charset=utf-8"}],
bodyString = query
),
r = Json.Read(httpPost, outputType)
in
r
let
feed = Xml.InferAndRead("http://rss.cnn.com/rss/edition_us.rss"),
items = Collection.Take(feed.channel.item, 10),
withMetadata = Collection.Transform(items, (i) -> {title: i.title, link: i.link, description: description(i.link)}),
withAnalysis = Collection.Transform(withMetadata, (r) -> Record.AddField(r, analysis = analyze(r.description))),
explodeEntities = Collection.Explode(withAnalysis, (row) -> row.analysis.entities),
interestingEntities = Collection.Filter(
explodeEntities,
(row) ->
List.Contains(["PERSON", "LOCATION", "ORGANIZATION", "EVENT", "WORK_OF_ART", "CONSUMER_GOOD"], row.`type`)
),
grouped = Collection.GroupBy(
interestingEntities,
(row) -> {name: row.name, `type`: row.`type`, metadata: row.metadata}
),
report = Collection.Transform(
grouped,
(g) ->
{
key: g.key,
total_salience: Collection.Sum(g.group.salience),
story_count: Collection.Count(g.group),
stories: Collection.Distinct(g.group.link),
mention_count: Collection.Count(Collection.Explode(g.group, (g) -> g.mentions))
}
)
in
Collection.OrderBy(report, (row) -> row.story_count, "DESC")
To start, we use an RSS feed which is a well-known XML standard to present updates to websites in a computer-readable format. For our example, we use CNN US channel that has a list of news feeds on different subjects.
Then, for each news article link in the file, we issue a call to OpenGraph.io
,
which exposes an API that extracts OpenGraph metadata from the content of a URL.
Finally, text summaries are sent to Google's natural language API that returns a number of well identified entities (people, companies, locations) it recognized in it.
Now that we have associated each article to its formal entities, we perform an aggregation tto shape these information to our needs.
Step 1: Read an RSS feed
RSS format is based on XML.
Snapi supports XML with Xml.InferAndRead
. Here's how to read the CNN US news feed:
let
feed = Xml.InferAndRead("http://rss.cnn.com/rss/edition_us.rss")
in
// links, titles and other metadata which reside inside the `item` node, within `channel`.
feed.channel.item.title
The results look like:
[
"Suspect in Dallas Zoo animal thefts allegedly admitted to the crime and says he would do it again, affidavits claim",
"School and food vendor apologize for insensitive lunch served on first day of Black History Mont",
"An off-duty New York police officer who was shot while trying to buy an SUV has died",
"Labor Secretary Marty Walsh expected to leave Biden administration | CNN Politics",
...
"HS football players gain perspective helping vets",
"Milwaukee Dancing Grannies planning return",
"Fire crews respond to fire at boarded up building",
...
]
Step 2: Extracting OpenGraph metadata
RSS data contains some metadata about each article it refers to (e.g. its title), but more metadata can be found in the articles themselves. We have to traverse down to process articles.
OpenGraph specifies a set of <meta/>
HTML tags that help including generic web pages in Facebook's social graph.
News articles include the title of the article, its type (e.g. article, opinion), links to illustrations and a description that contains a summary of the article.
Here are the tags found in one of the CNN articles.
<meta property="og:title" content="Suspect in Dallas Zoo animal thefts allegedly admitted to the crime and says he would do it again, affidavits claim">
<meta property="og:site_name" content="CNN">
<meta property="og:type" content="article">
<meta property="og:url" content="https://www.cnn.com/2023/02/08/us/dallas-zoo-suspect-arrest-affidavits/index.html">
<meta property="og:image" content="https://cdn.cnn.com/cnnnext/dam/assets/230207231552-01-dallas-zoo-020323-file-super-tease.jpg">
<meta property="og:description" content="The man who faces charges stemming from a string of suspicious activities at the Dallas Zoo allegedly admitted to stealing two tamarin monkeys and trying to steal the clouded snow leopard last month, according to arrest warrant affidavits.">
The OpenGraph.io
website exposes an API that
extracts OpenGraph metadata from the content of a URL. This includes the
description
field. We'd like to isolate that description field in order to
perform textual analysis later. As we're processing the collection of links
found in the RSS file, the content of their description
tag can be obtained
by passing that link to OpenGraph.io
's API.
Let's define a function that performs the HTTP call to OpenGraph.io
.
description(url: string) =
let
encoded = Http.UrlEncode(url),
key = "####",
opengraphReq = Http.Get(
"https://opengraph.io/api/1.1/site/" + encoded,
args = [{"app_id", key}]
),
metadata = Json.Read(
opengraphReq,
type record(
hybridGraph: record(title: string, description: string)
)
)
in
metadata.hybridGraph.description
Here's what is obtained with the article used as an example:
{
"title": "Suspect in Dallas Zoo animal thefts allegedly admitted to the crime and says he would do it again, affidavits claim | CNN",
"description": "The man who faces charges stemming from a string of suspicious activities at the Dallas Zoo allegedly admitted to stealing two tamarin monkeys and trying to steal the clouded snow leopard last month, according to arrest warrant affidavits.",
"type": "article",
"image": {
"url": "https://media.cnn.com/api/v1/images/stellar/prod/230207231552-01-dallas-zoo-020323-file.jpg?c=16x9&q=w_800,c_fill"
},
"url": "https://www.cnn.com/2023/02/08/us/dallas-zoo-suspect-arrest-affidavits/index.html",
"site_name": "CNN",
"articlePublishedTime": "2023-02-08T07:33:17Z",
"articleModifiedTime": "2023-02-08T08:14:39Z"
}
Step 3: Perform the textual analysis
A second function called analyze
is defined (code isn't shown here) that
sends the content of the description
field Google's Natural Language API,
using HTTP too. The function returns the set of entities identified by the
service. Here's the entity matching Joe Biden.
{
"name": "Joe Biden",
"type": "PERSON",
"metadata": {
"mid": "/m/012gx2",
"wikipedia_url": "https://en.wikipedia.org/wiki/Joe_Biden"
},
"salience": 0.2149425,
"mentions": [
{ "text": { "content": "Biden", "beginOffset": 54 }, "type": "PROPER" },
{ "text": { "content": "Joe Biden", "beginOffset": 190 }, "type": "PROPER" },
{ "text": { "content": "President", "beginOffset": 180 }, "type": "COMMON" }
]
}
Step 4: Our data product
Both functions are cascaded in order to augment the RSS initial data with textual analysis:
let
feed = Xml.InferAndRead("http://rss.cnn.com/rss/edition_us.rss"),
items = feed.channel.item,
withMetadata = Collection.Transform(
items,
(i) ->
{title: i.title, link: i.link, description: description(i.link)}
),
withAnalysis = Collection.Transform(
withMetadata,
(r) -> Record.AddField(r, analysis = analyze(r.description))
),
....
Here is an example of a row that has been augmented with both the description and its entities:
{
"title": "Labor Secretary Marty Walsh expected to leave Biden administration | CNN Politics",
"link": "https://www.cnn.com/2023/02/07/politics/marty-walsh-leaving/index.html",
"description": "Labor Secretary Marty Walsh is expected to depart the Biden administration soon, according to two people familiar with the matter, marking the first Cabinet secretary departure of President Joe Biden's presidency.",
"analysis": {
"entities": [
{
"name": "Marty Walsh",
"type": "PERSON",
"metadata": {
"wikipedia_url": "https://en.wikipedia.org/wiki/Marty_Walsh",
"mid": "/m/0swn343"
},
"salience": 0.50773776,
"mentions": [
{ "text": { "content": "Marty Walsh", "beginOffset": 16 }, "type": "PROPER" },
{ "text": { "content": "Labor Secretary", "beginOffset": 0 }, "type": "COMMON" }
]
},
{
"name": "Joe Biden",
"type": "PERSON",
"metadata": {
"mid": "/m/012gx2",
"wikipedia_url": "https://en.wikipedia.org/wiki/Joe_Biden"
},
"salience": 0.2149425,
"mentions": [
{ "text": { "content": "Biden", "beginOffset": 54 }, "type": "PROPER" },
{ "text": { "content": "Joe Biden", "beginOffset": 190 }, "type": "PROPER" },
{ "text": { "content": "President", "beginOffset": 180 }, "type": "COMMON" }
]
},
...
{
"name": "secretary departure",
"type": "EVENT",
"metadata": {},
"salience": 0.045937307,
"mentions": [
{ "text": { "content": "secretary departure", "beginOffset": 157 }, "type": "COMMON" }
]
},
...
],
"language": "en"
}
}
Present aggregated results
Results are now the output from two external APIs, added to our input RSS feed items.
Depending on what question we are asking, the final query could return different structures.
We show here a query that returns aggregated Entity and Type information across all the pages in the RSS feed, in descending order of "hits", to see what’s "most reported".
let //
// ...
//
explodeEntities = Collection.Explode(
withAnalysis,
(row) -> row.analysis.entities
),
interestingEntities = Collection.Filter(
explodeEntities,
(row) ->
List.Contains(
[
"PERSON",
"LOCATION",
"ORGANIZATION",
"EVENT",
"WORK_OF_ART",
"CONSUMER_GOOD"
],
row.`type`
)
),
grouped = Collection.GroupBy(
interestingEntities,
(row) ->
{name: row.name, `type`: row.`type`, metadata: row.metadata}
),
report = Collection.Transform(grouped,
g -> {
g.key,
total_salience: Collection.Sum(g.group.salience),
story_count: Collection.Count(g.group),
stories: Collection.Distinct(g.group.link),
mention_count: Collection.Count(Collection.Explode(g.group, g -> g.mentions))
})
in
Collection.OrderBy(report, row -> row.story_count, "DESC")
The results are:
[
{
"key": {
"name": "police",
"type": "PERSON",
"metadata": {
"value": null,
"wikipedia_url": null,
"mid": null,
"currency": null,
"year": null
}
},
"total_salience": 0.56725444,
"story_count": 3,
"stories": [
"https://abc7ny.com/police-involved-shooting-grand-concourse-section-suspect-shot-in-head-and-leg-bronx/12524318",
"https://www.atlantanewsfirst.com/2022/12/04/police-2-ford-mustangs-totaling-nearly-200k-stolen-upson-county-dealership/",
"https://www.cbs58.com/news/horizon-west-condo-owners-in-waukesha-remember-building-fire-one-year-later"
],
"mention_count": 3
},
{
"key": {
"name": "students",
"type": "PERSON",
"metadata": {
"value": null,
"wikipedia_url": null,
"mid": null,
"currency": null,
"year": null
}
},
"total_salience": 0.42099381599999997,
"story_count": 2,
"stories": [
"https://www.cnn.com/2023/02/06/us/aramark-black-history-month-menu-school-reaj/index.html",
"https://www.wptv.com/news/education/200-000-worth-of-supplies-distributed-for-palm-beach-county-schools-during-giveaway-event"
],
"mention_count": 2
},
{
"key": {
"name": "Amazon",
"type": "ORGANIZATION",
"metadata": {
"value": null,
"wikipedia_url": "https://en.wikipedia.org/wiki/Amazon_(company)",
"mid": "/m/0mgkg",
"currency": null,
"year": null
}
},
"total_salience": 0.006539275,
"story_count": 1,
"stories": [
"https://www.tmj4.com/news/local-news/10-year-old-upset-over-vr-headset-fatally-shoots-mother-charged-as-an-adult"
],
"mention_count": 1
},
...
...
...
Ready to try it out?
Register for free and start building today!Otherwise, if you have questions/comments, join us in our Community!