Katrin Erk - 2017 - What do you know about an alligator when you know the company it keeps?

What do you know about
an alligator when you know
the company it keeps?
Katrin Erk
University of Texas at Austin
STARSEM 2017

Distributional semantics and
you
• Distributional models/Embeddings: An incredible
success story in computational linguistics
• Do you make use of distributional information, too?
• Landauer & Dumais, 1997: “A solution to Plato’s problem”
• How do humans acquire such a gigantic vocabulary in such a
short time?
• Much debate in psychology,
experimental support: McDonald&Ramscar, 2001,
Lazaridou et al, 2016
• But how about the linguistic side of the story?

“A solution to Plato’s problem”
“Many well-read adults know that Buddha sat long
under a banyan tree (whatever that is) and Tahitian
natives lived idyllically on breadfruit and poi (whatever
those are). More or less correct usage often precedes
referential knowledge” (Landauer&Dumais, 1997)

“A solution to Plato’s problem”
“Many well-read adults know that Buddha sat long
under a banyan tree (whatever that is) and Tahitian
natives lived idyllically on breadfruit and poi (whatever
those are). More or less correct usage often precedes
referential knowledge” ” (Landauer&Dumais, 1997)
But wait: How can you use the word “banyan” more or
less correctly when you are not aware of its reference?
When you couldn’t point out a banyan in a yard?

Learning about word meaning
from textual context
• Main aim: insight
• What information is present in distributional
representations, and why?
• Assuming a learner with grounded concepts:
How can distributional information contribute?

Learning about meaning from
textual context
Suppose you do not know what an alligator is. What
do these sentence tell you about alligators?
• On our last evening, the boatman killed an alligator as
it crawled past our camp-fire to go hunting in the reeds
beyond.
• A study done by Edwin Colbert and his colleagues
showed that a tiny 50 gramme (1.76 oz) alligator heated
up 1◦C every minute and a half from the Sun[…]
• The throne was occupied by a pipe-smoking alligator.

Learning about word meaning
from textual context
• Setting: adult learner
• What kind of information can you get from text?
• How does it enable you to use “alligator” more or less
correctly?
• Why can you learn anything from text?
• Textual clues are rarely 100% reliable
• “An alligator was lying at the bottom of a pool”
• Could be an animal, a pool-cleaning implement…

The story in a nutshell
• How can I successfully use the word “alligator”
when I don’t know what it refers to?
• I know some properties of alligators: they are
animals, dangerous, …
• So then I use “alligator” in animal-like textual
contexts

• How does distributional information help?
• It lets me infer properties of words:
• Suppose I don’t know what an alligator is
• But it appears in similar contexts as “crocodile”
• So it must be something like a crocodile:
• That is, it must share properties with a crocodile
• So it may be an animal, it may be dangerous…

• But distributional information can never yield
certain knowledge
• Instead uncertain, probabilistic information
• Formal semantics framework
• Probabilistic semantics:
• Probability distribution over worlds that could be the
current one
• Probability of a world influenced by distributional
information

Plan
• What can an agent learn from distributional context?
• A probabilistic information state
• Influencing a probabilistic information state with
distributional information
• A toy experiment

What is in an embedding?
• What information can be encoded in an embedding
computed from text data?
• Lots of things, given the right objective function
• But:
• What objective function can we assume a human agent
to use?
• What individual linguistic phenomena have been
shown to be encoded?
• So, restrict ourselves to simple model

What is in an embedding?
• Count-based models of textual context
• (and neural models like word2vec,
see Levy&Goldberg 2015)
• Long-time criticism in psychology, eg. Murphy (2002):
only a vague notion of “similarity”
• But in fact distributional models can distinguish between
semantic relations
• by choice of what “context” means
• through relation-specific classifiers (Fu et al, 2014; Levy et al,
2015; Shwartz et al, 2016; Roller& Erk, 2016, …)

The effect of context window size
• Peirsman 2008 (Dutch):
• Narrow context window: high ratings to “similar” words
• Particularly to co-hyponyms
• Syntactic context even more so
• Wide context window: high ratings to “related” words
• Baroni/Lenci 2011 (English):
• Narrow context window: highest ratings to co-hyponyms
• Wide context window: ratings equal across many relations

What is narrow-window
similarity?
• High ratings for co-hyponyms, also synonyms, some
hypernyms, antonyms (well-known bug)
• What semantic relation is that?
• Co-hyponymy is an odd relation
• dictionary-specific
• can be incompatible (cat/dog) or compatible
(hotel/restaurant)
• Proposal: property overlap
• Alligator, crocodile have many properties in common:
animal, reptile, scaly, dangerous, …

Why does narrow-window
similarity do this?
• Focus on noun targets
• Narrow window, syntactic context contain:
• Modifiers
• Verbs that take target as argument
• Selectional constraints
• Traditionally formulated in terms of taxonomic
properties
• subject of “crawl”: animate

But wait, where do the
probabilities come from?
• Frequency in text is not frequency in real life
• Reporting bias: Almost no one says “Bananas are
yellow” (Bruni et al, 2012)
• Genre bias: “captive” and “westerner” respective
nearest neighbors in Lin 1998
• Then how can counts in text lead us to probabilities
relevant to grounded concepts?

But wait, where do the
probabilities come from?
• Two tricks in this study
1. Only consider properties that apply to all members of
a category (like “being an animal”)
2. Use distributional context only indirectly: Learn
correlation between distributional context and real-
world properties
• More recent work: trick 2 without trick 1
• I think we can use distributional context directly
and properly to get probabilities – more later

Learning properties from
distributional data
• Concrete noun concepts
• To learn: properties of a concept
• Focus on properties applying to all members of a
category (like taxonomic properties)
• Broad definition of a property: can be expressed as an
adjective, can be a hypernym, …

Property overlap
• Percentage of properties that are joint
• Jaccard coefficient on sets
• A, B, sets of properties:
• Degrees of property overlap
• Idea: The more properties in common, the higher the
distributional similarity
Jac(A, B) =
|A B|
|A [ B|
jac = 2 / 6 = 0.33

Information states
• Information state of Agent: set of worlds that the agent
considers possibilities
• Agent not omniscient
• As far as Agent is concerned, any of these worlds could be
the actual world
• Update semantics: Information state updated through
communication (Veltman 1996)
• Probabilistic information state: probability distribution
over worlds (van Benthem et al. 2009, Zeevat 2013)
• Not all worlds equally likely to be the actual world

Probabilistic logics
• Uncertainty about the world we are in
• Probability distribution over worlds
• Nilsson 1986
• Probability that a sentence is true depends on the
probabilities of the worlds in which it is true
P(') =
X
w:||'||w=t
P(w)

Generating a probability
distribution over worlds
• Text understanding as a generative process
• Agent mentally simulates (i.e., probabilistically
generates) the situation described in the text
• Goodman et al, 2015; Goodman and Lassiter, 2016
• To generate a person:
• draw gender: flip a fair coin
• draw height from the normal distribution of heights for
that gender.

Properties in a probabilistic
information state
• Property applies in a particular world: extension of
predicate included in extension of property in that
world
• Focus here: Properties that the agent is certain
about: apply in all worlds that have non-zero
probability

Bayesian update on the probability
distribution over worlds
• Prior distribution over worlds P0
• Then we see distributional evidence Edist
• e.g.: Distributional similarity of “crocodile” and
“alligator” is 0.93
• Posterior distribution P1 given Edist
• How do we determine the likelihood?
P1(w) = P(w|Edist) =
P(Edist|w)P0(w)
P(Edist)

Interpreting distributional data
• Speaker observes words with known properties,
and their
distributional
similarity
Property overlap from McRae feature norms (McRae et al 2005).
Similarities from a narrow-context model computed on UKWaC+
Wikipedia+BNC
word 1 word 2 ovl sim
peacock raven 0.29 0.70
mixer toaster 0.19 0.72
crocodile frog 0.17 0.86
bagpipe banjo 0.10 0.72
scissors typewriter 0.04 0.62
crocodile lime 0.03 0.33
coconut porcupine 0.03 0.42

Observing regularities: high property overlap
goes with high distributional similarity
word 1 word 2 ovl sim
peacock raven 0.29 0.70
mixer toaster 0.19 0.72
crocodile frog 0.17 0.86
bagpipe banjo 0.10 0.72
scissors typewriter 0.04 0.62
crocodile lime 0.03 0.33
coconut porcupine 0.03 0.42
0.05 0.10 0.15 0.20 0.25 0.30
0.20.61.0
Property overlap versus
similarity (artificial data)
property overlap
dist.sim.
In the simplest case:
linear regression.

Given the regularities I observed, and the
distributional evidence, what do I now
think of world w?
• World w:
• property overlap of crocodile and alligator is o = 0.1
• Predicted similarity:
• Distributional evidence: sim(crocodile, alligator) = 0.93
• How likely are we to observe a distributional
similarity of 0.93 if the predicted similarity is 0.53?
• Standard move in hypothesis testing: How likely to
see an observed value this high or higher
given the predicted distribution?
0 + 1o = 0.53

Likelihood of the distributional
evidence in this world
• What distribution?
• Equivalent view of linear regression:
Observed similarity = predicted similarity + normally
distributed error
• Normal distribution with mean
f(o) = 0 + 1o
0.00.10.20.30.4
dist.rating
prob.density
f(o)
0.00.10.20.30.4
prob.density

Likelihood of the distributional
evidence in this world
• Distributional similarity s = sim(crocodile, alligator)
• Hypothesis testing: How likely to see similarity value
as high as s or higher given property overlap o?
0.00.10.20.30.4
prob.density
f(o)
0.00.10.20.30.4
prob.density
f(o) s

Computing posterior probabilities in
a probabilistic generative framework
• Probabilistically generate worlds:
• “To generate a person, flip a fair coin to determine their
gender…”
• Approximately determine probability distribution
over worlds: Sample n probabilistically generated
worlds
• Sample from posterior:
• Rejection sampling
• Formulate likelihood as a sampling condition

Computing posterior probabilities in
a probabilistic generative framework
• Property overlap o between crocodiles and alligators
in world w
• Distributional similarity s = sim(crocodile, alligator)
• Keep w if similarity as high as s or higher is likely
given o
• Sample s’ from the normal
distribution with mean f(o)
• Keep world w if s’ >= s
0.00.10.20.30.4
prob.density
f(o) 0.00.10.20.30.4
prob.density
f(o) s

Toy experiments
• Property collection: McRae et al., 2005
• Human-generated definitional features for concrete noun
properties
• Distributional model: narrow context, UKWaC + Wikipedia +
BNC
• Hold out alligator as unknown word
• Given distributional evidence, how likely are we to believe…
1. All alligators are dangerous
2. All alligators are edible
3. All alligators are animals

Toy experiments
• All alligators are dangerous:
• Known word: crocodile. sim(alligator, crocodile) = 0.93
• Crocodiles are animals, dangerous, scaly, and crocodiles
• All alligators are edible:
• Known word: trout. sim(alligator, trout) = 0.68
• Trouts are animals, aquatic, edible, and trouts
• Probability should be lower because similarity is lower
• All alligators are animals:
• Known words: crocodile, trout.
• Can evidence accumulate with multiple similarity ratings?

Generative story for the
prior probability
• Fix domain size to 10
• For each entity in the domain:
• Flip a fair coin to determine if it is a crocodile. Likewise for
alligator.
• For each entity in the domain:
• If it is a crocodile, it is also an animal, dangerous, and scaly.
• Otherwise, flip a fair coin to see if it is an animal (dangerous,
scaly).
Implemented in Church.

Results: All alligators are…
Sentence words sim prior posterior
. . . dangerous alligator,
crocodile
0.93 0.26 0.47
. . . edible alligator, trout 0.68 0.26 0.38
• Aim: Significant increase in probability
• Absolute probabilities depend on domain size,
problem formulation
• Higher similarities lead to significantly more confident inferences
• “Crocodile” much more similar to “alligator” than “trout”:
Agent more confidently ascribes crocodile properties to alligators

Probability of property
overlap: prior versus posterior
0 0.2 0.4 0.6 0.8 1
no dist. evidence
with dist. evidence
Property overlap of 'alligator' and 'crocodile'
prop. overlap
num.worlds
0200400600800
0 0.2 0.4 0.6 0.8 1
no dist. evidence
with dist. evidence
Property overlap of 'alligator' and trout'
prop. overlap
num.worlds
0200400600800
Alligator vs crocodile Alligator vs trout
prior
posterior

Accumulating evidence:
“All alligators are animals”
sim of alligator to. . . prior posterior
crocodile: 0.93 0.53 0.68
trout: 0.68 0.53 0.63
crocodile: 0.93,
trout: 0.68
0.53 0.80
• Does distributional evidence accumulate?
• Both crocodiles and trouts are known to be animals
• Posterior significantly higher
when two pieces of evidence present

Summary
• How can people use a word whose reference they don’t
know?
• Suppose we don’t know what an alligator is, can we still
infer from context clues that it’s an animal?
• Proposal:
• (Narrow-window) distributional evidence is property overlap
evidence
• Distributional evidence affects probabilistic information state
• Can be described in probabilistic generative framework

Next questions
• Learning from a single sentence only
• On our last evening, the boatman killed an alligator as it
crawled past our camp-fire to go hunting in the reeds beyond.
• Distributional one-shot learning
• Doable: same setup, learn McRae et al. definitional features
using selectional constraints of neighboring predicates
• Properties that do not apply to all members of a category
• Some but not all crocodiles are dangerous
• Learn probability of generating a property for “alligator”

Next questions
• Here: Learn from context only indirectly,
from correlation with grounded properties
• Can we learn from what is said in the text?
• On our last evening, the boatman killed an alligator as it
crawled past our camp-fire to go hunting in the reeds beyond.
• Alligators are entities that generally crawl, hunt, and are
found in reeds
• P(q is a generic property of alligators that would be
mentioned by people)
• Relevant to “human experience of alligators”
(Thill/Padó/Ziemke 2014)

Thanks
Gemma Boleda, Louise McNally, Judith Tonhauser
(best editor on earth!), Nicholas Asher, Marco Baroni,
David Beaver, John Beavers, Ann Copestake, Ido
Dagan, Aurélie Herbelot, Hans Kamp, Alexander
Koller, Alessandro Lenci, Sebastian Löbner, Julian
Michael, Ray Mooney, Sebastian Padó, Manfred
Pinkal, Stephen Roller, Hinrich Schütze, Jan van Eijck,
Leah Velleman, Steve Wechsler, Roberto Zamparelli,
and the Foundations of Semantic Spaces reading group

Katrin Erk - 2017 - What do you know about an alligator when you know the company it keeps?

Recommended

Recommended

More Related Content

Similar to Katrin Erk - 2017 - What do you know about an alligator when you know the company it keeps?

Similar to Katrin Erk - 2017 - What do you know about an alligator when you know the company it keeps? (20)

More from Association for Computational Linguistics

More from Association for Computational Linguistics (20)

Recently uploaded

Recently uploaded (20)

Katrin Erk - 2017 - What do you know about an alligator when you know the company it keeps?