Demo of ALC embeddings - Discussion in Italian parliament about immigration in 2015

Purpose

We rely on speeches in the Italian parliament (2013-2020) taken from ParlaMint to illustrate a possible use case of our quantities. By showing how government and opposition parties differentially adjusted their speeches around issues of immigration following the 2015 refugee crisis in Europe, we illustrate how our ALC resources can be used to make inferences about semantic differences across time and groups. Our resources are fully integrated with the conText R package. You find more information on how to get started with conText here.

Embedding resources

# transformation matrix
local_fasttext = readRDS("../../replication/data/raw/embeddings/it/fastText/fasttext_transform_itwiki_25.rds")
dim(local_fasttext)

## [1] 300 300

# pretrained embeddings
not_all_na <- function(x) any(!is.na(x))
fasttext <-  setDT(read_delim("../../replication/data/raw/embeddings/it/fastText/fasttext_vectors_itwiki.vec",
                              delim = " ",
                              quote = "",
                              skip = 1,
                              col_names = F,
                              col_types = cols())) %>%
  dplyr::select(where(not_all_na)) # remove last column which is all NA
word_vectors <-  as.matrix(fasttext, rownames = 1)
colnames(word_vectors) <-  NULL
rm(fasttext)
dim(word_vectors)

## [1] 309561    300

Corpus

We use ParlaMint data for Italian parliamentary debates for the lower house only.

In terms of preprocessing, it is generally a good idea to keep the pre-processing close to what we did for training:

remove punctuation btw tokens (default in quanteda)
lower case
remove rare terms
window size of 5

# restricted to lower house
data_lim <- readRDS("../../replication/data/analysis/examples/ParlaMint/parlamint_it.rds") 
glimpse(data_lim)

## Rows: 21,654
## Columns: 37
## $ doc_id             <chr> "ParlaMint-IT_2014-01-02-LEG17-Sed-159.u2", "ParlaM…
## $ text               <chr> "  Signor Presidente, chiedo la votazione del proce…
## $ Title              <chr> "Report of the session of the Senate of the Italian…
## $ From               <date> 2014-01-02, 2014-01-02, 2014-01-02, 2014-01-02, 20…
## $ To                 <date> 2014-01-02, 2014-01-02, 2014-01-02, 2014-01-02, 20…
## $ House              <chr> "Upper house", "Upper house", "Upper house", "Upper…
## $ Term               <chr> "17-upper", "17-upper", "17-upper", "17-upper", "17…
## $ Session            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Meeting            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Sitting            <chr> "159-upper", "159-upper", "159-upper", "159-upper",…
## $ Agenda             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Subcorpus          <chr> "Reference", "Reference", "Reference", "Reference",…
## $ Speaker_role       <chr> "Regular", "Regular", "Regular", "Regular", "Regula…
## $ Speaker_type       <chr> "MP", "MP", "MP", "MP", "MP", "MP", "MP", "MP", "MP…
## $ Speaker_party      <chr> "M5S.1", "M5S.1", "M5S.1", "M5S.1", "PD", "LN-Aut",…
## $ Speaker_party_name <chr> "Movimento 5 Stelle", "Movimento 5 Stelle", "Movime…
## $ Party_status       <chr> "Opposition", "Opposition", "Opposition", "Oppositi…
## $ Speaker_name       <chr> "Ciampolillo, Lello", "Ciampolillo, Lello", "Ciampo…
## $ Speaker_gender     <chr> "M", "M", "M", "M", "M", "M", "M", "F", "F", "M", "…
## $ Speaker_birth      <dbl> 1972, 1972, 1972, 1962, 1953, 1955, 1960, 1964, 197…
## $ Year               <chr> "2014", "2014", "2014", "2014", "2014", "2014", "20…
## $ Date               <date> 2014-01-02, 2014-01-02, 2014-01-02, 2014-01-02, 20…
## $ Moy                <chr> "2014-01", "2014-01", "2014-01", "2014-01", "2014-0…
## $ postimmig          <chr> "beforeImmig", "beforeImmig", "beforeImmig", "befor…
## $ postimmig_num      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ populist           <dbl> 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, …
## $ government         <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, …
## $ periodparty        <fct> 1.beforeImmig, 1.beforeImmig, 1.beforeImmig, 1.befo…
## $ partyperiod        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Moyparty           <fct> "Movimento 5 Stelle.2014-01", "Movimento 5 Stelle.2…
## $ Yearparty          <fct> "Movimento 5 Stelle.2014", "Movimento 5 Stelle.2014…
## $ governmentperiod   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ periodgovernment   <fct> 0.beforeImmig, 0.beforeImmig, 0.beforeImmig, 0.befo…
## $ moy                <fct> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,…
## $ Moygovernment      <fct> 0, 0, 0, 0, 10, 0, 0, 0, 0, 10, 10, 10, 0, 0, 0, 10…
## $ months             <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ monthsgovernment   <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

table(data_lim$Moy)

## 
## 2014-01 2014-02 2014-03 2014-04 2014-05 2014-06 2014-07 2014-08 2014-09 2014-10 
##     801     938     416     841     402     373     821     752     376     486 
## 2014-11 2014-12 2015-01 2015-02 2015-03 2015-04 2015-05 2015-06 2015-07 2015-08 
##     444     253     529     482     494     658     413     478     844     125 
## 2015-09 2015-10 2015-11 2015-12 2016-01 2016-02 2016-03 2016-04 2016-05 2016-06 
##     419    1128     369     313     306     294     587     350     482     404 
## 2016-07 2016-08 2016-09 2016-10 2016-11 2016-12 2017-01 2017-02 2017-03 2017-04 
##     506     299     281     497     512      76     324     384     375     224 
## 2017-05 2017-06 2017-07 2017-08 2017-09 2017-10 2017-11 2017-12 
##     421     280     536      95     321     398     113     434

corp <- quanteda::corpus(data_lim)


#########################
## Corpus prep

# remove short speeches
corpus <- corp %>% 
  corpus_trim(what = "documents",
              min_ntoken = 10) 

# some pre-processing
toks <- tokens(corpus, remove_punct=T, remove_symbols=T) %>% 
  tokens_tolower()

# without stops (also works with them!)
toks_nostop <- tokens_select(toks, pattern = stopwords("it"), selection = "remove")

# only use features that appear at least 10 times in the corpus
feats <- dfm(toks_nostop, tolower=T, verbose = FALSE) %>%
  dfm_trim(min_termfreq = 10) %>% featnames()
head(feats, n = 50)

##  [1] "signor"           "presidente"       "chiedo"           "votazione"       
##  [5] "processo"         "verbale"          "previa"           "verifica"        
##  [9] "numero"           "legale"           "tratta"           "altro"           
## [13] "precisazione"     "dell'emendamento" "c'è"              "stato"           
## [17] "voto"             "senatrice"        "siede"            "posto"           
## [21] "accanto"          "senatore"         "gasparri"         "votato"          
## [25] "quel"             "momento"          "presente"         "vorrei"          
## [29] "venisse"          "messo"            "l'ennesima"       "ripetuta"        
## [33] "violazione"       "regolamento"      "norme"            "modalità"        
## [37] "viene"            "perpetrata"       "quest'aula"       "quindi"          
## [41] "venga"            "comunque"         "prima"            "ogni"            
## [45] "nave"             "crociera"         "venezia"          "inquina"         
## [49] "14.000"           "vecchie"

toks_nostop <- tokens_select(toks_nostop, feats, padding = TRUE)

Nearest neighbors by groups

A good first exploratory step is to analyze the nearest neighbors of the ALC embeddings by groups, i.e. features with the highest cosine-similarity with each group embedding using conText::get_nns() (a wrapper function to conText::nns()). In our example, we are interested in the nearest neighbors to the ALC embedding of the wordstem immigr across government and oppositions parties and across time. We use the candidates argument to limit the set of features we want get_nns() to consider as candidate nearest neighbors. In our case we limit candidates to those features that appear in the context window around the target term immigr (we could also allow this set to incorporate the entire corpus or all features in the pretrained embeddings).

target_toks <- tokens_context(x = toks_nostop, pattern = "immigr*", window = 5L)

## 378 instances of "immigrati" found.
## 19 instances of "immigrato" found.
## 280 instances of "immigrazione" found.
## 15 instances of "immigrazioni" found.

feats <- featnames(dfm(target_toks))

# nearest neighbors: features with the highest cosine-similarity with each group embedding

# ---------------------------------
# by government vs. opposition 

target_nns <- get_nns(x = target_toks, N = 10,
                      groups = docvars(target_toks, 'government'),
                      candidates = feats, # restrict to candidates in context window
                      pre_trained = word_vectors,
                      transform = TRUE,
                      transform_matrix = local_fasttext,
                      bootstrap = F) %>% 
  lapply(., "[[",2) %>% 
  do.call(rbind, .) %>% 
  as.data.frame()
target_nns[, 1:5]

##                  V1               V2               V3           V4
## 1 dell'immigrazione all'immigrazione      richiedenti immigrazione
## 0 dell'immigrazione      richiedenti all'immigrazione immigrazione
##               V5
## 1   emergenziale
## 0 l'immigrazione

# ---------------------------------
# across months

target_nns <- get_nns(x = target_toks, N = 10,
                      groups = docvars(target_toks, 'Moy'),
                      candidates = feats,
                      pre_trained = word_vectors,
                      transform = TRUE,
                      transform_matrix = local_fasttext,
                      bootstrap = F) %>% 
  lapply(., "[[",2) %>% 
  do.call(rbind, .) %>% 
  as.data.frame() %>% 
  tibble::rownames_to_column(var = "Moy") %>%
  arrange(lubridate::ym(Moy))
target_nns[19:25, 1:5]

##        Moy                V1            V2               V3            V4
## 19 2015-07       richiedenti  emergenziale      richiedente     chiediamo
## 20 2015-08       richiedenti   richiedente       lavoratori      migranti
## 21 2015-09       richiedenti   richiedente     emergenziale pregiudiziale
## 22 2015-10 dell'immigrazione  immigrazione all'immigrazione      migranti
## 23 2015-11       ventimiglia       francia         invadere    respingere
## 24 2015-12         incentiva previdenziali     emergenziale   sostenibile
## 25 2016-01       richiedenti   richiedente     immigrazione      migranti

Result:

government and opposition parties differ little in their connotation of “immigration” across the entire sample period
speakers in the Italian parliament were more prone to speak of “immigration” in the context of “sustainability”, “invasion” or “social security” right after the refugee shock,
but they were more likely to speak of general immigration issues (“applicants”, “workers”, “immigration”) in other months.

Regression framework

We evaluate the trend in semantic differences across government and opposition parties around the 2015 refugee crisis using embedding regression. conText::conText() uses ALC embeddings within a regression-style framework, i.e. it allows to examine covariate effects on embeddings beyond discrete group variables or while controlling for other covariates.

set.seed(2021L)
models <- lapply(unique(docvars(target_toks, 'months')), function(j){
  conText(formula =  . ~ government, 
          data = tokens_subset(target_toks, months == j), 
          pre_trained = word_vectors,
          transform = TRUE,
          transform_matrix = local_fasttext,
          stratify = T,
          jackknife = T,
          # bootstrap = T,
          permute = TRUE,
          num_permutations = 100,
          hard_cut = F,
          window = 5,
          case_insensitive = TRUE,
          verbose = T)
})

## total observations included in regression: 175 
## starting permutations 
## done with permutations 
## Note: These values are not regression coefficients. Check out the Quick Start Guide for help with interpretation: 
## https://github.com/prodriguezsosa/conText/blob/master/vignettes/quickstart.md
## 
##   coefficient normed.estimate std.error  lower.ci upper.ci p.value
## 1  government       0.8911084 0.1999593 0.4964504 1.285766    0.05
## total observations included in regression: 66 
## starting permutations 
## done with permutations 
## Note: These values are not regression coefficients. Check out the Quick Start Guide for help with interpretation: 
## https://github.com/prodriguezsosa/conText/blob/master/vignettes/quickstart.md
## 
##   coefficient normed.estimate std.error  lower.ci upper.ci p.value
## 1  government         1.34187 0.3037269 0.7352853 1.948454    0.06
## total observations included in regression: 130 
## starting permutations 
## done with permutations 
## Note: These values are not regression coefficients. Check out the Quick Start Guide for help with interpretation: 
## https://github.com/prodriguezsosa/conText/blob/master/vignettes/quickstart.md
## 
##   coefficient normed.estimate std.error  lower.ci upper.ci p.value
## 1  government       0.8568327  0.170718 0.5190629 1.194602    0.02
## total observations included in regression: 34 
## starting permutations 
## done with permutations 
## Note: These values are not regression coefficients. Check out the Quick Start Guide for help with interpretation: 
## https://github.com/prodriguezsosa/conText/blob/master/vignettes/quickstart.md
## 
##   coefficient normed.estimate std.error lower.ci upper.ci p.value
## 1  government        2.061913 0.5126719 1.018875 3.104952       0
## total observations included in regression: 77 
## starting permutations 
## done with permutations 
## Note: These values are not regression coefficients. Check out the Quick Start Guide for help with interpretation: 
## https://github.com/prodriguezsosa/conText/blob/master/vignettes/quickstart.md
## 
##   coefficient normed.estimate std.error  lower.ci upper.ci p.value
## 1  government        1.352791 0.2647936 0.8254088 1.880173    0.01
## total observations included in regression: 44 
## starting permutations 
## done with permutations 
## Note: These values are not regression coefficients. Check out the Quick Start Guide for help with interpretation: 
## https://github.com/prodriguezsosa/conText/blob/master/vignettes/quickstart.md
## 
##   coefficient normed.estimate std.error  lower.ci upper.ci p.value
## 1  government        1.350581 0.3374869 0.6699739 2.031188    0.05
## total observations included in regression: 103 
## starting permutations 
## done with permutations 
## Note: These values are not regression coefficients. Check out the Quick Start Guide for help with interpretation: 
## https://github.com/prodriguezsosa/conText/blob/master/vignettes/quickstart.md
## 
##   coefficient normed.estimate std.error  lower.ci upper.ci p.value
## 1  government        1.386223 0.2107045 0.9682915 1.804154       0
## total observations included in regression: 63 
## starting permutations 
## done with permutations 
## Note: These values are not regression coefficients. Check out the Quick Start Guide for help with interpretation: 
## https://github.com/prodriguezsosa/conText/blob/master/vignettes/quickstart.md
## 
##   coefficient normed.estimate std.error  lower.ci upper.ci p.value
## 1  government        1.378633 0.3288502 0.7212713 2.035996    0.09

plot_tibble <- lapply(models, function(i) i@normed_coefficients) %>% 
  do.call(rbind, .) %>% 
  mutate(period = factor(seq(1, 8), labels = c("2014-01/06", "2014-07/12", 
                                               "2015-01/08", "2015-09/12", 
                                               "2016-01/06", "2016-07/12", 
                                               "2017-01/06", "2017-07/12")))

ggplot(data = plot_tibble,
       aes(x = period, 
           y = normed.estimate)) +
  geom_point() +
  geom_errorbar(aes(ymin = lower.ci,
                    ymax = upper.ci),
                width = 0.5) +
  geom_vline(xintercept = 3.5, linetype = "dashed") + 
  labs(x = "", 
       title = "Norm of Difference between Government and Opposition ALC embeddings of 'immigr*'",
       y = TeX("Norm of $\\hat{\\beta}$"))+
  theme_bw()

Cosine similarity ratios

Another exploratory exercise is to compute the cosine similarity ratio between group embeddings and features using conText::get_nns_ratio() (a wrapper function for conText::nns_ratio()). Given ALC embeddings for two groups, get_nns_ratio() first computes the similarity between a feature and each group embedding for any given feature, and then takes the ratio of these two similarities.

This ratio captures how “discriminant” a feature is of a given group. Values larger (smaller) than 1 mean the feature is more (less) discriminant of the group in the numerator (denominator). Use the numerator argument to define which group represents the numerator in this ratio. If N is defined, this ratio is computed for the union of the top N nearest neighbors.

plotfun <- function(period){
  temp <- tokens_subset(target_toks, months==period)
  feats <- featnames(dfm(target_toks))
  docvars(temp)$Government = ifelse(docvars(temp)$government==1, "Government", "Opposition")
  set.seed(111)
  target_nns_ratio <- get_nns_ratio(x = temp,
                                    N = 10,
                                    groups = docvars(temp, 'Government'),
                                    numerator = "Government",
                                    candidates = feats,
                                    pre_trained = word_vectors,
                                    transform = TRUE,
                                    transform_matrix = local_fasttext,
                                    bootstrap = T,
                                    num_bootstraps = 100,
                                    permute = TRUE,
                                    num_permutations = 100,
                                    verbose = T)
  
  return(target_nns_ratio)
}

out_before <- plotfun(period = 2) # Jan - Aug 2015

## starting bootstraps 
## done with bootstraps 
## starting permutations 
## done with permutations 
## NOTE: values refer to the ratio Government/Opposition.

out_before

## # A tibble: 14 × 7
##    feature           value std.error lower.ci upper.ci p.value group     
##    <chr>             <dbl>     <dbl>    <dbl>    <dbl>   <dbl> <chr>     
##  1 consapevoli       1.10     0.0727    0.968     1.22    0.17 Government
##  2 dell'attualità    1.04     0.0700    0.914     1.15    0.64 Government
##  3 richiedente       1.03     0.0945    0.895     1.19    0.75 Government
##  4 legalità          0.997    0.0792    0.888     1.11    0.95 Government
##  5 riteniamo         0.972    0.0771    0.816     1.09    0.78 shared    
##  6 richiedenti       0.967    0.0821    0.843     1.11    0.73 shared    
##  7 chiediamo         0.961    0.0874    0.790     1.10    0.61 shared    
##  8 emergenziale      0.953    0.0821    0.833     1.08    0.59 shared    
##  9 all'immigrazione  0.937    0.0857    0.801     1.08    0.45 shared    
## 10 umanitari         0.902    0.0718    0.789     1.04    0.2  Opposition
## 11 dell'immigrazione 0.901    0.0836    0.768     1.05    0.27 shared    
## 12 immigrazione      0.894    0.0899    0.756     1.04    0.34 Opposition
## 13 migranti          0.866    0.0870    0.729     1.02    0.16 Opposition
## 14 extracomunitari   0.853    0.0903    0.693     1.01    0.17 Opposition

plot_nns_ratio(x = out_before, alpha = 0.05, horizontal = F)

out_after <- plotfun(period = 3) # Sep - Dec 2015

## starting bootstraps 
## done with bootstraps 
## starting permutations 
## done with permutations 
## NOTE: values refer to the ratio Government/Opposition.

out_after

## # A tibble: 19 × 7
##    feature           value std.error lower.ci upper.ci p.value group     
##    <chr>             <dbl>     <dbl>    <dbl>    <dbl>   <dbl> <chr>     
##  1 destabilizzazione 1.43     0.232     1.12     1.83     0.07 Government
##  2 integrazione      1.32     0.209     1.00     1.69     0.13 Government
##  3 d'integrazione    1.27     0.201     1.01     1.65     0.19 Government
##  4 regolamentazione  1.23     0.210     0.910    1.52     0.35 Government
##  5 normativa         1.22     0.202     0.916    1.53     0.39 Government
##  6 emergenziale      1.19     0.191     0.874    1.50     0.41 Government
##  7 bossi-fini        1.19     0.193     0.948    1.52     0.42 Government
##  8 previdenziali     1.11     0.165     0.846    1.41     0.7  Government
##  9 schengen          0.993    0.187     0.674    1.27     1    Government
## 10 dell'immigrazione 0.875    0.135     0.648    1.08     0.56 shared    
## 11 all'immigrazione  0.857    0.114     0.665    1.02     0.42 Opposition
## 12 l'immigrazione    0.854    0.128     0.644    1.02     0.5  Opposition
## 13 immigrazione      0.842    0.126     0.638    1.04     0.45 Opposition
## 14 immigrazioni      0.751    0.112     0.590    0.951    0.27 Opposition
## 15 richiedenti       0.663    0.0788    0.533    0.792    0.03 Opposition
## 16 extracomunitari   0.660    0.0957    0.466    0.802    0.03 Opposition
## 17 migranti          0.645    0.0882    0.479    0.773    0.01 Opposition
## 18 chiediamo         0.555    0.107     0.374    0.711    0.01 Opposition
## 19 cittadini         0.529    0.0809    0.404    0.643    0    Opposition

plot_nns_ratio(x = out_after, alpha = 0.05, horizontal = F)

Results:

Both types of parliamentary camps discussed issues of immigration in similar ways in early 2015, often sharing nearest neighbors such as emergency (emergenziale) or applicants (richiedenti).
In the later months of 2015, in contrast, the vocabularies are radically different between government and opposition parties.
While opposition parties still seem to talk about immigration in more general terms (e.g. invoking terms lexically related to immigrazione), government parties now mention normative challenges of immigration as well as legal constraints, e.g. the Schengen area or the “Bossi-Fini law”.

Train A locally

We now validate the performance of our pretrained ALC resources, comparing it against locally trained quantities.

# ---------------------------------
# pretrained embeddings + local A
toks_fcm <- fcm(toks_nostop, context = "window", window = 5, count = "frequency")
localA <- conText::compute_transform(x = toks_fcm, pre_trained = word_vectors, weighting = 'log')

# ---------------------------------
# local embeddings + local A

# now with both glove and A locally
# library(text2vec)
# estimate glove model using text2vec
# glovelocal <- GlobalVectors$new(rank = 300, 
#                                 x_max = 100,
#                                 learning_rate = 0.05)
# wv_main <- glovelocal$fit_transform(toks_fcm, n_iter = 10,
#                                     convergence_tol = 1e-3, 
#                                     n_threads = parallel::detectCores()) # set to 'parallel::detectCores()' to use all available cores
# 
# wv_context <- glovelocal$components
# locallocal_glove <- wv_main + t(wv_context) # word vectors
# saveRDS(locallocal_glove, "localglove_nostops_italianparliament.rds")

# read in local A
locallocal_glove <- readRDS("localglove_nostops_italianparliament.rds")

locallocalA <- compute_transform(x = toks_fcm, pre_trained = locallocal_glove, weighting = 'log')


#------------------------------------------------------------------------------#
#                                 Nearest Neighbors
#------------------------------------------------------------------------------#

immig_toks <- tokens_context(x = toks_nostop, pattern = "immigrazione", window = 5L)

## 280 instances of "immigrazione" found.

feats <- featnames(dfm(immig_toks))
immig_dfm <- dfm(immig_toks)

###########################
## with local GloVe

# GloVE
nns_localglove <- find_nns(locallocal_glove['immigrazione',], 
                           pre_trained = locallocal_glove, 
                           N = 10,
                           candidates = feats) 
nns_localglove

##  [1] "immigrazione" "reato"        "clandestina"  "terrorismo"   "parlando"    
##  [6] "unico"        "riferimento"  "l'altro"      "previsto"     "tortura"

# GloVe ALC
immig_dem_local <- dem(x = immig_dfm, 
                         pre_trained = locallocal_glove, 
                         transform = TRUE, 
                         transform_matrix = locallocalA, 
                         verbose = TRUE)

# take the column average to get a single "corpus-wide" embedding
immig_wv_local <- colMeans(immig_dem_local)

# find nearest neighbors for overall ALC embedding
nns_localglove_alc <- find_nns(immig_wv_local, 
                               pre_trained = locallocal_glove, 
                               N = 10, 
                               candidates = feats)
nns_localglove_alc

##  [1] "quindi"  "fatto"   "infatti" "però"    "solo"    "così"    "proprio"
##  [8] "poi"     "invece"  "parte"

##################################
# with our FT quantities

nns_ft <- find_nns(word_vectors['immigrazione',], 
                   pre_trained = word_vectors, 
                   N = 10, 
                   candidates = feats[feats %in% rownames(word_vectors)]) 
nns_ft

##  [1] "immigrazione"      "dell'immigrazione" "emigrazione"      
##  [4] "all'immigrazione"  "immigrati"         "migranti"         
##  [7] "emigrati"          "criminalità"       "rifugiati"        
## [10] "esodo"

immig_dem_local <- dem(x = immig_dfm, 
                         pre_trained = word_vectors, 
                         transform = TRUE, 
                         transform_matrix = local_fasttext, 
                         verbose = TRUE)

# take the column average to get a single "corpus-wide" embedding
immig_wv_local <- colMeans(immig_dem_local)

# find nearest neighbors for overall embedding
nns_ftalc <- nns(x = immig_wv_local, 
                 N = 10, 
                 candidates = feats, 
                 pre_trained = word_vectors, 
                 stem = F, 
                 as_list = FALSE, 
                 show_language = FALSE)

## Warning in nns(x = immig_wv_local, N = 10, candidates = feats, pre_trained =
## word_vectors, : the following canidates do not appear to have an embedding in
## the set of pre-trained embeddings provided: vergogniamo, soffermarmi,
## cardiello, ricordiamoci, credetemi, nell'emendamento, subemendamento, segnalo,
## assumete, piangiamo, rinviabili, esodati, recuperiamo, confrontarci,
## quest'aula, diciamolo, lasciatemelo, risolviamo, rimandiamo, serissimo, citavo,
## sottoscriviamo, quest'assemblea, concludo

nns_ftalc

## # A tibble: 10 × 4
##    target feature            rank value
##    <lgl>  <chr>             <int> <dbl>
##  1 NA     depenalizzazione      1 0.640
##  2 NA     reato                 2 0.616
##  3 NA     bossi-fini            3 0.616
##  4 NA     pregiudiziale         4 0.602
##  5 NA     dell'immigrazione     5 0.597
##  6 NA     criminalizzare        6 0.590
##  7 NA     depenalizzare         7 0.581
##  8 NA     terrorismo            8 0.579
##  9 NA     penale                9 0.578
## 10 NA     dell'illecito        10 0.575

knitr::kable(data.frame(
  nns_ft, 
  nns_ftalc$feature,
  nns_localglove, 
  nns_localglove_alc),
  format = "simple",
  booktabs = T,
  linesep = "",
  col.names = c("our fT", "our fT-ALC", "local GloVe", "local GloVe-ALC")
)

our fT	our fT-ALC	local GloVe	local GloVe-ALC
immigrazione	depenalizzazione	immigrazione	quindi
dell’immigrazione	reato	reato	fatto
emigrazione	bossi-fini	clandestina	infatti
all’immigrazione	pregiudiziale	terrorismo	però
immigrati	dell’immigrazione	parlando	solo
migranti	criminalizzare	unico	così
emigrati	depenalizzare	riferimento	proprio
criminalità	terrorismo	l’altro	poi
rifugiati	penale	previsto	invece
esodo	dell’illecito	tortura	parte

Translation

our fT	our fT-ALC	local GloVe	local GloVe-ALC
immigration	decriminalization	immigration	therefore
of immigration	crime	crime	fact
emigration	Bossi-Fini	illegal (undocumented)	in fact
to immigration	prejudicial	terrorism	however
immigrants	of immigration	speaking	only
migrants	to criminalize	unique	so
emigrants	to decriminalize	reference	own
crime	terrorism	the other	then
refugees	criminal	foreseen	instead
exodus	of the illegal	torture	part

Result:

Our parliamentary corpus is too small to train high-quality embeddings and the corresponding ALC transformation matrix locally. However, our pretrained quantities seem to get the “job done.”

Demo of ALC embeddings - Discussion in Italian parliament about immigration in 2015

Elisa Wirsching

2024-11-11

Purpose

Embedding resources

Corpus

Nearest neighbors by groups

Result:

Regression framework

Cosine similarity ratios

Results:

Train A locally

Translation

Result: