# Generowanie tytułów publikacji naukowych na podstawie danych ArXiv i gpt2

<br>
<br>
<br>



Autorzy: *Krzysztof Wodecki, Hubert Tacik, Karol Waligóra*



#### Kraków, 2022

## GPT-2
#### Generative Pre-trained Transformer 2

Projekt typu open-source stworzony przez OpenAI w 2019 roku.
GPT-2 jest narzędziem ogólnego przeznaczenia do przetwarzania języka naturalnego. Bazuje na głębokiej sieci neuronowej i umożliwia:
* tłumaczenie tekstu
* odpowiadanie na pytania
* generowanie streszczeń tekstu
* generowanie tekstu wyjściowego na podstawie danych wejściowych

## Pobranie danych
Korzystamy z predefiniowanej biblioteki *arxivscraper* w celu wyciągnięcia danych z bazy arxiv, która skupia w sobie artykuły wielu dyscyplin naukowych. Pobierane artykuły można filtrować wobec zadanych kategorii oraz czasu utworzenia. Tytuły następnie zapisywane są do pliku CSV w celu wykorzystania podczas nauki.

In [2]:
import arxivscraper as ax
import numpy as np

scraper = ax.Scraper(category='cs', date_from='2022-01-01',
                     date_until='2022-07-01', t=10,
                     filters={'categories':['cl']})

output = scraper.scrape()

titles = [' '.join(i['title'].split()) for i in output]
np.savetxt('titles_ref.csv', np.array(titles), fmt='%s')

http://export.arxiv.org/oai2?verb=ListRecords&from=2022-01-01&until=2022-07-01&metadataPrefix=arXiv&set=cs
fetching up to  1000 records...
fetching up to  2000 records...
fetching up to  3000 records...
fetching up to  4000 records...
fetching up to  5000 records...
fetching up to  6000 records...
fetching up to  7000 records...
fetching up to  8000 records...
fetching up to  9000 records...
fetching up to  10000 records...
fetching up to  11000 records...
Got 503. Retrying after 10 seconds.
fetching up to  11000 records...
fetching up to  12000 records...
Got 503. Retrying after 10 seconds.
fetching up to  12000 records...
fetching up to  13000 records...
Got 503. Retrying after 10 seconds.
fetching up to  13000 records...
fetching up to  14000 records...
fetching up to  15000 records...
fetching up to  16000 records...
fetching up to  17000 records...
Got 503. Retrying after 10 seconds.
fetching up to  17000 records...
fetching up to  18000 records...
fetching up to  19000 records...

## Wyświetlenie pobranych danych
Pobrane dane wyświetlamy za pomocą biblioteki *pandas*.

In [3]:
import pandas as pd
cols = ('id', 'title', 'categories', 'authors')
df = pd.DataFrame(output,columns=cols)

df

Unnamed: 0,id,title,categories,authors
0,1403.1773,finding eyewitness tweets during crises,cs.cl cs.cy,"[morstatter, lubold, pon-barry, pfeffer, liu]"
1,1606.06361,a probabilistic generative grammar for semanti...,cs.cl cs.lg stat.ml,[saparov]
2,1609.03528,the microsoft 2016 conversational speech recog...,cs.cl eess.as,"[xiong, droppo, huang, seide, seltzer, stolcke..."
3,1609.05935,advances in all-neural speech recognition,cs.cl,"[zweig, yu, droppo, stolcke]"
4,1703.08748,lepor: an augmented machine translation evalua...,cs.cl,[han]
...,...,...,...,...
4691,cs/0006023,dialogue act modeling for automatic tagging an...,cs.cl,"[stolcke, ries, coccaro, shriberg, bates, jura..."
4692,cs/0006036,prosody-based automatic segmentation of speech...,cs.cl,"[shriberg, stolcke, hakkani-tur, tur]"
4693,cs/0010012,finding consensus in speech recognition: word ...,cs.cl,"[mangu, brill, stolcke]"
4694,cs/0105037,integrating prosodic and lexical cues for auto...,cs.cl,"[tur, hakkani-tur, stolcke, shriberg]"


## Pobranie modelu 117M z GPT-2
Model 117M stanowi jeden z paru dostępnych modeli sieci neuronowych w bibliotece *GPT-2*. Jest ona złożona z 12 wejść, 768 warstw ukrytych, 12 wyjść oraz 117 milionów parametrów. Jest to najmniejsza dostępna predefiniowana sieć neuronowa z bibliotece.

In [4]:
import gpt_2_simple as gpt2
import os.path

model_name = "117M"

if not os.path.isdir("./models"):
    gpt2.download_gpt2(model_name=model_name)
    # model is saved into current directory under /models/117M/
else:
    print("Model already exists.")

Fetching checkpoint: 1.05Mit [00:00, 349Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:01, 914kit/s]                                                    
Fetching hparams.json: 1.05Mit [00:00, 526Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [09:11, 903kit/s]                                   
Fetching model.ckpt.index: 1.05Mit [00:00, 1.05Git/s]                                               
Fetching model.ckpt.meta: 1.05Mit [00:00, 1.26Mit/s]                                                
Fetching vocab.bpe: 1.05Mit [00:00, 1.25Mit/s]                                                      


## Uczenie modelu
Model następnie uczony jest na tytułach pobranych z *arxiv*. Dane ustalone są na 1000 generacji nauki, a co 10 iteracji zapisywane są: model i wygenerowane podczas nauki tytuły. Naukę można przerwać w każdym momencie, przerwanie indukuje proces zapisania modelu. Po uruchomieniu skryptu model zostanie wczytany, nie ma potrzeby ponownego uczenia przy każdym uruchomieniu.

In [5]:
import gpt_2_simple as gpt2

model_name = "117M"

sess = gpt2.start_tf_sess()
gpt2.finetune(sess,
              'titles_ref.csv',
              model_name=model_name,
              steps=1000,
              #reuse=True,
              save_every=10, # checkpoint nauczonego modelu co 10 iteracji
              sample_every=10) # co 10 iteracji pojawia się próbka danych 

Loading checkpoint models\117M\model.ckpt
INFO:tensorflow:Restoring parameters from models\117M\model.ckpt


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 45.45it/s]

Loading dataset...





dataset has 135462 tokens
Training...
[1 | 39.58] loss=2.66 avg=2.66
[2 | 75.13] loss=2.64 avg=2.65
[3 | 108.83] loss=2.59 avg=2.63
[4 | 141.93] loss=2.44 avg=2.58
[5 | 176.29] loss=2.50 avg=2.57
[6 | 209.11] loss=2.47 avg=2.55
[7 | 242.01] loss=2.43 avg=2.53
[8 | 275.01] loss=2.34 avg=2.51
[9 | 308.01] loss=2.33 avg=2.49
[10 | 341.03] loss=2.29 avg=2.47
Saving checkpoint\run1\model-10
 knowledge. I have a great feeling that it'll be the next great movie and the only one you'll ever see.
For example:
The fact that the last few scenes of the show never end at a certain point seems like a longshot, right? But the truth of my mind is the audience's view of the first two scenes doesn't change over time and is very different from what the audience always seems to view. So we only see four of the first seven scenes with the audience. That's actually more of a problem as it is for all of the first fourteen scenes. We've never seen the audience just looking around and making the decisions for 

[29 | 1141.65] loss=2.10 avg=2.25
[30 | 1174.98] loss=2.05 avg=2.24
Saving checkpoint\run1\model-30
<<|startoftext|>trans-labeling for text-based dialog generation<|endoftext|>
<|startoftext|>a unified method for evaluating the impact of style-related knowledge transfer and text captioning on self-aware bert<|endoftext|>
<|startoftext|>a preface for conversational pre-trained languages: a review<|endoftext|>
<|startoftext|>a framework for text classification based on sentence ordering<|endoftext|>
<|startoftext|>sparsely: a robust corpus of low-resource text-to-speech models<|endoftext|>
<|startoftext|>improved language understanding training for learning from natural language<|endoftext|>
<|startoftext|>text-to-speech-to-text transformer based text-based bert training<|endoftext|>
<|startoftext|>a multi-modal approach to graph pre-distributed network for text classification<|endoftext|>
<|startoftext|>a model-agnostic approach for text-centric text classification with multi-modal tran

[46 | 1865.03] loss=2.00 avg=2.14
[47 | 1899.10] loss=1.80 avg=2.13
[48 | 1932.90] loss=1.94 avg=2.13
[49 | 1968.34] loss=1.95 avg=2.12
[50 | 2003.77] loss=1.93 avg=2.12
Saving checkpoint\run1\model-50
texttext|>subunit: a new toolkit for studying machine translation<|endoftext|>
<|startoftext|>possible explanations of model errors in english language models<|endoftext|>
<|startoftext|>concentric learning towards scalable learning towards fine-tuning of latent-variance network<|endoftext|>
<|startoftext|>districts and pretrained representations for neural language analysis<|endoftext|>
<|startoftext|>a case study on the importance of covid-19 statistics in studying hate speech in the quran forums<|endoftext|>
<|startoftext|>extracting language-aware subtitles with low-cost and high-cost transformer<|endoftext|>
<|startoftext|>learning to express human emotions with an effective language model<|endoftext|>
<|startoftext|>a few-shot question answering machine learning algorithm for natur

## Wyświetlenie tytułów wygenerowanych podczas nauki
Wygenerowane podczas uczenia tytuły są wczytywane, a następnie usuwane są zbędne ciągi znaków. Wyświetlenie wygenerowanych tytułów zostało przedstawiony poniżej.

In [6]:
sample_file = 'samples/run1/samples-51'
t = open(sample_file, 'r').read()

for s in ['endoftext', 'startoftext', '<|', '|>']:
    t = t.replace(s, '')
for title in t.title().split('\n')[1:]:
    if not title == '':
        print('- ' + title)

- Texttextsubunit: A New Toolkit For Studying Machine Translation
- Possible Explanations Of Model Errors In English Language Models
- Concentric Learning Towards Scalable Learning Towards Fine-Tuning Of Latent-Variance Network
- Districts And Pretrained Representations For Neural Language Analysis
- A Case Study On The Importance Of Covid-19 Statistics In Studying Hate Speech In The Quran Forums
- Extracting Language-Aware Subtitles With Low-Cost And High-Cost Transformer
- Learning To Express Human Emotions With An Effective Language Model
- A Few-Shot Question Answering Machine Learning Algorithm For Natural Language Identification
- A Generalisation Of Bert Models
- On The Importance Of Covid–19 Data In Hate Speech Detection
- Conforming A Simple To Model Neural Network To Generate Large Scale Models
- Tendency Estimation For Text-To-Audio Conversion
- A Benchmark For Evaluating The Feasibility Of Pretrained And Distanced Models For Document Consistency Checking
- Constraining Pret

## Generowanie pojedynczego tytułu
Po nauce modelu można go wczytać i wygenerować nowe tytuły oraz je wyświetlić.

In [None]:
# inicjalizacja sieci neuronowej gpt2 w celu generowania danych z nauczonego już modelu
import gpt_2_simple as gpt2
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess)

In [2]:
text = gpt2.generate(sess,
              length=400,
              temperature=0.7,
              nsamples=1,
              batch_size=1,
              return_as_list=True)


t = text[0].title()
t = t.replace('<|Startoftext|>', '').replace('\n', '') # remove extraneous stuff
t = t[:t.index('<|Endoftext|>')] # only get one title
t = t.replace('|', '').replace('>', '').replace('<','')
print(t)

The Cognitive And Emotional Aspect Of Social Media Use: A Meta-Analysis


# Dziękujemy za uwagę. 

## Źródła

1. https://github.com/csinva/gpt-paper-title-generator
2. https://github.com/Mahdisadjadi/arxivscraper
3. https://arxiv.org/category_taxonomy
4. https://en.wikipedia.org/wiki/GPT-2
